infosphere cdc flat file for datastage configuration and ... · pdf fileinfosphere cdc flat...

25
© 2010 IBM Corporation InfoSphere CDC Flat file for DataStage Configuration and Best Practices

Upload: duongmien

Post on 01-Feb-2018

290 views

Category:

Documents


16 download

TRANSCRIPT

Page 1: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

© 2010 IBM Corporation

InfoSphere CDC Flat file for DataStage

Configuration and Best Practices

Page 2: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

2

Understanding the Flat File Workflow

Landing Location

Page 3: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

3

Landing Location

1. Source Database

• Configure CDC on the source database where the CDC service for the database reads the transaction log to captu re changes

Page 4: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

4

Landing Location

2. Defining the Replication Definition

• CDC for DataStage transfers the change data accordi ng to the replication definition

• To configure:

• Define the table structure that will be sent to DataStage

• Define the DataStage connection method for Flat Files

• Define single or multiple format to determine how DataStage will be processing the incoming records

Page 5: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

5

Map Table for Flat File Output (1)

• Map table as usual, select WebSphere DataStage as t he target

• Select Flat File for method

• Specify the directory to which the flat files will be written and picked up by the DataStage job (directory resides on the D S server)

• Initial status of table will be Active (picking up changes from the moment it was mapped)

Page 6: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

6

Map Table for Flat File Output (2)

Page 7: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

7

Defining the DataStage Record Format (1)

• Standard columns containing information about the c hange:

• DM_TIMESTAMP - The timestamp obtained from the log of when the operation occurred (contains the value from the &TIMSTAMP journal control field)

• DM_TXID - Transaction identifier (contains the value from the &CCID journal control field)

• DM_OPERATION_TYPE contains a single character indicating the type of operation:• "I" for an insert. • "D" for a delete. • For Single Record Format there is one type that represents the update image

• "U" represents an update. • For Multiple Record Format there are two separate types that represent before and

after image• "B" for the row containing the before image of an update. • "A" for the row containing the after image of an update.

• DM_USER - The user that performed the operation (contains the value from the &USER journal control field)

Page 8: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

8

Defining the DataStage Record Format (2)

• Single record• In this format an update operation is sent as a single row• The before and after image is contained in the same record • E.g. Updating 3 records

"2010-11-23 21:43:24","0","U","EPANG","1","elaine ","1","update “"2010-11-23 21:43:24","0","U","EPANG","2","elaine ","2","update “"2010-11-23 21:43:24","0","U","EPANG","3","abc ","3","update "

• Multiple record format• An update operation is sent as two rows, the first row being the before image

and the second row containing the after image.• E.g. Updating 3 records

"2010-11-23 21:46:15","0","B","EPANG","1","update “

"2010-11-23 21:46:15","0","A","EPANG","1","hello “

"2010-11-23 21:46:15","0","B","EPANG","2","update “

"2010-11-23 21:46:15","0","A","EPANG","2","hello “

"2010-11-23 21:46:15","0","B","EPANG","3","update “

"2010-11-23 21:46:15","0","A","EPANG","3","hello "

Page 9: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

9

Naming Convention of Flat Files

• CDC uses the following convention to name the flat files that are produced during replication.

• [Table].x[Date].[Time][# Records]

• x = D for completed flat files, @ for currently open flat file• [Date] = Julian date (year, day number within year)• [Time] = hh24mmss when flat file was created (in GMT)• [# Records] = Optionally the number of records can be added

• [Table].STOPPED

• When subscription is stopped, this file is generated

The timestamp format can be configured using the system parameter ds_output_timestamp_format . E.g. ds_output_timestamp_format=“yyyy-

MM-dd HH:mm:ss.SSS” (to include milliseconds)

Page 10: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

10

Landing Location

3. Flat Files Become Available for DataStage

• CDC for DataStage server hardens the files and depo sits them in the flat file location.

• While actively mirroring to a file it is not access ible to DataStage. The process of hardening involves renam ing the file, replacing the ‘@’ with a ‘D’ thus making it a vailable to Datastage.

• To configure:

• Define the Batch Size Threshold settings to determine how often CDC hardens the flat files that are made available to DataStage

Page 11: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

11

Set Subscription DataStage Properties

• Right-click on subscription to set properties

• The file will be hardened always at the end of a tr ansaction boundary and when either of the following thresholds are passed:• Timing in seconds of flat file closure• Maximum number of rows per flat file

• Flat file is closed and next one is created/opened when either value is reached• Closed flat files can be picked up by DataStage for processing as they will contain only

completed transactions

Page 12: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

12

Landing Location

4. Flat Files Read by DataStage Job

• InfoSphere DataStage sequential file reader retriev es the flat files as part of an InfoSphere DataStage job and transforms them

• The job has three parameters defined in the Managem ent Console where the *.dsx file is created:

• SPFolderPath – the full path name for the folder that DataStage searches for the source flat files created by CDC

• SPFileNamePattern – the file name pattern used to identify the source flat files

• SPEndFileNamePattern – the file name pattern DataStage creates when subscriptions stop mirroring.

Page 13: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

13

Landing Location

5. Flat Files are Deposited to New Location

• InfoSphere DataStage sequential file reader deposit s the transformed flat files in the new flat file locatio n

• To configure:

• DataStage definition file (*.dsx ) from Management Console or in DataStage Designer

• Import definition file into DataStage and customize any additional steps/stages where necessary

Page 14: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

14

Connecting CDC for DataStage with DataStage

• Datastage uses job definitions to describe the sequence of steps, or stages required to transform data

• DataStage jobs are normally designed and edited in InfoSphere DataStage Designer

• When using CDC for DataStage you have the option of generating a job definition within CDC without crea ting it in DataStage Designer

Page 15: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

15

Generating an InfoSphere DataStage Definition File

• DataStage definition import file (.dsx) can be gene rated automatically

• Right-click on subscription and select Generate Inf oSphere DataStage Job Definition

• Place .dsx file at a location where it can be selec ted from DataStage (or copy it to the DS server)

Page 16: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

16

Import .dsx file into DataStage (1)

• DataStage flat file processing job will be generate d automatically

• DS job is already tailored to picking up the flat files from the specified directory

Page 17: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

17

Import .dsx file into DataStage (2)

Page 18: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

18

Best Practices for Flat Files

Page 19: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

19

Flat Files are Best Suited for…

• Best suited for under a few hundred tables

• Extra memory will need to be allocated with larger numbers of tables

• Very high data volume which requires parallel loadi ng

• Replacement for existing ETL delta extracts

• Data warehouses which benefit from bulk load of cha nged data

• Installation on 64 bit systems

Page 20: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

20

Considerations and Limitations

• The Flat File integration option is not suitable wh en character columns contain binary data. The UTF-8 files may c ontain code points that resolve to special characters, such as quotes, line feed or carriage returns, that cannot be processed

• Tables are individually replicated, which can break transactional table dependencies

• Additional processing is required in DataStage to maintain referential integrity between dependent tables

• Disk staging space

• Managing many files

Page 21: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

21

Initial Synchronization

• DataStage extracts data from source database using standard ETL functions

• An alternative is to use CDC to perform initial Ref resh and then transition to mirroring mode. This method inv olves first creating flat files for the refresh then loading us ing DataStage.

Page 22: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

22

Recommended Flat file Storage Option

• Direct attached disk storage is a typical option us ed for the storage of CDC flat files.

• Shared Storage Area Network (SAN) is another recomm ended option to stage files.

• This allows running CDC DS on a server separated from the DataStage grid, ensuring CDC has dedicated CPU/Disk capacity.

• The DataStage grid nodes can then read the files on the shared SAN, allowing for high performance and recoverability.

• Network File System (NFS) is not recommended for hi gh volume environments.

• CDC is not resilient to file system errors that may occur, and may suffer from network latency for writing many small changes to the flat files.

Page 23: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

23

Clean-up of Flat files Generated by CDC

• By default, the .dsx file generated by CDC will def ine that flat files are removed once CDC has deposited the files into the DataStage job.

• If additional sequencing of the files is required ( i.e. multiple tables containing foreign key relationships) this l ogic requires customization.

• A DataStage expert can modify the .dsx file generat ed by CDC to remove the cleanup logic and make adjustments as appropriate.

Page 24: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

24

Distinguishing Transaction/Record Ordering

• The timestamp field provides second to microsecond accuracy. It cannot be used alone to uniquely order records if m ultiple records are changed at the same time

• You can use the system parameter ds_output_timestamp_formatto format timestamp in milliseconds in the flat fil es. Note: some databases like Oracle can not produce millisecond a ccuracy. Changing this parameter can not improve upon the ac curacy that the database supports

• For sequencing within a single table:

• Use a combination of the timestamp, flat file number and line number to uniquely identify changes in commit order

• If you need to sequence across all tables in a subs cription, you will additionally be required to use a derived column on the source to generate a sequence number

Page 25: InfoSphere CDC Flat file for DataStage Configuration and ... · PDF fileInfoSphere CDC Flat file for DataStage Configuration and Best Practices. ... DataStage grid, ... • CDC is

Information Management Software

25

Recovery

• CDC maintains the source database log position in a ‘bookmark’ which is used for restarting replication and/or recovery from failure

• Flat files – CDC writes the bookmark to internal CDC metadata when hardening a flat file which has finished writing

• If the network is lost or a system failure occurs the flat file option provides recoverability and resiliency; CDC will start from the last flat file that was not yet hardened

• Both options operate independently from DataStage w hich periodically picks up the changes and processes the data

• CDC only manages recovery up to the CDC staging mechanism