cdc

8
Implementing Change Data Capture for a Slowly Changing Dimension in SSIS 2005 Roderick Lee, 2010

Upload: rolee23

Post on 16-Jan-2015

875 views

Category:

Technology


0 download

DESCRIPTION

Implementing Change Data Capture for a Slowly Changing Dimension in SSIS 2005: a research presentation for the SetFocus Business Intelligence Honors program

TRANSCRIPT

Page 1: CDC

ImplementingChange Data Capture

for aSlowly Changing Dimension

in SSIS 2005

Roderick Lee, 2010

Page 2: CDC

Employee Rates Data Flow

The process must execute a Lookup on the target table for each incoming record to distinguish inserts and updates.

Also, without separate tracking data, the count of incoming records is the size of the source table.

Sample Multi-Purpose Data Flow for both

Inserts and Updates

Page 3: CDC

Change Data Capture

image from Microsoft Books Online, 2008

Change Data Capture (CDC) is an automated operation that records transactional activity in the source table (inserts, updates, and deletes). This streamlines the ETL procedure because there is no need to compare all the data in the target table to identify changes. Also, it increases efficiency by limiting the source pool to already identified changes.

SQL Server 2008 has full CDC support and implements the capture process by writing transaction log activity into a set of specialized CDC tables. This is a new feature which did not exist in SQL Server 2005.

Even without the automated transaction log tracking, there are other methods of developing a capture process. This demonstration uses triggers to load the changes in a CDC change table which is similar in design to the 2008 version.

Page 4: CDC

EmployeeRates

PK EmployeeRatePK int identity

EmployeePK int HourlyRate decimal(18,2) EffectiveDate datetime

EmployeeRates2

PK EmployeeRatePK int identity

EmployeePK int HourlyRate decimal(18,2) StartDate datetime EndDate datetime

EmployeeRatesNew_CDC

CDC_$start_lsn int CDC_$end_lsn int CDC_$seqval smallint CDC_$operation tinyint CDC_$update_mask bit EmployeePK int HourlyRate decimal(18,2) insert_date datetime CDC_process_date datetime

EmployeeRatesNew

PK EmployeePK int

HourlyRate decimal(18,2)

TablesOriginal Target Table Adapted for SCD Type 2

CDC Table Source Table

The five preliminary CDC columns demonstrate the SQL Server 2008 change table architecture.

• “lsn” = log sequence number• The update mask column is a bit mask datatype,

one bit per original source column• Insert and process dates track CDC progress

because there are no actual log sequence numbers

Page 5: CDC

CDC Test Inserts and Updates

Result set in the CDC table tracking the changes. Note, the updates create two records.

Test script with inserts, updates, and deletes

Page 6: CDC

SCD Data FlowCDC for Slowly Changing Dimension

The SCD transform determines insert or update without the need for a Lookup transform. The conditional split is based on the CDC_$operation column.

Note, the source table for this data flow is the CDC table

Page 7: CDC

Near Real-Time ChangesReduce Source-Target Latency

By running the SSIS package as a recurring job in the background, can reduce the latency interval to the execution time of the complete CDC process.

For this demonstration, there is a single data flow, so a For Loop container can serve a similar purpose.

The data flow executes multiple times within the loop and captures any changes to the CDC table.

Page 8: CDC

Final Results

A second set of inserts and updates and the corresponding changes to the CDC and target tables, mere seconds later.