cdc
DESCRIPTION
Implementing Change Data Capture for a Slowly Changing Dimension in SSIS 2005: a research presentation for the SetFocus Business Intelligence Honors programTRANSCRIPT
ImplementingChange Data Capture
for aSlowly Changing Dimension
in SSIS 2005
Roderick Lee, 2010
Employee Rates Data Flow
The process must execute a Lookup on the target table for each incoming record to distinguish inserts and updates.
Also, without separate tracking data, the count of incoming records is the size of the source table.
Sample Multi-Purpose Data Flow for both
Inserts and Updates
Change Data Capture
image from Microsoft Books Online, 2008
Change Data Capture (CDC) is an automated operation that records transactional activity in the source table (inserts, updates, and deletes). This streamlines the ETL procedure because there is no need to compare all the data in the target table to identify changes. Also, it increases efficiency by limiting the source pool to already identified changes.
SQL Server 2008 has full CDC support and implements the capture process by writing transaction log activity into a set of specialized CDC tables. This is a new feature which did not exist in SQL Server 2005.
Even without the automated transaction log tracking, there are other methods of developing a capture process. This demonstration uses triggers to load the changes in a CDC change table which is similar in design to the 2008 version.
EmployeeRates
PK EmployeeRatePK int identity
EmployeePK int HourlyRate decimal(18,2) EffectiveDate datetime
EmployeeRates2
PK EmployeeRatePK int identity
EmployeePK int HourlyRate decimal(18,2) StartDate datetime EndDate datetime
EmployeeRatesNew_CDC
CDC_$start_lsn int CDC_$end_lsn int CDC_$seqval smallint CDC_$operation tinyint CDC_$update_mask bit EmployeePK int HourlyRate decimal(18,2) insert_date datetime CDC_process_date datetime
EmployeeRatesNew
PK EmployeePK int
HourlyRate decimal(18,2)
TablesOriginal Target Table Adapted for SCD Type 2
CDC Table Source Table
The five preliminary CDC columns demonstrate the SQL Server 2008 change table architecture.
• “lsn” = log sequence number• The update mask column is a bit mask datatype,
one bit per original source column• Insert and process dates track CDC progress
because there are no actual log sequence numbers
CDC Test Inserts and Updates
Result set in the CDC table tracking the changes. Note, the updates create two records.
Test script with inserts, updates, and deletes
SCD Data FlowCDC for Slowly Changing Dimension
The SCD transform determines insert or update without the need for a Lookup transform. The conditional split is based on the CDC_$operation column.
Note, the source table for this data flow is the CDC table
Near Real-Time ChangesReduce Source-Target Latency
By running the SSIS package as a recurring job in the background, can reduce the latency interval to the execution time of the complete CDC process.
For this demonstration, there is a single data flow, so a For Loop container can serve a similar purpose.
The data flow executes multiple times within the loop and captures any changes to the CDC table.
Final Results
A second set of inserts and updates and the corresponding changes to the CDC and target tables, mere seconds later.