all about surrogate key

Upload: kodanda

Post on 04-Jun-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 All About Surrogate Key

    1/14

    Surrogate Key in Data Warehouse, What, When, Why and Why Not

    Surrogate keys are widely used and accepted design standard in data warehouses. It is

    sequentially generated unique number attached with each and every record in a Dimension table in

    any Data Warehouse. It join between the fact and dimension tables and is necessary to handle

    changes in dimension table attributes.

    What Is Surrogate Key

    Surrogate Key (SK) is sequentially generated meaningless unique number attached with each and

    every record in a table in any Data Warehouse (DW).

    It is UNIQUEsince it is sequentially generated integer for each record being inserted in thetable.

    It is MEANINGLESSsince it does not carry any business meaning regarding the record it isattached to in any table.

    It is SEQUENTIAL since it is assigned in sequential order as and when new records arecreated in the table, starting with one and going up to the highest number that is needed.

    Surrogate Key Pipeline and Fact Table

    During the FACT table load, different dimensional attributes are looked up in the corresponding

    Dimensions and SKs are fetched from there. These SKs should be fetched from the most recent

    versions of the dimension records. Finally the FACT table in DW contains the factual data along with

    corresponding SKs from the Dimension tables.

    http://www.disoln.org/2013/11/Surrogate-Key-in-Data-Warehouse-What-When-Why-and-Why-Not.htmlhttp://www.disoln.org/2013/11/Surrogate-Key-in-Data-Warehouse-What-When-Why-and-Why-Not.htmlhttp://www.disoln.org/2013/11/Surrogate-Key-in-Data-Warehouse-What-When-Why-and-Why-Not.html
  • 8/13/2019 All About Surrogate Key

    2/14

    The below diagram shows how the FACT table is loaded from the source.

    Why Should We Use Surrogate Key

    Basically its an artificial key that is used as a substitute for a Natural Key (NK). We should have

    defined NK in our tables as per the business requirement and that might be able to uniquely identify

    any record. But, SK is just an Integer attached to a record for the purpose of joining different tables

    in a Star or Snowflake schema based DW. SK is much needed when we have very long NK or the

    datatype of the NK is not suitable for Indexing.

    The below image shows a typical Star Schema, joining different Dimensions with the Fact using

    SKs.

  • 8/13/2019 All About Surrogate Key

    3/14

    Ralph Kimballemphasizes more on the abstraction of NK. As per him, Surrogate Keys should NOT

    be:

    Smart, where you can tell something about the record just by looking at the key. Composed of natural keys glued together. Implemented as multiple parallel joins between the dimension table and the fact table; so-

    called double or triple barreled joins.

    As perThomas Kejser,a good key isa column that has the following properties:

    It forced to be unique It is small It is an integer Once assigned to a row, it never changes Even if deleted, it will never be re-used to refer to a new row It is a single column It is stupid It is not intended as being remembered by users

    If the above mentioned features are taken into account, SK would be a great candidate for a Good

    Key in a DW.

    Apart from these, few more reasons for choosing this SK approach are:

    http://www.kimballgroup.com/http://www.kimballgroup.com/http://blog.kejser.org/http://blog.kejser.org/http://blog.kejser.org/http://blog.kejser.org/http://www.kimballgroup.com/
  • 8/13/2019 All About Surrogate Key

    4/14

    If we replace the NK with a single Integer, it should be able to save a substantial amount ofstorage space. The SKs of different Dimensions would be stored as Foreign Keys (FK) in the

    Fact tables to maintainReferential Integrity(RI), and here instead of storing of those big or

    huge NKs, storing of concise SKs would result in less amount of space needed. The

    UNIQUE indexes built on the SK will take less space than the UNIQUE index built on the NK

    which may be alphanumeric.

    Replacing big, ugly NKs and composite keys with beautiful, tight integer SKs is bound toimprove join performance, since joining two Integer columns works faster. So, it provides an

    extra edge in the ETL performanceby fastening data retrieval and lookup.

    Advantage of a four-byte integer key is that it can represent more than 2 billion differentvalues, which would be enough for any dimension and SK would not run out of values, not

    even for the Big or Monster Dimension.

    SK is usually independent of the data contained in the record, we cannot understandanything about the data in a record simply by seeing only the SK. Hence it provides Data

    Abstraction.

    So, apart from the abstraction of critical business data involved in the NK, we have the advantage of

    storage space reduction as well to implement the SK in our DW. It has become a Standard

    Practiceto associate an SK with a table in DW irrespective of being it a Dimension, Fact, Bridge or

    Aggregate table.

    Why Shouldnt We Use Surrogate Key

    There are myriad number of disadvantages as well while working with SK. Lets see them one by

    one:

    The values of SKs have no relationship with the real world meaning of the data held in a row.Therefore over usage of SKs lead to the problem of disassociation.

  • 8/13/2019 All About Surrogate Key

    5/14

    The generation and attachment of SK creates unnecessary ETL burden. Sometimes it maybe found that the actual piece of code is short and simple, but generating the SK and

    carrying it forward till the target adds extra overhead on the code.

    During the Horizontal Data Integration(DI) where multiple source systems loads data into asingle Dimension, we have to maintain a single SK Generating Area to enforce the

    Uniqueness of SK. This may come as an extra overhead on the ETL.

    Even query optimization becomes difficult since SK takes the place of PK, unique index isapplied on that column. And any query based on NK leads to Full Table Scan(FTS) as that

    query cannot take the advantage of unique index on the SK.

    Replication of data from one environment to another, i.e. Data Migration, becomes difficultsince SKs from different Dimension tables are used as the FKs in the Fact table and SKs are

    DW specific, any mismatch in the SK for a particular Dimension would result in no data or

    erroneous data when we join them in a Star Schema.

    If duplicate records come from the source, there is a potential risk of duplicates being loadedinto the target, since Unique Constraint is defined on the SK and not on the NK.

    Crux of the matter is that SK should not be implemented just in the name of standardizing your code.

    SK is required when we cannot use an NK to uniquely identify a record or when using an SK seems

    more suitable as the NK is not a good fit for PK.

  • 8/13/2019 All About Surrogate Key

    6/14

    Surrogate Key Generation Approaches Using InformaticaPowerCenter

    Surrogate Keyis sequentially generated unique number attached with each and every record

    in a Dimension table in any Data Warehouse. We discussed aboutSurrogate Keyin in detail in our

    previous article. Here in this article we will concentrate on different approaches to generate

    Surrogate Key for different type ETL process.

    Surrogate Key for Dimensions Loading in Parallel

    When you have a single dimension table loading in parallel from different application data sources,

    special care should be given to make sure that no keys are duplicated. Lets see different design

    options here.

    1. Using Sequence Generator Transformation

    This is the simplest and most preferred way to generate Surrogate Key(SK). We create

    areusableSequence Generatortransformation in the mapping and map the NEXTVAL port to the

    SK field in the target table in the INSERT flow of the mapping. The start value is usually kept 1 and

    incremented by 1.

    Below shown is areusableSequence Generatortransformation.

    http://www.disoln.org/2013/11/Surrogate-Key-Generation-Approaches-Using-Informatica-PowerCenter.htmlhttp://www.disoln.org/2013/11/Surrogate-Key-Generation-Approaches-Using-Informatica-PowerCenter.htmlhttp://www.disoln.org/2013/11/Surrogate-Key-Generation-Approaches-Using-Informatica-PowerCenter.htmlhttp://www.disoln.org/2013/11/Surrogate-Key-in-Data-Warehouse-What-When-Why-and-Why-Not.htmlhttp://www.disoln.org/2013/11/Surrogate-Key-in-Data-Warehouse-What-When-Why-and-Why-Not.htmlhttp://www.disoln.org/2013/11/Surrogate-Key-in-Data-Warehouse-What-When-Why-and-Why-Not.htmlhttp://www.disoln.org/2013/11/Surrogate-Key-in-Data-Warehouse-What-When-Why-and-Why-Not.htmlhttp://www.disoln.org/2013/11/Surrogate-Key-in-Data-Warehouse-What-When-Why-and-Why-Not.htmlhttp://www.disoln.org/2012/10/11-ways-to-make-informatica-powercenter-code-reusable.htmlhttp://www.disoln.org/2012/10/11-ways-to-make-informatica-powercenter-code-reusable.htmlhttp://www.disoln.org/2013/02/Sequence-Generator-Transformation-for-Unique-Key-Generation.htmlhttp://www.disoln.org/2013/02/Sequence-Generator-Transformation-for-Unique-Key-Generation.htmlhttp://www.disoln.org/2013/02/Sequence-Generator-Transformation-for-Unique-Key-Generation.htmlhttp://www.disoln.org/2012/10/11-ways-to-make-informatica-powercenter-code-reusable.htmlhttp://www.disoln.org/2012/10/11-ways-to-make-informatica-powercenter-code-reusable.htmlhttp://www.disoln.org/2013/02/Sequence-Generator-Transformation-for-Unique-Key-Generation.htmlhttp://www.disoln.org/2013/02/Sequence-Generator-Transformation-for-Unique-Key-Generation.htmlhttp://www.disoln.org/2013/02/Sequence-Generator-Transformation-for-Unique-Key-Generation.htmlhttp://lh6.ggpht.com/-RbUKdcd5xtw/Uo2PCLW8JQI/AAAAAAAAI64/rL_atJBHw84/s1600-h/image%255B24%255D.pnghttp://www.disoln.org/2013/02/Sequence-Generator-Transformation-for-Unique-Key-Generation.htmlhttp://www.disoln.org/2012/10/11-ways-to-make-informatica-powercenter-code-reusable.htmlhttp://www.disoln.org/2013/02/Sequence-Generator-Transformation-for-Unique-Key-Generation.htmlhttp://www.disoln.org/2012/10/11-ways-to-make-informatica-powercenter-code-reusable.htmlhttp://www.disoln.org/2013/11/Surrogate-Key-in-Data-Warehouse-What-When-Why-and-Why-Not.htmlhttp://www.disoln.org/2013/11/Surrogate-Key-in-Data-Warehouse-What-When-Why-and-Why-Not.htmlhttp://www.disoln.org/2013/11/Surrogate-Key-Generation-Approaches-Using-Informatica-PowerCenter.htmlhttp://www.disoln.org/2013/11/Surrogate-Key-Generation-Approaches-Using-Informatica-PowerCenter.html
  • 8/13/2019 All About Surrogate Key

    7/14

    NEXTVAL port from theSequence Generatorcan be mapped to the surrogate key in the target

    table. Below shown is the sequence generator transformation.

    Note: Make sure to create a reusable transformation, so that the same transformation can be

    reused in multiple mappings, which loads the same dimension table.

    2. Using Database Sequence

    We can create a SEQUENCE in the database and use the same to generate the SKs for any table.

    This can be invoked by aSQL Transformationor a using aStored Procedure Transformation.

    First we create a SEQUENCE using the following command.

    CREATE SEQUENCE DW.Customer_SK

    MINVALUE 1

    MAXVALUE 99999999

    START WITH 1

    INCREMENT BY 1;

    Using SQL Transformation

    You can create a create reusablereusableSQL Transformationas shown below. It takes the name

    of the database sequence and the schema name as input and returns SK numbers.

    http://www.disoln.org/2013/02/Sequence-Generator-Transformation-for-Unique-Key-Generation.htmlhttp://www.disoln.org/2013/02/Sequence-Generator-Transformation-for-Unique-Key-Generation.htmlhttp://www.disoln.org/2013/02/Sequence-Generator-Transformation-for-Unique-Key-Generation.htmlhttp://www.disoln.org/2013/09/Informatica-SQL-Transformation-Beyond-Pre-Post-Session-SQL-Commands.htmlhttp://www.disoln.org/2013/09/Informatica-SQL-Transformation-Beyond-Pre-Post-Session-SQL-Commands.htmlhttp://www.disoln.org/2013/09/Informatica-SQL-Transformation-Beyond-Pre-Post-Session-SQL-Commands.htmlhttp://www.disoln.org/2013/03/Stored-Procedure-Transformation-to-Leverage-Existing-DB-Scripts.htmlhttp://www.disoln.org/2013/03/Stored-Procedure-Transformation-to-Leverage-Existing-DB-Scripts.htmlhttp://www.disoln.org/2013/03/Stored-Procedure-Transformation-to-Leverage-Existing-DB-Scripts.htmlhttp://www.disoln.org/2012/10/11-ways-to-make-informatica-powercenter-code-reusable.htmlhttp://www.disoln.org/2012/10/11-ways-to-make-informatica-powercenter-code-reusable.htmlhttp://www.disoln.org/2013/09/Informatica-SQL-Transformation-Beyond-Pre-Post-Session-SQL-Commands.htmlhttp://www.disoln.org/2013/09/Informatica-SQL-Transformation-Beyond-Pre-Post-Session-SQL-Commands.htmlhttp://www.disoln.org/2013/09/Informatica-SQL-Transformation-Beyond-Pre-Post-Session-SQL-Commands.htmlhttp://lh5.ggpht.com/-L1D8Y2VqOfs/Uo2eYREaQBI/AAAAAAAAI7Y/Tu1hUvmcmbA/image_thumb%255B41%255D.pnghttp://www.disoln.org/2013/09/Informatica-SQL-Transformation-Beyond-Pre-Post-Session-SQL-Commands.htmlhttp://www.disoln.org/2012/10/11-ways-to-make-informatica-powercenter-code-reusable.htmlhttp://www.disoln.org/2013/03/Stored-Procedure-Transformation-to-Leverage-Existing-DB-Scripts.htmlhttp://www.disoln.org/2013/09/Informatica-SQL-Transformation-Beyond-Pre-Post-Session-SQL-Commands.htmlhttp://www.disoln.org/2013/02/Sequence-Generator-Transformation-for-Unique-Key-Generation.html
  • 8/13/2019 All About Surrogate Key

    8/14

    Schema name (DW) and sequence name (Customer_SK) can be passed in as input value for the

    transformation and the output can be mapped to the target SK column. Below shown is the SQL

    transformation image.

    Using Stored Procedure Transformation

    We use the SEQUENCE DW.Customer_SK to generate the SKs in an Oracle function, which in turn

    called via astored procedure transformation.

    http://www.disoln.org/2013/03/Stored-Procedure-Transformation-to-Leverage-Existing-DB-Scripts.htmlhttp://www.disoln.org/2013/03/Stored-Procedure-Transformation-to-Leverage-Existing-DB-Scripts.htmlhttp://lh6.ggpht.com/-iMMjamjCqK4/Uo2eZkHSPwI/AAAAAAAAI7g/diTLqZ6cLqE/s1600-h/image%255B41%255D.pnghttp://lh6.ggpht.com/-2yosbKWUsZw/Uo2O_PdQnQI/AAAAAAAAI6o/eBtGF78z02k/s1600-h/image%255B19%255D.pnghttp://lh6.ggpht.com/-iMMjamjCqK4/Uo2eZkHSPwI/AAAAAAAAI7g/diTLqZ6cLqE/s1600-h/image%255B41%255D.pnghttp://lh6.ggpht.com/-2yosbKWUsZw/Uo2O_PdQnQI/AAAAAAAAI6o/eBtGF78z02k/s1600-h/image%255B19%255D.pnghttp://www.disoln.org/2013/03/Stored-Procedure-Transformation-to-Leverage-Existing-DB-Scripts.html
  • 8/13/2019 All About Surrogate Key

    9/14

    Create a database function as below. Here we are creating an Oracle function.

    CREATE OR REPLACE FUNCTION DW.Customer_SK_Func

    RETURN NUMBER

    IS

    Out_SK NUMBER;

    BEGIN

    SELECT DW.Customer_SK.NEXTVAL INTO Out_SK FROM DUAL;

    RETURN Out_SK;

    EXCEPTION

    WHEN OTHERS THEN

    raise_application_error(-20001,'An error was encountered - '||SQLCODE||' -ERROR-

    '||SQLERRM);

    END;

    You can import the database function as a stored procedure transformation as shown in below

    image.

    Now, just before the target instance for Insert flow, we add an Expression transformation. We add an

    output port there with the following formula. This output port GET_SK can be connected to the target

    surrogate key column.

    GET_SK =:SP. CUSTOMER_SK_FUNC()

    http://lh4.ggpht.com/-ROCbiaH4Pbw/Uo2tbL8-mAI/AAAAAAAAI8A/TdSRqRHKzVs/s1600-h/image%255B61%255D.png
  • 8/13/2019 All About Surrogate Key

    10/14

    Note: Database function can be parametrized and the stored procedure can also be made reusable

    to make this approach more effective

    Surrogate Key for Non Parallel Loading Dimensions

    If the dimension table is not loading in parallel from different application data sources, we have

    couple of more options to generate SKs. Lets see different design options here.

    Using Dynamic LookUP

    When we implement Dynamic LookUP in any mapping, we may not even need to use the Sequence

    Generator for generating the SK values.

    For a Dynamic LookUP on Target, we have the option of associating any LookUP port with an input

    port, output port, or Sequence-ID. When we associate a Sequence-ID, the Integration Service

    http://lh6.ggpht.com/-SAQpYN1BN9M/Uo2zquciJzI/AAAAAAAAI8Y/8DcDzayKpe8/s1600-h/image%255B78%255D.png
  • 8/13/2019 All About Surrogate Key

    11/14

    generates a unique Integer value for each inserted rows in the lookup cache., but this is applicable

    for the ports with Bigint, Integer or Small Integer data type. Since SK is usually of Integer type, we

    can exploit this advantage.

    The Integration Service uses the following process to generate Sequence IDs.

    When the Integration Service creates the dynamic lookup cache, it tracks the range of valuesfor each port that has a sequence ID in the dynamic lookup cache.

    When the Integration Service inserts a row of data into the cache, it generates a key for aport by incrementing the greatest sequence ID value by one.

    When the Integration Service reaches the maximum number for a generated sequence ID, itstarts over at one. The Integration Service increments each sequence ID by one until it

    reaches the smallest existing value minus one. If the Integration Service runs out of unique

    sequence ID numbers, the session fails.

    Above shown is a dynamic lookup configuration to generate SK for CUST_SK.

    The Integration Service generates a Sequence-ID for each row it inserts into the cache. For any

    records which are already present in the Target, it gets the SK value from the Target

    http://lh4.ggpht.com/-L3O_dJCQiZM/Uo7ZCcQMnGI/AAAAAAAAI80/wd4bmPtyqNk/s1600-h/image%255B19%255D.png
  • 8/13/2019 All About Surrogate Key

    12/14

    Dynamic LookUP cache, based on the Associated Ports matching. So, if we take this port and

    connect to the target SK field, there will not be any need to generate SK values separately, since the

    new SK value(for records to be Inserted) or the existing SK value(for records to be Updated) is

    supplied from the Dynamic LookUP.

    The disadvantage of this technique lies in the fact that we dont have any separate SK Generating

    Area and the source of SK is totally embedded into the code.

    Using Expression Transformation

    Suppose we are populating a CUSTOMER_DIM. So in the Mapping, first create a Unconnected

    Lookup for the dimension table, say LKP_CUSTOMER_DIM. The purpose is to get the maximum SK

    value in the dimension table. Say the SK column is CUSTOMER_KEY and the NK column is

    CUSTOMER_ID.

    Select CUSTOMER_KEY as Return Portand Lookup Conditionas

    CUSTOMER_ID = IN_CUSTOMER_ID

    Use the SQL Overrideas below:

    SELECT MAX (CUSTOMER_KEY) AS CUSTOMER_KEY, '1' AS CUSTOMER_ID FROMCUSTOMER_DIM

    Next in the mapping after the SQ use an Expression transformation. Here actually we will be

    generating the SKs for the Dimension based on the previous value generated. We will create the

    following ports in the EXP to compute the SK value.

    VAR_COUNTER = IIF(ISNULL( VAR_INC ), NVL(:LKP.LKP_CUSTOMER_DIM('1'), 0) + 1,VAR_INC + 1 )

    VAR_INC = VAR_COUNTER

  • 8/13/2019 All About Surrogate Key

    13/14

    OUT_COUNTER = VAR_COUNTER

    When the mapping starts, for the first row we will look up the Dimension table to fetch the maximum

    available SK in the table. Next we will keep on incrementing the SK value stored in the variable port

    by 1 for each incoming row. Here the O_COUNTER will give the SKs to be populated in

    CUSTOMER_KEY.

    Using Mapping & Workflow Variable

    Here again we will use the Expression transformation to compute the next SK, but will get the MAX

    available SK in a different way.

    Suppose, we have a session s_New_Customer, which loads the Customer Dimension table. Before

    that session in the Workflow, we add a dummy session as s_Dummy.

    In s_Dummy, we will

    have a mapping variable, e.g. $$MAX_CUST_SK which will be set with the value of MAX (SK) in

    Customer Dimension table.

    SELECT MAX (CUSTOMER_KEY) AS CUSTOMER_KEY FROM CUSTOMER_DIM

    We will have the CUSTOMER_DIM as our source table and target can be a simple flat file, which will

    not be used anywhere. We pull this MAX (SK) from the SQ and then in an EXP we assign this value

    to the mapping variable using the SETVARIABLE function. So, we will have the following ports in

    the EXP:

    INP_CUSTOMER_KEY = INP_CUSTOMER_KEY -The MAX of SK coming from CustomerDimension table.

    http://lh3.ggpht.com/-qVXVckepRpw/Uo7o38yNwEI/AAAAAAAAI9M/pddU10PpsP8/s1600-h/image%255B37%255D.png
  • 8/13/2019 All About Surrogate Key

    14/14

    OUT_MAX_SK = SETVARIABLE ($$MAX_CUST_SK, INP_CUSTOMER_KEY) - OutputPort

    This output port will be connected to the flat file port, but the value we assigned to the variable will

    persist in the repository.

    In our second mapping we start generating the SK from the value $$MAX_CUST_SK + 1. But how

    can we pass the parameter value from one session into the other one?

    Here the use of Workflow Variable comes into picture. We define a WF variable as $$MAX_SK and

    in the Post-session on success variable assignment section of s_Dummy, we assign the value of

    $$MAX_CUST_SK to $$START_SK. Now the variable $$MAX_SK contains the maximum available

    SK value from CUSTOMER_DIM table. Next we define another mapping variable in the session

    s_New_Customer as $$START_VALUE and this is assigned the value of $$MAX_SK in the Pre-

    session variable assignment section of s_New_Customer.

    So, the sequence is:

    Post-session on success variable assignment of First Session:o $$MAX_SK = $$MAX_CUST_SK

    Pre-session variable assignment of Second Session:o $$START_VALUE = $$MAX_SK

    Now in the actual mapping, we add an EXP and the following ports into that to compute the SKs one

    by one for each records being loaded in the target.

    VAR_COUNTER = IIF (ISNULL (VAR_INC), $$START_VALUE + 1, VAR_INC + 1) VAR_INC = VAR_COUNTER OUT_COUNTER = VAR_COUNTER

    OUT_COUNTER will be connected to the SK port of the target.