datwarehouse project implementation

Upload: abhilasha-dewan

Post on 04-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Datwarehouse Project Implementation

    1/43

    This Data Warehousing site aims to help people get a good high-level understanding of what it takes to

    implement a successful data warehouse project. A lot of the information is from my personal experience

    as a business intelligence professional, both as a client and as a vendor.

    This site is divided into five main areas.

    - Tools: The selection of business intelligence tools and the selection of the data warehousing team.

    Tools covered are:

    Database, Hardware

    ETL (Extraction, Transformation, and Loading)

    OLAP

    Reporting

    Metadata

    - Steps: This selection contains the typical milestones for a data warehousing project, from requirement

    gathering, query optimization, to production rollout and beyond. I also offer my observations on the data

    warehousing field.

    - Business Intelligence: Business intelligence is closely related to data warehousing. This section

    discusses business intelligence, as wellas the relationship between business intelligence and data

    warehousing.

    - Concepts: This section discusses several concepts particular to the data warehousing field. Topics

    include:

    Dimensional Data Model

    Star Schema

    Snowflake Schema

    Slowly Changing Dimension

    Conceptual Data Model

    Logical Data Model

    Physical Data Model

    Conceptual, Logical, and Physical Data Model

    Data Integrity

    What is OLAP

    http://var/www/apps/conversion/tmp/scratch_2/tools.htmlhttp://www.1keydata.com/datawarehousing/tooldatabase.htmlhttp://www.1keydata.com/datawarehousing/tooletl.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toolmetadata.htmlhttp://www.1keydata.com/datawarehousing/processes.htmlhttp://www.1keydata.com/datawarehousing/processes.htmlhttp://www.1keydata.com/datawarehousing/requirement.htmlhttp://www.1keydata.com/datawarehousing/requirement.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/rollout.htmlhttp://www.1keydata.com/datawarehousing/observations.htmlhttp://www.1keydata.com/datawarehousing/business-intelligence.phphttp://www.1keydata.com/datawarehousing/business-intelligence.phphttp://www.1keydata.com/datawarehousing/concepts.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/scd.htmlhttp://www.1keydata.com/datawarehousing/conceptual-data-model.htmlhttp://www.1keydata.com/datawarehousing/conceptual-data-model.htmlhttp://www.1keydata.com/datawarehousing/logical-data-model.htmlhttp://www.1keydata.com/datawarehousing/logical-data-model.htmlhttp://www.1keydata.com/datawarehousing/physical-data-model.htmlhttp://www.1keydata.com/datawarehousing/physical-data-model.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-integrity.htmlhttp://www.1keydata.com/datawarehousing/what-is-olap.htmlhttp://www.1keydata.com/datawarehousing/tooldatabase.htmlhttp://www.1keydata.com/datawarehousing/tooletl.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toolmetadata.htmlhttp://www.1keydata.com/datawarehousing/processes.htmlhttp://www.1keydata.com/datawarehousing/requirement.htmlhttp://www.1keydata.com/datawarehousing/requirement.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/rollout.htmlhttp://www.1keydata.com/datawarehousing/observations.htmlhttp://www.1keydata.com/datawarehousing/business-intelligence.phphttp://www.1keydata.com/datawarehousing/concepts.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/scd.htmlhttp://www.1keydata.com/datawarehousing/conceptual-data-model.htmlhttp://www.1keydata.com/datawarehousing/logical-data-model.htmlhttp://www.1keydata.com/datawarehousing/physical-data-model.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-integrity.htmlhttp://www.1keydata.com/datawarehousing/what-is-olap.htmlhttp://var/www/apps/conversion/tmp/scratch_2/tools.html
  • 7/29/2019 Datwarehouse Project Implementation

    2/43

    MOLAP, ROLAP, and HOLAP

    Bill Inmon vs. Ralph Kimball

    - Business Intelligence Conferences: Lists upcoming conferences in the business intelligence / data

    warehousing industry.

    - Glossary: A glossary of common data warehousing terms.

    ********************

    As the old Chinese saying goes, "To accomplish a goal, make sure the proper tools are selected." This is

    especially true when the goal is to achieve business intelligence. Given the complexity of the data

    warehousing system and the cross-departmental implications of the project, it is easy to see why the

    proper selection of business intelligence software and personnel is very important. This section will talk

    about the such selections. They are grouped into the following:

    General Considerations

    Database/Hardware

    ETL Tools

    OLAP Tools

    Reporting Tools

    Metadata Tools

    Data Warehouse Team Personnel

    Please note that this site is vendor neutral. Some business intelligence vendor names will be

    mentioned, but it should not be considered as an endorsement from this site.

    Buy vs. Build

    The only choices here are what type of hardware and database to purchase, as there is basically no way

    that one can build hardware/database systems from scratch.

    Database/Hardware Selections

    In making selection for the database/hardware platform, there are several items that need to be carefullyconsidered:

    Scalability: How can the system grow as your data storage needs grow? Which RDBMS and hardware

    platform can handle large sets of data most efficiently? To get an idea of this, one needs to determine the

    approximate amount of data that is to be kept in the data warehouse system once it's mature, and base

    any testing numbers from there.

    http://www.1keydata.com/datawarehousing/molap-rolap.htmlhttp://www.1keydata.com/datawarehousing/inmon-kimball.htmlhttp://www.1keydata.com/datawarehousing/inmon-kimball.htmlhttp://var/www/apps/conversion/tmp/scratch_2/business-intelligence-conferences.phphttp://www.1keydata.com/datawarehousing/glossary.htmlhttp://www.1keydata.com/datawarehousing/tool.htmlhttp://www.1keydata.com/datawarehousing/tooldatabase.htmlhttp://www.1keydata.com/datawarehousing/tooletl.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toolmetadata.htmlhttp://www.1keydata.com/datawarehousing/dw_team.htmlhttp://www.1keydata.com/datawarehousing/dw_team.htmlhttp://www.1keydata.com/datawarehousing/molap-rolap.htmlhttp://www.1keydata.com/datawarehousing/inmon-kimball.htmlhttp://var/www/apps/conversion/tmp/scratch_2/business-intelligence-conferences.phphttp://www.1keydata.com/datawarehousing/glossary.htmlhttp://www.1keydata.com/datawarehousing/tool.htmlhttp://www.1keydata.com/datawarehousing/tooldatabase.htmlhttp://www.1keydata.com/datawarehousing/tooletl.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toolmetadata.htmlhttp://www.1keydata.com/datawarehousing/dw_team.html
  • 7/29/2019 Datwarehouse Project Implementation

    3/43

    Parallel Processing Support: The days of multi-million dollar supercomputers with one single CPU are

    gone, and nowadays the most powerful computers all use multiple CPUs, where each processor can

    perform a part of the task, all at the same time. When I first started working with massively parallel

    computers in 1993, I had thought that it would be the best way for any large computations to be done

    within 5 years. Indeed, parallel computing is gaining popularity now, although a little slower than I had

    originally thought.

    RDBMS/Hardware Combination: Because the RDBMS physically sits on the hardware platform, there

    are going to be certain parts of the code that is hardware platform-dependent. As a result, bugs and bug

    fixes are often hardware dependent.

    True Case: One of the projects I have worked on was with a major RDBMS provider paired with a

    hardware platform that was not so popular (at least not in the data warehousing world). The DBA

    constantly complained about the bug not being fixed because the support level for the particular type of

    hardware that client had chosen was Level 3, which basically meant that no one in the RDBMS support

    organization will fix any bug particular to that hardware platform.

    Popular Relational Databases

    Oracle

    Microsoft SQL Server

    IBM DB2

    Teradata

    Sybase

    MySQL

    Popular OS Platforms

    Linux

    FreeBSD

    Microsoft

    Buy vs. Build

    When it comes to ETL tool selection, it is not always necessary to purchase a third-party tool. This

    determination largely depends on three things:

    Complexity of the data transformation: The more complex the data transformation is, the more

    suitable it is to purchase an ETL tool.

    Data cleansing needs: Does the data need to go through a thorough cleansing exercise before it

    is suitable to be stored in the data warehouse? If so, it is best to purchase a tool with strong data

    http://launch%28%27http//www.oracle.com')http://launch%28%27http//www.microsoft.com')http://launch%28%27http//www.ibm.com')http://launch%28%27http//www.teradata.com')http://launch%28%27http//www.sybase.com')http://launch%28%27http//www.mysql.com')http://launch%28%27http//www.linux.org')http://launch%28%27http//www.freebsd.org')http://launch%28%27http//www.microsoft.com')http://launch%28%27http//www.microsoft.com')http://launch%28%27http//www.oracle.com')http://launch%28%27http//www.microsoft.com')http://launch%28%27http//www.ibm.com')http://launch%28%27http//www.teradata.com')http://launch%28%27http//www.sybase.com')http://launch%28%27http//www.mysql.com')http://launch%28%27http//www.linux.org')http://launch%28%27http//www.freebsd.org')http://launch%28%27http//www.microsoft.com')
  • 7/29/2019 Datwarehouse Project Implementation

    4/43

    cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL routine from

    scratch.

    Data volume. Available commercial tools typically have features that can speed up data

    movement. Therefore, buying a commercial product is a better approach if the volume of data

    transferred is large.

    ETL Tool Functionalities

    While the selection of a database and a hardware platform is a must, the selection of an ETL tool is highly

    recommended, but it's not a must. When you evaluate ETL tools, it pays to look for the following

    characteristics:

    Functional capability: This includes both the 'transformation' piece and the 'cleansing' piece. In general,

    the typical ETL tools are either geared towards having strong transformation capabilities or having strong

    cleansing capabilities, but they are seldom very strong in both. As a result, if you know your data is going

    to be dirty coming in, make sure your ETL tool has strong cleansing capabilities. If you know there are

    going to be a lot of different data transformations, it then makes sense to pick a tool that is strong in

    transformation.

    Ability to read directly from your data source: For each organization, there is a different set of data

    sources. Make sure the ETL tool you select can connect directly to your source data.

    Metadata support: The ETL tool plays a key role in your metadata because it maps the source data to

    the destination, which is an important piece of the metadata. In fact, some organizations have come to

    rely on the documentation of their ETL tool as their metadata source. As a result, it is very important to

    select an ETL tool that works with your overall metadata strategy.

    Popular Tools

    IBM WebSphere Information Integration (Ascential DataStage)

    Ab Initio

    Informatica

    Talend

    Buy vs. Build

    OLAP tools are geared towards slicing and dicing of the data. As such, they require a strongmetadata layer, as well as front-end flexibility. Those are typically difficult features for anyhome-built systems to achieve. Therefore, my recommendation is that if OLAP analysis is part ofyour charter for building a data warehouse, it is best to purchase an existing OLAP tool ratherthan creating one from scratch.

    OLAP Tool Functionalities

    Before we speak about OLAP tool selection criterion, we must first distinguish between the two

  • 7/29/2019 Datwarehouse Project Implementation

    5/43

    types of OLAP tools, MOLAP (Multidimensional OLAP) and ROLAP (Relational OLAP).

    1. MOLAP: In this type of OLAP, a cube is aggregated from the relational data source (datawarehouse). When user generates a report request, the MOLAP tool can generate the createquickly because all data is already pre-aggregated within the cube.

    2. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube, the ROLAPengine essentially acts as a smart SQL generator. The ROLAP tool typically comes with a'Designer' piece, where the data warehouse administrator can specify the relationship betweenthe relational tables, as well as how dimensions, attributes, and hierarchies map to the underlyingdatabase tables.

    Right now, there is a convergence between the traditional ROLAP and MOLAP vendors.ROLAP vendor recognize that users want their reports fast, so they are implementing MOLAPfunctionalities in their tools; MOLAP vendors recognize that many times it is necessary to drilldown to the most detail level information, levels where the traditional cubes do not get to forperformance and size reasons.

    So what are the criteria for evaluating OLAP vendors? Here they are:

    Ability to leverage parallelism supplied by RDBMS and hardware: This would greatlyincrease the tool's performance, and help loading the data into the cubes as quickly as possible.

    Performance: In addition to leveraging parallelism, the tool itself should be quick both in termsof loading the data into the cube and reading the data from the cube.

    Customization efforts: More and more, OLAP tools are used as an advanced reporting tool.This is because in many cases, especially for ROLAP implementations, OLAP tools often can beused as a reporting tool. In such cases, the ease of front-end customization becomes an important

    factor in the tool selection process.Security Features: Because OLAP tools are geared towards a number of users, making surepeople see only what they are supposed to see is important. By and large, all established OLAPtools have a security layer that can interact with the common corporate login protocols. Thereare, however, cases where large corporations have developed their own user authenticationmechanism and have a "single sign-on" policy. For these cases, having a seamless integrationbetween the tool and the in-house authentication can require some work. I would recommendthat you have the tool vendor team come in and make sure that the two are compatible.

    Metadata support: Because OLAP tools aggregates the data into the cube and sometimes servesas the front-end tool, it is essential that it works with the metadata strategy/tool you have

    selected.

    Popular Tools

    Business Objects

    Cognos

    Hyperion

  • 7/29/2019 Datwarehouse Project Implementation

    6/43

    Microsoft Analysis Services

    MicroStrategy

    Pentaho

    Palo OLAP Server

    The OLAP Tool Market Share shows the market share of the above 5 vendors.

    Buy vs. Build

    There is a wide variety of reporting requirements, and whether to buy or build a reporting toolfor your business intelligence needs is also heavily dependent on the type of requirements.Typically, the determination is based on the following:

    Number of reports: The higher the number of reports, the more likely that buying areporting tool is a good idea. This is not only because reporting tools typically make

    creating new reports easier (by offering re-usable components), but they also alreadyhave report management systems to make maintenance and support functions easier.

    Desired Report Distribution Mode: If the reports will only be distributed in a single mode(for example, email only, or over the browser only), we should then strongly consider thepossibility of building the reporting tool from scratch. However, if users will access thereports through a variety of different channels, it would make sense to invest in a third-party reporting tool that already comes packaged with these distribution modes.

    Ad Hoc Report Creation: Will the users be able to create their own ad hoc reports? If so,it is a good idea to purchase a reporting tool. These tool vendors have accumulated

    extensive experience and know the features that are important to users who are creatingad hoc reports. A second reason is that the ability to allow for ad hoc report creationnecessarily relies on a strong metadata layer, and it is simply difficult to come up with ametadata model when building a reporting tool from scratch.

    Reporting Tool Functionalities

    Data is useless if all it does is sit in the data warehouse. As a result, the presentation layer is ofvery high importance.

    Most of the OLAP vendors already have a front-end presentation layer that allows users to callup pre-defined reports or create ad hoc reports. There are also several report tool vendors. Either

    way, pay attention to the following points when evaluating reporting tools:

    Data source connection capabilities

    In general there are two types of data sources, one the relationship database, the other is theOLAP multidimensional data source. Nowadays, chances are good that you might want to haveboth. Many tool vendors will tell you that they offer both options, but upon closer inspection, itis possible that the tool vendor is especially good for one type, but to connect to the other type ofdata source, it becomes a difficult exercise in programming.

    http://www.1keydata.com/datawarehousing/olap-market-share.htmlhttp://www.1keydata.com/datawarehousing/olap-market-share.html
  • 7/29/2019 Datwarehouse Project Implementation

    7/43

    Scheduling and distribution capabilities

    In a realistic data warehousing usage scenario by senior executives, all they have time for is tocome in on Monday morning, look at the most important weekly numbers from the previousweek (say the sales numbers), and that's how they satisfy their business intelligence needs. Allthe fancy ad hoc and drilling capabilities will not interest them, because they do not touch thesefeatures.

    Based on the above scenario, the reporting tool must have scheduling and distributioncapabilities. Weekly reports are scheduled to run on Monday morning, and the resulting reportsare distributed to the senior executives either by email or web publishing. There are claims byvarious vendors that they can distribute reports through various interfaces, but based on myexperience, the only ones that really matter are delivery via email and publishing over theintranet.

    Security Features: Because reporting tools, similar to OLAP tools, are geared towards a numberof users, making sure people see only what they are supposed to see is important. Security can

    reside at the report level, folder level, column level, row level, or even individual cell level. Byand large, all established reporting tools have these capabilities. Furthermore, they have asecurity layer that can interact with the common corporate login protocols. There are, however,cases where large corporations have developed their own user authentication mechanism andhave a "single sign-on" policy. For these cases, having a seamless integration between the tooland the in-house authentication can require some work. I would recommend that you have thetool vendor team come in and make sure that the two are compatible.

    Customization

    Every one of us has had the frustration over spending an inordinate amount of time tinkeringwith some office productivity tool only to make the report/presentation look good. This is

    definitely a waste of time, but unfortunately it is a necessary evil. In fact, a lot of times, analystswill wish to take a report directly out of the reporting tool and place it in their presentations orreports to their bosses. If the reporting tool offers them an easy way to pre-set the reports to lookexactly the way that adheres to the corporate standard, it makes the analysts jobs much easier,and the time savings are tremendous.

    Export capabilities

    The most common export needs are to Excel, to a flat file, and to PDF, and a good report toolmust be able to export to all three formats. For Excel, if the situation warrants it, you will want toverify that the reporting format, not just the data itself, will be exported out to Excel. This canoften be a time-saver.

    Integration with the Microsoft Office environment

    Most people are used to work with Microsoft Office products, especially Excel, for manipulatingdata. Before, people used to export the reports into Excel, and then perform additional formatting/ calculation tasks. Some reporting tools now offer a Microsoft Office-like editing environmentfor users, so all formatting can be done within the reporting tool itself, with no need to export thereport into Excel. This is a nice convenience to the users.

  • 7/29/2019 Datwarehouse Project Implementation

    8/43

    Popular Tools

    Business Objects (Crystal Reports)

    Cognos

    Actuate

    Buy vs. Build

    Only in the rarest of cases does it make sense to build a metadata tool from scratch. This is because

    doing so requires resources that are intimately familiar with the operational, technical, and business

    aspects of the data warehouse system, and such resources are difficult to come by. Even when such

    resources are available, there are often other tasks that can provide more value to the organization than

    to build a metadata tool from scratch.

    In fact, the question is often whether any type of metadata tool is needed at all. Although metadata plays

    an extremely important role in a successful data warehousing implementation, this does not always meanthat a tool is needed to keep all the "data about data." It is possible to, say, keey such information in the

    repository of other tools used, in a text documentation, or even in a presentation or a spreadsheet.

    Having said the above, though, it is author's believe that having a solid metadata foundation is one of the

    keys to the success of a data warehousing project. Therefore, even if a metadata tool is not selected at

    the beginning of the project, it is essential to have a metadata strategy; that is, how metadata in the data

    warehousing system will be stored.

    Metadata Tool Functionalities

    This is the most difficult tool to choose, because there is clearly no standard. In fact, it might be better to

    call this a selection of the metadata strategy. Traditionally, people have put the data modeling information

    into a tool such as ERWin and Oracle Designer, but it is difficult to extract information out of such data

    modeling tools. For example, one of the goals for your metadata selection is to provide information to the

    end users. Clearly this is a difficult task with a data modeling tool.

    So typically what is likely to happen is that additional efforts are spent to create a layer of metadata that is

    aimed at the end users. While this allows the end users to gain the required insight into what the data and

    reports they are looking at means, it is clearly inefficient because all that information already resides

    somewhere in the data warehouse system, whether it be the ETL tool, the data modeling tool, the OLAP

    tool, or the reporting tool.

    There are efforts among data warehousing tool vendors to unify on a metadata model. In June of 2000,

    the OMG released a metadata standard called CWM (Common Warehouse Metamodel), and some of the

    vendors such as Oracle have claimed to have implemented it. This standard incorporates the latest

    technology such as XML, UML, and SOAP, and, if accepted widely, is truly the best thing that can happen

    to the data warehousing industry. As of right now, though, the author has not really seen that many tools

    leveraging this standard, so clearly it has not quite caught on yet.

    http://launch%28%27http//www.omg.org')http://launch%28%27http//www.omg.org')
  • 7/29/2019 Datwarehouse Project Implementation

    9/43

    So what does this mean about your metadata efforts? In the absence of everything else, I would

    recommend that whatever tool you choose for your metadata support supports XML, and that whatever

    other tool that needs to leverage the metadata also supports XML. Then it is a matter of defining your

    DTD across your data warehousing system. At the same time, there is no need to worry about criteria that

    typically is important for the other tools such as performance and support for parallelism because the size

    of the metadata is typically small relative to the size of the data warehouse.

    Data Warehouse Team Personnel Selection

    There are two areas of discussion: First is whether to use external consultants or hire permanentemployees. The second is on what type of personnel is recommended for a data warehousingproject.

    The pros of hiring external consultants are:

    1. They are usually more experienced in data warehousing implementations. The fact of thematter is, even today, people with extensive data warehousing backgrounds are difficult to find.With that, when there is a need to ramp up a team quickly, the easiest route to go is to hireexternal consultants.

    The pros of hiring permanent employees are:

    1. They are less expensive. With hourly rates for experienced data warehousing professionalsrunning from $100/hr and up, and even more for Big-5 or vendor consultants, hiring permanentemployees is a much more economical option.

    2. They are less likely to leave. With consultants, whether they are on contract, via a Big-5 firm,

    or one of the tool vendor firms, they are likely to leave at a moment's notice. This makesknowledge transfer very important. Of course, the flip side is that these consultants are mucheasier to get rid of, too.

    The following roles are typical for a data warehouse project:

    Project Manager: This person will oversee the progress and be responsible for the success ofthe data warehousing project.

    DBA: This role is responsible to keep the database running smoothly. Additional tasks for thisrole may be to plan and execute a backup/recovery plan, as well asperformance tuning.

    Technical Architect: This role is responsible for developing and implementing the overalltechnical architecture of the data warehouse, from the backend hardware/software to the clientdesktop configurations.

    ETL Developer: This role is responsible for planning, developing, and deploying the extraction,transformation, and loading routine for the data warehouse.

    Front End Developer: This person is responsible for developing the front-end, whether it beclient-server or over the web.

    http://www.1keydata.com/datawarehousing/performance.htmlhttp://www.1keydata.com/datawarehousing/performance.html
  • 7/29/2019 Datwarehouse Project Implementation

    10/43

    OLAP Developer: This role is responsible for the development of OLAP cubes.

    Trainer: A significant role is the trainer. After the data warehouse is implemented, a person onthe data warehouse team needs to work with the end users to get them familiar with how thefront end is set up so that the end users can get the most benefit out of the data warehousesystem.

    Data Modeler: This role is responsible for taking the data structure that exists in the enterpriseand model it into a schema that is suitable for OLAP analysis.

    QA Group: This role is responsible for ensuring the correctness of the data in the datawarehouse. This role is more important than it appears, because bad data quality turns awayusers more than any other reason, and often is the start of the downfall for the data warehousingproject.

    The above list is roles, and one person does not necessarily correspond to only one role. In fact,it is very common in a data warehousing team where a person takes on multiple roles. For atypical project, it is common to see teams of 5-8 people. Any data warehousing team thatcontains more than 10 people is definitely bloated.

    Data Warehouse Design

    After the tools and team personnel selections are made, the data warehouse design can begin.The following are the typical steps involved in the datawarehousing project cycle.

    Requirement Gathering

    Physical Environment Setup

    Data Modeling

    ETL

    OLAP Cube Design

    Front End Development

    Report Development

    Performance Tuning

    Query Optimization

    Quality Assurance

    Rolling out to Production

    Production Maintenance

    Incremental Enhancements

    http://www.1keydata.com/datawarehousing/requirement.htmlhttp://www.1keydata.com/datawarehousing/environment.htmlhttp://www.1keydata.com/datawarehousing/datamodeling.htmlhttp://www.1keydata.com/datawarehousing/etl.htmlhttp://www.1keydata.com/datawarehousing/olap.htmlhttp://www.1keydata.com/datawarehousing/frontend.htmlhttp://www.1keydata.com/datawarehousing/report-development.htmlhttp://www.1keydata.com/datawarehousing/performance.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/qa.htmlhttp://www.1keydata.com/datawarehousing/rollout.htmlhttp://www.1keydata.com/datawarehousing/maintenance.htmlhttp://www.1keydata.com/datawarehousing/enhancement.htmlhttp://www.1keydata.com/datawarehousing/requirement.htmlhttp://www.1keydata.com/datawarehousing/environment.htmlhttp://www.1keydata.com/datawarehousing/datamodeling.htmlhttp://www.1keydata.com/datawarehousing/etl.htmlhttp://www.1keydata.com/datawarehousing/olap.htmlhttp://www.1keydata.com/datawarehousing/frontend.htmlhttp://www.1keydata.com/datawarehousing/report-development.htmlhttp://www.1keydata.com/datawarehousing/performance.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/qa.htmlhttp://www.1keydata.com/datawarehousing/rollout.htmlhttp://www.1keydata.com/datawarehousing/maintenance.htmlhttp://www.1keydata.com/datawarehousing/enhancement.html
  • 7/29/2019 Datwarehouse Project Implementation

    11/43

    Each page listed above represents a typical data warehouse design phase, and has severalsections:

    Task Description: This section describes what typically needs to be accomplished duringthis particular data warehouse design phase.

    Time Requirement: A rough estimate of the amount of time this particular datawarehouse task takes.

    Deliverables: Typically at the end of each data warehouse task, one or more documentsare produced that fully describe the steps and results of that particular task. This isespecially important for consultants to communicate their results to the clients.

    Possible Pitfalls: Things to watch out for. Some of them obvious, some of them not soobvious. All of them are real.

    The Additional Observationssection contains my own observations on data warehouse processesnot included in any of the design steps.

    Requirement Gathering

    Task Description

    The first thing that the project team should engage in is gathering requirements from end users.Because end users are typically not familiar with the data warehousing process or concept, thehelp of the business sponsor is essential. Requirement gathering can happen as one-to-onemeetings or as Joint Application Development (JAD) sessions, where multiple people are talkingabout the project scope in the same meeting.

    The primary goal of this phase is to identify what constitutes as a success for this particularphase of the data warehouse project. In particular, end user reporting / analysis requirements areidentified, and the project team will spend the remaining period of time trying to satisfy theserequirements.

    Associated with the identification of user requirements is a more concrete definition of otherdetails such as hardware sizing information, training requirements, data source identification, andmost importantly, a concrete project plan indicating the finishing date of the data warehousingproject.

    Based on the information gathered above, a disaster recovery plan needs to be developed so thatthe data warehousing system can recover from accidents that disable the system. Without aneffective backup and restore strategy, the system will only last until the first major disaster, and,as many data warehousing DBA's will attest, this can happen very quickly after the project goeslive.

    Time Requirement

    2 - 8 weeks.

    http://www.1keydata.com/datawarehousing/observations.htmlhttp://www.1keydata.com/datawarehousing/observations.htmlhttp://www.1keydata.com/datawarehousing/observations.html
  • 7/29/2019 Datwarehouse Project Implementation

    12/43

    Deliverables

    A list of reports / cubes to be delivered to the end users by the end of this current phase.

    A updated project plan that clearly identifies resource loads and milestone delivery dates.

    Possible Pitfalls

    This phase often turns out to be the most tricky phase of the data warehousing implementation.The reason is that because data warehousing by definition includes data from multiple sourcesspanning many different departments within the enterprise, there are often political battles thatcenter on the willingness of information sharing. Even though a successful data warehousebenefits the enterprise, there are occasions where departments may not feel the same way. As aresult of unwillingness of certain groups to release data or to participate in the data warehousingrequirement definition, the data warehouse effort either never gets off the ground, or could notstart in the direction originally defined.

    When this happens, it would be ideal to have a strong business sponsor. If the sponsor is at theCXO level, she can often exert enough influence to make sure everyone cooperates.

    Physical Environment Setup

    Task Description

    Once the requirements are somewhat clear, it is necessary to set up the physical servers anddatabases. At a minimum, it is necessary to set up a development environment and a productionenvironment. There are also many data warehousing projects where there are threeenvironments: Development, Testing, and Production.

    It is not enough to simply have different physical environments set up. The different processes(such as ETL, OLAP Cube, and reporting) also need to be set up properly for each environment.

    It is best for the different environments to use distinct application and database servers. In otherwords, the development environment will have its own application server and database servers,and the production environment will have its own set of application and database servers.

    Having different environments is very important for the following reasons:

    All changes can be tested and QA'd first without affecting the production environment.

    Development and QA can occur during the time users are accessing the data warehouse.

    When there is any question about the data, having separate environment(s) will allow thedata warehousing team to examine the data without impacting the productionenvironment.

    Time Requirement

    Getting the servers and databases ready should take less than 1 week.

  • 7/29/2019 Datwarehouse Project Implementation

    13/43

    Deliverables

    Hardware / Software setup document for all of the environments, including hardwarespecifications, and scripts / settings for the software.

    Possible Pitfalls

    To save on capital, often data warehousing teams will decide to use only a single database and asingle server for the different environments. Environment separation is achieved by either adirectory structure or setting up distinct instances of the database. This is problematic for thefollowing reasons:

    1. Sometimes it is possible that the server needs to be rebooted for the development environment.Having a separate development environment will prevent the production environment from beingimpacted by this.

    2. There may be interference when having different database environments on a single box. Forexample, having multiple long queries running on the development database could affect the

    performance on the production database.

    Data Modeling

    Task Description

    This is a very important step in the data warehousing project. Indeed, it is fair to say that thefoundation of the data warehousing system is the data model. A good data model will allow thedata warehousing system to grow easily, as well as allowing for good performance.

    In data warehousing project, the logical data model is built based on user requirements, and then

    it is translated into the physical data model. The detailed steps can be found in the Conceptual,Logical, and Physical Data Modeling section.

    Part of the data modeling exercise is often the identification of data sources. Sometimes this stepis deferred until the ETL step. However, my feeling is that it is better to find out where the dataexists, or, better yet, whether they even exist anywhere in the enterprise at all. Should the datanot be available, this is a good time to raise the alarm. If this was delayed until the ETL phase,rectifying it will becoming a much tougher and more complex process.

    Time Requirement

    2 - 6 weeks.

    Deliverables

    Identification of data sources.

    Logical data model.

    Physical data model.

    http://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.html
  • 7/29/2019 Datwarehouse Project Implementation

    14/43

    Possible Pitfalls

    It is essential to have a subject-matter expert as part of the data modeling team. This person canbe an outside consultant or can be someone in-house who has extensive experience in theindustry. Without this person, it becomes difficult to get a definitive answer on many of thequestions, and the entire project gets dragged out.

    ETL

    Task Description

    The ETL (Extraction, Transformation, Loading) process typically takes the longest to develop,and this can easily take up to 50% of the data warehouse implementation cycle or longer. Thereason for this is that it takes time to get the source data, understand the necessary columns,understand the business rules, and understand the logical and physical data models.

    Time Requirement

    1 - 6 weeks.

    Deliverables

    Data Mapping Document

    ETL Script / ETL Package in the ETL tool

    Possible Pitfalls

    There is a tendency to give this particular phase too little development time. This can prove

    suicidal to the project because end users will usually tolerate less formatting, longer time to runreports, less functionality (slicing and dicing), or fewer delivered reports; one thing that they willnot tolerate is wrong information.

    A second common problem is that some people make the ETL process more complicated thannecessary. In ETL design, the primary goal should be to optimize load speed without sacrificingon quality. This is, however, sometimes not followed. There are cases where the design goal is tocover all possible future uses, whether they are practical or just a figment of someone'simagination. When this happens, ETL performance suffers, and often so does the performance ofthe entire data warehousing system.

    OLAP Cube Design

    Task Description

    Usually the design of the olap cube can be derived from the Requirement Gatheringphase. Moreoften than not, however, users have some idea on what they want, but it is difficult for them tospecify the exact report / analysis they want to see. When this is the case, it is usually a good ideato include enough information so that they feel like they have gained something through the datawarehouse, but not so much that it stretches the data warehouse scope by a mile. Remember that

    http://var/www/apps/conversion/tmp/scratch_2/requirement.htmlhttp://var/www/apps/conversion/tmp/scratch_2/requirement.htmlhttp://var/www/apps/conversion/tmp/scratch_2/requirement.html
  • 7/29/2019 Datwarehouse Project Implementation

    15/43

    data warehousing is an iterative process - no one can ever meet all the requirements all at once.

    Time Requirement

    1 - 2 weeks.

    Deliverables

    Documentation specifying the OLAP cube dimensions and measures.

    Actual OLAP cube / report.

    Possible Pitfalls

    Make sure your olap cube-bilding process is optimized. It is common for the data warehouse tobe on the bottom of the nightly batch load, and after the loading of the data warehouse, thereusually isn't much time remaining for the olap cube to be refreshed. As a result, it is worthwhileto experiment with the olap cube generation paths to ensure optimal performance.

    Front End Development

    Task Description

    Regardless of the strength of the OLAP engine and the integrity of the data, if the users cannotvisualize the reports, the data warehouse brings zero value to them. Hence front end developmentis an important part of a data warehousing initiative.

    So what are the things to look out for in selecting a front-end deployment methodology? Themost important thing is that the reports should need to be delivered over the web, so the only

    thing that the user needs is the standard browser. These days it is no longer desirable nor feasibleto have the IT department doing program installations on end users desktops just so that they canview reports. So, whatever strategy one pursues, make sure the ability to deliver over the web isa must.

    The front-end options ranges from an internal front-end development using scripting languagessuch as ASP, PHP, or Perl, to off-the-shelf products such as Seagate Crystal Reports, to the morehigher-level products such as Actuate. In addition, many OLAP vendors offer a front-end ontheir own. When choosing vendor tools, make sure it can be easily customized to suit theenterprise, especially the possible changes to the reporting requirements of the enterprise.Possible changes include not just the difference in report layout and report content, but alsoinclude possible changes in the back-end structure. For example, if the enterprise decides to

    change from Solaris/Oracle to Microsoft 2000/SQL Server, will the front-end tool be flexibleenough to adjust to the changes without much modification?

    Another area to be concerned with is the complexity of the reporting tool. For example, do thereports need to be published on a regular interval? Are there very specific formattingrequirements? Is there a need for a GUI interface so that each user can customize her reports?

    Time Requirement

  • 7/29/2019 Datwarehouse Project Implementation

    16/43

    1 - 4 weeks.

    Deliverables

    Front End Deployment Documentation

    Possible Pitfalls

    Just remember that the end users do not care how complex or how technologically advancedyour front end infrastructure is. All they care is that they receives their information in a timelymanner and in the way they specified.

    Report Development

    Task Description

    Report specification typically comes directly from the requirements phase. To the end user, the

    only direct touchpoint he or she has with the data warehousing system is the reports they see. So,report development, although not as time consuming as some of the other steps such asETL anddata modeling, nevertheless plays a very important role in determining the success of the datawarehousing project.

    One would think that report development is an easy task. How hard can it be to just followinstructions to build the report? Unfortunately, this is not true. There are several points the datawarehousing team need to pay attention to before releasing the report.

    User customization: Do users need to be able to select their own metrics? And how do usersneed to be able to filter the information? The report development process needs to take thosefactors into consideration so that users can get the information they need in the shortest amount

    of time possible.

    Report delivery: What report delivery methods are needed? In addition to delivering the reportto the web front end, other possibilities include delivery via email, via text messaging, or in someform of spreadsheet. There are reporting solutions in the marketplace that support report deliveryas a flash file. Such flash file essentially acts as a mini-cube, and would allow end users to sliceand dice the data on the report without having to pull data from an external source.

    Access privileges: Special attention needs to be paid to who has what access to whatinformation. A sales report can show 8 metrics covering the entire company to the companyCEO, while the same report may only show 5 of the metrics covering only a single district to aDistrict Sales Director.

    Report development does not happen only during the implementation phase. After the systemgoes into production, there will certainly be requests for additional reports. These types ofrequests generally fall into two broad categories:

    1. Data is already available in the data warehouse. In this case, it should be fairly straightforwardto develop the new report into the front end. There is no need to wait for a major production pushbefore making new reports available.

    http://www.1keydata.com/datawarehousing/etl.htmlhttp://www.1keydata.com/datawarehousing/etl.htmlhttp://www.1keydata.com/datawarehousing/datamodeling.htmlhttp://www.1keydata.com/datawarehousing/datamodeling.htmlhttp://www.1keydata.com/datawarehousing/etl.htmlhttp://www.1keydata.com/datawarehousing/datamodeling.html
  • 7/29/2019 Datwarehouse Project Implementation

    17/43

    2. Data is not yet available in the data warehouse. This means that the request needs to beprioritized and put into a future data warehousing development cycle.

    Time Requirement

    1 - 2 weeks.

    Deliverables

    Report Specification Documentation.

    Reports set up in the front end / reports delivered to user's preferred channel.

    Possible Pitfalls

    Make sure the exact definitions of the report are communicated to the users. Otherwise, userinterpretation of the report can be errenous.

    Performance Tuning

    Task Description

    There are three major areas where a data warehousing system can use a little performancetuning:

    ETL - Given that the data load is usually a very time-consuming process (and hence theyare typically relegated to a nightly load job) and that data warehousing-related batch jobsare typically of lower priority, that means that the window for data loading is not verylong. A data warehousing system that has its ETL process finishing right on-time is going

    to have a lot of problems simply because often the jobs do not get started on-time due tofactors that is beyond the control of the data warehousing team. As a result, it is alwaysan excellent idea for the data warehousing group to tune the ETL process as much aspossible.

    Query Processing - Sometimes, especially in a ROLAP environment or in a system wherethe reports are run directly against the relationship database, query performance can be anissue. A study has shown that users typically lose interest after 30 seconds of waiting fora report to return. My experience has been that ROLAP reports or reports that run directlyagainst the RDBMS often exceed this time limit, and it is hence ideal for the datawarehousing team to invest some time to tune the query, especially the most popularlyones. We present a number ofquery optimizationideas.

    Report Delivery - It is also possible that end users are experiencing significant delays inreceiving their reports due to factors other than the query performance. For example,network traffic, server setup, and even the way that the front-end was built sometimesplay significant roles. It is important for the data warehouse team to look into these areasfor performance tuning.

    Time Requirement

    http://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.html
  • 7/29/2019 Datwarehouse Project Implementation

    18/43

    3 - 5 days.

    Deliverables

    Performance tuning document - Goal and Result

    Possible Pitfalls

    Make sure the development environment mimics the production environment as much aspossible - Performance enhancements seen on less powerful machines sometimes do notmaterialize on the larger, production-level machines.

    Query Performance

    For any production database, SQL query performance becomes an issue sooner or later. Havinglong-running queries not only consumes system resources that makes the server and applicationrun slowly, but also may lead to table locking and data corruption issues. So, query optimization

    becomes an important task.

    First, we offer some guiding principles for query optimization:

    1. Understand how your database is executing your query

    Nowadays all databases have their own query optimizer, and offers a way for users to understandhow a query is executed. For example, which index from which table is being used to execute thequery? The first step to query optimization is understanding what the database is doing. Differentdatabases have different commands for this. For example, in MySQL, one can use "EXPLAIN[SQL Query]" keyword to see the query plan. In Oracle, one can use "EXPLAIN PLAN FOR[SQL Query]" to see the query plan.

    2. Retrieve as little data as possible

    The more data returned from the query, the more resources the database needs to expand toprocess and store these data. So for example, if you only need to retrieve one column from atable, do not use 'SELECT *'.

    3. Store intermediate results

    Sometimes logic for a query can be quite complex. Often, it is possible to achieve the desiredresult through the use of subqueries, inline views, and UNION-type statements. For those cases,the intermediate results are not stored in the database, but are immediately used within the query.

    This can lead to performance issues, especially when the intermediate results have a largenumber of rows.

    The way to increase query performance in those cases is to store the intermediate results in atemporary table, and break up the initial SQL statement into several SQL statements. In manycases, you can even build an index on the temporary table to speed up the query performanceeven more. Granted, this adds a little complexity in query management (i.e., the need to managetemporary tables), but the speedup in query performance is often worth the trouble.

  • 7/29/2019 Datwarehouse Project Implementation

    19/43

    Below are several specific query optimization strategies.

    Use Index

    Using an index is the first strategy one should use to speed up a query. In fact, thisstrategy is so important that index optimization is also discussed.

    Aggregate TablePre-populating tables at higher levels so less amount of data need to be parsed.

    Vertical Partitioning

    Partition the table by columns. This strategy decreases the amount of data a SQL queryneeds to process.

    Horizontal Partitioning

    Partition the table by data value, most often time. This strategy decreases the amount ofdata a SQL query needs to process.

    Denormalization

    The process of denormalization combines multiple tables into a single table. This speedsup query performance because fewer table joins are needed.

    Server Tuning

    Each server has its own parameters, and often tuning server parameters so that it can fullytake advantage of the hardware resources can significantly speed up query performance.

    Quality Assurance

    Task Description

    Once the development team declares that everything is ready for further testing, the QA teamtakes over. The QA team is always from the client. Usually the QA team members will knowlittle about data warehousing, and some of them may even resent the need to have to learnanother tool or tools. This makes the QA process a tricky one.

    Sometimes the QA process is overlooked. On my very first data warehousing project, the projectteam worked very hard to get everything ready for Phase 1, and everyone thought that we hadmet the deadline. There was one mistake, though, the project managers failed to recognize that itis necessary to go through the client QA process before the project can go into production. As aresult, it took five extra months to bring the project to production (the original development timehad been only 2 1/2 months).

    Time Requirement

    1 - 4 weeks.

    Deliverables

    QA Test Plan

  • 7/29/2019 Datwarehouse Project Implementation

    20/43

    QA verification that the data warehousing system is ready to go to production

    Possible Pitfalls

    As mentioned above, usually the QA team members know little about data warehousing, andsome of them may even resent the need to have to learn another tool or tools. Make sure the QA

    team members get enough education so that they can complete the testing themselves.

    Rollout To Production

    Task Description

    Once the QA team gives thumbs up, it is time for the data warehouse system to go live. Somemay think this is as easy as flipping on a switch, but usually it is not true. Depending on thenumber of end users, it sometimes take up to a full week to bring everyone online! Fortunately,nowadays most end users access the data warehouse over the web, making going productionsometimes as easy as sending out an URL via email.

    Time Requirement

    1 - 3 days.

    Deliverables

    Delivery of the data warehousing system to the end users.

    Possible Pitfalls

    Take care to address the user education needs. There is nothing more frustrating to spend several

    months to develop and QA the data warehousing system, only to have little usage because theusers are not properly trained. Regardless of how intuitive or easy the interface may be, it isalways a good idea to send the users to at least a one-day course to let them understand what theycan achieve by properly using the data warehouse.

    Production Maintenance

    Task Description

    Once the data warehouse goes production, it needs to be maintained. Tasks as such regularbackup and crisis management becomes important and should be planned out. In addition, it is

    very important to consistently monitor end user usage. This serves two purposes: 1. To captureany runaway requests so that they can be fixed before slowing the entire system down, and 2. Tounderstand how much users are utilizing the data warehouse for return-on-investmentcalculations and future enhancement considerations.

    Time Requirement

    Ongoing.

  • 7/29/2019 Datwarehouse Project Implementation

    21/43

    Deliverables

    Consistent availability of the data warehousing system to the end users.

    Possible Pitfalls

    Usually by this time most, if not all, of the developers will have left the project, so it is essentialthat proper documentation is left for those who are handling production maintenance. There isnothing more frustrating than staring at something another person did, yet unable to figure it outdue to the lack of proper documentation.

    Another pitfall is that the maintenance phase is usually boring. So, if there is another phase of thedata warehouse planned, start on that as soon as possible.

    Task Description

    Once the data warehousing system goes live, there are often needs for incrementalenhancements. I am not talking about a new data warehousing phases, but simply small changes

    that follow the business itself. For example, the original geographical designations may bedifferent, the company may originally have 4 sales regions, but now because sales are going sowell, now they have 10 sales regions.

    Deliverables

    Change management documentation

    Actual change to the data warehousing system

    Possible Pitfalls

    Because a lot of times the changes are simple to make, it is very tempting to just go ahead andmake the change in production. This is a definite no-no. Many unexpected problems will pop upif this is done. I would very strongly recommend that the typical cycle of development --> QA--> Production be followed, regardless of how simple the change may seem.

    observations

    This section lists the trends I have seen based on my experience in the data warehousing field:

    Quick implementation time

    Lack of collaboration with data mining efforts

    Industry consolidation

    How to measure success

    Recipes for data warehousing project failure

    Quick Implementation Time

    If you add up the total time required to complete the tasks from Requirement Gathering to

    http://www.1keydata.com/datawarehousing/quick-implementation.htmlhttp://www.1keydata.com/datawarehousing/lack-data-mining.htmlhttp://www.1keydata.com/datawarehousing/industry-consolidation.htmlhttp://www.1keydata.com/datawarehousing/how-to-measure-success.htmlhttp://www.1keydata.com/datawarehousing/recipes-for-failure.htmlhttp://www.1keydata.com/datawarehousing/recipes-for-failure.htmlhttp://var/www/apps/conversion/tmp/scratch_2/requirement.htmlhttp://www.1keydata.com/datawarehousing/quick-implementation.htmlhttp://www.1keydata.com/datawarehousing/lack-data-mining.htmlhttp://www.1keydata.com/datawarehousing/industry-consolidation.htmlhttp://www.1keydata.com/datawarehousing/how-to-measure-success.htmlhttp://www.1keydata.com/datawarehousing/recipes-for-failure.htmlhttp://var/www/apps/conversion/tmp/scratch_2/requirement.html
  • 7/29/2019 Datwarehouse Project Implementation

    22/43

    Rollout to Production, you'll find it takes about 9 - 29 weeks to complete each phase of the datawarehousing efforts. The 9 weeks may sound too quick, but I have been personally involved in aturnkey data warehousing implementation that took 40 business days, so that is entirely possible.Furthermore, some of the tasks may proceed in parallel, so as a rule of thumb it is reasonable tosay that it generally takes 2 - 6 months for each phase of the data warehousing implementation.

    Why is this important? The main reason is that in today's business world, the businessenvironment changes quickly, which means that what is important now may not be important 6months from now. For example, even the traditionally static financial industry is coming up withnew products and new ways to generate revenue in a rapid pace. Therefore, a time-consumingdata warehousing effort will very likely become obsolete by the time it is in production. It is bestto finish a project quickly. The focus on quick delivery time does mean, however, that the scopefor each phase of the data warehousing project will necessarily be limited. In this case, the 80-20rule applies, and our goal is to do the 20% of the work that will satisfy 80% of the user needs.The rest can come later.

    Lack Of Collaboration With Data Mining Efforts

    Usually data mining is viewed as the final manifestation of the data warehouse. The ideal is thatnow information from all over the enterprise is conformed and stored in a central location, datamining techniques can be applied to find relationships that are otherwise not possible to find.Unfortunately, this has not quite happened due to the following reasons:

    1. Few enterprises have an enterprise data warehouse infrastructure. In fact, currently they aremore likely to have isolated data marts. At the data mart level, it is difficult to come up withrelationships that cannot be answered by a good OLAP tool.

    2. The ROI for data mining companies is inherently lower because by definition, data miningwill only be performed by a few users (generally no more than 5) in the entire enterprise. As a

    result, it is hard to charge a lot of money due to the low number of users. In addition, developingdata mining algorithms is an inherently complex process and requires a lot of up frontinvestment. Finally, it is difficult for the vendor to put a value proposition in front of the clientbecause quantifying the returns on a data mining project is next to impossible.

    This is not to say, however, that data mining is not being utilized by enterprises. In fact, manyenterprises have made excellent discoveries using data mining techniques. What I am saying,though, is that data mining is typically not associated with a data warehousing initiative. It seemslike successful data mining projects are usually stand-alone projects.

    Industry Consolidation

    In the last several years, we have seen rapid industry consolidation, as the weaker competitorsare gobbled up by stronger players. The most significant transactions are below (note that thedollar amount quoted is the value of the deal when initially announced):

    IBM purchased Cognos for $5 billion in 2007.

    SAP purchased Business Objects for $6.8 billion in 2007.

    http://var/www/apps/conversion/tmp/scratch_2/rollout.htmlhttp://var/www/apps/conversion/tmp/scratch_2/rollout.html
  • 7/29/2019 Datwarehouse Project Implementation

    23/43

    Oracle purchased Hyperion for $3.3 billion in 2007.

    Business Objects (OLAP/ETL) purchased FirstLogic (data cleansing) for $69 million in2006.

    Informatica (ETL) purchased Similarity Systems (data cleansing) for $55 million in 2006.

    IBM (database) purchased Ascential Software (ETL) for $1.1 billion in cash in 2005.

    Business Objects (OLAP) purchased Crystal Decisions (Reporting) for $820 million in2003.

    Hyperion (OLAP) purchased Brio (OLAP) for $142 million in 2003.

    GEAC (ERP) purchased Comshare (OLAP) for $52 million in 2003.

    For the majority of the deals, the purchase represents an effort by the buyer to expand into otherareas of data warehousing (Hyperion's purchase of Brio also falls into this category because,

    even though both are OLAP vendors, their product lines do not overlap). This clearly showsvendors' strong push to be the one-stop shop, from reporting, OLAP, to ETL.

    There are two levels of one-stop shop. The first level is at the corporate level. In this case, thevendor is essentially still selling two entirely separate products. But instead of dealing with twosets of sales and technology support groups, the customers only interact with one such group.The second level is at the product level. In this case, different products are integrated. In datawarehousing, this essentially means that they share the same metadata layer. This is actually arather difficult task, and therefore not commonly accomplished. When there is metadataintegration, the customers not only get the benefit of only having to deal with one vendor insteadof two (or more), but the customer will be using a single product, rather than multiple products.

    This is where the real value of industry consolidation is shown.

    How To Measure Success

    Given the significant amount of resources usually invested in a data warehousing project, a veryimportant question is how success can be measured. This is a question that many projectmanagers do not think about, and for good reason: Many project managers are brought in tobuild the data warehousing system, and then turn it over to in-house staff for ongoingmaintenance. The job of the project manager is to build the system, not to justify its existence.

    Just because this is often not done does not mean this is not important. Just like a data

    warehousing system aims to measure the pulse of the company, the success of the datawarehousing system itself needs to be measured. Without some type of measure on the return oninvestment (ROI), how does the company know whether it made the right choice? Whether itshould continue with the data warehousing investment?

    There are a number of papers out there that provide formula on how to calculate the return on adata warehousing investment. Some of the calculations become quite cumbersome, with anumber of assumptions and even more variables. Although they are all valid methods, I believe

  • 7/29/2019 Datwarehouse Project Implementation

    24/43

    the success of the data warehousing system can simply be measured by looking at one criteria:

    How often the system is being used.

    If the system is satisfying user needs, users will naturally use the system. If not, users willabandon the system, and a data warehousing system with no users is actually a detriment to the

    company (since resources that can be deployed elsewhere are required to maintain the system).Therefore, it is very important to have a tracking mechanism to figure out how much are theusers accessing the data warehouse. This should not be a problem if third-party reporting/OLAPtools are used, since they all contain this component. If the reporting tool is built from scratch,this feature needs to be included in the tool. Once the system goes into production, the datawarehousing team needs to periodically check to make sure users are using the system. If usagestarts to dip, find out why and address the reason as soon as possible. Is the data quality lacking?Are the reports not satisfying current needs? Is the response time slow? Whatever the reason,take steps to address it as soon as possible, so that the data warehousing system is serving itspurpose successfully.

    Business IntelligenceBusiness intelligence is a term commonly associated with data warehousing. In fact, many of thetool vendors position their products as business intelligence software rather than datawarehousing software. There are other occasions where the two terms are used interchangeably.So, exactly what is business inteligence?

    Business intelligence usually refers to the information that is available for the enterprise to makedecisions on. A data warehousing (or data mart) system is the backend, or the infrastructural,component for achieving business intellignce. Business intelligence also includes the insightgained from doing data mining analysis, as well as unstrctured data (thus the need fo contentmanagement systems). For our purposes here, we will discuss business intelligence in the context

    of using a data warehouse infrastructure.

    This section includes the following:

    Business intelligence tools: Tools commonly used for business intelligence.

    Business intelligence uses: Different forms of business intelligence.

    Business intelligence news: News in the business intelligence area.

    Business Intelligence > Tools

    The most common tools used for business intelligence are as follows. They are listed in the

    following order: Increasing cost, increasing functionality, increasing business intelligencecomplexity, and decreasing number of total users.

    Excel

    Take a guess what's the most common business intelligence tool? You might be surprised to findout it's Microsoft Excel. There are several reasons for this:

    http://www.1keydata.com/datawarehousing/business-intelligence.phphttp://www.1keydata.com/datawarehousing/business-intelligence-tools.phphttp://www.1keydata.com/datawarehousing/business-intelligence-uses.phphttp://www.1keydata.com/datawarehousing/business-intelligence-news.phphttp://www.1keydata.com/datawarehousing/business-intelligence.phphttp://www.1keydata.com/datawarehousing/business-intelligence.phphttp://www.1keydata.com/datawarehousing/business-intelligence-tools.phphttp://www.1keydata.com/datawarehousing/business-intelligence-uses.phphttp://www.1keydata.com/datawarehousing/business-intelligence-news.phphttp://www.1keydata.com/datawarehousing/business-intelligence.php
  • 7/29/2019 Datwarehouse Project Implementation

    25/43

    1. It's relatively cheap.

    2. It's commonly used. You can easily send an Excel sheet to another person without worryingwhether the recipient knows how to read the numbers.

    3. It has most of the functionalities users need to display data.

    In fact, it is still so popular that all third-party reporting / OLAP tools have an "export to Excel"functionality. Even for home-built solutions, the ability to export numbers to Excel usually needsto be built.

    Excel is best used for business operations reporting and goals tracking.

    Reporting tool

    In this discussion, I am including both custom-built reporting tools and the commercial reportingtools together. They provide some flexibility in terms of the ability for each user to create,schedule, and run their own reports. The Reporting Tool Selectionselection discusses how one

    should select an OLAP tool.

    Business operations reporting and dashboard are the most common applications for a reportingtool.

    OLAP tool

    OLAP tools are usually used by advanced users. They make it easy for users to look at the datafrom multiple dimensions. The OLAP Tool Selectionselection discusses how one should selectan OLAP tool.

    OLAP tools are used for multidimensional analysis.

    Data mining tool

    Data mining tools are usually only by very specialized users, and in an organization, even largeones, there are usually only a handful of users using data mining tools.

    Data mining tools are used for finding correlation among different factors.

    Business Intelligence Uses

    Business intelligence usage can be categorized into the following categories:

    1. Business operations reporting

    The most common form of business intelligence is business operations reporting. This includesthe actuals and how the actuals stack up against the goals. This type of business intelligenceoften manifests itself in the standard weekly or monthly reports that need to be produced.

    2. Forecasting

    Many of you have no doubt run into the needs for forecasting, and all of you would agree thatforecasting is both a science and an art. It is an art because one can never be sure what the future

    http://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/business-intelligence-uses.phphttp://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/business-intelligence-uses.php
  • 7/29/2019 Datwarehouse Project Implementation

    26/43

    holds. What if competitors decide to spend a large amount of money in advertising? What if theprice of oil shoots up to $80 a barrel? At the same time, it is also a science because one canextrapolate from historical data, so it's not a total guess.

    3. Dashboard

    The primary purpose of a dashboard is to convey the information at a glance. For this audience,there is little, if any, need for drilling down on the data. At the same time, presentation and easeof use are very important for a dashboard to be useful.

    4. Multidimensional analysis

    Multidimensional analysis is the "slicing and dicing" of the data. It offers good insight into thenumbers at a more granular level. This requires a solid data warehousing / data mart backend, aswell as business-savvy analysts to get to the necessary data.

    5. Finding correlation among different factors

    This is diving very deep into business intelligence. Questions asked are like, "How do differentfactors correlate to one another?" and "Are there significant time trends that can beleveraged/anticipated?"

    Data Warehousing > Concepts

    Several concepts are of particular importance to data warehousing. They are discussed in detailin this section.

    Dimensional Data Model: Dimensional data model is commonly used in data warehousingsystems. This section describes this modeling technique, and the two common schema types,starschema andsnowflake schema.

    Slowly Changing Dimension: This is a common issue facing data warehousing practioners. Thissection explains the problem, and describes the three ways of handling this problem withexamples.

    Conceptual Data Model: What is a conceptual data model, its features, and an example of thistype of data model.

    Logical Data Model: What is a logical data model, its features, and an example of this type ofdata model.

    Physical Data Model: What is a physical data model, its features, and an example of this type of

    data model.

    Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a datamodel. This section compares and constrasts the three different types of data models.

    Data Integrity: What is data integrity and how it is enforced in data warehousing.

    What is OLAP: Definition of OLAP.

    http://www.1keydata.com/datawarehousing/datawarehouse.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/scd.htmlhttp://www.1keydata.com/datawarehousing/conceptual-data-model.htmlhttp://www.1keydata.com/datawarehousing/logical-data-model.htmlhttp://www.1keydata.com/datawarehousing/physical-data-model.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-integrity.htmlhttp://www.1keydata.com/datawarehousing/what-is-olap.htmlhttp://www.1keydata.com/datawarehousing/datawarehouse.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/scd.htmlhttp://www.1keydata.com/datawarehousing/conceptual-data-model.htmlhttp://www.1keydata.com/datawarehousing/logical-data-model.htmlhttp://www.1keydata.com/datawarehousing/physical-data-model.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-integrity.htmlhttp://www.1keydata.com/datawarehousing/what-is-olap.html
  • 7/29/2019 Datwarehouse Project Implementation

    27/43

    MOLAP, ROLAP, and HOLAP: What are these different types of OLAP technology? Thissection discusses how they are different from the other, and the advantages and disadvantages ofeach.

    Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a different viewof the role between data warehouse and data mart.

    Dimensional Data Modeling

    Dimensional data model is most often used in data warehousing systems. This is different fromthe 3rd normal form, commonly used for transactional (OLTP) type systems. As you canimagine, the same data would then be stored differently in a dimensional model than in a 3rdnormal form model.

    To understand dimensional data modeling, let's define some of the terms commonly used in thistype of modeling:

    Dimension: A category of information. For example, the time dimension.

    Attribute: A unique level within a dimension. For example, Month is an attribute in the TimeDimension.

    Hierarchy: The specification of levels that represents relationship between different attributeswithin a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.

    Fact Table: A fact table is a table that contains the measures of interest. For example, salesamount would be such a measure. This measure is stored in the fact table with the appropriategranularity. For example, it can be sales amount by store by day. In this case, the fact tablewould contain three columns: A date column, a store column, and a sales amount column.

    Lookup Table: The lookup table provides the detailed information about the attributes. Forexample, the lookup table for the Quarter attribute would include a list of all of the quartersavailable in the data warehouse. Each row (each quarter) may have several fields, one for theunique ID that identifies the quarter, and one or more additional fields that specifies how thatparticular quarter is represented on a report (for example, first quarter of 2001 may berepresented as "Q1 2001" or "2001 Q1").

    A dimensional model includes fact tables and lookup tables. Fact tables connect to one or morelookup tables, but fact tables do not have direct relationships to one another. Dimensions andhierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup

    tables.In designing data models for data warehouses / data marts, the most commonly used schematypes are Star Schemaand Snowflake Schema.

    Whether one uses a star or a snowflake largely depends on personal preference and businessneeds. Personally, I am partial to snowflakes, when there is a business case to analyze theinformation at that particular level.

    http://www.1keydata.com/datawarehousing/molap-rolap.htmlhttp://www.1keydata.com/datawarehousing/inmon-kimball.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/fact-table-granularity.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/molap-rolap.htmlhttp://www.1keydata.com/datawarehousing/inmon-kimball.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/fact-table-granularity.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.html
  • 7/29/2019 Datwarehouse Project Implementation

    28/43

    Fact Table Granularity

    Granularity

    The first step in designing a fact table is to determine the granularity of the fact table. Bygranularity, we mean the lowest level of information that will be stored in the fact table. This

    constitutes two steps:

    Determine which dimensions will be included.

    Determine where along the hierarchy of each dimension the information will be kept.

    The determining factors usually goes back to the requirements.

    Which Dimensions To Include

    Determining which dimensions to include is usually a straightforward process, because businessprocesses will often dictate clearly what are the relevant dimensions.

    For example, in an off-line retail world, the dimensions for a sales fact table are usually time,geography, and product. This list, however, is by no means a complete list for all off-lineretailers. A supermarket with a Rewards Card program, where customers provide some personalinformation in exchange for a rewards card, and the supermarket would offer lower prices forcertain items for customers who present a rewards card at checkout, will also have the ability totrack the customer dimension. Whether the data warehousing system includes the customerdimension will then be a decision that needs to be made.

    What Level Within Each Dimensions To Include

    Determining which part of hierarchy the information is stored along each dimension is a bit more

    tricky. This is where user requirement (both stated and possibly future) plays a major role.In the above example, will the supermarket wanting to do analysis along at the hourly level? (i.e.,looking at how certain products may sell by different hours of the day.) If so, it makes sense touse 'hour' as the lowest level of granularity in the time dimension. If daily analysis is sufficient,then 'day' can be used as the lowest level of granularity. Since the lower the level of detail, thelarger the data amount in the fact table, the granularity exercise is in essence figuring out thesweet spot in the tradeoff between detailed level of analysis and data storage.

    Note that sometimes the users will not specify certain requirements, but based on the industryknowledge, the data warehousing team may foresee that certain requirements will beforthcoming that may result in the need of additional details. In such cases, it is prudent for the

    data warehousing team to design the fact table such that lower-level information is included.This will avoid possibly needing to re-design the fact table in the future. On the other hand,trying to anticipate all future requirements is an impossible and hence futile exercise, and thedata warehousing team needs to fight the urge of the "dumping the lowest level of detail into thedata warehouse" symptom, and only includes what is practically needed. Sometimes this can bemore of an art than science, and prior experience will become invaluable here.

    Fact Table Types

    http://www.1keydata.com/datawarehousing/fact-table-granularity.htmlhttp://www.1keydata.com/datawarehousing/fact-table-types.htmlhttp://www.1keydata.com/datawarehousing/fact-table-types.htmlhttp://www.1keydata.com/datawarehousing/fact-table-granularity.htmlhttp://www.1keydata.com/datawarehousing/fact-table-types.html
  • 7/29/2019 Datwarehouse Project Implementation

    29/43

    Types of Facts

    There are three types of facts:

    Additive: Additive facts are facts that can be summed up through all of the dimensions inthe fact table.

    Semi-Additive: Semi-additive facts are facts that can be summed up for some of thedimensions in the fact table, but not the others.

    Non-Additive: Non-additive facts are facts that cannot be summed up for any of thedimensions present in the fact table.

    Let us use examples to illustrate each of the three types of facts. The first example assumes thatwe are a retailer, and we have a fact table with the following columns:

    Date

    Store

    Product

    Sales_Amount

    The purpose of this table is to record the sales amount for each product in each store on a dailybasis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can

    sum up this fact along any of the three dimensions present in the fact table -- date, store, andproduct. For example, the sum ofSales_Amount for all 7 days in a week represent the total sales

    amount for that week.

    Say we are a bank with the following fact table:

    Date

    Account

    Current_Balance

    Profit_Margin

    The purpose of this table is to record the current balance for each account at the end of each day,as well as the profit margin for each account for each day. Current_Balance and

    Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to addthem up for all accounts (what's the total current balance for all accounts in the bank?), but itdoes not make sense to add them up through time (adding up all current balances for a given

    account for each day of the month does not give us any useful information). Profit_Margin is anon-additive fact, for it does not make sense to add them up for the account level or the day

    level.

  • 7/29/2019 Datwarehouse Project Implementation

    30/43

    Types of Fact Tables

    Based on the above classifications, there are two types of fact tables:

    Cumulative: This type of fact table describes what has happened over a period of time.For example, this fact table may describe the total sales by product by store by day. The

    facts for this type of fact tables are mostly additive facts. The first example presentedhere is a cumulative fact table.

    Snapshot: This type of fact table describes the state of things in a particular instance oftime, and usually includes more semi-additive and non-additive facts. The secondexample presented here is a snapshot fact table.

    Types of Facts

    There are three types of facts:

    Additive: Additive facts are facts that can be summed up through all of the dimensions in

    the fact table.

    Semi-Additive: Semi-additive facts are facts that can be summed up for some of thedimensions in the fact table, but not the others.

    Non-Additive: Non-additive facts are facts that cannot be summed up for any of thedimensions present in the fact table.

    Let us use examples to illustrate each of the three types of facts. The first example assumes thatwe are a retailer, and we have a fact table with the following columns:

    Date

    Store

    Product

    Sales_Amount

    The purpose of this table is to record the sales amount for each product in each store on a dailybasis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can

    sum up this fact along any of the three dimensions present in the fact table -- date, store, andproduct. For example, the sum ofSales_Amount for all 7 days in a week represent the total sales

    amount for that week.

    Say we are a bank with the following fact table:

    Date

    Account

  • 7/29/2019 Datwarehouse Project Implementation

    31/43

    Current_Balance

    Profit_Margin

    The purpose of this table is to record the current balance for each account at the end of each day,

    as well as the profit margin for each account for each day. Current_Balance andProfit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add

    them up for all accounts (what's the total current balance for all accounts in the bank?), but itdoes not make sense to add them up through time (adding up all current balances for a given

    account for each day of the month does not give us any useful information). Profit_Margin is anon-additive fact, for it does not make sense to add them up for the account level or the day

    level.

    Types of Fact Tables

    Based on the above classifications, there are two types of fact tables:

    Cumulative: This type of fact table describes what has happened over a period of time.For example, this fact table may describe the total sales by product by store by day. Thefacts for this type of fact tables are mostly additive facts. The first example presentedhere is a cumulative fact table.

    Snapshot: This type of fact table describes the state of things in a particular instance oftime, and usually includes more semi-additive and non-additive facts. The secondexample presented here is a snapshot fact table.

    Types of Facts

    There are three types of facts:

    Additive: Additive facts are facts that can be summed up through all of the dimensions inthe fact table.

    Semi-Additive: Semi-additive facts are facts that can be summed up for some of thedimensions in the fact table, but not the others.

    Non-Additive: Non-additive facts are facts that cannot be summed up for any of thedimensions present in the fact table.

    Let us use examples to illustrate each of the three types of facts. The first example assumes thatwe are a retailer, and we have a fact table with the following colu