new trends in data warehousing - · pdf file · 2010-06-05term paper on new trends...

TERM PAPER

ON

NEW TRENDS IN DATA WAREHOUSING

PRESENTED

BY

KOYA TEMITOPE ABAYOMI

(ACU/597)

IN PARTIAL FULFILMENT FOR THE REQUIREMENT

OF

TOPICS IN I.C.T

(ICT 4107)

April, 2010

TABLE OF CONTENT

TITLE

TABLE OF CONTENT

ABSTRACT

CHAPTER ONE: INTRODUCTION

CHAPTER TWO: ARCHITECTURE

CHAPTER THREE: TRENDS IN DATA WAREHOUSING

REFERENCE

ABSTRACT

“Although data warehousing has greatly matured as a technology discipline over the past ten

years, enterprises that undertake data warehousing initiatives continue to face fresh challenges

that evolve with the changing business and technology environment. The data warehouse is

being called on to support new initiatives, such as customer relationship management and supply

chain management, and has also been directly impacted by the rise of e - business. Data

warehousing vendors have developed new and more sophisticated technologies and have

acquired and merged with other vendors. The number of homegrown and packaged software

implementations throughout the average enterprise has grown rapidly, creating more data sources

and information delivery options. With all of the activity surrounding data warehousing, it is

hard to sort out which issues and trends are most pressing for enterprises. To that end, this term

paper presents insights into the latest trends in data warehousing.” [8]

CHAPTER ONE: INTRODUCTION

[10]According to W. H. Inmon, A subject-oriented, integrated, time-variant and non-volatile

collection of data in support of management's decision making process. According to Ralph

Kimball, A copy of transaction data, specifically structured for query and analysis. A data

warehouse is a copy of transaction data specifically structured for querying, analysis, reporting,

and more rigorous data mining. Note that the data warehouse contains a copy of the transactions

which are not updated or changed later by the transaction system. Also note that this data is

specially structured, and may have been transformed when it was copied into the data warehouse.

[1]A data warehouse is a repository of an organization’s electronically stored data. This definition

of data warehousing focuses on data storage. However, the means to retrieve and analyze data, to

extract, transform & load data, and to manage data dictionary are also considered essential

components of a data warehousing system. Thus, an expanded definition for data warehousing

includes business intelligence tools, tools to extract, transform & load data into the repository

and tools to manage & retrieve metadata. Data warehousing arises in an organization’s need to

reliable, consolidated, unique and integrated & analysis of its data at different levels of

aggregation. The practical reality of most organization is that their data infrastructure is made up

by a collection of heterogeneous systems.

The concept of data warehousing dates back to the late 1980s when IBM researchers Barry

Devlin and Paul Murphy developed the “Business data warehouse”. In essence, the data

warehousing concept was intended to provide an architectural model for the flow of data from

operational systems to decision support environments. The concept attempted to address the

HISTORY

various problems associated with this flow, mainly the high costs associated with it. In the

absence of data warehousing architecture, an enormous amount of redundancy was required to

support multiple decision support environments. In larger corporations it was typical for multiple

decision support environments to operate independently. Each environment served different

users but often required much of the same stored data. The process of gathering, cleaning and

integrating data from various sources, usually from long-term existing operational systems

(usually referred to as legacy system) was typically in part replicated for each environment.

Moreover, the operational systems were frequently reexamined as new decision support

requirements emerged. Often new requirements necessitated gathering, cleaning and integrating

new data from ‘data marts’ (which is a subset of an organizational data store, usually oriented

to a specific purpose or major data subject that may be distributed to support business needs.

Data marts are analytical data stores designed to focus on specific business functions for a

specific community within an organization. Data marts are often derived from subsets of data in

a data warehouse, though in the bottom-up data warehouse design methodology the data

warehouse is created from the union of organizational data marts.) That was tailored for ready

access by users. Key developments in early years of data warehousing were:

• 1960s – General Mills and Dartmouth College. In a joint research project, develop the

terms ‘dimension and facts.

• 1970s – ACNielsen and IRI provide dimensional data marts for retail sales.

• 1983 – Teradata introduces a database management system specifically designed for

decision support.

• 1988 – Barry Devlin and Paul Murphy published the article “An architecture for a

business” and information systems in IBM systems journal where they introduced the

term “business data warehousing”.

• 1990 – Red brick systems introduced red brick warehouse, a database management

system specifically for data warehousing.

• 1991 - Prism Solution introduced Prism warehouse manager, software for developing a

data warehouse.

• 1991 – Bill Inmon published the book “Building the data warehouse”.

• 1995 – The data warehousing institute, a for-profit organization that promotes data

warehousing is founded.

• 1996 – Ralph Kimball published the book “The data warehouse Toolkit”.

• 1997 – Oracle 8 with support for star queries is released.

• 1998 – Microsoft releases Microsoft analysis service (Then OLAP services) heavily

utilizing data warehouse schemas.

CHAPTER TWO:

ARCHITECTURE

[6]Architecture in the context of an organization’s data warehousing efforts is a conceptualization

of how the data warehouse is built. There is no right or wrong architecture but rather there are

multiple architectures that exist to support various environments and situations. The worthiness

of the architecture can be judged from how the conceptualization aids in the building,

maintenance, and usage of the data warehouse. One possible simple conceptualization of data

warehouse architecture consists of the following interconnected layers:

Operational database layer

The source data for the data warehouse – An organization’s Enterprise Resource Planning

Systems falls into this layer.

Data access layer

The interface between the operational and informational access layer – Tools to extract,

transform, load data into the warehouse fall into this layer.

Metadata layer

The data directory – This is usually more detailed than an operational system data directory.

There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can

be accessed by a particular reporting and analysis tool.

Informational access layer

The data accessed for reporting & analyzing and the tools for reporting & analyzing data –

Business intelligence tools fall into this layer. And the Inmon – Kimball differences about design

methodology have to do with this layer.

THE MAJOR STEPS IN DEVELOPING DATAWAREHOUSE

The very first step before you start to develop data warehouse, the data source will be identified.

You need to figure out what are the data that are required to be put into your data warehouse. For

a library data warehouse, there are two types of data sources that need to be considered, internal

[7]

Identify the data source

and external data source. Internal data source will be the data that already exist in the library

system. The external data source is the data that does not exist within library system (Nicholson,

2003).

Build customized ETL tool

Each data warehouse has the different requirements. Therefore, a customized ETL tool is the

better solution in order to fulfill the requirements. For the library data warehouse, we choose our

own extract program. We deal the inconsistency issues with our own transformation method and

finally we load the data into the data warehouse database.

Extraction

This can be the most time consuming part where you need to grab the data from various data

source and store it into the staging database. Much of the time and effort are needed in writing a

custom program to transfer the data from sources into staging database. As a result, during

extraction, we need to determine which database system will be used for thestaging area and also

figure out what are the necessary data that are needed before grab it. The decline in the cost of

hardware and storage has overcome the issues on avoiding the data duplication and also their

worries on lack of storage as storing the excessive or unnecessary data. However, there is

probably no reason to store the unnecessary data which had been identified not being useful in

decision making process. Therefore, there is a necessary for extract only the relevant data before

bringing into data warehouse (Mallach, 2000).

Transformation

After extracting the data from various data sources, transformation is needed to ensure the data

consistency. In order to transform the data into data warehouse properly, you need to figure out a

way of mapping the external data sources fields to the data warehouse fields. Transformation can

be performed during data extraction or while loading the data into data warehouse. This

integration can be a complex issue when the number of data sources getting bigger.

Loading

Once the extracting process, transforming and cleansing has been done, the data are loaded into

the data warehouse. The loading of data can be categorised into two types; the loading of data

that currently contain in the operational database and the loading of the updates to the data

warehouse from the changes that have occurred in the operational database. As to guarantee the

freshness of data, data warehouse is needed to be refreshed to update its data. Many issues are

needed to be considered especially during loading the updates to the data warehouse. While

updating the data warehouse, we need to ensure that no data are loosed and also to ensure a

minimum overhead over the scanning existing file process.

AN OVERVIEW OF DATAWAREHOUSE INFRASTRUCTURE [14]

COMPONENTS OF A DATA WAREHOUSE [13]

Overall Architecture

The data warehouse architecture is based on a relational database management system server that

functions as the central repository for informational data. Operational data and processing is

completely separated from data warehouse processing. This central information repository is

surrounded by a number of key components designed to make the entire environment functional,

manageable and accessible by both the operational systems that source data into the warehouse

and by end-user query and analysis tools.

Typically, the source data for the warehouse is coming from the operational applications. As the

data enters the warehouse, it is cleaned up and transformed into an integrated structure and

format. The transformation process may involve conversion, summarization, filtering and

condensation of data. Because the data contains a historical component, the warehouse must be

capable of holding and managing large volumes of data as well as different data structures for the

same database over time. The seven major components of data warehousing are:

Data Warehouse Database

The central data warehouse database is the cornerstone of the data warehousing environment.

This database is almost always implemented on the relational database management system

(RDBMS) technology. However, this kind of implementation is often constrained by the fact that

traditional RDBMS products are optimized for transactional database processing. Certain data

warehouse attributes, such as very large database size, ad hoc query processing and the need for

flexible user view creation including aggregates, multi-table joins and drill-downs, have become

drivers for different technological approaches to the data warehouse database. These approaches

include:

• Parallel relational database designs for scalability that include shared-memory, shared

disk, or shared-nothing models implemented on various multiprocessor configurations

(symmetric multiprocessors or SMP, massively parallel processors or MPP, and/or

clusters of uni- or multiprocessors).

• An innovative approach to speed up a traditional RDBMS by using new index structures

to bypass relational table scans.

• Multidimensional databases (MDDBs) that are based on proprietary database technology;

conversely, a dimensional data model can be implemented using a familiar RDBMS.

Multi-dimensional databases are designed to overcome any limitations placed on the

warehouse by the nature of the relational data model. MDDBs enable on-line analytical

processing (OLAP) tools that architecturally belong to a group of data warehousing

components jointly categorized as the data query, reporting, analysis and mining tools.

Sourcing, Acquisition, Cleanup and Transformation Tools

A significant portion of the implementation effort is spent extracting data from operational

systems and putting it in a format suitable for informational applications that run off the data

warehouse. The data sourcing, cleanup, transformation and migration tools perform all of the

conversions, summarizations, key changes, structural changes and condensations needed to

transform disparate data into information that can be used by the decision support tool. They

produce the programs and control statements, including the COBOL programs, MVS job-control

language (JCL), UNIX scripts, and SQL data definition language (DDL) needed to move data

into the data warehouse for multiple operational systems. These tools also maintain the meta

data. The functionality includes:

• Removing unwanted data from operational databases

• Converting to common data names and definitions

• Establishing defaults for missing data

• Accommodating source data definition changes

The data sourcing, cleanup, extract, transformation and migration tools have to deal with some

significant issues including:

• Database heterogeneity. DBMSs are very different in data models, data access language,

data navigation, operations, concurrency, integrity, recovery etc.

• Data heterogeneity. This is the difference in the way data is defined and used in different

models - homonyms, synonyms, unit compatibility (U.S. vs metric), different attributes

for the same entity and different ways of modeling the same fact.

These tools can save a considerable amount of time and effort. However, significant

shortcomings do exist. For example, many available tools are generally useful for simpler data

extracts. Frequently, customized extract routines need to be developed for the more complicated

data extraction procedures.

Meta data

Meta data is data about data that describes the data warehouse. It is used for building,

maintaining, managing and using the data warehouse. Meta data can be classified into:

• Technical meta data, which contains information about warehouse data for use by

warehouse designers and administrators when carrying out warehouse development and

management tasks.

• Business meta data, which contains information that gives users an easy-to-understand

perspective of the information stored in the data warehouse.

Equally important, meta data provides interactive access to users to help understand content and

find data. One of the issues dealing with meta data relates to the fact that many data extraction

tool capabilities to gather meta data remain fairly immature. Therefore, there is often the need to

create a meta data interface for users, which may involve some duplication of effort.

Meta data management is provided via a meta data repository and accompanying software. Meta

data repository management software, which typically runs on a workstation, can be used to map

the source data to the target database; generate code for data transformations; integrate and

transform the data; and control moving data to the warehouse.

As user's interactions with the data warehouse increase, their approaches to reviewing the results

of their requests for information can be expected to evolve from relatively simple manual

analysis for trends and exceptions to agent-driven initiation of the analysis based on user-defined

thresholds. The definition of these thresholds, configuration parameters for the software agents

using them, and the information directory indicating where the appropriate sources for the

information can be found are all stored in the meta data repository as well.

Access Tools

The principal purpose of data warehousing is to provide information to business users for

strategic decision-making. These users interact with the data warehouse using front-end tools.

Many of these tools require an information specialist, although many end users develop expertise

in the tools. Tools fall into four main categories: query and reporting tools, application

development tools, online analytical processing tools, and data mining tools.

Query and Reporting tools can be divided into two groups: reporting tools and managed query

tools. Reporting tools can be further divided into production reporting tools and report writers.

Production reporting tools let companies generate regular operational reports or support high-

volume batch jobs such as calculating and printing paychecks. Report writers, on the other hand,

are inexpensive desktop tools designed for end-users.

Managed query tools shield end users from the complexities of SQL and database structures by

inserting a metalayer between users and the database. These tools are designed for easy-to-use,

point-and-click operations that either accept SQL or generate SQL database queries.

Often, the analytical needs of the data warehouse user community exceed the built-in capabilities

of query and reporting tools. In these cases, organizations will often rely on the tried-and-true

approach of in-house application development using graphical development environments such

as PowerBuilder, Visual Basic and Forte. These application development platforms integrate

well with popular OLAP tools and access all major database systems including Oracle, Sybase,

and Informix.

OLAP tools are based on the concepts of dimensional data models and corresponding databases,

and allow users to analyze the data using elaborate, multidimensional views. Typical business

applications include product performance and profitability, effectiveness of a sales program or

marketing campaign, sales forecasting and capacity planning. These tools assume that the data is

organized in a multidimensional model.

A critical success factor for any business today is the ability to use information effectively. Data

mining is the process of discovering meaningful new correlations, patterns and trends by digging

into large amounts of data stored in the warehouse using artificial intelligence, statistical and

mathematical techniques.

Data Marts

The concept of a data mart is causing a lot of excitement and attracts much attention in the data

warehouse industry. Mostly, data marts are presented as an alternative to a data warehouse that

takes significantly less time and money to build. However, the term data mart means different

things to different people. A rigorous definition of this term is a data store that is subsidiary to a

data warehouse of integrated data. The data mart is directed at a partition of data (often called a

subject area) that is created for the use of a dedicated group of users. A data mart might, in fact,

be a set of denormalized, summarized, or aggregated data. Sometimes, such a set could be placed

on the data warehouse rather than a physically separate store of data. In most instances, however,

the data mart is a physically separate store of data and is resident on separate database server,

often a local area network serving a dedicated user group. Sometimes the data mart simply

comprises relational OLAP technology which creates highly denormalized dimensional model

(e.g., star schema) implemented on a relational database. The resulting hypercubes of data are

used for analysis by groups of users with a common interest in a limited portion of the database.

These types of data marts, called dependent data marts because their data is sourced from the

data warehouse, have a high value because no matter how they are deployed and how many

different enabling technologies are used, different users are all accessing the information views

derived from the single integrated version of the data.

Unfortunately, the misleading statements about the simplicity and low cost of data marts

sometimes result in organizations or vendors incorrectly positioning them as an alternative to the

data warehouse. This viewpoint defines independent data marts that in fact, represent fragmented

point solutions to a range of business problems in the enterprise. This type of implementation

should be rarely deployed in the context of an overall technology or applications architecture.

Indeed, it is missing the ingredient that is at the heart of the data warehousing concept -- that of

data integration. Each independent data mart makes its own assumptions about how to

consolidate the data, and the data across several data marts may not be consistent.

Moreover, the concept of an independent data mart is dangerous -- as soon as the first data mart

is created, other organizations, groups, and subject areas within the enterprise embark on the task

of building their own data marts. As a result, you create an environment where multiple

operational systems feed multiple non-integrated data marts that are often overlapping in data

content, job scheduling, connectivity and management. In other words, you have transformed a

complex many-to-one problem of building a data warehouse from operational and external data

sources to a many-to-many sourcing and management nightmare.

Data Warehouse Administration and Management

Data warehouses tend to be as much as four times as large as related operational databases,

reaching terabytes in size depending on how much history needs to be saved. They are not

synchronized in real time to the associated operational data but are updated as often as once a

day if the application requires it.

In addition, almost all data warehouse products include gateways to transparently access multiple

enterprise data sources without having to rewrite applications to interpret and utilize the data.

Furthermore, in a heterogeneous data warehouse environment, the various databases reside on

disparate systems, thus requiring inter-networking tools. The need to manage this environment is

obvious.

Managing data warehouses includes security and priority management; monitoring updates from

the multiple sources; data quality checks; managing and updating meta data; auditing and

reporting data warehouse usage and status; purging data; replicating, subsetting and distributing

data; backup and recovery and data warehouse storage management.

Information Delivery System

The information delivery component is used to enable the process of subscribing for data

warehouse information and having it delivered to one or more destinations according to some

user-specified scheduling algorithm. In other words, the information delivery system distributes

warehouse-stored data and other information objects to other data warehouses and end-user

products such as spreadsheets and local databases. Delivery of information may be based on time

of day or on the completion of an external event. The rationale for the delivery systems

component is based on the fact that once the data warehouse is installed and operational, its users

don't have to be aware of its location and maintenance. All they need is the report or an

analytical view of data at a specific point in time. With the proliferation of the Internet and the

World Wide Web such a delivery system may leverage the convenience of the Internet by

delivering warehouse-enabled information to thousands of end-users via the ubiquitous world

wide network.

In fact, the Web is changing the data warehousing landscape since at the very high level the

goals of both the Web and data warehousing are the same: easy access to information. The value

of data warehousing is maximized when the right information gets into the hands of those

individuals who need it, where they need it and they need it most. However, many corporations

have struggled with complex client/server systems to give end users the access they need. The

issues become even more difficult to resolve when the users are physically remote from the data

warehouse location. The Web removes a lot of these issues by giving users universal and

relatively inexpensive access to data. Couple this access with the ability to deliver required

information on demand and the result is a web-enabled information delivery system that allows

users dispersed across continents to perform a sophisticated business-critical analysis and to

engage in collective decision-making.

NORMALIZED VERSUS DIMENSIONAL APPROACH FOR STORAGE OF DATA

[16]There are two leading approaches to storing data in a data warehouse – the dimensional

approach and normalized approach.

In a dimensional approach, transaction data are partitioned into either “facts”, which are

generally numeric transaction data or “dimension”, which are the reference information that

gives context to the facts. For example, a sales transaction can be broken up into facts such as the

number of products ordered and the price paid for the products, and into dimensions such as

order date, customer name, product number, order ship-to & bill-to location, and salesperson

responsible for receiving the order. A key advantage of a dimensional approach is that the data

warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data

warehouse tends to operate very quickly. The main disadvantages of dimensional approach are:

1. In order to maintain the integrity of facts and dimensions, loading the data warehouse

with data from different operational systems is complicated.

2. It is difficult to modify the data warehouse structure if the organization adopting the

dimensional approach changes the way in which it does business.

In the normalized approach, the data in the data warehouse are stored following to a degree,

database normalization rules. Tables are grouped together by subject areas that reflect general

data categories (e.g. data on customers, products, finance etc.). The main advantage of this

approach is that it is straightforward to add information into the database. A disadvantage of this

approach is that because of the number of tables involved it can be difficult for users both to:

1. Join data from different sources into meaningful information.

2. Access the information without a precise understanding of the sources of data and of the

data structure of the data warehouse.

These approaches are not mutually exclusive and there are other approaches. Dimensional can

involve normalizing data to a degree.

CONFORMING INFORMATION

Another important fact in designing a data warehouse is which data to conform and how to

conform the data. For example, one operational system feeding data into the data warehouse may

use “M” & “F” to denote sex of an employee while another operational system may use “Male”

and “Female”. Though this is a simple example, much of the work in implementing a data

warehouse is devoted to making similar meaning data consistent when they are stored in the data

warehouse. Typically, extract, transform, load tools are used in this work. Master data

management has the aim of conforming data that could be considered “dimensions”.

TOP-DOWN VERSUS BOTTOM-UP DESIGN METHODOLOGIES [11]

Bottom-up Design

Ralph Kimball, a well known author in data warehousing is a proponent of an approach to data

warehouse design frequently considered as bottom-up. In this approach, data marts are first

created to provide reporting and analytical capabilities for specific business processes. Data

marts contain atomic data and if necessary summarized data; these data marts can eventually be

unioned together to create a comprehensive data warehouse. The combination of data marts is

managed through the implementation of what Kimball calls “a data warehouse bus architecture”.

Business value can be returned as quickly as the first data marts can be created. Maintaining tight

management over the data warehouse bus architecture is fundamental to maintaining the integrity

of the data warehouse. The most important management task is making sure dimensions among

data marts are consistent. In Kimball’s words, this means that the dimensions “conform”.

Top-Down Design

Bill Inmon, one of the first authors on the subject of data warehousing has defined a data

warehouse as a centralized repository for the entire enterprise. Inmon is one of the leading

proponents of the top-down approach to data warehouse design in which the data warehouse is

designed using a normalized enterprise data model. “Atomic” data, which is data at the lowest

level of detail, are stored in the data warehouse. Dimensional data marts containing data needed

for specific business processes or specific departments are created from the data warehouse. In

the Inmon vision, the data warehouse is at the center of the “Corporate Information Factory”

(CIF) which provides a logical framework for delivering business intelligence and business

management capabilities. Inmon states that data warehouse is:

• Subject-Oriented:

•

the data in the data warehouse is organized so that all the data element

relating to the same real-world event or objects are linked together.

Non-Volatile:

•

data in the data warehouse is never overwritten or deleted once written, the

data is static, read-only and retained for future reporting.

Integrated:

The top-down design methodology generates highly consistent dimensional views of data across

data marts since all data marts are loaded from the centralized repository. Top-down design has

also proven to be robust against business changes. Generating new dimensional data marts

against the data stored in the data warehouse is a relatively simple task. The main disadvantage

of the top-down methodology is that it represents a very large project with a very broad scope.

The up-front cost for implementing a data warehouse using the top-down methodology is

significant and the duration of time from the start of project to the point that end users

experience initial benefits can be substantial. In addition, the top-down methodology can be

inflexible and unresponsive to changing departmental needs during the implementation phases.

Hybrid Design

Over time, it has become apparent to proponents of bottom-up and top-down data warehouse

design that both methodologies have benefits and risks. Hybrid methodologies have evolved to

take advantage of the fast turn-around time of bottom-up design and the enterprise-wide data

consistency of top-down design.

the data warehouse contains data from most or all of an organization’s

operational systems and this data are made consistent.

DATA WAREHOUSE Vs OPERATIONAL SYSTEMS

Operational systems are optimized for preservation of data integrity and speed of recording of

business transactions through use of database normalization and an entity-relationship model.

Operational system designers generally follow the Codd rules of database normalization in

order to ensure data integrity. Codd defined five increasingly stringent rules of normalization;

fully normalized database designs (that is those satisfying all five codd rules) often result in

information from a business transaction being stored in dozens to hundreds of tables. Relational

databases are efficient at managing the relationships between the tables; the databases have very

fast insert/update performance because only a small amount of data in those tables is affected

each time a transaction is processed. Finally, in order to improve performance, older data(s) are

usually periodically purged from operational systems. Data warehouses are optimized for speed

of data analysis. Frequently, data in data warehouses are de-normalized via a dimension based

model; also, to speed data retrieval, data warehouse data are often are often stored multiple times

in their most granular form and in summarized forms called aggregates. Data warehouse data are

gathered from the operational systems and held in the data warehouse even after the data has

been purged from the operational systems.

EVOLUTION IN ORGANIZATION USE

[3] Organizations generally start off with relatively simple use of data warehousing. Over time

more sophisticated use of data warehousing evolves; the following general stages of use of the

data warehouse can be distinguished.

Offline operational database

Data warehouse in this initial stage are developed by simply copying the data off an operational

system to another server where the processing load of reporting against the copied data does not

impact the operational system’s performance.

Offline data warehouse

Data warehouses at this stage are updated from data in the operational systems on a regular basis

and the data warehouse data is stored in a data structure designed to facilitate reporting.

Real time data warehouse

Data warehouse at this stage are updated every time an operational system performs a

transaction (e.g. an order or a delivery or a booking).

Integrated data warehouse

Data warehouses at this stage are updated every time an operational system performs a

transaction. The data warehouses then generate transactions that are passed back into the

operational systems.

BENEFITS

[17] Some of the benefits that a data warehouse provides are as follows:

• A data warehouse provides a common data model for all data of interest regardless of the

data’s source. This makes it easier to report and analyze information than it would be if

multiple data models were used to retrieve information such as sales, invoices, and order

receipts. General ledger charges e.t.c.

• Prior to loading data into the data warehouse, inconsistencies are identified and resolved;

this greatly simplifies reporting and analysis.

• Information in the data warehouse is under the control of data warehouse users so that

even if the source system data is purged over time, the information in the warehouse can

be stored safely for extended periods of time.

• Because they are separate from operational systems, data warehouses provide retrieval of

data without slowing down operational systems.

• Data warehouses can work in conjunction with and hence, enhance the value of

operational business applications, notably customer relationship management (CRM)

systems.

• Data warehouses facilitate decision support system applications such as trend reports (e.g

the items with the most sales in a particular area within the last two years), exception

reports, and reports that show actual performance versus goals.

DISADVANTAGES

Some disadvantages of using data warehouse are:

• Data warehouses are not the optimal environment for unstructured data.

• Because data must be extracted, transformed and loaded into the warehouse, there is an

element of latency in data warehouse data.

• Over their life, data warehouses can have high costs.

• Data warehouses can get outdated relatively quickly; there is a cost of delivering

suboptimal information to the organization.

• There is often a fine line between data warehouses and operational systems. Duplicate,

expensive functionality may be developed. Or, functionality may be developed in the

data warehouse that, in retrospect, should have been developed in the operational

systems and vice versa.

SAMPLE APPLICATIONS

Some applications data warehousing can be used for are:

• Credit card churn analysis

• Insurance fraud analysis

• Call record analysis

• Call record analysis

• Logistics management amongst others

1. Teradata

DATA WAREHOUSE APPLIANCE

[9] A data warehouse appliance consists of an integrated set of servers, storage operating systems,

DBMS and software specifically pre-installed and pre-optimized for data warehousing. Data

warehouse appliance provides solution for the mid-to-large volume data warehousing markets,

offering low cost performance most commonly on data volumes in the terabyte to petabyte

range. Data warehousing are designed to facilitate reporting and analysis. Examples of data

warehouse appliances are:

2. Kognitio

3. Kickfire

4. Vertica

5. EXASOL

6. Paraccel amongst others.

BENEFITS OF DATA APPLIANCES

• Reduction in costs

• Parallel performance

• Reduced administration

• Built-in high availability

• Scalability

• Rapid time-to-value

APPLICATION USES

Data warehousing appliances provide solutions for many analytic application uses, including:

• Enterprise data warehousing

• Super-sized sandboxes which isolate power users with resource intensive queries

• Pilot projects or projects requiring rapid prototyping and rapid time-to-value

• off-loading projects from the enterprise data warehouse, such as large analytical query

projects that affect the overall workload of the enterprise data warehouse

• Applications with specific performance or loading requirements

• Data marts that have outgrown their present environment

• Turnkey data warehouses or data marts

• Solutions for applications with high data-growth and high-performance requirements

• Applications requiring data warehouse encryption

TRENDS IN DATA WAREHOUSING APPLIANCES

The Data warehousing appliance market has started to shift trends in many areas as it evolves:

• Vendors have started moving toward using commodity technologies rather than

proprietary assembly of commodity components.

• Implemented applications show usage expansion from tactical and data-mart solutions to

strategic and enterprise data-warehouse use.

• Mainstream vendor participation has become apparent as of 2009.

• With a lower total cost of ownership, reduced maintenance and high performance to

address business analytics on growing data volumes, most analysts believe that Data

Warehousing appliances will gain market share - though Teradata maintain their

leadership position.

CHAPTER THREE:

• The first is the growth of data in the data warehouse environment.

TRENDS IN DATA WAREHOUSE

[2]According to Bill Inmon in an interview granted to Sun Systems, there are several important

trends in data warehousing today which include the following

• The awareness that unstructured data belongs in the data warehouse. There is a wealth of

information in unstructured data that simply never makes it onto the data warehouse but it

should.

• The fact that metadata needs to be tightly integrated within the data warehouse. Metadata

has always been loosely associated with a data warehouse but never tightly integrated and

that is a mistake.

• Some people have a need for what is termed a “global warehouse”. A global warehouse is

one that is constructed from a collection of local data warehouse. Large multi-national

corporations have a need for a global data warehouse.

MAJOR DATA WAREHOUSING EVENTS OF 2008 [4]

Everyone Had An Appliance Story: With the acceptance of data warehouse appliances as part

of an overall data warehousing architecture, major vendors (in addition to start-ups) have jumped

on the appliance bandwagon. Column-oriented database vendors such as ParAccel and Vertica

have partnered with hardware vendors to support and market data warehouse appliances, and

Oracle has partnered with HP to produce the HP Oracle Database Machine. Even Teradata, a

company that for many years dismissed data warehouse appliances as niche technology that

could only lead to multiple versions of truth, has acknowledged the appliance concept with

several platforms of its own and is now bragging that it was the original data warehouse

appliance vendor.

Industry Consolidations Continued: Acquisitions in 2008 included data warehouse appliance

specialist DatAllegro by Microsoft, Identity Systems by Informatica, specialty analytics vendor

NuTech by Netezza, IDeaS Revenue Optimization and natural language processing specialist

Teragram by SAS, and open source database vendor MySQL AB by Sun Microsystems.

Furthermore, although announced in 2007, IBM’s acquisition of Cognos and SAP’s acquisition

of Business Objects were both completed in January 2008.

The Recessionary Environment Encouraged Further BI Deployments: As companies sought

ways to maintain profitability in the face of a deteriorating economy, they recognized the value

of business intelligence for discovering new revenue opportunities, identifying areas of potential

cost reductions, and reducing fraud. While IT expenditures were closely watched and, in many

cases, reduced, many BI projects actually had their priorities increased.

Open Source Grew: Supporting cost reduction initiatives, open source business intelligence,

database, and data integration technology has shown substantial uptake in 2008 to the point

where open source offerings, especially products that have formal (albeit extra-cost) support are

now making inroads into accounts that previously would not consider them. In many cases, free

open source technology was initially used for prototype deployments with organizations

upgrading to commercial versions with formal support when the prototypes were placed in

production.

Major Trends for 2009

Trends #1: Further Industry Consolidation

Acquisitions will remain a fact of life in the data warehousing industry. Since the economy is

now officially in a recession, some vendors will be open to being acquired if only to ensure their

survival. Other, more established, vendors may simply succumb to offers they, or their

stockholders, simply can’t refuse.

If I had to pick two likely targets, my guess would be Informatica, perhaps by HP in order to

augment its data warehousing technology portfolio, and SPSS, perhaps by SAP as Business

Objects sells (as an OEM) SPSS predictive analytics technology for its BusinessObjects XI

platform. Although neither of these two companies is experiencing major financial problems,

their technologies would make attractive additions to the technology portfolio of potential

acquirers.

Trend #2: Cloud Computing will Come Down to Earth

Continued pressure to reduce expenses will serve as a catalyst for organizations both large and

small to utilize cloud computing as an alternative to obtaining and funding in-house resources.

Although small companies may use this as their primary computing platform, large companies

may use cloud computing for incremental, perhaps one-time, projects. BI vendors not already

offering on-demand software will establish a cloud presence to better compete in the small-to-

midsize business (SMB) market.

Trend #3: Open Source Growth will Accelerate

Economic pressure will accelerate the growth of open source technology as well, especially as

open source has now established itself in production deployments. Because many vendors are

utilizing source technology in their applications in order to reduce costs or, as in the case of

several data warehouse appliance vendors, partnering with open source business intelligence and

data integration vendors to offer a more complete solution, the growth will be seen in both

standalone and embedded environments.

Trend #4: The IT World Will Become Greener

The peak in energy costs earlier this year provided a strong incentive for organizations to

consider becoming “greener” for cost savings as well as more altruistic environmental reasons.

Organizations will look to minimize the energy costs associated with their hardware and consider

both direct power consumption as well as associated costs such as air conditioning in their

technology evaluations. This will further drive virtualization efforts to maximize the utilization

of existing hardware.

Trend #5: Major Emphasis on Solutions Rather than Tools and Technology

The need to quickly address business concerns as well as compliance requirements will drive

organizations to seek customizable analytic applications rather than to build them from scratch

with Business Intelligence tools. Business Intelligence vendors will respond to this demand with

additional vertical and functional analytic applications, some of which may be obtained through

the acquisition of their current partners. Furthermore, vendors such as Oracle and SAP will

continue to enhance the analytic functionality of their operational enterprise applications.

CONCLUSION

Judging from the trends in 2008 and that of 2009, appliances will be on the decline and IT

companies will be trying to invest in other markets in order to stay afloat and in the case where

the company cannot keep up, industry consolidation comes up. Data warehousing trends seems

to remain the same but enlarged in order to accommodate the rapid growth of data being stored

in the data warehouse by enterprises especially in the area of open source technology which will

see data warehouse appliance vendors, partnering with open source business intelligence and

data integration vendors to offer a more complete solution.

Even with this backdrop of anxiety and budget constraints that was caused by the economic

recession, business intelligence, data warehousing and data integration will continue to grow.

Some will argue that because business intelligence and data warehouse provides such business

value they will be exempted from the budget cuts but that will not be the case.

•

REFERENCES:

www.wikipedia.org/datawarehouse/ [1]

• www.sun.com/solutions/documents/interviews/bidw_QandABillInmon_gg.xml [2]

• http://www.kmworld.com/Articles/Editorial/Feature/Data-warehousing-from-end-to-end-

9101.aspx [3]

• http://tdwi.org/articles/2008/12/17/major-data-warehousing-events-of-2008-and-

predictions-for-2009.aspx [4]

• Han, J. and M. Kamber. Data Mining: Concepts and Techniques. (2001) [5]

• http://www.dwinfocenter.org/ [6]

• Special Issue of the International Journal of the Computer, the Internet and Management,

Vol.15 No. SP4, November, (2007) [7]

• http://www.information-management.com/infodirect/20011026/4191-1.html [8]

• http://www.carolla.com/wp-dw.htm [9]

• http://www.hinduwebsite.com/webresources/data_warehousing.asp[10]

• http://stanford.edu/dept/itss/docs/oracle/10g/server.101/b10743/bus_intl.htm#i30690 [11]

• http://www.data-miners.com/companion/Chapter15.ppt [12]

• http://www.tdan.com/view-articles/4213 [13]

• http://www.dwreview.com/DW_Overview.html [14]

• http://dataminingwarehousing.blogspot.com/ [16]

• http://www.kenorrinst.com/dwpaper.html [17]

http://www.wikipedia.org/datawarehouse/�

http://www.sun.com/solutions/documents/interviews/bidw_QandABillInmon_gg.xml�

http://www.kmworld.com/Articles/Editorial/Feature/Data-warehousing-from-end-to-end-9101.aspx�

http://www.kmworld.com/Articles/Editorial/Feature/Data-warehousing-from-end-to-end-9101.aspx�

http://tdwi.org/articles/2008/12/17/major-data-warehousing-events-of-2008-and-predictions-for-2009.aspx�

http://tdwi.org/articles/2008/12/17/major-data-warehousing-events-of-2008-and-predictions-for-2009.aspx�

http://www.dwinfocenter.org/�

http://www.information-management.com/infodirect/20011026/4191-1.html�

http://www.carolla.com/wp-dw.htm�

http://www.hinduwebsite.com/webresources/data_warehousing.asp�

http://stanford.edu/dept/itss/docs/oracle/10g/server.101/b10743/bus_intl.htm#i30690�

http://www.data-miners.com/companion/Chapter15.ppt�

http://www.tdan.com/view-articles/4213�

http://www.dwreview.com/DW_Overview.html�

http://dataminingwarehousing.blogspot.com/�

http://www.kenorrinst.com/dwpaper.html�

new trends in data warehousing - · pdf file · 2010-06-05term paper on new trends...

Documents