new trends in data warehousing - · pdf file · 2010-06-05term paper on new trends...
TRANSCRIPT
TERM PAPER
ON
NEW TRENDS IN DATA WAREHOUSING
PRESENTED
BY
KOYA TEMITOPE ABAYOMI
(ACU/597)
IN PARTIAL FULFILMENT FOR THE REQUIREMENT
OF
TOPICS IN I.C.T
(ICT 4107)
April, 2010
TABLE OF CONTENT
TITLE
TABLE OF CONTENT
ABSTRACT
CHAPTER ONE: INTRODUCTION
CHAPTER TWO: ARCHITECTURE
CHAPTER THREE: TRENDS IN DATA WAREHOUSING
REFERENCE
ABSTRACT
“Although data warehousing has greatly matured as a technology discipline over the past ten
years, enterprises that undertake data warehousing initiatives continue to face fresh challenges
that evolve with the changing business and technology environment. The data warehouse is
being called on to support new initiatives, such as customer relationship management and supply
chain management, and has also been directly impacted by the rise of e - business. Data
warehousing vendors have developed new and more sophisticated technologies and have
acquired and merged with other vendors. The number of homegrown and packaged software
implementations throughout the average enterprise has grown rapidly, creating more data sources
and information delivery options. With all of the activity surrounding data warehousing, it is
hard to sort out which issues and trends are most pressing for enterprises. To that end, this term
paper presents insights into the latest trends in data warehousing.” [8]
CHAPTER ONE: INTRODUCTION
[10]According to W. H. Inmon, A subject-oriented, integrated, time-variant and non-volatile
collection of data in support of management's decision making process. According to Ralph
Kimball, A copy of transaction data, specifically structured for query and analysis. A data
warehouse is a copy of transaction data specifically structured for querying, analysis, reporting,
and more rigorous data mining. Note that the data warehouse contains a copy of the transactions
which are not updated or changed later by the transaction system. Also note that this data is
specially structured, and may have been transformed when it was copied into the data warehouse.
[1]A data warehouse is a repository of an organization’s electronically stored data. This definition
of data warehousing focuses on data storage. However, the means to retrieve and analyze data, to
extract, transform & load data, and to manage data dictionary are also considered essential
components of a data warehousing system. Thus, an expanded definition for data warehousing
includes business intelligence tools, tools to extract, transform & load data into the repository
and tools to manage & retrieve metadata. Data warehousing arises in an organization’s need to
reliable, consolidated, unique and integrated & analysis of its data at different levels of
aggregation. The practical reality of most organization is that their data infrastructure is made up
by a collection of heterogeneous systems.
The concept of data warehousing dates back to the late 1980s when IBM researchers Barry
Devlin and Paul Murphy developed the “Business data warehouse”. In essence, the data
warehousing concept was intended to provide an architectural model for the flow of data from
operational systems to decision support environments. The concept attempted to address the
HISTORY
various problems associated with this flow, mainly the high costs associated with it. In the
absence of data warehousing architecture, an enormous amount of redundancy was required to
support multiple decision support environments. In larger corporations it was typical for multiple
decision support environments to operate independently. Each environment served different
users but often required much of the same stored data. The process of gathering, cleaning and
integrating data from various sources, usually from long-term existing operational systems
(usually referred to as legacy system) was typically in part replicated for each environment.
Moreover, the operational systems were frequently reexamined as new decision support
requirements emerged. Often new requirements necessitated gathering, cleaning and integrating
new data from ‘data marts’ (which is a subset of an organizational data store, usually oriented
to a specific purpose or major data subject that may be distributed to support business needs.
Data marts are analytical data stores designed to focus on specific business functions for a
specific community within an organization. Data marts are often derived from subsets of data in
a data warehouse, though in the bottom-up data warehouse design methodology the data
warehouse is created from the union of organizational data marts.) That was tailored for ready
access by users. Key developments in early years of data warehousing were:
• 1960s – General Mills and Dartmouth College. In a joint research project, develop the
terms ‘dimension and facts.
• 1970s – ACNielsen and IRI provide dimensional data marts for retail sales.
• 1983 – Teradata introduces a database management system specifically designed for
decision support.
• 1988 – Barry Devlin and Paul Murphy published the article “An architecture for a
business” and information systems in IBM systems journal where they introduced the
term “business data warehousing”.
• 1990 – Red brick systems introduced red brick warehouse, a database management
system specifically for data warehousing.
• 1991 - Prism Solution introduced Prism warehouse manager, software for developing a
data warehouse.
• 1991 – Bill Inmon published the book “Building the data warehouse”.
• 1995 – The data warehousing institute, a for-profit organization that promotes data
warehousing is founded.
• 1996 – Ralph Kimball published the book “The data warehouse Toolkit”.
• 1997 – Oracle 8 with support for star queries is released.
• 1998 – Microsoft releases Microsoft analysis service (Then OLAP services) heavily
utilizing data warehouse schemas.
CHAPTER TWO:
ARCHITECTURE
[6]Architecture in the context of an organization’s data warehousing efforts is a conceptualization
of how the data warehouse is built. There is no right or wrong architecture but rather there are
multiple architectures that exist to support various environments and situations. The worthiness
of the architecture can be judged from how the conceptualization aids in the building,
maintenance, and usage of the data warehouse. One possible simple conceptualization of data
warehouse architecture consists of the following interconnected layers:
Operational database layer
The source data for the data warehouse – An organization’s Enterprise Resource Planning
Systems falls into this layer.
Data access layer
The interface between the operational and informational access layer – Tools to extract,
transform, load data into the warehouse fall into this layer.
Metadata layer
The data directory – This is usually more detailed than an operational system data directory.
There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can
be accessed by a particular reporting and analysis tool.
Informational access layer
The data accessed for reporting & analyzing and the tools for reporting & analyzing data –
Business intelligence tools fall into this layer. And the Inmon – Kimball differences about design
methodology have to do with this layer.
THE MAJOR STEPS IN DEVELOPING DATAWAREHOUSE
The very first step before you start to develop data warehouse, the data source will be identified.
You need to figure out what are the data that are required to be put into your data warehouse. For
a library data warehouse, there are two types of data sources that need to be considered, internal
[7]
Identify the data source
and external data source. Internal data source will be the data that already exist in the library
system. The external data source is the data that does not exist within library system (Nicholson,
2003).
Build customized ETL tool
Each data warehouse has the different requirements. Therefore, a customized ETL tool is the
better solution in order to fulfill the requirements. For the library data warehouse, we choose our
own extract program. We deal the inconsistency issues with our own transformation method and
finally we load the data into the data warehouse database.
Extraction
This can be the most time consuming part where you need to grab the data from various data
source and store it into the staging database. Much of the time and effort are needed in writing a
custom program to transfer the data from sources into staging database. As a result, during
extraction, we need to determine which database system will be used for thestaging area and also
figure out what are the necessary data that are needed before grab it. The decline in the cost of
hardware and storage has overcome the issues on avoiding the data duplication and also their
worries on lack of storage as storing the excessive or unnecessary data. However, there is
probably no reason to store the unnecessary data which had been identified not being useful in
decision making process. Therefore, there is a necessary for extract only the relevant data before
bringing into data warehouse (Mallach, 2000).
Transformation
After extracting the data from various data sources, transformation is needed to ensure the data
consistency. In order to transform the data into data warehouse properly, you need to figure out a
way of mapping the external data sources fields to the data warehouse fields. Transformation can
be performed during data extraction or while loading the data into data warehouse. This
integration can be a complex issue when the number of data sources getting bigger.
Loading
Once the extracting process, transforming and cleansing has been done, the data are loaded into
the data warehouse. The loading of data can be categorised into two types; the loading of data
that currently contain in the operational database and the loading of the updates to the data
warehouse from the changes that have occurred in the operational database. As to guarantee the
freshness of data, data warehouse is needed to be refreshed to update its data. Many issues are
needed to be considered especially during loading the updates to the data warehouse. While
updating the data warehouse, we need to ensure that no data are loosed and also to ensure a
minimum overhead over the scanning existing file process.
AN OVERVIEW OF DATAWAREHOUSE INFRASTRUCTURE [14]
COMPONENTS OF A DATA WAREHOUSE [13]
Overall Architecture
The data warehouse architecture is based on a relational database management system server that
functions as the central repository for informational data. Operational data and processing is
completely separated from data warehouse processing. This central information repository is
surrounded by a number of key components designed to make the entire environment functional,
manageable and accessible by both the operational systems that source data into the warehouse
and by end-user query and analysis tools.
Typically, the source data for the warehouse is coming from the operational applications. As the
data enters the warehouse, it is cleaned up and transformed into an integrated structure and
format. The transformation process may involve conversion, summarization, filtering and
condensation of data. Because the data contains a historical component, the warehouse must be
capable of holding and managing large volumes of data as well as different data structures for the
same database over time. The seven major components of data warehousing are:
Data Warehouse Database
The central data warehouse database is the cornerstone of the data warehousing environment.
This database is almost always implemented on the relational database management system
(RDBMS) technology. However, this kind of implementation is often constrained by the fact that
traditional RDBMS products are optimized for transactional database processing. Certain data
warehouse attributes, such as very large database size, ad hoc query processing and the need for
flexible user view creation including aggregates, multi-table joins and drill-downs, have become
drivers for different technological approaches to the data warehouse database. These approaches
include:
• Parallel relational database designs for scalability that include shared-memory, shared
disk, or shared-nothing models implemented on various multiprocessor configurations
(symmetric multiprocessors or SMP, massively parallel processors or MPP, and/or
clusters of uni- or multiprocessors).
• An innovative approach to speed up a traditional RDBMS by using new index structures
to bypass relational table scans.
• Multidimensional databases (MDDBs) that are based on proprietary database technology;
conversely, a dimensional data model can be implemented using a familiar RDBMS.
Multi-dimensional databases are designed to overcome any limitations placed on the
warehouse by the nature of the relational data model. MDDBs enable on-line analytical
processing (OLAP) tools that architecturally belong to a group of data warehousing
components jointly categorized as the data query, reporting, analysis and mining tools.
Sourcing, Acquisition, Cleanup and Transformation Tools
A significant portion of the implementation effort is spent extracting data from operational
systems and putting it in a format suitable for informational applications that run off the data
warehouse. The data sourcing, cleanup, transformation and migration tools perform all of the
conversions, summarizations, key changes, structural changes and condensations needed to
transform disparate data into information that can be used by the decision support tool. They
produce the programs and control statements, including the COBOL programs, MVS job-control
language (JCL), UNIX scripts, and SQL data definition language (DDL) needed to move data
into the data warehouse for multiple operational systems. These tools also maintain the meta
data. The functionality includes:
• Removing unwanted data from operational databases
• Converting to common data names and definitions
• Establishing defaults for missing data
• Accommodating source data definition changes
The data sourcing, cleanup, extract, transformation and migration tools have to deal with some
significant issues including:
• Database heterogeneity. DBMSs are very different in data models, data access language,
data navigation, operations, concurrency, integrity, recovery etc.
• Data heterogeneity. This is the difference in the way data is defined and used in different
models - homonyms, synonyms, unit compatibility (U.S. vs metric), different attributes
for the same entity and different ways of modeling the same fact.
These tools can save a considerable amount of time and effort. However, significant
shortcomings do exist. For example, many available tools are generally useful for simpler data
extracts. Frequently, customized extract routines need to be developed for the more complicated
data extraction procedures.
Meta data
Meta data is data about data that describes the data warehouse. It is used for building,
maintaining, managing and using the data warehouse. Meta data can be classified into:
• Technical meta data, which contains information about warehouse data for use by
warehouse designers and administrators when carrying out warehouse development and
management tasks.
• Business meta data, which contains information that gives users an easy-to-understand
perspective of the information stored in the data warehouse.
Equally important, meta data provides interactive access to users to help understand content and
find data. One of the issues dealing with meta data relates to the fact that many data extraction
tool capabilities to gather meta data remain fairly immature. Therefore, there is often the need to
create a meta data interface for users, which may involve some duplication of effort.
Meta data management is provided via a meta data repository and accompanying software. Meta
data repository management software, which typically runs on a workstation, can be used to map
the source data to the target database; generate code for data transformations; integrate and
transform the data; and control moving data to the warehouse.
As user's interactions with the data warehouse increase, their approaches to reviewing the results
of their requests for information can be expected to evolve from relatively simple manual
analysis for trends and exceptions to agent-driven initiation of the analysis based on user-defined
thresholds. The definition of these thresholds, configuration parameters for the software agents
using them, and the information directory indicating where the appropriate sources for the
information can be found are all stored in the meta data repository as well.
Access Tools
The principal purpose of data warehousing is to provide information to business users for
strategic decision-making. These users interact with the data warehouse using front-end tools.
Many of these tools require an information specialist, although many end users develop expertise
in the tools. Tools fall into four main categories: query and reporting tools, application
development tools, online analytical processing tools, and data mining tools.
Query and Reporting tools can be divided into two groups: reporting tools and managed query
tools. Reporting tools can be further divided into production reporting tools and report writers.
Production reporting tools let companies generate regular operational reports or support high-
volume batch jobs such as calculating and printing paychecks. Report writers, on the other hand,
are inexpensive desktop tools designed for end-users.
Managed query tools shield end users from the complexities of SQL and database structures by
inserting a metalayer between users and the database. These tools are designed for easy-to-use,
point-and-click operations that either accept SQL or generate SQL database queries.
Often, the analytical needs of the data warehouse user community exceed the built-in capabilities
of query and reporting tools. In these cases, organizations will often rely on the tried-and-true
approach of in-house application development using graphical development environments such
as PowerBuilder, Visual Basic and Forte. These application development platforms integrate
well with popular OLAP tools and access all major database systems including Oracle, Sybase,
and Informix.
OLAP tools are based on the concepts of dimensional data models and corresponding databases,
and allow users to analyze the data using elaborate, multidimensional views. Typical business
applications include product performance and profitability, effectiveness of a sales program or
marketing campaign, sales forecasting and capacity planning. These tools assume that the data is
organized in a multidimensional model.
A critical success factor for any business today is the ability to use information effectively. Data
mining is the process of discovering meaningful new correlations, patterns and trends by digging
into large amounts of data stored in the warehouse using artificial intelligence, statistical and
mathematical techniques.
Data Marts
The concept of a data mart is causing a lot of excitement and attracts much attention in the data
warehouse industry. Mostly, data marts are presented as an alternative to a data warehouse that
takes significantly less time and money to build. However, the term data mart means different
things to different people. A rigorous definition of this term is a data store that is subsidiary to a
data warehouse of integrated data. The data mart is directed at a partition of data (often called a
subject area) that is created for the use of a dedicated group of users. A data mart might, in fact,
be a set of denormalized, summarized, or aggregated data. Sometimes, such a set could be placed
on the data warehouse rather than a physically separate store of data. In most instances, however,
the data mart is a physically separate store of data and is resident on separate database server,
often a local area network serving a dedicated user group. Sometimes the data mart simply
comprises relational OLAP technology which creates highly denormalized dimensional model
(e.g., star schema) implemented on a relational database. The resulting hypercubes of data are
used for analysis by groups of users with a common interest in a limited portion of the database.
These types of data marts, called dependent data marts because their data is sourced from the
data warehouse, have a high value because no matter how they are deployed and how many
different enabling technologies are used, different users are all accessing the information views
derived from the single integrated version of the data.
Unfortunately, the misleading statements about the simplicity and low cost of data marts
sometimes result in organizations or vendors incorrectly positioning them as an alternative to the
data warehouse. This viewpoint defines independent data marts that in fact, represent fragmented
point solutions to a range of business problems in the enterprise. This type of implementation
should be rarely deployed in the context of an overall technology or applications architecture.
Indeed, it is missing the ingredient that is at the heart of the data warehousing concept -- that of
data integration. Each independent data mart makes its own assumptions about how to
consolidate the data, and the data across several data marts may not be consistent.
Moreover, the concept of an independent data mart is dangerous -- as soon as the first data mart
is created, other organizations, groups, and subject areas within the enterprise embark on the task
of building their own data marts. As a result, you create an environment where multiple
operational systems feed multiple non-integrated data marts that are often overlapping in data
content, job scheduling, connectivity and management. In other words, you have transformed a
complex many-to-one problem of building a data warehouse from operational and external data
sources to a many-to-many sourcing and management nightmare.
Data Warehouse Administration and Management
Data warehouses tend to be as much as four times as large as related operational databases,
reaching terabytes in size depending on how much history needs to be saved. They are not
synchronized in real time to the associated operational data but are updated as often as once a
day if the application requires it.
In addition, almost all data warehouse products include gateways to transparently access multiple
enterprise data sources without having to rewrite applications to interpret and utilize the data.
Furthermore, in a heterogeneous data warehouse environment, the various databases reside on
disparate systems, thus requiring inter-networking tools. The need to manage this environment is
obvious.
Managing data warehouses includes security and priority management; monitoring updates from
the multiple sources; data quality checks; managing and updating meta data; auditing and
reporting data warehouse usage and status; purging data; replicating, subsetting and distributing
data; backup and recovery and data warehouse storage management.
Information Delivery System
The information delivery component is used to enable the process of subscribing for data
warehouse information and having it delivered to one or more destinations according to some
user-specified scheduling algorithm. In other words, the information delivery system distributes
warehouse-stored data and other information objects to other data warehouses and end-user
products such as spreadsheets and local databases. Delivery of information may be based on time
of day or on the completion of an external event. The rationale for the delivery systems
component is based on the fact that once the data warehouse is installed and operational, its users
don't have to be aware of its location and maintenance. All they need is the report or an
analytical view of data at a specific point in time. With the proliferation of the Internet and the
World Wide Web such a delivery system may leverage the convenience of the Internet by
delivering warehouse-enabled information to thousands of end-users via the ubiquitous world
wide network.
In fact, the Web is changing the data warehousing landscape since at the very high level the
goals of both the Web and data warehousing are the same: easy access to information. The value
of data warehousing is maximized when the right information gets into the hands of those
individuals who need it, where they need it and they need it most. However, many corporations
have struggled with complex client/server systems to give end users the access they need. The
issues become even more difficult to resolve when the users are physically remote from the data
warehouse location. The Web removes a lot of these issues by giving users universal and
relatively inexpensive access to data. Couple this access with the ability to deliver required
information on demand and the result is a web-enabled information delivery system that allows
users dispersed across continents to perform a sophisticated business-critical analysis and to
engage in collective decision-making.
NORMALIZED VERSUS DIMENSIONAL APPROACH FOR STORAGE OF DATA
[16]There are two leading approaches to storing data in a data warehouse – the dimensional
approach and normalized approach.
In a dimensional approach, transaction data are partitioned into either “facts”, which are
generally numeric transaction data or “dimension”, which are the reference information that
gives context to the facts. For example, a sales transaction can be broken up into facts such as the
number of products ordered and the price paid for the products, and into dimensions such as
order date, customer name, product number, order ship-to & bill-to location, and salesperson
responsible for receiving the order. A key advantage of a dimensional approach is that the data
warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data
warehouse tends to operate very quickly. The main disadvantages of dimensional approach are:
1. In order to maintain the integrity of facts and dimensions, loading the data warehouse
with data from different operational systems is complicated.
2. It is difficult to modify the data warehouse structure if the organization adopting the
dimensional approach changes the way in which it does business.
In the normalized approach, the data in the data warehouse are stored following to a degree,
database normalization rules. Tables are grouped together by subject areas that reflect general
data categories (e.g. data on customers, products, finance etc.). The main advantage of this
approach is that it is straightforward to add information into the database. A disadvantage of this
approach is that because of the number of tables involved it can be difficult for users both to:
1. Join data from different sources into meaningful information.
2. Access the information without a precise understanding of the sources of data and of the
data structure of the data warehouse.
These approaches are not mutually exclusive and there are other approaches. Dimensional can
involve normalizing data to a degree.
CONFORMING INFORMATION
Another important fact in designing a data warehouse is which data to conform and how to
conform the data. For example, one operational system feeding data into the data warehouse may
use “M” & “F” to denote sex of an employee while another operational system may use “Male”
and “Female”. Though this is a simple example, much of the work in implementing a data
warehouse is devoted to making similar meaning data consistent when they are stored in the data
warehouse. Typically, extract, transform, load tools are used in this work. Master data
management has the aim of conforming data that could be considered “dimensions”.
TOP-DOWN VERSUS BOTTOM-UP DESIGN METHODOLOGIES [11]
Bottom-up Design
Ralph Kimball, a well known author in data warehousing is a proponent of an approach to data
warehouse design frequently considered as bottom-up. In this approach, data marts are first
created to provide reporting and analytical capabilities for specific business processes. Data
marts contain atomic data and if necessary summarized data; these data marts can eventually be
unioned together to create a comprehensive data warehouse. The combination of data marts is
managed through the implementation of what Kimball calls “a data warehouse bus architecture”.
Business value can be returned as quickly as the first data marts can be created. Maintaining tight
management over the data warehouse bus architecture is fundamental to maintaining the integrity
of the data warehouse. The most important management task is making sure dimensions among
data marts are consistent. In Kimball’s words, this means that the dimensions “conform”.
Top-Down Design
Bill Inmon, one of the first authors on the subject of data warehousing has defined a data
warehouse as a centralized repository for the entire enterprise. Inmon is one of the leading
proponents of the top-down approach to data warehouse design in which the data warehouse is
designed using a normalized enterprise data model. “Atomic” data, which is data at the lowest
level of detail, are stored in the data warehouse. Dimensional data marts containing data needed
for specific business processes or specific departments are created from the data warehouse. In
the Inmon vision, the data warehouse is at the center of the “Corporate Information Factory”
(CIF) which provides a logical framework for delivering business intelligence and business
management capabilities. Inmon states that data warehouse is:
• Subject-Oriented:
•
the data in the data warehouse is organized so that all the data element
relating to the same real-world event or objects are linked together.
Non-Volatile:
•
data in the data warehouse is never overwritten or deleted once written, the
data is static, read-only and retained for future reporting.
Integrated:
The top-down design methodology generates highly consistent dimensional views of data across
data marts since all data marts are loaded from the centralized repository. Top-down design has
also proven to be robust against business changes. Generating new dimensional data marts
against the data stored in the data warehouse is a relatively simple task. The main disadvantage
of the top-down methodology is that it represents a very large project with a very broad scope.
The up-front cost for implementing a data warehouse using the top-down methodology is
significant and the duration of time from the start of project to the point that end users
experience initial benefits can be substantial. In addition, the top-down methodology can be
inflexible and unresponsive to changing departmental needs during the implementation phases.
Hybrid Design
Over time, it has become apparent to proponents of bottom-up and top-down data warehouse
design that both methodologies have benefits and risks. Hybrid methodologies have evolved to
take advantage of the fast turn-around time of bottom-up design and the enterprise-wide data
consistency of top-down design.
the data warehouse contains data from most or all of an organization’s
operational systems and this data are made consistent.
DATA WAREHOUSE Vs OPERATIONAL SYSTEMS
Operational systems are optimized for preservation of data integrity and speed of recording of
business transactions through use of database normalization and an entity-relationship model.
Operational system designers generally follow the Codd rules of database normalization in
order to ensure data integrity. Codd defined five increasingly stringent rules of normalization;
fully normalized database designs (that is those satisfying all five codd rules) often result in
information from a business transaction being stored in dozens to hundreds of tables. Relational
databases are efficient at managing the relationships between the tables; the databases have very
fast insert/update performance because only a small amount of data in those tables is affected
each time a transaction is processed. Finally, in order to improve performance, older data(s) are
usually periodically purged from operational systems. Data warehouses are optimized for speed
of data analysis. Frequently, data in data warehouses are de-normalized via a dimension based
model; also, to speed data retrieval, data warehouse data are often are often stored multiple times
in their most granular form and in summarized forms called aggregates. Data warehouse data are
gathered from the operational systems and held in the data warehouse even after the data has
been purged from the operational systems.
EVOLUTION IN ORGANIZATION USE
[3] Organizations generally start off with relatively simple use of data warehousing. Over time
more sophisticated use of data warehousing evolves; the following general stages of use of the
data warehouse can be distinguished.
Offline operational database
Data warehouse in this initial stage are developed by simply copying the data off an operational
system to another server where the processing load of reporting against the copied data does not
impact the operational system’s performance.
Offline data warehouse
Data warehouses at this stage are updated from data in the operational systems on a regular basis
and the data warehouse data is stored in a data structure designed to facilitate reporting.
Real time data warehouse
Data warehouse at this stage are updated every time an operational system performs a
transaction (e.g. an order or a delivery or a booking).
Integrated data warehouse
Data warehouses at this stage are updated every time an operational system performs a
transaction. The data warehouses then generate transactions that are passed back into the
operational systems.
BENEFITS
[17] Some of the benefits that a data warehouse provides are as follows:
• A data warehouse provides a common data model for all data of interest regardless of the
data’s source. This makes it easier to report and analyze information than it would be if
multiple data models were used to retrieve information such as sales, invoices, and order
receipts. General ledger charges e.t.c.
• Prior to loading data into the data warehouse, inconsistencies are identified and resolved;
this greatly simplifies reporting and analysis.
• Information in the data warehouse is under the control of data warehouse users so that
even if the source system data is purged over time, the information in the warehouse can
be stored safely for extended periods of time.
• Because they are separate from operational systems, data warehouses provide retrieval of
data without slowing down operational systems.
• Data warehouses can work in conjunction with and hence, enhance the value of
operational business applications, notably customer relationship management (CRM)
systems.
• Data warehouses facilitate decision support system applications such as trend reports (e.g
the items with the most sales in a particular area within the last two years), exception
reports, and reports that show actual performance versus goals.
DISADVANTAGES
Some disadvantages of using data warehouse are:
• Data warehouses are not the optimal environment for unstructured data.
• Because data must be extracted, transformed and loaded into the warehouse, there is an
element of latency in data warehouse data.
• Over their life, data warehouses can have high costs.
• Data warehouses can get outdated relatively quickly; there is a cost of delivering
suboptimal information to the organization.
• There is often a fine line between data warehouses and operational systems. Duplicate,
expensive functionality may be developed. Or, functionality may be developed in the
data warehouse that, in retrospect, should have been developed in the operational
systems and vice versa.
SAMPLE APPLICATIONS
Some applications data warehousing can be used for are:
• Credit card churn analysis
• Insurance fraud analysis
• Call record analysis
• Call record analysis
• Logistics management amongst others
1. Teradata
DATA WAREHOUSE APPLIANCE
[9] A data warehouse appliance consists of an integrated set of servers, storage operating systems,
DBMS and software specifically pre-installed and pre-optimized for data warehousing. Data
warehouse appliance provides solution for the mid-to-large volume data warehousing markets,
offering low cost performance most commonly on data volumes in the terabyte to petabyte
range. Data warehousing are designed to facilitate reporting and analysis. Examples of data
warehouse appliances are:
2. Kognitio
3. Kickfire
4. Vertica
5. EXASOL
6. Paraccel amongst others.
BENEFITS OF DATA APPLIANCES
• Reduction in costs
• Parallel performance
• Reduced administration
• Built-in high availability
• Scalability
• Rapid time-to-value
APPLICATION USES
Data warehousing appliances provide solutions for many analytic application uses, including:
• Enterprise data warehousing
• Super-sized sandboxes which isolate power users with resource intensive queries
• Pilot projects or projects requiring rapid prototyping and rapid time-to-value
• off-loading projects from the enterprise data warehouse, such as large analytical query
projects that affect the overall workload of the enterprise data warehouse
• Applications with specific performance or loading requirements
• Data marts that have outgrown their present environment
• Turnkey data warehouses or data marts
• Solutions for applications with high data-growth and high-performance requirements
• Applications requiring data warehouse encryption
TRENDS IN DATA WAREHOUSING APPLIANCES
The Data warehousing appliance market has started to shift trends in many areas as it evolves:
• Vendors have started moving toward using commodity technologies rather than
proprietary assembly of commodity components.
• Implemented applications show usage expansion from tactical and data-mart solutions to
strategic and enterprise data-warehouse use.
• Mainstream vendor participation has become apparent as of 2009.
• With a lower total cost of ownership, reduced maintenance and high performance to
address business analytics on growing data volumes, most analysts believe that Data
Warehousing appliances will gain market share - though Teradata maintain their
leadership position.
CHAPTER THREE:
• The first is the growth of data in the data warehouse environment.
TRENDS IN DATA WAREHOUSE
[2]According to Bill Inmon in an interview granted to Sun Systems, there are several important
trends in data warehousing today which include the following
• The awareness that unstructured data belongs in the data warehouse. There is a wealth of
information in unstructured data that simply never makes it onto the data warehouse but it
should.
• The fact that metadata needs to be tightly integrated within the data warehouse. Metadata
has always been loosely associated with a data warehouse but never tightly integrated and
that is a mistake.
• Some people have a need for what is termed a “global warehouse”. A global warehouse is
one that is constructed from a collection of local data warehouse. Large multi-national
corporations have a need for a global data warehouse.
MAJOR DATA WAREHOUSING EVENTS OF 2008 [4]
Everyone Had An Appliance Story: With the acceptance of data warehouse appliances as part
of an overall data warehousing architecture, major vendors (in addition to start-ups) have jumped
on the appliance bandwagon. Column-oriented database vendors such as ParAccel and Vertica
have partnered with hardware vendors to support and market data warehouse appliances, and
Oracle has partnered with HP to produce the HP Oracle Database Machine. Even Teradata, a
company that for many years dismissed data warehouse appliances as niche technology that
could only lead to multiple versions of truth, has acknowledged the appliance concept with
several platforms of its own and is now bragging that it was the original data warehouse
appliance vendor.
Industry Consolidations Continued: Acquisitions in 2008 included data warehouse appliance
specialist DatAllegro by Microsoft, Identity Systems by Informatica, specialty analytics vendor
NuTech by Netezza, IDeaS Revenue Optimization and natural language processing specialist
Teragram by SAS, and open source database vendor MySQL AB by Sun Microsystems.
Furthermore, although announced in 2007, IBM’s acquisition of Cognos and SAP’s acquisition
of Business Objects were both completed in January 2008.
The Recessionary Environment Encouraged Further BI Deployments: As companies sought
ways to maintain profitability in the face of a deteriorating economy, they recognized the value
of business intelligence for discovering new revenue opportunities, identifying areas of potential
cost reductions, and reducing fraud. While IT expenditures were closely watched and, in many
cases, reduced, many BI projects actually had their priorities increased.
Open Source Grew: Supporting cost reduction initiatives, open source business intelligence,
database, and data integration technology has shown substantial uptake in 2008 to the point
where open source offerings, especially products that have formal (albeit extra-cost) support are
now making inroads into accounts that previously would not consider them. In many cases, free
open source technology was initially used for prototype deployments with organizations
upgrading to commercial versions with formal support when the prototypes were placed in
production.
Major Trends for 2009
Trends #1: Further Industry Consolidation
Acquisitions will remain a fact of life in the data warehousing industry. Since the economy is
now officially in a recession, some vendors will be open to being acquired if only to ensure their
survival. Other, more established, vendors may simply succumb to offers they, or their
stockholders, simply can’t refuse.
If I had to pick two likely targets, my guess would be Informatica, perhaps by HP in order to
augment its data warehousing technology portfolio, and SPSS, perhaps by SAP as Business
Objects sells (as an OEM) SPSS predictive analytics technology for its BusinessObjects XI
platform. Although neither of these two companies is experiencing major financial problems,
their technologies would make attractive additions to the technology portfolio of potential
acquirers.
Trend #2: Cloud Computing will Come Down to Earth
Continued pressure to reduce expenses will serve as a catalyst for organizations both large and
small to utilize cloud computing as an alternative to obtaining and funding in-house resources.
Although small companies may use this as their primary computing platform, large companies
may use cloud computing for incremental, perhaps one-time, projects. BI vendors not already
offering on-demand software will establish a cloud presence to better compete in the small-to-
midsize business (SMB) market.
Trend #3: Open Source Growth will Accelerate
Economic pressure will accelerate the growth of open source technology as well, especially as
open source has now established itself in production deployments. Because many vendors are
utilizing source technology in their applications in order to reduce costs or, as in the case of
several data warehouse appliance vendors, partnering with open source business intelligence and
data integration vendors to offer a more complete solution, the growth will be seen in both
standalone and embedded environments.
Trend #4: The IT World Will Become Greener
The peak in energy costs earlier this year provided a strong incentive for organizations to
consider becoming “greener” for cost savings as well as more altruistic environmental reasons.
Organizations will look to minimize the energy costs associated with their hardware and consider
both direct power consumption as well as associated costs such as air conditioning in their
technology evaluations. This will further drive virtualization efforts to maximize the utilization
of existing hardware.
Trend #5: Major Emphasis on Solutions Rather than Tools and Technology
The need to quickly address business concerns as well as compliance requirements will drive
organizations to seek customizable analytic applications rather than to build them from scratch
with Business Intelligence tools. Business Intelligence vendors will respond to this demand with
additional vertical and functional analytic applications, some of which may be obtained through
the acquisition of their current partners. Furthermore, vendors such as Oracle and SAP will
continue to enhance the analytic functionality of their operational enterprise applications.
CONCLUSION
Judging from the trends in 2008 and that of 2009, appliances will be on the decline and IT
companies will be trying to invest in other markets in order to stay afloat and in the case where
the company cannot keep up, industry consolidation comes up. Data warehousing trends seems
to remain the same but enlarged in order to accommodate the rapid growth of data being stored
in the data warehouse by enterprises especially in the area of open source technology which will
see data warehouse appliance vendors, partnering with open source business intelligence and
data integration vendors to offer a more complete solution.
Even with this backdrop of anxiety and budget constraints that was caused by the economic
recession, business intelligence, data warehousing and data integration will continue to grow.
Some will argue that because business intelligence and data warehouse provides such business
value they will be exempted from the budget cuts but that will not be the case.
•
REFERENCES:
www.wikipedia.org/datawarehouse/ [1]
• www.sun.com/solutions/documents/interviews/bidw_QandABillInmon_gg.xml [2]
• http://www.kmworld.com/Articles/Editorial/Feature/Data-warehousing-from-end-to-end-
9101.aspx [3]
• http://tdwi.org/articles/2008/12/17/major-data-warehousing-events-of-2008-and-
predictions-for-2009.aspx [4]
• Han, J. and M. Kamber. Data Mining: Concepts and Techniques. (2001) [5]
• http://www.dwinfocenter.org/ [6]
• Special Issue of the International Journal of the Computer, the Internet and Management,
Vol.15 No. SP4, November, (2007) [7]
• http://www.information-management.com/infodirect/20011026/4191-1.html [8]
• http://www.carolla.com/wp-dw.htm [9]
• http://www.hinduwebsite.com/webresources/data_warehousing.asp[10]
• http://stanford.edu/dept/itss/docs/oracle/10g/server.101/b10743/bus_intl.htm#i30690 [11]
• http://www.data-miners.com/companion/Chapter15.ppt [12]
• http://www.tdan.com/view-articles/4213 [13]
• http://www.dwreview.com/DW_Overview.html [14]
• http://dataminingwarehousing.blogspot.com/ [16]
• http://www.kenorrinst.com/dwpaper.html [17]