intro to data mining
TRANSCRIPT
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 1/27
Introduction toSoftware Engineering
NOTES
Self-Instructional Material 3
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 2/27
Introduction toSoftware Engineering
NOTES
UNIT 1 INTRODUCTION TO
DATAMINING &
WAREHOUSING (FULL TEXT)
Structure
1.0 Introduction
During the Early Nineties Industries realised that they were not getting
the promised returns on their investment in IT infrastructure. A major
focus of the industrial leaders was to utilise IT as a strategic tool to
maximise profits. They expected IT to leverage their decision making
capability and not merely in terms of obtaining MIS reports which were
primarily routine in nature and did not help them in sifting through
voluminous data and identifying hidden, camouflaged or implied
information . The emphasis therefore shifted from generating MIS reports
to what was termed as KDD( Knowledge Discovery in Databases). KDD by itself involved a large number of components out of which the most
desirable was extraction of useful information or patterns from massive
corporate data and was termed as “Data Mining”. Following text is a
systematic presentation of Data Mining techniques and Data Warehouse
framework which is the repository of the vast, integrated, time variant,
historical & subject oriented data, to be operated upon.
1.1 Unit Objectives
• Explaining the evolution of Data Mining & Warehousing.
• Understanding the components of Data Mining & Warehousing system.
• To study the complementary relationship of Data Mining & Warehousing.
• To learn the steps involved in Implementation of Data Mining &
Warehousing architecture.
• Identifying problems associated with Data Mining & Data Warehousing
Framework.
• To know the role of Data Mining & Warehousing in strategic decision
making & giving a competitive advantage to a business activity.
1.2 Emergence of Data Mining & Warehousing
1.2.1 Business needs drive technology
Data Mining & Warehousing have now become familiar words for not only the
computer professionals but most of the decision makers A rapid growth has taken
place in developing a technology surrounding them, with most of the leading
companies of the world creating products and services to exploit its potential. ItSelf-Instructional Material 4
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 3/27
Introduction toSoftware Engineering
NOTES
therefore brings out an underlying fact before us, that is, any technology results as an
outcome of Business needs. In this case, it was an inescapable and urgent
requirement of the Business community to have a powerful, online, decision making
tool to support and substantiate its own intuitive, thought processes. It has thus
resulted in development of Data Mining & Warehousing technology.
1.2.2 IT solutions for Strategic Decision making
Information Technology has emerged as a powerful business driver and an essential
component for the companies, to give them a competitive advantage in the
challenging market scenario. Each and every aspect of IT is being closely examinedand integrated into the business activity. But by far the very existence of the industry
depends on the strategic decisions taken by its top echelon, because these are the
once which are being translated by its middle level and operational staff into actions.
Data Mining & Warehousing IT solutions have taken Strategic Decision making
beyond the realm of conventional MIS (Management Information System) and OIS
(Operational Information System) boundaries.
1.2.3 Focus on the End Users
Data Mining & Warehousing , as we have seen has been developed for the Managers
& top level executives to assist them in reaching decisions based not only on facts
and figures seen superficially, but drawing inferences from hidden and widely
dispersed uncorrelated data. The focus of the system is on End Users and meeting
their requirements by offering simple interfaces. It is essential to keep the Front End
tools simple, less complicated and user friendly. The experts have to create a system
keeping in view the technical limitations of the End user, their lack of understanding
of system capabilities in the initial stages and support them to achieve the desired
result.
1.3 Evolution of Data mining & Warehousing
Data mining & Warehousing has grown to its full potential today in a number
of clearly demarcated stages. Prior to 1970s we had the flat files & databases.
The emphasis then was on Data Collection & Database Creation. A manager
was able to use them only for simplifying his/her day to day task.
These were further developed into DBMS from 1970s to Early1980s, offering
advantages like concurrent, shared or distributed access, ensuring the
Self-Instructional Material 5
Information Systems
Operational
Information System
Management
Information System
ERP, CRM, SCM
etc
DSS, EIS,Expert
Systems etc
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 4/27
Introduction toSoftware Engineering
NOTES
consistency and providing information security, providing us the backbone for
the Reports and supporting the conventional MIS and OIS features.
Later on the DBMS followed three growth paths. These were:
(a)Advanced database Systems – Developed from1980s till present.
They included Object–relational, Spatial, Temporal, Biological Database
systems.
(b) Data Warehousing & Data mining – Evolved from 1980s till present as Knowledge Discovery & Data Mining, Data warehouse &
OLAP technologies. Their full potential was not realised due to a large
number of limitations of the then Hardware, Networking and software
constraints.
(c) Web Based Database Systems(1990s till present) were based on
phenomenal reach of Internet, XML Based systems, Web technology
and Web mining.
1980 till today
1960-1970 1970-1980 2000 onwards
The latest trend in the decade is Integrated information systems based on
above three paving way for revolutionary development in Decision making in
the corporate world.
Self-Instructional Material 6
Flat Files,
Database
DBMS
Advanced
Databases
Data
Mining
Web
Database
Integrated
Database
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 5/27
Introduction toSoftware Engineering
NOTES
Self-Instructional Material 7
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 6/27
Introduction toSoftware Engineering
NOTES
1.4 INTRODUCTION TO DATA
MINING
1.4.1 What is Data mining
Data Mining means locating, identifying and finding unforeseen information
from a large data base. The information is one which is interesting to the end
user. It can also be understood as data analysis based on searching or learning
dependent on deduction.
1.4.2 What is Interestingness of Pattern?
A data pattern discovered through a data base search is considered
interesting, if it is easily understood, is valid on new or test data with some
degree of uncertainty, potentially useful and is novel. Interesting patterns are
identified by objective parameters which are combined with the subjective
requirements to reflect the needs and interests of a particular user.
1.4.3 Data mining & Knowledge Discovery;
Difference between Data Mining & Knowledge Discovery in Data bases
How are they different? Data Mining is devoted specifically to the processes
involved in extraction of useful information by applying specific techniques
based on certain knowledge domains. These are say, based on statistics,
Artificial Intelligence, and so on. While Knowledge discovery is a wide term
and is the entire range of activities right from deciding Business objectives,
Capturing desired data, preparing ,processing, arranging it, applying predefined
techniques and then presenting them in an understandable form to the user, To
say specifically Knowledge discovery can be sub-divided into Four specific
steps which are performed repetitively till the desired result is reached, one of them is Data Mining.
1. Data Processing comprising of Data Selection, Data Cleaning,
Data Integration
2. Data Transformation & organising in a form ready for fast access
3. Data Mining( DM Engine) and other techniques like OLAP/
OLTP for searching and extraction.
4 Knowledge presentation methods through Graphical User
Interface ( GUI).
5. Analysing the Result and Assimilating it in a knowledge domain
Following diagram refers:
Self-Instructional Material 8
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 7/27
Introduction toSoftware Engineering
NOTES
We can thus consider Data mining as a subset of Knowledge Discovery.
1.4.4 Nature of Data to be mined – Operational & Analytical
Self-Instructional Material 9
Data Processing
Data Transformation
Data Mining Engine
Knowledge Presentation
Through GUI
Result Analysis
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 8/27
Introduction toSoftware Engineering
NOTES
Data Mining is an essential step towards the creation of Information systems. These
are Operational Information Systems like Enterprise Resource Planning ( ERP) or as
Management Information Systems including Decision support Systems The DSS
systems assist managers in taking decisions based on available unstructured data and
validate their intuitive judgements. OIS & DSS each has its own requirement of Data
Structures and Databases.
The Data in turn is categorised as Operational data which is dynamic in nature and
meets short term goals. Analytical data has a longer time span and supports intuitive
decisions. Operational Database supports Transaction processing through On LineTransaction Processing (OLTP) Queries. Analytical Database meets On Line
Analytical Processing (OLAP) requirements of Decision Support Systems (DSS).
Differences in DB requirement differences for OLTP & DSS
Characteristic DB for OLTP DB for OLAP Needs
1.Nature of content Dynamic Static
2. Time span Current Historical
3. Time measured Implicit, Implied Explicit & mentioned
4. Level of Detail Primitive/ Detailed Detailed & Derived/Granularity
5. Update cycle Real time Periodic, planned
6. Tasks Known Pattern, Repetitive Unpredictable
7. Response Time bound Flexible
1.4.5 OLTP & OLAP
Having seen the Database requirement of OIS & DSS let us differentiatethe Query systems associated with each. These are OLTP & OLAP. OLTP
fulfills the requirements of OIS well, as the Queries are simple in nature .
OLAP, on the other hand, addresses the needs of defining more complex
queries and requires novel Databases in the form of Multi Dimensional &
Multi Relational Databases ( MDDB & MRDB respectively) to provide
the back end.
Features of both OLAP & OLTP are compared below
Feature OLTP OLAP
1. Meant for OIS MIS/DSS
2. Purpose Supports Transaction For Analysis
3. End User Operations Level, DB Specialists Knowledge worker
4. Function Daily operations Long term needs
Self-Instructional Material 10
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 9/27
Introduction toSoftware Engineering
NOTES
5. DB Design ER based, Star/snowflake schemas
Application oriented Subject oriented
6. Data Current, up-to-date Historical,
7. Summarization Primitive, Highly detailed Aggregated
8. View Relational Multidimensional
Multi-Relational
9. Work unit Short, simple transaction Complex query
10. Access Mode Both Read/Write Mostly read
11. Based on Data inputs Derived Information
12. Operations Operation on primary key Multiple scans
13. Number of records accessed Few Many
14. Number of Users Large Number Selected
15. DB Size In MB /GB In over 100GB to TB
16. Priority High performance & availability High flexibility,
End user autonomy
17. Measure Transactions throughput Specific Query
Comparison Database & DWH
DB(Transaction Processing) DWH(DSS)
1. Systematic data stored in a prescribed format Collection of
unstructured data
2. DB contains operational data Non-operational
data.
Self-Instructional Material 11
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 10/27
Introduction toSoftware Engineering
NOTES
3. Uses structured language for searching DM tools for
extracting pattern.
4. Keeps normalized data Does not store
normalized data.
Dr E.F. Codd ‘s guidelines for OLAP
OLAP is an essential ingredient of Data Mining. it is therefore essential to
understand the relevance of Dr E.F. Codd’s ( a well known authority on
RDBMS) guidelines. An interpretation of each of them is given below,
relating them to the issues involved:
1. Multidimensional Conceptual view- Business problems are complex
and can be solved only through a Multi dimensional concept as
Normal Queries cannot address them effectively. As such Multidimensional schemas are essential to create relevant databases.
2. Transparency-An end user must be presented a cohesive ,
unambiguous version of data and must not be exposed to complexity
and diversity of data sources.
3. Accessibility- Essential data must be identified and accessed.
4. Consistent reporting performance- Reporting must remain dependable
and reliable even with increase of database size.
5. Client/server architecture- DM and Warehousing systems are created
to meet growing business needs and financial constraints.
6. Generic dimensionality.-Every data dimension has the same
importance.
7. Dynamic sparse matrix handling- Provide capability to keep the
database size within limits by adopting suitable methods of handling
sparse matrices.
8. Multiuser support- The system must permit a large number of users to
be permitted access at the same time.
9. Unrestricted cross-dimensional operations- Multi dimensional schemas
must be well understood, designed and permit cross references.
10. Intuitive data manipulation- OLAP is created for Decision makers to
make intuitive decisions. They are not computer experts and must be
Self-Instructional Material 12
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 11/27
Introduction toSoftware Engineering
NOTES
provided with a user friendly uncomplicated access to generate
Queries.
11. Flexible reporting-The system must be capable of providing reports
desired by the end user.
12. Unlimited dimensions and aggregation levels The system must remain
flexible/ expandable for adding extra dimensions and permit additionalaggregations.
1.4.6 Data Mining a multidisciplinary area
Data mining is a confluence or combination of multiple disciplines. Some of
these are:
1. Information science
2. Database technology
3. Statistics
4. Machine Learning
5. Visualization
6. Other Disciplines
Self-Instructional Material 13
InformationScience
Statistics
Database
Technology
Machine
Learning
Visualisation Other Sciences
Data
Mining
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 12/27
Introduction toSoftware Engineering
NOTES
Successful Development of Data Mining System would thus require
joint efforts from experts of different domains.
1.4.7 Classification of Data mining systems
Data mining development of special algorithms to answer queries of various users.
The procedure is to evolve a number of models and to match one of them to data
stored in the database. Three steps involved in this process are: creating a model,
Find out the criteria to give Preference of a model over others and identify the search
technique.
Data Mining models being mathematical in nature are classified as Predictive and
Descriptive.
(a) A Predictive model spells in advance, the values a data may assume
,based on known results from other data stored in the database.A predictive model performs data mining tasks of Classification, Time series
Analysis, regression and Prediction.
(b) A descriptive model based on identification & relationships in data. The
descriptive model aims to discover rather than predict the properties of
data.
A descriptive model performs data mining tasks comprising of Clustering ,
Summarisation, association Rules and Sequence Discovery.
1.4.8 Data Mining TasksThe Basic tasks under Predictive and Descriptive models are:
Predictive Model
(a) Classification -Data is mapped into predefined groups or classes. Also
termed as supervised learning as classes are established prior to
examination of data.
(b) Regression- Mapping of data item into known type of functions. These
may be linear, logistic functions etc.
Self-Instructional Material 14
Data Mining
Models
Predictive Model Descriptive Model
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 13/27
Introduction toSoftware Engineering
NOTES
(c) Time Series Analysis- Value of an attribute are examined at evenly
spaced times, as it varies with time.
(d) Prediction- It means fore telling future data states based on past and
current data.
Descriptive Model
(a) Clustering- It is referred as unsupervised learning or
segmentation/partitioning. In clustering groups are not pre-defined.
(b) Summarisation- Data is mapped into subsets with simple descriptions .
Also termed as Characterisation or generalisation.
(c) Sequence Discovery- Sequential analysis or sequence discovery utilised
to find out sequential patterns in data. Similar to association but
relationship is based on time.
(d) Association Rules- A model which identifies specific types of data
associations.
1.4.9 Data Mining Primitives
A data mining task is expressed in the form of a DMQL statement and requires
certain primitives to be stated. These are:
(a)Task-relevant data- It mentions the part of the database to be examined.
(b)Nature of Knowledge to be mined- It defines the tasks or functions to be
performed on the data. Examples are Characterisation, Association, Clustering.
(c)Background knowledge- It means here the concept hierarchy as they indicate the
level of abstraction at which data is to be mined.
(d)Interestingness measures- These are defined for the task or function to be
performed. Example ,For Association rule the Support and Confidence Factors are
measured corresponding to threshold levels specified by the users as a measure of
interestingness.
(e)Presentation & Visualisation of discovered patterns- They refer to the ways in
which the result obtained can be displayed for the convenience of the user.
1.4.10 Data Mining Query Language (DMQL)
Data mining systems are required to support ad hoc and interactive requirements
of knowledge discovery from Relational Database and Multiple levels of
abstraction. Data mining languages are designed to meet this requirement. They
help us in formulating a query to define a data mining task primitives.
The primitives require:
(a) Set of task –relevant data to be mined.
(b) Nature of knowledge to be mined.
(c) Background knowledge required for the discovery.
(d) Measures of Interestingness
(e) Visualisation representation.
DMQL follows a SQL like syntax which is amenable for linking with
Relational Query languages and simplifies a users task of knowledge extraction
easier.
1.4.11 Integration of Data mining systems
A diverse number of Data mining tools may be available in an
organisation. It is essential to identify and categorise them to understand which
Self-Instructional Material 15
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 14/27
Introduction toSoftware Engineering
NOTES
model and tasks they support. Also find out if any data mining tools are being
developed in-house. Display them on the GUI of the Client Desk top to select
the right data mining tool for the problem in hand.
1.4.12 Major issues of Data mining
These are mentioned below:
(a) Human Interaction.
(b) Over-fitting
(c) Outliers.
(d) Interpretation of Results
(e) Visualisation of Results.
(f) Large Datasets
(g) High Dimensionality.
(h) Multimedia data.
(i) Missing Data.
(j) Irrelevant Data.
(k) Noisy Data.
(l) Changing data.
(m) Integration.
(n) Application
1.5 Introduction to Data Warehousing( DWH)
1.5.1 What is Data Warehouse
W.H. Inmon , well known as Father of the data warehouse concept defines “ A
Data warehouse is a subject oriented, integrated, non-volatile and time-variant
collection of data in support of management’s decisions”.
Where subject oriented means database is organised in a data warehouse on a
subject wise manner even at the expense of redundancy. Thus every manager
would have access to desired information in the shortest possible time not withstanding the extra space occupied by it.
Integrated implies, related database tables created in the form of Fact &
Dimension tables can be linked to each other and are not stored as stand alone
data resources.
Non destructible means storage of data on a permanent, non-volatile basis. It can
only be purged or removed only as an exception as an organisational need.
Time variant requires all data to be entered in the data warehouse to be time stamped
or associated with its time of entry. The time element introduced may not be the
actual time when data entered the operational system.
Data Warehouse(DWH) Block Diagram
Self-Instructional Material 16
External
Data
Admin &
Mgt. tools
Information
Delivery
S stem
Data
mining
Tools
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 15/27
Introduction toSoftware Engineering
NOTES
DATA MARTS
1.5.2 Data Warehouse Building Blocks
Data Pre-processing Tools
It means sourcing, acquisition, cleaning and transformation of data prior to its
entry into a Data warehouse Data repository. The data is received from legacy
systems, Web or other external sources. The data and the database from where itis received would itself be heterogenous and it requires:
(a) Removal of unwanted data
(b) Converting to common data & definition names
(c) Summarising the data
(d) Completing missing data
Operational Data Store
The data is transformed and loaded into the operational data store(ODS)in real
time frame. From the ODS it is loaded into the Data warehouse after extraction ,cleaning operations at regular intervals but not as and when received from
external sources. As such a time of entry is attached with it. The data thus
available is loaded under the control of Metadata.
Metadata
The metadata is data about data and keeps information as
(a) Technical metadata- containing, Sources of data, data structure,
transformation description, rules specified during data processing,
access authorisations & back up history.
(b) Business Metadata- Contains information about Subject areas,
information object types, Internet home pages, Information delivery system
details- that is when to despatch information and to whom, Data warehouse
operational information & Ownership details.
Data Warehouse Database
It is the central database consisting of Data Warehouse RDBMS,A large
Repository and supporting databases like Multi Relational Database, Multi
Dimensional Database & Data marts.
Data Mart
Data mart is another important component of the Data warehouse and is a data
store that is subsidiary to a data warehouse. It is created to meet specific
information needs of different functional area managers. Data marts are a partSelf-Instructional Material 17
Transforma
-tion Tools
OperationalData Store
Metadata
Data warehouse
DBMS & Data
Repository
MRDB
MDDB
OLAP
Tools
Report,
QueryTools
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 16/27
Introduction toSoftware Engineering
NOTES
of the data warehouse database and cannot be taken as an alternative for a data
warehouse.
Management & Administration Tools
They are provided to :
1. Managing & updating of metadata
2. Backup & recovery.
3. Removal of unwanted data
4. Security & assigning priorities5. Quality checks.
6. Distribution of data.
Access Tools
They are categorised as:
1.Query & reporting Tools
2.Application Tools- to meet specific user requirements.
3. Data Mining tools- To discover knowledge, Data visualisation and
Correcting data when the input data is incomplete.
4.OLAP Tools- These are associated with multidimensional databases to
provide elaborate, complex views for analysis
Information Delivery System
It provides an external interface to provide Data Warehouse reports
information objects to external users as per a specified schedule.
1.5.3 Granularity of Data
It means the level of detail or summarisation at which data is stored in a data
warehouse. Larger the granularity less will the detail be available for those data
item. Vice versa also holds good. A data warehouse manager is required to
identify the granularity of data for any organisation so that reports of the
requisite detail are available.
An example is maintenance of the details of each and every call made by a
mobile user by Telecom Operators to provide a high level of details (Low
level of granularity) to meet legal requirements at a later stage.Granular Data offers the advantage of reusability of data by other users and also
help in optimising the storage space.
1.5.4 Multidimensional Data Models & Schemas
Data warehouses and OLAP tools are based on what is known as a
Multidimensional model. Data is visualised as a Data cube in such model
identified by Fact & Dimension tables.
Facts are the numerical measures of a central theme. For example a Student.
The measures may be Marks_obtained, Division_Scored.
Dimensions are the entities with respect to which the organisation keeps its
records. For example Teacher, Subject, class, college, university etc.
Concept Hierarchies
It is a method of defining a sequence of identifying levels for each
entity.Example is a City, District, State and a country.
Schemas
While Entity- Relationship model was found adequate in the design of
Relational Databases, a data warehouse requires a subject oriented schema for better
analysis and handling more complex queries.
Three schemas are therefore created to meet the data warehouse requirements. These
are:
Self-Instructional Material 18
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 17/27
Introduction toSoftware Engineering
NOTES
Star schema
Most common model. In which a large central Fact table containing maximum data
without duplication or redundancy is stored. A large number of Dimension tables are
referred by it . Each of these handles a dimension. Refer to example given:
Snowflake schema
It is an extension of Star schema. The dimension tables are further extended to extra
tables.Diagram below gives an Example:
Constellation SchemSelf-Instructional Material 19
Dimensio
n Table
Key D
Fact table
Contains
Keys say
A,B,C, D &measures
Dimension
Table
Key B
Dimension
Table
Key A
Dimension
Table
Key C
Dimension
Table
Key D
Dimensio
n Table
Key D
Fact table
Contains
Keys say
A,B,C, D &
measures
Dimension
Table
Key B
Dimension
Table
Key A
Dimension
Table
Key C
Key E
Dimension
Table
Key D
Key F
Dimension
Table
Key F
Dimension
Table
Key E
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 18/27
Introduction toSoftware Engineering
NOTES
It has Multiple fact tables to meet the requirements of more advanced
applications.The fact tables are permitted to share Dimension tables. Example given
below refers:
1.5.5 Data Warehouse design
A Data Warehouse design consists of :
1. Choosing a business process to model. For example Orders, Invoices,
Shipments etc.
2. Choose a DWH for a large organisation while select a Datamart for
departmental implementation.
3. Choose the grain of the business- the fundamental, atomic level of data
to be represented in the Fact table.
4. Choose the Dimensions to be applied to each Fact table.
5. Choose the measures that will populate each fact table record e.g.
Units _sold, Rs_sold.
Self-Instructional Material 20
Dimensio
n Table
Key D
Dimensiontable
Keys A,B,C
\Fact Table 2
Key B,DFact table 1
Key A,C
Dimension
Table
Key C
Key E
Dimension
Table
Key D
Dimension
Table
Key E
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 19/27
Introduction toSoftware Engineering
NOTES
Based on these four principles a nine step method is evolved as under:
1.Choosing the subject matter.
2.Deciding what the Fact table represents
3. Identifying and conforming the dimensions.
4. Choosing the Facts.
5.Storing pre-calculations in the fact table.
6.Rounding out the dimension tables.
7. Choosing the duration of the data bases.
8.The need to track, slowly changing dimensions.
9.Deciding the Query priorities.
1.5.6 Data Warehouse Architecture
Data Warehouse Architecture-
Data Warehouse architecture is based on a RDBMS system server. It has a
massive central repository for storage of data, subsidiary Databases and front
end tools
The architecture consisting of:
1.Bottom Tier- A RDBMS & a DWH Server
2.Middle Tier-OLAP Server
3.Top Tier-Front End Tools
DATA SERVERS APPLICATION SERVERS CLIENTS
Virtual Warehouse
Another commonly used terms is a Virtual server. is a set of views over
operational databases. For efficient query processing only some of the
possible summary views are materialised. It is easy to build but requires
excess capacity on operational database servers.
Developing a Data Warehouse
It consists of:1. Defining a High level corporate data model.
2. Develop an Enterprise Data Warehouse and continue refining it to meet
user requirements.
3. In parallel Develop data marts and refine these models.
1.5.7 ROLAP,MOLAP & HOLAP
These tools utilise specialised data structures to organise, navigate and
analyse data, typically in a aggregated form. They require a tight coupling
with the application and the presentation layer.
Self-Instructional Material 21
GUIData
Warehouse
Database,
Metadata,
Data Logic
MDDB,
MRDB,
Metadata
GUI,
Presentation
Logic,Query
Specificatio
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 21/27
Introduction toSoftware Engineering
NOTES
3. HOLAP use the best features of both i.e. flexibility of ROLAP RDBMS
and the optimised multidimensional structure of MOLAP. Users are provided
ability to perform limited analysis capability either against RDBMS products
or by introducing an intermediate MOLAP server. A user can send a query to
select data from the DBMS which then delivers the requested data to the
desktop where it is placed in a data cube. The desired information is
maintained locally and need not be created each time a query is given.
4. Salient differences to be noted are:
(a) In MOLAP there is no Query directly given by the user to the
DWH Server. The desired Multidimensional data is positioned in
the MOLAP server after a SQL sent by MOLAP server is sent to
the DWH Server.
(b) ROLAP server does not store the intermediate result in a cube but
a Relational table. The user gets his query serviced by the ROLAP
server.
(c) In HOLAP, SQL is sent by user to the DWH server then either theResult is received by it directly or an intermediate MOLAP server
data cube is created and accessed by the user.
(d) ROLAP server does not store the intermediate result in a cube but
a Relational table. The user gets his query serviced by the ROLAP
server.
(e) In HOLAP, SQL is sent by user to the DWH server then either the
Result is received by it directly or through an intermediate
MOLAP server
1.6 Developing a Data Mining & Warehousing framework
1.6.1 Evolving System.
Data Mining & warehouse are not only massive but extremely critical systems
for the entire organisation .They are not to be built in one stroke but to be
gradually evolved as they require a substantial investment from the organisation
in the form of human and financial resources. It is therefore common practice
identify one subject area and build a system for it and then gradually move to
other functional areas. Several measures to curtail the investment are introduced.
One of them is only implementing a Datamart to begin with and then create the
data warehouse repository. It is also essential to select systems which are
upward scalable, keeping the Evolution factor in mind.
1.6.2 SDLC & CLDS
Systems Development Life Cycle (SDLC) is a requirements driven life
Cycle and supports the operational environment . A well known model is
the Waterfall Development .
Self-Instructional Material 23
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 22/27
Introduction toSoftware Engineering
NOTES
The CLDS ( Reverse way of saying SDLC) is the methodology followed for
developing a data mining application. It is a reverse way as the end user being a
manager and not a technocrat does not at an outset realise the potential of the
system and its decision support capability. The user expects the Technical experts to
present the available data, identify suitable algorithms and test the results.
Self-Instructional Material 24
Coding
Feasibility
Analysis
Design
Testing
Integration
Implementation
Implement
Warehouse
Integrate data
Test for
information bias
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 23/27
Introduction toSoftware Engineering
NOTES
1.6.3 Selection of Hardware of Data Warehouse
Decision to select hardware is based on:
(a) Existing & Future Business growth plan of the organisation
(b) Data expected to be stored.
(c) Performance expected from system.
(d) Identifying suitable SMP or MPP system
(e) Network Bandwidth for external connectivity
(f) Nature of legacy systems and their interfaces.
(g) Nature of Information Delivery System
(h) Data & Application Servers
(i) Client Desk top system with adequate storage & processing power
1.6.4 Selection or Development of Data Mining Tools
(a) Identify the User requirement.
(b) Categorise the model applicable and the associated tasks.
(c) Make or Buy suitable tools
(d) Find out the available data warehouse capabilities and develop an
application based on these.
1.7 Data Mining & Warehousing Challenges
1.7.1 Major issues/challenges to Data mining
These are:
1. Mining different kinds of knowledge in databases.
2. Interactive mining of knowledge at multiple levels of abstraction.
3. Incorporation of background knowledge.
4. Data mining Query languages and ad hoc data mining.
5. Presentation and visualisation of data mining results.
6. Handling noisy or incomplete data.Self-Instructional Material 25
Develope program
for data
Design DSS
System
Analyze result
Understand
requirements
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 24/27
Introduction toSoftware Engineering
NOTES
1.7.2 Data Warehouse- Open issues & research problems
Data Warehouse being an evolving system has many issues which are
open for further study & research. These are in the areas of:
1. Security- The Data warehouse contains full details of organisational
information, some of which is highly confidential. Data warehouse
provides access to many employees to access its contents. Special
efforts are t required to maintain safety, security and confidentiality of
this valuable and sensitive information residing in it.
2. Performance- Very large data bases requires their storage on multi-
processors and not conventional computers. These fall under the
category of Massively Parallel Processors and Symmetric
Multiprocessors. Both are upgradable and permit Very large database
to be partitioned and accessed at fast speeds. Features like Shared
memory protection and dynamic load balancing are in-built in them.
Designing scalable systems is an important Factor.
3. Functionality-Modification of accessing methods is a major challengeto Data warehouse managers. Developing novel Data mining
algorithms & OLAP for meeting User’s latest requirements are
essential features of extracting interesting data from warehouse.
4. Presentation- Data visualisation techniques have to be constantly
improved for the end users to utilise full capability of the system, draw
inferences and analyse the results.
1.8 Impact of Data Mining & Warehousing
1. Collecting, cleaning, transforming data from heterogeneous sources
2. Integrating data from dissimilar sources. Examples of Heterogeneous
& Dissimilar sources are legacy systems, World wide web,
organisations ERPs.
3. Storing data in a systematic manner but easily accessible manner.
Examples are Creating appropriate Fact & relation tables, Organising
data marts and Partitioning of database.
4. Organising data to facilitate –Manipulation, processing of Very largedata. A good Metadata design is a prime factor in attaining this
objective.
5 Simplify Exploration of massive data. Creation of Data mart are
essential for meeting this goal.
6 Arrange Data for interpretation & pattern recognition.Efficient Data
Mining Algorithms are necessary to achieve it.Self-Instructional Material 26
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 25/27
Introduction toSoftware Engineering
NOTES
7 Presenting data in a readable manner for the managers by designing
proper GUI.
8 Business Decision making based on Data mining & other access tools
1.9 Summary- Overview of Data Mining & Warehousing
1. Data Mining & Data warehousing principles were identified during early 1990s
but were implemented only after a decade due to non availability of:
(a) Suitable hardware at an affordable cost, to support parallel processing, providefast speeds and for storage of massive database.
(b) Operating systems to support parallel architectures.
(c) DBMS to manage very large database.
(d) Network bandwidth for interconnectivity.
(e) Suitable data mining algorithms.
(f) Visualisation & presentation tools.
2.The Data warehouse provides a platform for capturing, refining, integrating and
transforming data received from diverse sources. It is then stored in subject wise
Form in a central repository accessed through a metadata mechanism.3. The data is stored in a relational form by creating Fact & Dimension tables
connected through specialised schemas based on concept hierarchies. These are Star,
Snowflake& Fact Constellation schema.
4. The data is further organised as MRDB & MDDB which may also be considered
as part of Data marts created for different users to meet their specific requirements.
5. For producing useful reports Data mining, OLAP & Query tools and Application
tools are utilised.
6. Any result requires an effective presentation. GUI and Visualisation tools are made
available for the managers for them to assimilate and analyse the results for speedy
decision making.
7.Data mining & warehousing is constantly evolving and has to adopt a framework which is flexible and absorb the organisational and technological changes
1.10 Answer to ‘Check your Progress’
Question 1: In what way Data warehouse and Data Mining complement each other.
Question 2: Data Warehouse & Data Mining principles were formulated in early1990s but implemented only after a decade. Why?
Questions3: Differentiate between Operational Data Base, Operational Data
Store & Data Warehouse Repository.
Question 4: What is the function of “Information Delivery System”?
Self-Instructional Material 27
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 26/27
Introduction toSoftware Engineering
NOTES
Question 5 Differences between Data Mining, OLAP & Query Tools?
1.11 Exercises and Questions
1.11.1 Short-Answer Type Questions
Q1: Explain how the evolution of Database technology lead to Data
Mining?
Q2: Describe the steps involved in Data mining when viewed as a process
of knowledge discovery?
Q3: What are Data Mining models and the tasks associated with them?
Q4: What are the Data Mining issues. In what way do they affect
implementation of a Data mining system?
Q5: What is meant by interestingness of data as related to data mining?
1.11.2 Long-Answer type Questions
Q1:Define the data mining functionalities.
Q2:What is the difference between :
(a) Discrimination and Classification
(b) Characterization & Clustering
(c) Between Classification & Prediction?
(d) ROLAP & MOLAP
(e) MDDB & MRDB
(f) Data Warehouse & Data Mart.
Q3. Describe three challenges to data mining regarding Data mining
methodology and user interaction issues.
Q4: Describe two challenges to data mining regarding performance
issues:
Tip: Performance issues are:
(a) Efficiency & scalability of data mining algorithm.
(b) Parallel, Distributed and incremental mining algorithm.Q5: Describe issues related to diversity of data base types.
Tip (a) Handling of relational and complex types of data.
(b) Mining information from heterogeneous databases and
global information systems
1.12 Further Reading
Self-Instructional Material 28
8/8/2019 Intro to Data Mining
http://slidepdf.com/reader/full/intro-to-data-mining 27/27
Introduction toSoftware Engineering
NOTES
1. Data Mining Concepts & Techniques- Jiawei Han and Michelene
Kamber, Second Edition, Elsevier,2006
2. Data Warehousing, Data mining & OLAP – Alex Berson, Stephen J.
Smith, Tata McGraw Hill,2004
3. Building the Data warehouse – W.H. Inmon, Third Edition,
Wiley,2009
Self-Instructional Material 29