data mining and data warehousing

M.Sc. Computer Science Data Mining

The secret of success is to know something nobody else knows - Aristotle Onassis

DATA MINING

Introduction

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems

Major issues in data mining

April 10, 2023 2Module I : Data Mining and Warehousing

Introduction

Data is growing at a phenomenal rate

Users expect more sophisticated information

How?

3© Prentice Hall

UNCOVER HIDDEN INFORMATIONUNCOVER HIDDEN INFORMATIONDATA MININGDATA MINING

Evolution of Database Technology

1960s: Data collection, database creation, data management –primitive file

processing

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.)

and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s—2000s: Data mining and data warehousing, multimedia databases, and Web

databases


What Is Data Mining?

Data mining (knowledge discovery in databases):

The non-trivial process of identifying valid

novel

potentially useful, and

ultimately understandable patterns in data.

Data mining refers to the discovery of new information in terms of patterns or rules from vast amounts of data


Why Data Mining? From a managerial perspective????

Strategic Decision Making

Wealth Generation

Analyzing trends

Security


Database Processing vs. Data Mining Processing

Query- Well defined

- SQL

Query- Poorly defined

- No precise query language


Data– Operational data

Output– Precise– Subset of database

Data– Not operational data

Output– Not a subset of database

Query Examples Database

• Find all customers who live in Boa Vista

• Find all customers who use Mastercard

• Find all customers who missed one payment

Data Mining• Find all customers who are likely to miss one payment

(Classification)

• List all items that are frequently purchased with bicycles (Association rules)

• Find any “unusual” customers or behavior (e.g., phone calls) (Outlier detection, anomaly discovery)


Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful

information and patterns in data.

Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.


Data Mining: A KDD Process

Data mining: the core of knowledge discovery process.

April 10, 2023 10

Module I : Data Mining and Warehousing

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection and Transformation

Data Mining

Pattern Evaluation

Steps of a KDD Process

Data Cleaning : Remove noise and inconsistent data

Data Integration: multiple data sources are integrated

Data Selection: Obtain relevant data from the database.

Data Transformation: Convert to common format or consolidated into forms appropriate for mining by performing aggregation or summary operations.

Data Mining: Obtain desired results.

Pattern Evaluation: The patterns obtained in the data mining stage are converted into knowledge based on some interestingness measures

Knowledge Presentation: The knowledge obtained are presented to end-users in an understandable form, for example, visualization.

April 10, 2023 11


Architecture of a Typical Data Mining System

April 10, 2023 12


Data Warehouse

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base

Data cleaning , integration and Selection

WWW

Database,Datawarehouse,WorldWideWeb,or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.

Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.

Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction.

Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.

April 10, 2023 13

Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns.

User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results.


Relational Databases

A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data.

A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values.


Data Mining: On What Kind of Data?

April 10, 2023 16

Data source in Chicago

Data source in New York

Data source in Toronto

Clean, IntegrateTransformLoadRefresh

DataWarehouse

Query andAnalysis Tools

client

client


Data warehousesA data ware house is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing

Multidimensional Data Sales volume as a function of product, month, and region

April 10, 2023 17

tim

e(qu

arte

rs)

Add

ress

(Citi

es)

Item(types)

Dimensions: address, time, item

A Sample Data Cube

April 10, 2023 18

Total annual salesof TV in

Chicago for past 4QtrTime(quarters)

Addre

ss(c

ities

)

item

sum

sum Chicago

TorontoNew York

1Qtr 2Qtr 3Qtr 4Qtr

TV

comp

phone

sum

Sales

182


Product Sales

Pen 120

Honey 12

Pencil 50

Store Sales

1 102

2 80

store Product Sales

1 Pen 90

1 Honey 12

2 Pencil 50

2 Pen 30

Transactional databases

Object-oriented and object-relational databases

Spatial databases: contain space related information

Time-series data and temporal data: Time related attributes

Text databases and multimedia databases

Heterogeneous and legacy databases

WWW


Trans_ID List of Items

T100 11,13,15,16

T200 12,13,18

Data Mining Functionalities (1) are used to specify the kinds of patterns to be found in data mining tasks.

Data mining task can be : Predictive or descriptive

Concept/Class description: Characterization and discrimination Data can be associated with classes or concepts

The description of a class in summarized, concise, and yet precise terms is called class/concept description.

These description can be derived via

Data characterization

Data discrimination

Both characterization and discrimination

Data characterization is a summarization of the general characteristics or features of a target class of data.


Data corresponding to the user specified class are typically collected by a

database query

for example, a DM system should be able to produce a description summarizing

the characteristics of customers who spend more than $1,000 a year

Data discrimination is a comparison of the general features of target class data

objects from one or set of contrasting classes.

A DM system should able to compare two groups of customers, those who

shop for computer products regularly(more than two times a month) and those

who rarely shop for such products(i.e., less than 3 times a year)


Data Mining Functionalities (2)

Mining Frequent Patterns, Associations, and Correlations A frequent itemset typically refers to a set of items that frequently appear

together in a transactional data set, such as milk and bread.

Association analysis is the discovery of association rules showing attribute value conditions that occur frequently together in a given set of data.

X => Y

E.g., buys(X,”computer”) => buys(X,”software”)[support=1%,confidence=50%]

Confidence: “is a measure of how often the consequent is true when the antecedent is true.”

Here, if the customer buys a computer, there is a 50% chance that he will buy software as well.



“support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule”

Here, 1% support means that 1% of all transactions under analysis showed that computer and software purchased together.

Can have more predicates or attributes

Association rules that contain a single predicate are referred to as single-dimensional association rules.

age(X, “20…29”) ^ income(X, “20K...29K”) buys(X, “computer”) [support = 2%, confidence = 60%]



Classification is the process of finding a model( or function) that describes and

distinguishes data classes or concepts, for the purpose of being able to use the

model to predict the class of objects whose class label is unknown.

Given a set of items that have several classes, and given the past instances

(training instances) with their associated class, Classification is the process of

predicting the class of a new item.

The derived model can be represented using

IF-THEN

DECISION TREE

NEURAL NETWORKS etc.


Data Mining Functionalities (5) - Classification

26

Classification Process: Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

27

Classification Process: Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

age(X, “youth”) AND income(X, ”high”) class(X, “A”)

age(X, “youth”) AND income(X, ”low”) class(X, “B”)

age(X, “middle_aged”) class(X, “C”)

age(X, “Senior”) class(X, “C”)




age?

income? class C

class A class B

youth Middle_aged, senior

highlow

f1

f2

f3

f4

f5 f8

f7

f6age

income

Class A

Class B

Class C


Data Mining Functionalities (8) - Prediction

Prediction is used to predict missing or unavailable numeric data values

rather than class labels.

Classification and prediction may need to be proceeded by relevance analysis ,

which attempts to identify attributes that do not contribute to the classification

or prediction process.


Data Mining Functionalities (9) Cluster analysis

Similar to classification, but the class label is unknown and it is upto clustering algorithm to discover acceptable classes

“Clustering algorithms find groups of items that are similar. … It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. ”

Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased.

The categories are unspecified and this is referred to as ‘unsupervised learning’



Clustering based on the principle: maximizing the intra-class similarity and

minimizing the interclass similarity

Intra-class similarity means similarity between objects in same class

Inter-class similarity means similarity between objects of different classes

Each cluster that is formed can be viewed as a class of objects, from which

rules can be derived



Outlier analysis

Outlier: a data object that does not comply with the general behavior of the

data

It can be considered as noise or exception but is quite useful in fraud detection,

rare events analysis

Trend and evolution analysis

Describes and models regularities or trends for objects whose behavior

changes over time


Classification of Data Mining systems: Confluence of Multiple Disciplines


Data Mining

Data Mining: Classification Schemes Different views, different classifications

Kinds of databases to be mined:

relational, transactional, spatial etc.

Kinds of knowledge to be discovered : based on DM functionalities;

characterization, discrimination, association, classification etc.

Kinds of techniques utilized : DM can be categorized according to the

underlying DM technique employed.

These tech can be defined according the degree of user interaction

involved or the methods of data analysis employed

Database-oriented, data warehouse (OLAP), machine learning, statistics,

visualization, neural network, etc.


Kinds of applications adapted: DM systems can also be classified according to the applications they adapt Retail, telecommunication, banking, fraud analysis, DNA mining,

stock market analysis, Web mining, Weblog analysis, etc.


Data Mining: Classification Schemes

DATA MINING TASK PRIMITIVES A data mining task can be specified in the form of a data mining query

A data mining query is defined in terms of the following data mining task

primitives.

Task-relevant data: This specifies the portions of the database or the set of data

in which the user is interested.

This includes the database attributes or data warehouse dimensions of interest.

kind of knowledge: This specifies the data mining functions to be performed,

such as characterization, discrimination, association or correlation analysis,

classification, prediction, clustering, outlier analysis, or evolution analysis.


background knowledge : knowledge about the domain to be mined is

useful for guiding the knowledge discovery process and for evaluating the

patterns found.

Concept hierarchies are a popular form of background knowledge, allow

data to be mined at multiple levels of abstraction.


all

India Canada

OntariaColumbiaTamil naduKerala

EKMTVM Coimb chennai …

Interestingness measures and thresholds: They may be used to guide the

mining process or, after discovery, to evaluate the discovered patterns.

Different kinds of knowledge may have different interestingness measures.

For example, interestingness measures for association rules include support

and confidence.

Rules whose support and confidence values are below user-specified

thresholds are considered uninteresting.

Representation for visualizing: This refers to the form in which discovered

patterns are to be displayed,

rules, tables, charts, graphs, decision trees, and cubes.


INTEGRATION OF A DATA MINING SYSTEM WITH A DATABASE OR DATA WAREHOUSE SYSTEM

No Coupling: DM will not utilize any function of a DB or DW system.

It may fetch data from a particular source (such as file system), process data using some data mining algorithms, and then store the mining result in another file.

DB system provides a great deal of flexibility and efficiency at storing, organizing, accessing, and processing data.

Without using a DB/DW system , a DM system may spend a substantial amount of time finding, collecting, cleaning and transforming data.

Second, there are many tested, scalable algorithms and data structures implemented in DB and DW systems. It is feasible to realize efficient, scalable implementations using such systems.


most data have been or will be stored in DB/DW systems. Without any

coupling of such systems, a DM system will need to use other tools to

extract data, making it difficult to integrate such a system into an

information processing environment. Thus no coupling is a poor design.

LOOSE COUPLING:

that a data mining system will use some facilities of a DB/DW system,

fetching data from a data repository managed by these systems and then

performing data mining and then storing the mining results either in a

file or in a designated place in a database or data warehouse.


INTEGRATION OF A DATA MINING SYSTEM WITH A DATABASE OR DATA WAREHOUSE SYSTEM

It incurs some advantages of flexibility, efficiency, and other features provided by such systems.

loosely coupled mining systems are main memory based. Because mining does not explore data structures and query optimization methods provided by DB or DW systems, it is difficult for loose coupling to achieve high scalability and good performance with large data sets.

SEMI-TIGHT COUPLING besides linking a DM system to a DB/DW system, efficient

implementations of a few essential data mining primitives can be provided in the DB/DW system.

Also we can precompute some frequently used intermediate mining results and stored in DB/DW system. This will enhance performance of a DM system.


TIGHT COUPLING: DM system is smoothly integrated into the DB/DW system.

The data mining subsystem is treated as one functional component of an information system.

Data mining queries and functions are optimized based on query analysis, data structures, indexing schemes and query processing methods of a DB or DW system.

The tight coupling provides a uniform information processing environment.


Major Issues in Data Mining (1)

Mining methodology and user interaction Mining different kinds of knowledge in databases

Interactive mining of knowledge at multiple levels of abstraction

Incorporation of background knowledge

Data mining query languages and ad-hoc data mining

Expression and visualization of data mining results

Handling noise and incomplete data

Pattern evaluation: the interestingness problem

Performance and scalability Efficiency and scalability of data mining algorithms

Parallel, distributed and incremental mining methodsApril 10, 2023 45Module I : Data Mining and Warehousing

Major Issues in Data Mining (2)

Issues relating to the diversity of data types Handling relational and complex types of data Mining information from heterogeneous databases and global information

systems (WWW)

Issues related to applications and social impacts Application of discovered knowledge

Domain-specific data mining tools

Intelligent query answering

Process control and decision making Integration of the discovered knowledge with existing knowledge: A

knowledge fusion problem Protection of data security, integrity, and privacy


DATA WAREHOUSE

The main repository of an organization historical data

It contains the raw material for management’s decision

support system

The term Data Warehouse was coined by Bill Inmon in 1990

“A DW is a subject oriented, integrated, time-variant and non-

volatile collection of data in support of management’s decision

making process.”


Subject oriented: A DW is organized around major subjects, such as customer, supplier, product, sales etc.

Rather than focusing on day-to-day operations DW concentrate on the modeling and analysis of data for decision makers.

Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process


SalesProducts

Customers

Integrated: A DW is usually constructed by integrating multiple heterogeneous sources such as relational databases, flat files, etc.

Data cleaning and data integration techniques are applied


Savingsaccount

Loansaccount

Subject =account

Time-variant:

The time horizon for the data warehouse is significantly longer than that of operational systems

Operational database: current value data

Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)


51

Nonvolatile A physically separate store of data transformed from the

operational environment

Operational update of data does not occur in the data

warehouse environment

Does not require transaction processing, recovery, and

concurrency control mechanisms

Requires only two operations in data accessing:

• initial loading of data and access of data

April 10, 2023Module I : Data Mining and Warehousing

Data Warehouse vs. Heterogeneous DBMSTraditional heterogeneous DB integration:

Build wrappers/mediators on top of heterogeneous databases

Query driven approach

When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set

Complex information filtering Data warehouse: update-driven, high performance

Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis


Data Warehouse vs. Operational DBMS

OLTP (on-line transaction processing) Major task of traditional relational DBMS

Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing) Major task of data warehouse system

Data analysis and decision making


Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market

Data contents: current, detailed vs. historical, consolidated

Database design: ER + application vs. star + subject

View: current, local vs. evolutionary, integrated

Access patterns: update vs. read-only but complex queries


A multi-dimensional data model From Tables and Spreadsheets to Data Cubes: A data warehouse is based on a multidimensional data

model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and

viewed in multiple dimensions Dimension tables, such as item (item_name, brand, type), or

time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys

to each of the related dimension tables The data cube can be n-dimensional


In data warehousing literature, an n-D base cube is called a

base cuboid. The top most 0-D cuboid, which holds the

highest-level of summarization, is called the apex cuboid.

The lattice of cuboids forms a data cube.


Cube: A Lattice of Cuboids


Conceptual Modeling of Data Warehouses The most popular data model for a data warehouse is a

multidimensional model. Such a model exist in the form of a star schema, a snowflake schema or a fact constellation schema.

Star schema: A fact table in the middle connected to a set of dimension tables

Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation


Example of Star Schema


Example of Snowflake Schema


Example of Fact Constellation


A Data Mining Query Language, DMQL: Language Primitives Cube Definition (Fact Table)

define cube<cube_name> [<dimension_list>]:

<measure_list> Dimension Definition ( Dimension Table )

define dimension<dimension_name> as(<attribute_or_subdimension_list>)

Special Case (Shared Dimension Tables)

define dimension<dimension_name> as<dimension_name_first_time> in cube

<cube_name_first_time>April 10, 2023 63Module I : Data Mining and Warehousing

Defining a Star Schema in DMQL

define cubesales_star [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales= avg(sales_in_dollars), units_sold = count(*)

define dimensiontime as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as(branch_key, branch_name, branch_type)

define dimensionlocation as(location_key, street, city, province_or_state, country)


Defining a Snowflake Schema in DMQL define cubesales_snowflake [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales= avg(sales_in_dollars), units_sold = count(*)


define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))


define dimensionlocation as(location_key, street, city(city_key, province_or_state, country))


Defining a Fact Constellation in DMQL

define cubesales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales= avg(sales_in_dollars), units_sold = count(*)


define dimension item as (item_key, item_name, brand, type, supplier_type)


define dimensionlocation as(location_key, street, city, province_or_state, country)


define cubeshipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimensiontime as time in cubesales

define dimension item as item in cubesales

define dimension shipper as(shipper_key, shipper_name, locationaslocation in cubesales, shipper_type)

define dimensionfrom_location aslocation in cubesales

define dimensionto_location aslocation in cubesales


April 10, 2023

Measures of Data Cube

A data cube measure is a numerical function that can be

evaluated at each point in the data cube space.

A measure value is computed for a given point by aggregating the

data corresponding to the respective dimension-value pairs

defining the given point.

Module I : Data Mining and Warehousing 68

Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning

E.g., count(), sum(), min(), max()

Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function

E.g., avg(), min_N(), standard_deviation()

Holistic: if there is no constant bound on the storage size needed to describe a subaggregate.

E.g., median(), mode(), rank()

April 10, 2023

Measures of Data Cube: Three Categories


Concept Hierarchy

A concept hierarchy defines a sequence of mappings from a

set of low-level concepts to higher-level, more general

concepts

Consider dimension location: vancouver,Toronto,New York

and Chicago. Each city can be mapped to province or state

to which it belongs. The province or state can be mapped to

country.

April 10, 2023Module I : Data Mining and Warehousing 70

April 10, 2023

A Concept Hierarchy: Dimension (location)

all

Europe North_America

MexicoCanadaSpainGermany

Vancouver

M. WindL. Chan

...

......

... ...

...

all

region

office

country

TorontoFrankfurtcity


April 10, 2023

Typical OLAP Operations Roll up (drill-up): summarize data

by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up

from higher level summary to lower level summary or detailed data, or introducing new dimensions

Slice and dice: project and select Pivot (rotate):

reorient the cube, visualization, 3D to series of 2D planes Other operations

drill across: involving (across) more than one fact table

drill through: through the bottom level of the cube to its back-end relational tables (using SQL)


Typical OLAP Operations

(quarters)

April 10, 2023

Design of Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse

Top-down view• allows selection of the relevant information necessary for the data

warehouse

Data source view• exposes the information being captured, stored, and managed by

operational systems

Data warehouse view• consists of fact tables and dimension tables

Business query view • sees the perspectives of data in the warehouse from the view of end-

user74Module I : Data Mining and Warehousing

April 10, 2023

Data Warehouse Design Process Top-down, bottom-up approaches or a combination of both

Top-down: Starts with overall design and planning (mature and well known)

Bottom-up: Starts with experiments and prototypes (rapid)

From software engineering point of view

Waterfall: structured and systematic analysis at each step before proceeding to the next

Spiral: rapid generation of increasingly functional systems, with short interval between successive releases

Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc.

Choose the grain (atomic level of data) of the business process

Choose the dimensions that will apply to each fact table record

Choose the measure that will populate each fact table record


April 10, 2023 76Data Mining: Concepts and Techniques

Data Warehouse: A three-Tier DW Data Warehouse: A three-Tier DW ArchitectureArchitecture

Metadata

DataWarehouse

ExtractTransformLoadRefresh

Middle tier:OLAP server

AnalysisQueryReportsData mining

Monitor&

Integrator

Data Top tier:Front-End Tools

Serve

Data Marts

Operational DBs

Othersources

Bottom tier:Data warehouse

OLAP Server

April 10, 2023

Three Data Warehouse Models Enterprise warehouse

collects all of the information about subjects spanning the entire organization

Data Mart a subset of corporate-wide data that is of value to a specific

groups of users. Its scope is confined to specific, selected groups, such as marketing data mart

Targeted to meet the needs of small groups within the organizationo Independent vs. dependent (directly from warehouse) data mart

Dependent data mart : A subset that is created directly from a data warehouse

Independent data mart : A small data warehouse designed for a strategic business unit or a department

Data Mining: Concepts and Techniques 77

Virtual warehouse A set of views over operational databases Only some of the possible summary views may be

materialized


Three Data Warehouse Models

April 10, 2023

Data Warehouse Back-End Tools and Utilities

Data extraction get data from multiple, heterogeneous, and external sources

Data cleaning detect errors in the data and rectify them when possible

Data transformation convert data from legacy or host format to warehouse format

Load sort, summarize, consolidate, compute views, check integrity,

and build indicies and partitions Refresh

propagate the updates from the data sources to the warehouse


The recommended approach is to implement the warehouse in an

incremental and evolutionary manner.

First, a high-level corporate data model is defined within a reasonably short

period that provides corporate-wide, consistent, integrated view of data

among different subjects.

Second, independent data marts can be implemented in parallel with the

enterprise warehouse based on the same corporate data model set.

Third, distributed data marts can be constructed to integrate different data

marts


Data Warehouse Development: A Recommended Approach

April 10, 2023 81

Data Warehouse Development: A Recommended Approach

Define a high-level corporate data model

Data Mart

Data Mart

Distributed Data Marts

Multi-Tier Data Warehouse

Enterprise Data Warehouse

Model refinementModel refinement

Data Mining: Concepts and Techniques

April 10, 2023

Metadata Repository Meta data is the data defining warehouse objects. It stores:

Description of the structure of the data warehouse

schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents

Operational meta-data

data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)

The algorithms used for summarization

Data related to system performance warehouse schema, view etc

Business data

business terms and definitions, ownership of data, charging policies


OLAP Server

An OLAP Server is a high capacity, multi user data manipulation engine specifically designed to support and operate on multi-dimensional data structure.

OLAP server available are MOLAP server

ROLAP server

HOLAP server

Data Mining: Concepts and Techniques April 10, 2023 83

84April 10, 2023

OLAP Server Architectures

Relational OLAP (ROLAP)

These are intermediate servers that stand in between a relational back-end server and client front-end tools

They use a relational or extended-relational DBMS to store and manage warehouse data

Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services

Greater scalability than MOLAP

Data Mining: Concepts and Techniques

85

Relational OLAP: 3 Tier DSSData Warehouse ROLAP Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in industry standard RDBMS.

Generate SQL execution plans in the ROLAP engine to obtain OLAP functionality.

Obtain multi-dimensional reports.

Data Mining: Concepts and Techniques April 10, 2023

Multidimensional OLAP (MOLAP)

These servers support multidimensional views of data.

array-based multidimensional storage engine

Fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer2000)

Combines ROLAP and MOLAP technology

Allows large volumes of detail data to be stored in relational db, while aggregations are kept in a separate MOLAP


OLAP Server Architectures

87

MOLAP: 2 Tier DSSMDDB Engine MDDB Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in a proprietary data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data.

Obtain multi-dimensional reports

Data Mining: Concepts and Techniques April 10, 2023

Data warehouses contain huge volumes of data.

OLAP engines demand that decision support queries be answered in the order of seconds. Therefore, it is crucial for data warehouse systems to support highly efficient cube computation techniques, access methods, and query processing techniques.

Data cube can be viewed as a lattice of cuboids

One approach to cube computation is to use compute cube operator

The compute cube computes aggregates over all subsets of the dimension specified in the operation.

This incurs excessive storage space, essentially for large number of dimensions.


DW Implementation-Efficient Data Cube Computation

April 10, 2023 89

DW Implementation-Efficient Data Cube Computation

Data cube can be viewed as a lattice of cuboids

The bottom-most cuboid is the base cuboid

The top-most cuboid (apex) contains only one cell What is the total number of cuboids or group-by that can be computed

for the data cube contains 3 attributes: city, item, year?

23=8 {(city, item, year),(city, item), (city, year), (item, year), (city), (item), (year), () }

Apex cuboid contains total sum of all sales

Base cuboid returns the total sales for any combination of three dimensions

Base cuboid is the least generalized of the cuboid

Apex cuboid is the most generalized of the cuboid


An sql query contains no group-by such as ‘compute the sum of total sales’ is a zero dimensional operation

An sql query contains one group-by such as ‘compute the sum of total sales group by city’ is a one dimensional operation

Therefore, the cube operator is the n-dimensional generalization of the group by operator


(item)(city)

()

(year)

(city, item) (city, year) (item, year)

(city, item, year)

April 10, 2023 91

Cube Operation Cube definition and computation in DMQL

define cube sales[item, city, year]: sum(sales_in_dollars)

For a cube with n-dimensions,

compute cube sales

The cube computation operator was first introduced by Gray

OLAP may need to access different cuboids for different queries.

So, pre-computation

Pre-computation leads to fast response time and avoids some redundant computation.

A major challenge related to this pre-computation, however, is that the required storage space may explode if all of the cuboids in a data cube are pre-computed, especially when the cube has several dimensions associated with multiple level hierarchies.


The storage requirements are more excessive when many dimensions have

associated concept hierarchies, each with multiple levels. This problem is

referred as curse of dimensionality.

If there were no hierarchies associated with each dimension, then the total

number of cuboids for an n-dimensional data cube, as we have seen above, is

2n. However, in practice, many dimensions do have hierarchies.

day < week < month < quarter < year

Where Li is the number of levels associated with dimension i.

1 is added to include virtual top level all


)11(

n

i iLT

Partial Materialization : Selected Computation of cuboids Materialization of data cube

No materialization : pre-compute only the base cuboid and none of the remaining non-base cuboids

full materialization: pre-compute all of the cuboids

partial materialization :selectively compute a proper subset of the whole set of possible cuboids

(1) identify the subset of cuboids to materialize,

•Based on size, sharing, access frequency, etc.

(2) exploit the materialized cuboids during query processing, and

(3) efficiently update the materialized cuboids during load and refresh.


April 10, 2023 94

Iceberg Cube

Computing only the cuboid cells whose count or other aggregates satisfying the condition like

HAVING COUNT(*) >= minsup

Only calculate “interesting” cells—data above certain threshold


April 10, 2023 95

Indexing OLAP Data: Bitmap Index

Bit map indexing is a popular method in OLAP, allows quick searching in data cube

Is an alternative representation of record_id

In this, for a given attribute there is a distinct bit vector Bv

If the domain of a given attribute contains n values, then n bits are needed for each entry in the bitmap index.

If the attribute has the value v for a given row in the data table, then the bit representing that value is set to 1 in the corresponding row of bitmap index.


April 10, 2023 96

Indexing OLAP Data: Bitmap Index

RID Item CityR1 H VR2 C VR3 P VR4 S VR5 H TR6 C TR7 P TR8 S T

RID V TR1 1 0R2 1 0R3 1 0R4 1 0R5 0 1R6 0 1R7 0 1

RID H C P SR1 1 0 0 0R2 0 1 0 0R3 0 0 1 0R4 0 0 0 1R5 1 0 0 0

Base table Index on Item Index on city


April 10, 2023 97

Indexing OLAP Data: Join Indices

Join index: JI(R-id, S-id) where R (R-id, …) S (S-id,

…)

In data warehouses, join index relates the values of

the dimensions of a star schema to rows in the fact

table.

E.g. fact table: Sales and two dimensions city

and product

• A join index on city maintains for each

distinct city a list of R-IDs of the tuples

recording the Sales in the city


April 10, 2023 98

Efficient Processing OLAP Queries The purpose of materializing cuboids and constructing OLAP index structures is to speed

up query processing in data cubes. Given materialized views, then the query processing

will be as follows:

Determine which operations should be performed on the available cuboids

Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice =

selection + projection

Determine which materialized cuboid(s) should be selected for OLAP op.

Let {time, item,location} and the dimension hierarchies used are “day < month <

quarter < year” for time, “ item_name < brand < type “ for item and for location

“street < city < province or state < country”

Let the query to be processed be on {brand, province_or_state} with the condition

“year = 2004”, and there are 4 materialized cuboids available:


1) {year, item_name, city}

2) {year, brand, country}

3) {year, brand, province_or_state}

4) {item_name, province_or_state} where year = 2004

Which should be selected to process the query?

Explore indexing structures and compressed


From DW to DM DW Usuage

Data warehouses and data marts are used in a wide range of applications.

Business executives in almost every industry uses the data stored in data warehouses and data marts to perform data analysis and make strategic decisions.

Initially, the data warehouse is mainly used for generating reports and answering predefined queries.

Progressively, it is used to analyze summarized and detailed data, where the results are presented in the form of reports and charts.

Later, the data warehouse is used for strategic purposes, performing multidimensional analysis and sophisticated slice-and-dice operations.

Finally, the data warehouse may be employed for knowledge discovery and strategic decision making using data mining tools.

Data warehousing can be categorized into access and retrieval tools, database reporting tools, data analysis tools, and data mining tools.

April 10, 2023 100


April 10, 2023 101

Data Warehouse Usage

Three kinds of data warehouse applications

Information processing : supports querying, basic statistical analysis,

and reporting using crosstabs, tables, charts and graphs

Analytical processing : multidimensional analysis of data warehouse

data. It supports basic OLAP operations, slice-dice, drilling, pivoting

Data mining : knowledge discovery from hidden patterns . It supports

associations, constructing analytical models, performing classification

and prediction, and presenting the mining results using visualization

tools.


April 10, 2023 102

From On-Line Analytical Processing to On Line Analytical Mining (OLAM) OLAM integrates OLAP with data mining and mining knowledge in

multidimensional databases

Why online analytical mining? High quality of data in data warehouses

• DW contains integrated, consistent, cleaned data

Available information processing structure surrounding data warehouses

• ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools

OLAP-based exploratory data analysis

• mining with drilling, dicing, pivoting, etc.

On-line selection of data mining functions

• integration and swapping of multiple mining functions, algorithms, and tasks. Module I : Data Mining and Warehousing

An OLAM Architecture

Meta Data

MDDB

OLAMEngine

OLAPEngine

Graphical User Interface API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

Data Warehouse


data mining and data warehousing

Education

data warehousing

payment data mining

data mining modules

data mining introduction

data mining functionality

data mining stage

data transformation

introduction data