advance concept in data bases unit-5 by arun pratap singh

8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

1/82

PREPARED BY ARUN PRATAP SINGH MTECH2nd SEMESTER


2/82

PREPARED BY ARUN PRATAP SINGH 1

1

DESIGN OF DATA WAREHOUSE :

The term "Data Warehouse" was first coined by Bill Inmon in 1990. He said that Data warehouseis subject Oriented, Integrated, Time-Variant and nonvolatile collection of data.This data helps in

supporting decision making process by analyst in an organization

The operational database undergoes the per day transactions which causes the frequent changesto the data on daily basis.But if in future the business executive wants to analyse the previousfeedback on any data such as product,supplier,or the consumer data. In this case the analyst willbe having no data available to analyse because the previous data is updated due to transactions.

The Data Warehouses provide us generalized and consolidated data in multidimensional view.Along with generalize and consolidated view of data the Data Warehouses also provide us OnlineAnalytical Processing (OLAP) tools. These tools help us in interactive and effective analysis ofdata in multidimensional space. This analysis results in data generalization and data mining.

The data mining functions like association,clustering ,classification, prediction can be integratedwith OLAP operations to enhance interactive mining of knowledge at multiple level of abstraction.That's why data warehouse has now become important platform for data analysis and onlineanalytical processing.

Understanding Data Warehouse

The Data Warehouse is that database which is kept separate from the organization's operationaldatabase.

There is no frequent updation done in data warehouse.

Data warehouse possess consolidated historical data which help the organization to analyse it'sbusiness.

Data warehouse helps the executives to organize,understand and use their data to take strategicdecision.

Data warehouse systems available which helps in integration of diversity of application systems.

The Data warehouse system allows analysis of consolidated historical data analysis.

Definition : Data warehouse is Subject Oriented, Integrated, Time-Variant and Nonvolatile

collection of data that support management's decision making process.

Why Data Warehouse Separated from Operational Databases

The following are the reasons why Data Warehouse are kept separate from operationaldatabases:

The operational database is constructed for well known tasks and workload such as searchingparticular records, indexing etc but the data warehouse queries are often complex and it presentsthe general form of data.

UNIT : V


3/82


2

Operational databases supports the concurrent processing of multiple transactions. Concurrencycontrol and recovery mechanism are required for operational databases to ensure robustnessand consistency of database.

Operational database query allow to read, modify operations while the OLAP query need onlyread onlyaccess of stored data.

Operational database maintain the current data on the other hand data warehouse maintain thehistorical data.

Data Warehouse Features

The key features of Data Warehouse such as Subject Oriented, Integrated, Nonvolatile and Time-Variant are are discussed below:

Subject Oriented - The Data Warehouse is Subject Oriented because it provide us theinformation around a subject rather the organization's ongoing operations. These subjects can beproduct, customers, suppliers, sales, revenue etc. The data warehouse does not focus on the

ongoing operations rather it focuses on modelling and analysis of data for decision making. Integrated- Data Warehouse is constructed by integration of data from heterogeneous sources

such as relational databases, flat files etc. This integration enhance the effective analysis of data. Time-Variant- The Data in Data Warehouse is identified with a particular time period. The data

in data warehouse provide information from historical point of view. Non Volatile- Non volatile means that the previous data is not removed when new data is added

to it. The data warehouse is kept separate from the operational database therefore frequentchanges in operational database is not reflected in data warehouse.Note: - Data Warehouse does not require transaction processing, recovery and concurrencycontrol because it is physically stored separate from the operational database.

Data Warehouse Applications

As discussed before Data Warehouse helps the business executives in organize, analyse anduse their data for decision making. Data Warehouse serves as a soul part of a plan-execute-assess "closed-loop" feedback system for enterprise management. Data Warehouse is widelyused in the following fields:

financial services

Banking Services

Consumer goods

Retail sectors.

Controlled manufacturing

Data Warehouse Types

Information processing, Analytical processing and Data Mining are the three types of datawarehouse applications that are discussed below:


4/82


3

Information processing- Data Warehouse allow us to process the information stored in it. Theinformation can be processed by means of querying, basic statistical analysis, reporting usingcrosstabs, tables, charts, or graphs.

Analytical Processing - Data Warehouse supports analytical processing of the informationstored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.

Data Mining - Data Mining supports knowledge discovery by finding the hidden patterns andassociations, constructing analytical models, performing classification and prediction.Thesemining results can be presented using the visualization tools.

SN Data Warehouse (OLAP) Operational Database(OLTP)

1This involves historical processing ofinformation.

This involves day to day processing.

2OLAP systems are used by knowledgeworkers such as executive, manager and

analyst.

OLTP system are used by clerk, DBA, or

database professionals.

3 This is used to analysis the business. This is used to run the business.

4 It focuses on Information out. It focuses on Data in.

5This is based on Star Schema, SnowflakeSchema and Fact Constellation Schema.

This is based on Entity Relationship Model.

6 It focuses on Information out. This is application oriented.

7 This contains historical data. This contains current data.

8This provides summarized andconsolidated data.

This provide primitive and highly detailed data.

9This provide summarized andmultidimensional view of data.

This provides detailed and flat relational view ofdata.

10 The number or users are in Hundreds. The number of users are in thousands.

11The number of records accessed are inmillions.

The number of records accessed are in tens.

12 The database size is from 100GB to TB The database size is from 100 MB to GB.

13 This are highly flexible. This provide high performance.


5/82


4

What is Data Warehousing?

Data Warehousing is the process of constructing and using the data warehouse. The datawarehouse is constructed by integrating the data from multiple heterogeneous sources. This datawarehouse supports analytical reporting, structured and/or ad hoc queries and decision making.Data Warehousing involves data cleaning, data integration and data consolidations.

Using Data Warehouse Information

There are decision support technologies available which help to utilize the data warehouse. Thesetechnologies helps the executives to use the warehouse quickly and effectively. They can gatherthe data, analyse it and take the decisions based on the information in the warehouse. Theinformation gathered from the warehouse can be used in any of the following domains:

Tuning production strategies- The product strategies can be well tuned by repositioning theproducts and managing product portfolios by comparing the sales quarterly or yearly.

Customer Analysis - The customer analysis is done by analyzing the customer's buyingpreferences, buying time, budget cycles etc.

Operations Analysis - Data Warehousing also helps in customer relationship management,making environmental corrections. The Information also allow us to analyse the businessoperations.

In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is

a database used for reporting anddata analysis. Integrating data from one or more disparate

sources creates a central repository of data, a data warehouse (DW). Data warehouses store

current and historical data and are used for creating trending reports for senior management

reporting such as annual and quarterly comparisons.

The data stored in the warehouse isuploaded from the operational systems (such as marketing,

sales, etc.). The data may pass through anoperational data store for additional operations before

it is used in the DW for reporting.
http://en.wikipedia.org/wiki/Computinghttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Business_reportinghttp://en.wikipedia.org/wiki/Data_analysishttp://en.wikipedia.org/wiki/Uploading_and_downloadinghttp://en.wikipedia.org/wiki/Operational_data_storehttp://en.wikipedia.org/wiki/Operational_data_storehttp://en.wikipedia.org/wiki/Uploading_and_downloadinghttp://en.wikipedia.org/wiki/Data_analysishttp://en.wikipedia.org/wiki/Business_reportinghttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Computing


6/82


5

Data warehouses support business decisions by collecting, consolidating, and organizing data forreporting and analysis with tools such as online analytical processing (OLAP) and data mining.

Although data warehouses are built on relational database technology, the design of a datawarehouse database differs substantially from the design of an online transaction processing

system (OLTP) database.

Data Warehouses, OLTP, OLAP, and Data Mining

A relational database is designed for a specific purpose. Because the purpose of a datawarehouse differs from that of an OLTP, the design characteristics of a relational database thatsupports a data warehouse differ from the design characteristics of an OLTP database.

A Data Warehouse Supports OLTP-

A data warehouse supports an OLTP system by providing a place for the OLTP database tooffload data as it accumulates, and by providing services that would complicate and degradeOLTP operations if they were performed in the OLTP database.Without a data warehouse to hold historical information, data is archived to static media such asmagnetic tape, or allowed to accumulate in the OLTP database.If data is simply archived for preservation, it is not available or organized for use by analysts anddecision makers. If data is allowed to accumulate in the OLTP so it can be used for analysis, the


7/82


6

OLTP database continues to grow in size and requires more indexes to service analytical and

report queries. These queries access and process large portions of the continually growinghistorical data and add a substantial load to the database. The large indexes needed to supportthese queries also tax the OLTP transactions with additional index maintenance. These queries

can also be complicated to develop due to the typically complex OLTP database schema.A data warehouse offloads the historical data from the OLTP, allowing the OLTP to operate atpeak transaction efficiency. High volume analytical and reporting queries are handled by the datawarehouse and do not load the OLTP, which does not need additional indexes for their support.

As data is moved to the data warehouse, it is also reorganized and consolidated so that analyticalqueries are simpler and more efficient.

OLAP is a Data Warehouse Tool-

Online analytical processing (OLAP) is a technology designed to provide superior performancefor ad hoc business intelligence queries. OLAP is designed to operate efficiently with dataorganized in accordance with the common dimensional model used in data warehouses.

A data warehouse provides a multidimensional view of data in an intuitive model designed tomatch the types of queries posed by analysts and decision makers. OLAP organizes datawarehouse data into multidimensional cubes based on this dimensional model, and thenpreprocesses these cubes to provide maximum performance for queries that summarize data invarious ways. For example, a query that requests the total sales income and quantity sold for arange of products in a specific geographical region for a specific time period can typically beanswered in a few seconds or less regardless of how many hundreds of millions of rows of dataare stored in the data warehouse database.

Data warehouse database OLTP database

Designed for analysis of business measuresby categories and attributes

Designed for real-time business operations

Optimized for bulk loads and large, complex,unpredictable queries that access manyrows per table

Optimized for a common set of transactions,usually adding or retrieving a single row at atime per table

Loaded with consistent, valid data; requires

no real time validation

Optimized for validation of incoming data

during transactions; uses validation datatables

Supports few concurrent users relative toOLTP

Supports thousands of concurrent users


8/82


7

OLAP is not designed to store large volumes of text or binary data, nor is it designed to supporthigh volume update transactions. The inherent stability and consistency of historical data in a datawarehouse enables OLAP to provide its remarkable performance in rapidly summarizinginformation for analytical queries.In SQL Server 2000, Analysis Services provides tools for developing OLAP applications and a

server specifically designed to service OLAP queries.


9/82


8


10/82


9


11/82


10

Data Warehouse Tools and Utilities Functions

The following are the functions of Data Warehouse tools and Utilities:

Data Extraction - Data Extraction involves gathering the data from multiple heterogeneoussources.

Data Cleaning- Data Cleaning involves finding and correcting the errors in data.


12/82


11

Data Transformation - Data Transformation involves converting data from legacy format towarehouse format.

Data Loading- Data Loading involves sorting, summarizing, consolidating, checking integrity andbuilding indices and partitions.

Refreshing- Refreshing involves updating from data sources to warehouse.Note:Data Cleaning and Data Transformation are important steps in improving the quality of data

and data mining results.

Data Warehouse :

Data warehouse is subject Oriented, Integrated, Time-Variant and nonvolatile collection of datathat support of management's decision making process. Let's explore this Definition of datawarehouse.

Subject Oriented- The Data warehouse is subject oriented because it provide us the informationaround a subject rather the organization's ongoing operations. These subjects can be product,customers, suppliers, sales, revenue etc. The data warehouse does not focus on the ongoingoperations rather it focuses on modelling and analysis of data for decision making.

Integrated- Data Warehouse is constructed by integration of data from heterogeneous sourcessuch as relational databases, flat files etc. This integration enhance the effective analysis of data.

Time-Variant- The Data in Data Warehouse is identified with a particular time period. The datain data warehouse provide information from historical point of view.

Non Volatile- Non volatile means that the previous data is not removed when new data is addedto it. The data warehouse is kept separate from the operational database therefore frequentchanges in operational database is not reflected in data warehouse.

Metadata- Metadata is simply defined as data about data. The data that are used to representother data is known as metadata. For example the index of a book serve as metadata for thecontents in the book.In other words we can say that metadata is the summarized data that leadus to the detailed data.

In terms of data warehouse we can define metadata as following:

Metadata is a road map to data warehouse.

Metadata in data warehouse define the warehouse objects.

The metadata act as a directory.This directory helps the decision support system to locate thecontents of data warehouse.

Metadata Respiratory :

The Metadata Respiratory is an integral part of data warehouse system. The Metadata

Respiratory contains the following metadata:

Business Metadata- This metadata has the data ownership information, business definition andchanging policies.

Operational Metadata-This metadata includes currency of data and data lineage. Currency ofdata means whether data is active, archived or purged. Lineage of data means history of datamigrated and transformation applied on it.


13/82


12

Data for mapping from operational environment to data warehouse-This metadata includessource databases and their contents, data extraction,data partition, cleaning, transformationrules, data refresh and purging rules.

The algorithms for summarization- This includes dimension algorithms, data on granularity,aggregation, summarizing etc.

Data cube :

Data cube help us to represent the data in multiple dimensions. The data cube is defined bydimensions and facts. The dimensions are the entities with respect to which an enterprise keepthe records.

Illustration of Data cube

Suppose a company wants to keep track of sales records with help of sales data warehouse withrespect to time, item, branch and location. These dimensions allow to keep track of monthly salesand at which branch the items were sold.There is a table associated with each dimension. Thistable is known as dimension table. This dimension table further describes the dimensions. For

example "item" dimension table may have attributes such as item_name, item_type anditem_brand.

The following table represents 2-D view of Sales Data for a company with respect to time,itemand location dimensions.

But here in this 2-D table we have records with respect to time and item only. The sales for NewDelhi are shown with respect to time and item dimensions according to type of item sold. If wewant to view the sales data with one new dimension say the location dimension. The 3-D view of

the sales data with respect to time, item, and location is shown in the table below:


14/82


13

The above 3-D table can be represented as 3-D data cube as shown in the following figure:

DATA MART :

Data mart contains the subset of organization-wide data. This subset of data is valuable to specificgroup of an organization. In other words we can say that data mart contains only that data whichis specific to a particular group. For example the marketing data mart may contain only datarelated to item, customers and sales. The data mart are confined to subjects.


15/82


14

Points to remember about data marts:

Window based or Unix/Linux based servers are used to implement data marts. They areimplemented on low cost server.

The implementation cycle of data mart is measured in short period of time i.e. in weeks rather

than months or years.

The life cycle of a data mart may be complex in long run if it's planning and design are notorganization-wide.

Data mart are small in size.

Data mart are customized by department.

The source of data mart is departmentally structured data warehouse.

Data mart are flexible.

Graphical Representation of data mart.

A data martis the access layer of thedata warehouse environment that is used to get data out to the

users. The data mart is a subset of the data warehouse that is usually oriented to a specific business

line or team. Data marts are small slices of the data warehouse. Whereas data warehouses have an

enterprise-wide depth, the information in data marts pertains to a single department. In some

deployments, each department or business unit is considered the ownerof its data mart including all

the hardware, softwareand data.[1]This enables each department to use, manipulate and develop

their data any way they see fit; without altering information inside other data marts or the data

warehouse. In other deployments where conformed dimensions are used, this business unit ownership

will not hold true for shared dimensions like customer, product, etc.
http://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Data_mart#cite_note-1http://en.wikipedia.org/wiki/Data_mart#cite_note-1http://en.wikipedia.org/wiki/Data_mart#cite_note-1http://en.wikipedia.org/wiki/Data_mart#cite_note-1http://en.wikipedia.org/wiki/Data_warehouse


16/82


15

The reasons why organizations are building data warehouses and data marts are because the

information in the database is not organized in a way that makes it easy for organizations to find what

they need. Also complicated queries might take a long time to answer what people want to know since

the database systems are designed to process millions of transactions per day. Transactional

database are designed to be updated, however, data warehouses or marts are read only. Datawarehouses are designed to access large groups of related records.

Data marts improve end-user response time by allowing users to have access to the specific type of

data they need to view most often by providing the data in a way that supports the collective view of a

group of users.

A data mart is basically a condensed and more focused version of a data warehouse that reflects the

regulations and process specifications of each business unit within an organization. Each data mart is

dedicated to a specific business function or region. This subset of data may span across many or all

of an enterprises functional subject areas. It is common for multiple data marts to be used in order to

serve the needs of each individual business unit (different data marts can be used to obtain specific

information for various enterprise departments, such as accounting, marketing, sales, etc.).

Reasons for creating a data mart :

Easy access to frequently needed data

Creates collective view by a group of users

Improves end-userresponse time

Ease of creation

Lower cost than implementing a fulldata warehouse

Potential users are more clearly defined than in a full data warehouse

Contains only business essential data and is less cluttered.
http://en.wikipedia.org/wiki/Response_time_(technology)http://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Response_time_(technology)


17/82


16

DEPENDENT DATA MART :

According to theInmon school of data warehousing, a dependent data martis a logical subset (view)

or a physical subset (extract) of a largerdata warehouse,isolated for one of the following reasons:

A need refreshment for a specialdata model orschema:e.g., to restructure forOLAP

Performance: to offload the data mart to a separatecomputer for greater efficiency or to obviate

the need to manage that workload on the centralized data warehouse.

Security: to separate an authorized data subset selectively

Expediency: to bypass the data governance and authorizations required to incorporate a new

application on the Enterprise Data Warehouse

Proving Ground: to demonstrate the viability and ROI (return on investment) potential of an

application prior to migrating it to the Enterprise Data Warehouse

Politics: a coping strategy for IT (Information Technology) in situations where a user group has

more influence than funding or is not a good citizen on the centralized data warehouse.

Politics: a coping strategy for consumers of data in situations where a data warehouse team is

unable to create a usable data warehouse.

According to the Inmon school of data warehousing, tradeoffs inherent with data marts include limited

scalability, duplication of data, data inconsistency with other silos of information, and inability toleverage enterprise sources of data.

The alternative school of data warehousing is that ofRalph Kimball.In his view, a data warehouse is

nothing more than the union of all the data marts. This view helps to reduce costs and provides fast

development, but can create an inconsistent data warehouse, especially in large organizations.

Therefore, Kimball's approach is more suitable for small-to-medium corporations.
http://en.wikipedia.org/wiki/Bill_Inmonhttp://en.wikipedia.org/wiki/View_(database)http://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Data_modelhttp://en.wikipedia.org/wiki/Logical_schemahttp://en.wikipedia.org/wiki/Online_analytical_processinghttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Ralph_Kimballhttp://en.wikipedia.org/wiki/Ralph_Kimballhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Online_analytical_processinghttp://en.wikipedia.org/wiki/Logical_schemahttp://en.wikipedia.org/wiki/Data_modelhttp://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/View_(database)http://en.wikipedia.org/wiki/Bill_Inmon


18/82


17

Virtual Warehouse :

The view over a operational data warehouse is known as virtual warehouse. It is easy to built thevirtual warehouse. Building the virtual warehouse requires excess capacity on operational

database servers.

PROCESS FLOW IN DATA WAREHOUSE :

There are four major processes that build a data warehouse. Here is the list of four processes:

Extract and load data.

Cleaning and transforming the data.

Backup and Archive the data.

Managing queries & directing them to the appropriate data sources.

Extract and Load Process

The Data Extraction takes data from the source systems.

Data load takes extracted data and loads it into data warehouse.

Note: Before loading the data into data warehouse the information extracted from externalsources must be reconstructed.

Points to remember while extract and load process:

Controlling the process

When to Initiate Extract


19/82


18

Loading the Data

CONTROLLING THE PROCESS

Controlling the process involves determining that when to start data extraction and consistency

check on data. Controlling process ensures that tools, logic modules, and the programs areexecuted in correct sequence and at correct time.

WHEN TO INITIATE EXTRACT

Data need to be in consistent state when it is extracted i.e. the data warehouse should representsingle, consistent version of information to the user.

For example in a customer profiling data warehouse in telecommunication sector it is illogical tomerge list of customers at 8 pm on wednesday from a customer database with the customersubscription events up to 8 pm on tuesday. This would mean that we are finding the customersfor whom there are no associated subscription.

LOADING THE DATA

After extracting the data it is loaded into a temporary data store.Here in the temporary data storeit is cleaned up and made consistent.

Note: Consistency checks are executed only when all data sources have been loaded intotemporary data store.

Clean and Transform Process

Once data is extracted and loaded into temporary data store it is the time to perform Cleaning

and Transforming. Here is the list of steps involved in Cleaning and Transforming:

Clean and Transform the loaded data into a structure.

Partition the data.

Aggregation

CLEAN AND TRANSFORM THE LOADED DATA INTO A STRUCTURE

This will speed up the queries.This can be done in the following ways:

Make sure data is consistent within itself.

Make sure data is consistent with other data within the same data source.

Make sure data is consistent with data in other source systems.

Make sure data is consistent with data already in the warehouse.


20/82


19

Transforming involves converting the source data into a structure. Structuring the data will resultin increases query performance and decreases operational cost. Information in data warehousemust be transformed to support performance requirement from the business and also the ongoingoperational cost.

PARTITION THE DATA

It will optimize the hardware performance and simplify the management of data warehouse. Inthis we partition each fact table into a multiple separate partitions.

AGGREGATION

Aggregation is required to speed up the common queries. Aggregation rely on the fact that mostcommon queries will analyse a subset or an aggregation of the detailed data.

Backup and Archive the data

In order to recover the data in event of data loss, software failure or hardware failure it isnecessary to backed up on regular basis.Archiving involves removing the old data from thesystem in a format that allow it to be quickly restored whenever required.

For example in a retail sales analysis data warehouse, it may be required to keep data for 3 yearswith latest 6 months data being kept online. In this kind of scenario there is often requirement tobe able to do month-on-month comparisons for this year and last year. In this case we requiresome data to be restored from the archive.

Query Management Process

This process performs the following functions

This process manages the queries.

This process speed up the queries execution.

This Process direct the queries to most effective data sources.

This process should also ensure that all system sources are used in most effective way.

This process is also required to monitor actual query profiles.

Information in this process is used by warehouse management process to determine which

aggregations to generate.

This process does not generally operate during regular load of information into data warehouse.


21/82


20

THREE-TIER DATA WAREHOUSE ARCHITECTURE :

Generally the data warehouses adopt the three-tier architecture. Following are the three tiers ofdata warehouse architecture.

Bottom Tier- The bottom tier of the architecture is the data warehouse database server.It is the

relational database system.We use the back end tools and utilities to feed data into bottomtier.these back end tools and utilities performs the Extract, Clean, Load, and refresh functions.

Middle Tier- In the middle tier we have OLAp Server. the OLAP Server can be implemented ineither of the following ways.

o By relational OLAP (ROLAP), which is an extended relational database management system. TheROLAP maps the operations on multidimensional data to standard relational operations.

o By Multidimensional OLAP (MOLAP) model, which directly implements multidimensional data andoperations.

Top-Tier- This tier is the front-end client layer. This layer hold the query tools and reporting tool,analysis tools and data mining tools.

Following diagram explains the Three-tier Architecture of Data warehouse:


22/82


21

OLAP :

Introduction

Online Analytical Processing Server (OLAP) is based on multidimensional data model. It allowsthe managers , analysts to get insight the information through fast, consistent, interactive access

to information. In this chapter we will discuss about types of OLAP, operations on OLAP,Difference between OLAP and Statistical Databases and OLTP.

Types of OLAP Servers

We have four types of OLAP servers that are listed below.

Relational OLAP(ROLAP)

Multidimensional OLAP (MOLAP)

Hybrid OLAP (HOLAP)

Specialized SQL Servers

Relational OLAP(ROLAP)

The Relational OLAP servers are placed between relational back-end server and client front-endtools. To store and manage warehouse data the Relational OLAP use relational or extended-relational DBMS.

ROLAP includes the following.

implementation of aggregation navigation logic.

optimization for each DBMS back end.

additional tools and services.

Multidimensional OLAP (MOLAP)

Multidimensional OLAP (MOLAP) uses the array-based multidimensional storage engines formultidimensional views of data.With multidimensional data stores, the storage utilization may be

low if the data set is sparse. Therefore many MOLAP Server uses the two level of data storagerepresentation to handle dense and sparse data sets.

Hybrid OLAP (HOLAP)

The hybrid OLAP technique combination of ROLAP and MOLAP both. It has both the higherscalability of ROLAP and faster computation of MOLAP. HOLAP server allows to store the largedata volumes of detail data. the aggregations are stored separated in MOLAP store.


23/82


22

Specialized SQL Servers

specialized SQL servers provides advanced query language and query processing support forSQL queries over star and snowflake schemas in a read-only environment.

OLAP OperationsAs we know that the OLAP server is based on the multidimensional view of data hence we willdiscuss the OLAP operations in multidimensional data.

Here is the list of OLAP operations.

Roll-up

Drill-down

Slice and dice

Pivot (rotate)

ROLL-UP

This operation performs aggregation on a data cube in any of the following way:

By climbing up a concept hierarchy for a dimension

By dimension reduction.

Consider the following diagram showing the roll-up operation.


24/82


23

The roll-up operation is performed by climbing up a concept hierarchy for the dimension location.

Initially the concept hierarchy was "street < city < province < country".

On rolling up the data is aggregated by ascending the location hierarchy from the level of city tolevel of country.

The data is grouped into cities rather than countries.

When roll-up operation is performed then one or more dimensions from the data cube areremoved.

DRILL-DOWN

Drill-down operation is reverse of the roll-up. This operation is performed by either of the followingway:

By stepping down a concept hierarchy for a dimension.


25/82


24

By introducing new dimension.

Consider the following diagram showing the drill-down operation:

The drill-down operation is performed by stepping down a concept hierarchy for the dimensiontime.

Initially the concept hierarchy was "day < month < quarter < year."

On drill-up the time dimension is descended from the level quarter to the level of month.

When drill-down operation is performed then one or more dimensions from the data cube areadded.

It navigates the data from less detailed data to highly detailed data.


26/82


25

SLICE

The slice operation performs selection of one dimension on a given cube and give us a new subcube. Consider the following diagram showing the slice operation.

The Slice operation is performed for the dimension time using the criterion time ="Q1".

It will form a new sub cube by selecting one or more dimensions.

DICE

The Dice operation performs selection of two or more dimension on a given cube and give us anew subcube. Consider the following diagram showing the dice operation:


27/82


26

The dice operation on the cube based on the following selection criteria that involve threedimensions.

(location = "Toronto" or "Vancouver")

(time = "Q1" or "Q2")

(item =" Mobile" or "Modem").

PIVOT

The pivot operation is also known as rotation.It rotates the data axes in view in order to providean alternative presentation of data.Consider the following diagram showing the pivot operation.


28/82


27

In this the item and location axes in 2-D slice are rotated.

OLAP vs OLTP

SN Data Warehouse (OLAP) Operational Database(OLTP)

1 This involves historical processing ofinformation.

This involves day to day processing.

2OLAP systems are used byknowledge workers such asexecutive, manager and analyst.

OLTP system are used by clerk, DBA, ordatabase professionals.


29/82


28

3This is used to analysis thebusiness.

This is used to run the business.

4 It focuses on Information out. It focuses on Data in.

5This is based on Star Schema,Snowflake Schema and FactConstellation Schema.

This is based on Entity RelationshipModel.

6 It focuses on Information out. This is application oriented.

7 This contains historical data. This contains current data.

8This provides summarized andconsolidated data.

This provide primitive and highly detaileddata.

9This provide summarized andmultidimensional view of data.

This provides detailed and flat relationalview of data.

10The number or users are inHundreds.

The number of users are in thousands.

11The number of records accessed arein millions.

The number of records accessed are intens.

12

The database size is from 100GB to

TB The database size is from 100 MB to GB.

13 This are highly flexible. This provide high performance.

CONCEPTUAL MODELING OF DATA WAREHOUSES :

Dimensional modeling is a technique for conceptualizing and visualizing data models as a set ofmeasures that are described by common aspects of the business. Dimensional modeling has twobasic concepts.

Facts:

A fact is a collection of related data items, consisting of measures.

A fact is a focus of interest for the decision making process.

Measures are continuously valued attributes that describe facts.


30/82


29

A fact is a business measure.

Dimension:

The parameter over which we want to perform analysis of facts

The parameter that gives meaning to a measure number of customers is a fact, perform analysisover time.

Dimensional modeling also has emerged as the only coherent architecture for building distributedDW systems. If we come up with more complex questions for our warehouse which involves threeor more dimensions.

This is where the multi-dimensional database plays a significant role analysis. Dimensions arecategories by which summarized data can be viewed. Cubes are data processing units composedof fact tables and dimensions from the data warehouse. Dimensional modeling also has emergedas the only coherent architecture for building distributed data warehouse systems.

Multi-Dimensional Modeling

Multidimensional database technology has come a long way since its inception more than 30years ago. It has recently begun to reach the mass market, with major vendors now deliveringmultidimensional engines along with their relational database offerings, often at no extra cost.Multi-dimensional technology has also made significant gains in scalability and maturity.

Multidimensional data model emerged for use when the objective is to analyze rather than toperform on-line transactions.

Multidimensional model is based on three key concepts:

Modeling business rules

Cube and measures

Dimensions

Multidimensional data-base technology is a key factor in the interactive analysis of large amountsof data for decision-making purposes. Multidimensional data model is introduced based onrelational elements. Dimensions are modeled as dimension relations.

languages similar to structured query language. They can not treat all dimensions and measuressymmetrically the definition of multidimensional schema describes multiple levels along adimension, and there is at least one key attribute in each level that is included in the keys of thestar schema in RD systems. Multidimensional database enable end-users to model data in amultidimensional environment. This is real product strength, as it provides for the fastest, mostflexible method to process multidimensional requests.


31/82


30

The principal characteristic of a dimensional model is a set of detailed business facts surroundedby multiple dimensions that describe those facts. When realized in a database, the schema for adimensional model contains a central fact table and multiple dimension tables. A dimensionalmodel may produce a star schemaor a snowflake schema.

The schema is a logical description of the entire database. The schema includes the name anddescription of records of all record types including all associated data-items and aggregates.Likewise the database the data warehouse also require the schema. The database uses therelational model on the other hand the data warehouse uses the Stars, snowflake and factconstellation schema. In this chapter we will discuss the schemas used in data warehouse.

STAR SCHEMA :

In star schema each dimension is represented with only one dimension table.

This dimension table contains the set of attributes.

In the following diagram we have shown the sales data of a company with respect to the fourdimensions namely, time, item, branch and location.


32/82


31

There is a fact table at the centre. This fact table contains the keys to each of four dimensions.

The fact table also contain the attributes namely, dollars sold and units sold.

Note:Each dimension has only one dimension table and each table holds a set of attributes. Forexample the location dimension table contains the attribute set

{location_key,street,city,province_or_state,country}. This constraint may cause data redundancy.For example the "Vancouver" and "Victoria" both cities are both in Canadian province of BritishColumbia. The entries for such cities may cause data redundancy along the attributesprovince_or_state and country.

What is star schema? The star schema architecture is the simplest data warehouse schema. Itis called a star schema because the diagram resembles a star, with points radiating from acenter. The center of the star consists of fact table and the points of the star are the dimensiontables. Usually the fact tables in a star schema are in third normal form(3NF) whereasdimensional tables are de-normalized. Despite the fact that the star schema is the simplestarchitecture, it is most commonly used nowadays and is recommended by Oracle.

Fact Tables

A fact table typically has two types of columns: foreign keys to dimension tables and measuresthose that contain numeric facts. A fact table can contain fact's data on detail or aggregatedlevel.

Dimension Tables

A dimension is a structure usually composed of one or more hierarchies that categorizes data. If


33/82


32

a dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primarykeys of each of the dimension tables are part of the composite primary key of the fact table.Dimensional attributes help to describe the dimensional value. They are normally descriptive,textual values. Dimension tables are generally small in size then fact table.

Typical fact tables store data about sales while dimension tables data about geographic

region(markets, cities) , clients, products, times, channels.

The main characteristics of star schema:-> easy to understand schema

-> small number of tables to join-> de-normalization, redundancy

data caused that size of the table could be large.

SNOWFLAKE SCHEMA :

In Snowflake schema some dimension tables are normalized.

The normalization split up the data into additional tables.

Unlike Star schema the dimensions table in snowflake schema are normalized for example theitem dimension table in star schema is normalized and split into two dimension tables namely,item and supplier table.

Therefore now the item dimension table contains the attributes item_key, item_name, type, brand,and supplier-key.


34/82


33

The supplier key is linked to supplier dimension table. The supplier dimension table contains theattributes supplier_key, and supplier_type.


35/82


34

The shipping fact table has the five dimensions namely, item_key, time_key, shipper-key, from-location.

The shipping fact table also contains two measures namely, dollars sold and units sold.

It is also possible for dimension table to share between fact tables. For example time, item and

location dimension tables are shared between sales and shipping fact table.

DATA MINING

Data Mining is defined as extracting the information from the huge set of data. In other words wecan say that data mining is mining the knowledge from data.

Knowledge Discovery and Data Mining (KDD) is an interdisciplinary area focusing upon

methodologies for extracting useful knowledge from data. The ongoing rapid growth of online datadue to the Internet and the widespread use of databases have created an immense need for KDDmethodologies. The challenge of extracting knowledge from data draws upon research instatistics, databases, pattern recognition, machine learning, data visualization, optimization, andhigh-performance computing, to deliver advanced business intelligence and web discoverysolutions.

Introduction

There is huge amount of data available in Information Industry. This data is of no use untilconverted into useful information. Analysing this huge amount of data and extracting usefulinformation from it is necessary.

The extraction of information is not the only process we need to perform, it also involves otherprocesses such as Data Cleaning, Data Integration, Data Transformation, Data Mining, PatternEvaluation and Data Presentation. Once all these processes are over, we are now position to usethis information in many applications such as Fraud Detection, Market Analysis, ProductionControl, Science Exploration etc.

What is Data Mining

Data Mining is defined as extracting the information from the huge set of data. In other words wecan say that data mining is mining the knowledge from data. This information can be used for anyof the following applications:

Market Analysis

Fraud Detection

Customer Retention

Production Control


36/82


35

Science Exploration

Need of Data Mining

Here are the reasons listed below:

In field of Information technology we have huge amount of data available that need to be turnedinto useful information.

This information further can be used for various applications such as market analysis, frauddetection, customer retention, production control, science exploration etc.

Data Mining Applications

Here is the list of applications of Data Mining:

Market Analysis and Management

Corporate Analysis & Risk Management

Fraud Detection

Data Mining deals with what kind of patterns can be mined. On the basis of kind of data to bemined there are two kind of functions involved in Data Mining, that are listed below:

Descriptive

Classification and Prediction

Classification Criteria:

Classification according to kind of databases mined

Classification according to kind of knowledge mined

Classification according to kinds of techniques utilized

Classification according to applications adapted

CLASSIFICATION ACCORDING TO KIND OF DATABASES MINED

We can classify the data mining system according to kind of databases mined. Database systemcan be classified according to different criteria such as data models, types of data etc. And thedata mining system can be classified accordingly. For example if we classify the databaseaccording to data model then we may have a relational, transactional, object- relational, or datawarehouse mining system.


37/82


36

CLASSIFICATION ACCORDING TO KIND OF KNOWLEDGE MINED

We can classify the data mining system according to kind of knowledge mined. It is means datamining system are classified on the basis of functionalities such as:

Characterization

Discrimination

Association and Correlation Analysis

Classification

Prediction

Clustering

Outlier Analysis

Evolution Analysis

CLASSIFICATION ACCORDING TO KINDS OF TECHNIQUES UTILIZED

We can classify the data mining system according to kind of techniques used. We can describesthese techniques according to degree of user interaction involved or the methods of analysisemployed.

CLASSIFICATION ACCORDING TO APPLICATIONS ADAPTED

We can classify the data mining system according to application adapted. These applications areas follows:

Finance

Telecommunications

DNA

Stock Markets

E-mail

DATA MINING FUNCTIONALITIES :

Characterization

Discrimination


38/82


37


Classification

Prediction

Clustering

Outlier Analysis

Evolution Analysis


39/82


38


40/82


39

DATA MINING SYSTEM CATEGORIZATION AND ITS ISSUES :

Introduction

There is a large variety of Data Mining Systems available. Data mining System may integratetechniques from the following:

Spatial Data Analysis


41/82


40

Information Retrieval

Pattern Recognition

Image Analysis

Signal Processing

Computer Graphics

Web Technology

Business

Bioinformatics

Data Mining System Classification

The data mining system can be classified according to the following criteria:

Database Technology

Statistics

Machine Learning

Information Science

Visualization

Other Disciplines


42/82


41

Data mining is an interdisciplinary field, the confluence of a set of disciplines , including database

systems, statistics, machine learning, visualization, and information science. Moreover, depending on

the data mining approach used, techniques from other disciplines may be applied, such as neural

networks, fuzzy and/or rough set theory, knowledge representation, inductive logic programming, or

high performance computing. Depending on the kinds of data to be mined or on the given data mining

application, the data mining system may also integrate techniques from spatial data analysis,information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web

technology, economics, or psychology.

Because of the diversity of disciplines contributing to data mining, data mining research is expected

to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear

classification of data mining systems. Such a classification may help potential users distinguish data

mining systems and identify those that best match their needs. Data mining systems can be

categorized according to various criteria, as follows.

Classification according to kind of databases mined

Classification according to kind of knowledge mined

Classification according to kinds of techniques utilized

Classification according to applications adapted


43/82


42

CLASSIFICATION ACCORDING TO KIND OF DATABASES MINED :

We can classify the data mining system according to kind of databases mined. Database systemcan be classified according to different criteria such as data models, types of data etc. And thedata mining system can be classified accordingly. For example if we classify the database

according to data model then we may have a relational, transactional, object- relational, or datawarehouse mining system.

A data mining system can be classified according to the kinds of databases mined. Database

systems themselves can be classified according to different criteria (such as data models, or the

types of data or applications involved), each of which may require its own data mining technique.

Data mining systems can therefore be classified accordingly. For instance, if classifying according

to data models, we may have a relational, transactional, object-oriented, object-relational, or data

warehouse mining system. If classifying according to the special types of data handled, we may

have a spatial, time-series, text, or multimedia data mining system, or a World-Wide Web mining

system. Other system types include heterogeneous data mining systems, and legacy data miningsystems.

CLASSIFICATION ACCORDING TO KIND OF KNOWLEDGE MINED :

We can classify the data mining system according to kind of knowledge mined. It is means datamining system are classified on the basis of functionalities such as:

Characterization

Discrimination


Classification

Prediction

Clustering

Outlier Analysis

Evolution Analysis

Data mining systems can be categorized according to the kinds of knowledge they mine, i.e.,based on data mining functionalities, such as characterization, discrimination, association,classification, clustering, trend and evolution analysis, deviation analysis, similarity analysis, etc.

A comprehensive data mining system usually provides multiple and/or integrated data miningfunctionalities.


44/82


43

Moreover, data mining systems can also be distinguished based on the granularity or levels ofabstraction of the knowledge mined, including generalized knowledge (at a high level ofabstraction), primitive-level knowledge (at a raw data level), or knowledge at multiple levels(considering several levels of abstraction). An advanced data mining system should facilitate thediscovery of knowledge at multiple levels of abstraction.

CLASSIFICATION ACCORDING TO KINDS OF TECHNIQUES UTILIZED :

We can classify the data mining system according to kind of techniques used. We can describesthese techniques according to degree of user interaction involved or the methods of analysisemployed.

Data mining systems can also be categorized according to the underlying data mining techniquesemployed. These techniques can be described according to the degree of user interaction involved(e.g., autonomous systems, interactive exploratory systems, query-driven systems), or the methodsof data analysis employed (e.g., database-oriented or data warehouse-oriented techniques, machinelearning, statistics, visualization, pattern recognition, neural networks, and so on). A sophisticated datamining system will often adopt multiple data mining techniques or work out an effective, integratedtechnique which combines the merits of a few individual approaches.

CLASSIFICATION ACCORDING TO APPLICATIONS ADAPTED :

We can classify the data mining system according to application adapted. These applications areas follows:

Finance

Telecommunications

DNA

Stock Markets

E-mail

ISSUES IN DATA MINING :

Introduction

Data mining is not that easy. The algorithm used are very complex. The data is not available atone place it needs to be integrated form the various heterogeneous data sources. These factorsalso creates some issues. Here in this tutorial we will discuss the major issues regarding:

Mining Methodology and User Interaction

Performance Issues

Diverse Data Types Issues


45/82


44

The following diagram describes the major issues.

Mining Methodology and User Interaction Issues

It refers to the following kind of issues:

Mining different kinds of knowledge in databases. - The need of different users is not the

same. And Different user may be in interested in different kind of knowledge. Therefore it isnecessary for data mining to cover broad range of knowledge discovery task.

Interactive mining of knowledge at multiple levels of abstraction.- The data mining processneeds to be interactive because it allows users to focus the search for patterns, providing andrefining data mining requests based on returned results.

Incorporation of background knowledge. - To guide discovery process and to express thediscovered patterns, the background knowledge can be used. Background knowledge may beused to express the discovered patterns not only in concise terms but at multiple level ofabstraction.

Data mining query languages and ad hoc data mining.- Data Mining Query language thatallows the user to describe ad hoc mining tasks, should be integrated with a data warehousequery language and optimized for efficient and flexible data mining.

Presentation and visualization of data mining results.- Once the patterns are discovered itneeds to be expressed in high level languages, visual representations. This representationsshould be easily understandable by the users.

Handling noisy or incomplete data.- The data cleaning methods are required that can handlethe noise, incomplete objects while mining the data regularities. If data cleaning methods are notthere then the accuracy of the discovered patterns will be poor.

Pattern evaluation.- It refers to interestingness of the problem. The patterns discovered shouldbe interesting because either they represent common knowledge or lack novelty.


46/82


45

Performance Issues

It refers to the following issues:

Efficiency and scalability of data mining algorithms. - In order to effectively extract theinformation from huge amount of data in databases, data mining algorithm must be efficient and

scalable. Parallel, distributed, and incremental mining algorithms.- The factors such as huge size of

databases, wide distribution of data,and complexity of data mining methods motivate thedevelopment of parallel and distributed data mining algorithms. These algorithm divide the datainto partitions which is further processed parallel. Then the results from the partitions is merged.The incremental algorithms, updates databases without having mine the data again from scratch.

Diverse Data Types Issues

Handling of relational and complex types of data.- The database may contain complex dataobjects, multimedia data objects, spatial data, temporal data etc. It is not possible for one systemto mine all these kind of data.

Mining information from heterogeneous databases and global information systems. - The

data is available at different data sources on LAN or WAN. These data source may be structured,semi structured or unstructured. Therefore mining knowledge from them adds challenges to datamining.

OTHER ISSUES IN DATA MINING :

Some of these issues are addressed below. Note that these issues are not exclusive and are notordered in any way.

Security and social issues: Security is an important issue with any data collection that is shared

and/or is intended to be used for strategic decision-making. In addition, when data is collected forcustomer profiling, user behavior understanding, correlating personal data with other information,etc., large amounts of sensitive and private information about individuals or companies isgathered and stored. This becomes controversial given the confidential nature of some of thisdata and the potential illegal access to the information. Moreover, data mining could disclose newimplicit knowledge about individuals or groups that could be against privacy policies, especially ifthere is potential dissemination of discovered information. Another issue that arises from thisconcern is the appropriate use of data mining. Due to the value of data, databases of all sorts ofcontent are regularly sold, and because of the competitive advantage that can be attained fromimplicit knowledge discovered, some important information could be withheld, while otherinformation could be widely distributed and used without control.

User interface issues: The knowledge discovered by data mining tools is useful as long as it isinteresting, and above all understandable by the user. Good data visualization eases theinterpretation of data mining results, as well as helps users better understand their needs. Manydata exploratory analysis tasks are significantly facilitated by the ability to see data in anappropriate visual presentation. There are many visualization ideas and proposals for effectivedata graphical presentation. However, there is still much research to accomplish in order to obtaingood visualization tools for large datasets that could be used to display and manipulate mined


47/82


46

knowledge. The major issues related to user interfaces and visualization are "screen real-estate", information rendering, and interaction. Interactivity with the data and data mining resultsis crucial since it provides means for the user to focus and refine the mining tasks, as well as topicture the discovered knowledge from different angles and at different conceptual levels.

Mining methodology issues: These issues pertain to the data mining approaches applied andtheir limitations. Topics such as versatility of the mining approaches, the diversity of dataavailable, the dimensionality of the domain, the broad analysis needs (when known), theassessment of the knowledge discovered, the exploitation of background knowledge andmetadata, the control and handling of noise in data, etc. are all examples that can dictate miningmethodology choices. For instance, it is often desirable to have different data mining methodsavailable since different approaches may perform differently depending upon the data at hand.Moreover, different approaches may suit and solve user's needs differently.

Most algorithms assume the data to be noise-free. This is of course a strong assumption. Mostdatasets contain exceptions, invalid or incomplete information, etc., which may complicate, if notobscure, the analysis process and in many cases compromise the accuracy of the results. As aconsequence, data preprocessing (data cleaning and transformation) becomes vital. It is oftenseen as lost time, but data cleaning, as time-consuming and frustrating as it may be, is one of themost important phases in the knowledge discovery process. Data mining techniques should beable to handle noise in data or incomplete information.

More than the size of data, the size of the search space is even more decisive for data miningtechniques. The size of the search space is often depending upon the number of dimensions inthe domain space. The search space usually grows exponentially when the number of dimensionsincreases. This is known as the curse of dimensionality. This "curse" affects so badly theperformance of some data mining approaches that it is becoming one of the most urgent issuesto solve.

Performance issues: Many artificial intelligence and statistical methods exist for data analysisand interpretation. However, these methods were often not designed for the very large data setsdata mining is dealing with today. Terabyte sizes are common. This raises the issues of scalabilityand efficiency of the data mining methods when processing considerably large data. Algorithmswith exponential and even medium-order polynomial complexity cannot be of practical use fordata mining. Linear algorithms are usually the norm. In same theme, sampling can be used formining instead of the whole dataset. However, concerns such as completeness and choice ofsamples may arise. Other topics in the issue of performance are incremental updating, and

parallel programming. There is no doubt that parallelism can help solve the size problem if thedataset can be subdivided and the results can be merged later. Incremental updating is importantfor merging results from parallel mining, or updating data mining results when new data becomesavailable without having to re-analyze the complete dataset.

Data source issues: There are many issues related to the data sources, some are practical suchas the diversity of data types, while others are philosophical like the data glut problem. We


48/82


47

certainly have an excess of data since we already have more data than we can handle and weare still collecting data at an even higher rate. If the spread of database management systemshas helped increase the gathering of information, the advent of data mining is certainlyencouraging more data harvesting. The current practice is to collect as much data as possiblenow and process it, or try to process it, later. The concern is whether we are collecting the rightdata at the appropriate amount, whether we know what we want to do with it, and whether we

distinguish between what data is important and what data is insignificant. Regarding the practicalissues related to data sources, there is the subject of heterogeneous databases and the focus ondiverse complex data types. We are storing different types of data in a variety of repositories. It isdifficult to expect a data mining system to effectively and efficiently achieve good mining resultson all kinds of data and sources. Different kinds of data and sources may require distinctalgorithms and methodologies. Currently, there is a focus on relational databases and datawarehouses, but other approaches need to be pioneered for other specific complex data types. Aversatile data mining tool, for all sorts of data, may not be realistic. Moreover, the proliferation ofheterogeneous data sources, at structural and semantic levels, poses important challenges notonly to the database community but also to the data mining community.

DATA PROCESSING :

What is the need for Data Processing?

To get the required information from huge, incomplete, noisy and inconsistent set of data it isnecessary to use data processing.

Steps in Data Processing:

Data Cleaning

Data Integration

Data Transformation

Data reduction

Data Summarization

What is Data Cleaning?

Data cleaning is a procedure to clean the data by filling in missing values, smoothing noisy data,identifying or removing outliers, and resolving inconsistencies

What is Data Integration?

Integrating multiple databases, data cubes, or files, this is called data integration.

What is Data Transformation?

Data transformation operations, such as normalization and aggregation, are additional datapreprocessing procedures that would contribute toward the success of the mining process.


49/82


48

What is Data Reduction?

Data reduction obtains a reduced representation of the data set that is much smaller in volume,yet produces the same (or almost the same) analytical results.

What is Data Summarization?

It is the processes of representing the collected data in an accurate and compact way withoutlosing any information, it also involves getting a information from collected data.Ex: Display thedata as a graph and get the mean, median, mode etc.

How to Clean Data?

Handling Missing values

Ignore the tuple

Fill in the missing value manually

Use a global constant to fill in the missing value

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to the same class as the given tuple

Use the most probable value to fill in the missing value.

Handle Noisy Data

Binning: Binning methods smooth a sorted data value by consulting its neighborhood.

Regression: Data can be smoothed by fitting the data to a function, such as withregression.

Clustering: Outliers may be detected by clustering, where similar values are organizedinto groups, or clusters.

Data Integration :

Data Integration combines data from multiple sources into a coherent data store, as in datawarehousing. These sources may include multiple databases, data cubes, or flat files. Issues thatarises during data integration like Schema integration and object matching Redundancy is anotherimportant issue.

Data Transformation

Data transformation can be achieved in following ways

Smoothing: which works to remove noise from the data


50/82


49

Aggregation: where summary or aggregation operations are applied to the data. Forexample, the daily sales data may be aggregated so as to compute weekly and annuualtotal scores.

Generalization of the data: where low-level or primitive (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes,

like street, can be generalized to higher-level concepts, like city or country.

Normalization: where the attribute data are scaled so as to fall within a small specifiedrange, such as 1.0 to 1.0, or 0.0 to 1.0.

Attribute construction : this is where new attributes are constructed and added from thegiven set of attributes to help the mining process.

Data Reduction techniques

These are the techniques that can be applied to obtain a reduced representation of the data setthat is much smaller in volume, yet closely maintains the integrity of the original data.

1) Data cube aggregation

2) Attribute subset selection

3) Dimensionality reduction

4) Numerosity reduction

5) Discretization and concept hierarchy generation


51/82


50

DATA REDUCTION :

What is Data Reduction?

Data reduction obtains a reduced representation of the data set that is much smaller in volume,yet produces the same (or almost the same) analytical results.

Data reduction techniques can be applied to obtain a reduced representation of the data set thatis much smaller in volume, yet closely maintains the integrity of the original data. That is, miningon the reduced data set should be more efficient yet produce thesame (or almost the same)analytical results.

Data Reduction techniques

These are the techniques that can be applied to obtain a reduced representation of the data setthat is much smaller in volume, yet closely maintains the integrity of the original data.

1) Data cube aggregation

2) Attribute subset selection

3) Dimensionality reduction

4) Numerosity reduction

5) Discretization and concept hierarchy generation


52/82


51


53/82


52


54/82


53


55/82


54


56/82


55


57/82


56


58/82


57


59/82


58


60/82


59


61/82


60


62/82


61


63/82


62


64/82


63

DATA MINING STATISTICS :


65/82


64


66/82


65

DATA MINING TECHNIQUES :

Many different data mining, query model, processing model, and data collection techniques areavailable. Which one do you use to mine your data, and which one can you use in combinationwith your existing software and infrastructure? Examine different data mining and analyticstechniques and solutions, and learn how to build them using existing software and installations.Explore the different data mining tools that are available, and learn how to determine whether thesize and complexity of your information might result in processing and storage complexities, andwhat to do.


67/82


66

This overview provides a description of some of the most common data mining algorithms inuse today. We have broken the discussion into two sections, each with a specific theme:

Classical Techniques: Statistics, Neighborhoods and Clustering Next Generation Techniques: Trees, Networks and Rules

I. Classical Techniques: Statistics, Neighborhoods and Clustering

1.1. The Classics

These two sections have been broken up based on when the data mining technique wasdeveloped and when it became technically mature enough to be used for business, especially foraiding in the optimization of customer relationship management systems. Thus this sectioncontains descriptions of techniques that have classically been used for decades the next sectionrepresents techniques that have only been widely used since the early 1980s.

This section should help the user to understand the rough differences in the techniques and atleast enough information to be dangerous and well armed enough to not be baffled by the vendors

of different data mining tools.

The main techniques that we will discuss here are the ones that are used 99.9% of the time onexisting business problems. There are certainly many other ones as well as proprietarytechniques from particular vendors - but in general the industry is converging to those techniquesthat work consistently and are understandable and explainable.

1.2. Statistics

By strict definition "statistics" or statistical techniques are not data mining. They were being usedlong before the term data mining was coined to apply to business applications. However,statistical techniques are driven by the data and are used to discover patterns and build predictive

models. And from the users perspective you will be faced with a conscious choice when solvinga "data mining" problem as to whether you wish to attack it with statistical methods or other datamining techniques. For this reason it is important to have some idea of how statistical techniqueswork and how they can be applied.

What is different between statistics and data mining?

I flew the Boston to Newark shuttle recently and sat next to a professor from one the Boston areaUniversities. He was going to discuss the drosophila (fruit flies) genetic makeup to apharmaceutical company in New Jersey. He had compiled the world's largest database on thegenetic makeup of the fruit fly and had made it available to other researchers on the internetthrough Java applications accessing a larger relational database.

He explained to me that they not only now were storing the information on the flies but also weredoing "data mining" adding as an aside "which seems to be very important these days whateverthat is". I mentioned that I had written a book on the subject and he was interested in knowingwhat the difference was between "data mining" and statistics. There was no easy answer.

The techniques used in data mining, when successful, are successful for precisely the samereasons that statistical techniques are successful (e.g. clean data, a well defined target to predict


68/82


67

and good validation to avoid overfitting). And for the most part the techniques are used in thesame places for the same types of problems (prediction, classification discovery). In fact someof the techniques that are classical defined as "data mining" such as CART and CHAID arosefrom statisticians.

So what is the difference? Why aren't we as excited about "statistics" as we are about data

mining? There are several reasons. The first is that the classical data mining techniques suchas CART, neural networks and nearest neighbor techniques tend to be more robust to bothmessier real world data and also more robust to being used by less expert users. But that is notthe only reason. The other reason is that the time is right. Because of the use of computers forclosed loop business data storage and generation there now exists large quantities of data thatis available to users. IF there were no data - there would be no interest in mining it. Likewise thefact that computer hardware has dramatically upped the ante by several orders of magnitude instoring and processing the data makes some of the most powerful data mining techniques feasibletoday.

1 3 Nearest NeighborClustering and the Nearest Neighbor prediction technique are among the oldest techniques usedin data mining. Most people have an intuition that they understand what clustering is - namelythat like records are grouped or clustered together. Nearest neighbor is a prediction techniquethat is quite similar to clustering - its essence is that in order to predict what a prediction value isin one record look for records with similar predictor values in the historical database and use theprediction value from the record that it nearest to the unclassified record.

A simple example of clustering

A simple example of clustering would be the clustering that most people perform when they dothe laundry - grouping the permanent press, dry cleaning, whites and brightly colored clothes is

important because they have similar characteristics. And it turns out they have importantattributes in common about the way they behave (and can be ruined) in the wash. To clusteryour laundry most of your decisions are relatively straightforward. There are of course difficultdecisions to be made about which cluster your white shirt with red stripes goes into (since it ismostly white but has some color and is permanent press). When clustering is used in businessthe clusters are often much more dynamic - even changing weekly to monthly and many more ofthe decisions concerning which cluster a record falls into can be difficult.

A simple example of nearest neighbor

A simple example of the nearest neighbor prediction algorithm is that if you look at the people inyour neighborhood (in this case those people that are in fact geographically near to you). You

may notice that, in general, you all have somewhat similar incomes. Thus if your neighbor hasan income greater than $100,000 chances are good that you too have a high income. Certainlythe chances that you have a high income are greater when all of your neighbors have incomesover $100,000 than if all of your neighbors have incomes of $20,000. Within your neighborhoodthere may still be a wide variety of incomes possible among even your closest neighbors but ifyou had to predict someones income based on only knowing their neighbors youre best chanceof being right would be to predict the incomes of the neighbors who live closest to the unknownperson.


69/82


68

The nearest neighbor prediction algorithm works in very much the same way except thatnearness in a database may consist of a variety of factors not just where the person lives. Itmay, for instance, be far more important to know which school someone attended and whatdegree they attained when predicting income. The better definition of near might in fact be otherpeople that you graduated from college with rather than the people that you live next to.

Nearest Neighbor techniques are among the easiest to use and understand because they workin a way similar to the way that people think - by detecting closely matching examples. They alsoperform quite well in terms of automation, as many of the algorithms are robust with respect todirty data and missing data. Lastly they are particularly adept at performing complex ROIcalculations because the predictions are made at a local level where business simulations couldbe performed in order to optimize ROI. As they enjoy similar levels of accuracy compared toother techniques the measures of accuracy such as lift are as good as from any other.

How to use Nearest Neighbor for Prediction

One of the essential elements underlying the concept of clustering is that one particular object(whether they be cars, food or customers) can be closer to another object than can some thirdobject. It is interesting that most people have an innate sense of ordering placed on a variety ofdifferent objects. Most people would agree that an apple is closer to an orange than it is to atomato and that a Toyota Corolla is closer to a Honda Civic than to a Porsche. This sense ofordering on many different objects helps us place them in time and space and to make sense ofthe world. It is what allows us to build clusters - both in databases on computers as well as in ourdaily lives. This definition of nearness that seems to be ubiquitous also allows us to makepredictions.

The nearest neighbor prediction algorithm simply stated is:

Objects that are near to each other will have similar prediction values as well. Thus if you know

the prediction value of one of the objects you can predict it for its nearest neig

advance concept in data bases unit-5 by arun pratap singh

Documents