advance concept in data bases unit-5 by arun pratap singh

Upload: arunpratapsingh

Post on 03-Jun-2018

220 views

Category:

Documents


2 download

TRANSCRIPT

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    1/82

    PREPARED BY ARUN PRATAP SINGH MTECH2nd SEMESTER

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    2/82

    PREPARED BY ARUN PRATAP SINGH 1

    1

    DESIGN OF DATA WAREHOUSE :

    The term "Data Warehouse" was first coined by Bill Inmon in 1990. He said that Data warehouseis subject Oriented, Integrated, Time-Variant and nonvolatile collection of data.This data helps in

    supporting decision making process by analyst in an organization

    The operational database undergoes the per day transactions which causes the frequent changesto the data on daily basis.But if in future the business executive wants to analyse the previousfeedback on any data such as product,supplier,or the consumer data. In this case the analyst willbe having no data available to analyse because the previous data is updated due to transactions.

    The Data Warehouses provide us generalized and consolidated data in multidimensional view.Along with generalize and consolidated view of data the Data Warehouses also provide us OnlineAnalytical Processing (OLAP) tools. These tools help us in interactive and effective analysis ofdata in multidimensional space. This analysis results in data generalization and data mining.

    The data mining functions like association,clustering ,classification, prediction can be integratedwith OLAP operations to enhance interactive mining of knowledge at multiple level of abstraction.That's why data warehouse has now become important platform for data analysis and onlineanalytical processing.

    Understanding Data Warehouse

    The Data Warehouse is that database which is kept separate from the organization's operationaldatabase.

    There is no frequent updation done in data warehouse.

    Data warehouse possess consolidated historical data which help the organization to analyse it'sbusiness.

    Data warehouse helps the executives to organize,understand and use their data to take strategicdecision.

    Data warehouse systems available which helps in integration of diversity of application systems.

    The Data warehouse system allows analysis of consolidated historical data analysis.

    Definition : Data warehouse is Subject Oriented, Integrated, Time-Variant and Nonvolatile

    collection of data that support management's decision making process.

    Why Data Warehouse Separated from Operational Databases

    The following are the reasons why Data Warehouse are kept separate from operationaldatabases:

    The operational database is constructed for well known tasks and workload such as searchingparticular records, indexing etc but the data warehouse queries are often complex and it presentsthe general form of data.

    UNIT : V

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    3/82

    PREPARED BY ARUN PRATAP SINGH 2

    2

    Operational databases supports the concurrent processing of multiple transactions. Concurrencycontrol and recovery mechanism are required for operational databases to ensure robustnessand consistency of database.

    Operational database query allow to read, modify operations while the OLAP query need onlyread onlyaccess of stored data.

    Operational database maintain the current data on the other hand data warehouse maintain thehistorical data.

    Data Warehouse Features

    The key features of Data Warehouse such as Subject Oriented, Integrated, Nonvolatile and Time-Variant are are discussed below:

    Subject Oriented - The Data Warehouse is Subject Oriented because it provide us theinformation around a subject rather the organization's ongoing operations. These subjects can beproduct, customers, suppliers, sales, revenue etc. The data warehouse does not focus on the

    ongoing operations rather it focuses on modelling and analysis of data for decision making. Integrated- Data Warehouse is constructed by integration of data from heterogeneous sources

    such as relational databases, flat files etc. This integration enhance the effective analysis of data. Time-Variant- The Data in Data Warehouse is identified with a particular time period. The data

    in data warehouse provide information from historical point of view. Non Volatile- Non volatile means that the previous data is not removed when new data is added

    to it. The data warehouse is kept separate from the operational database therefore frequentchanges in operational database is not reflected in data warehouse.Note: - Data Warehouse does not require transaction processing, recovery and concurrencycontrol because it is physically stored separate from the operational database.

    Data Warehouse Applications

    As discussed before Data Warehouse helps the business executives in organize, analyse anduse their data for decision making. Data Warehouse serves as a soul part of a plan-execute-assess "closed-loop" feedback system for enterprise management. Data Warehouse is widelyused in the following fields:

    financial services

    Banking Services

    Consumer goods

    Retail sectors.

    Controlled manufacturing

    Data Warehouse Types

    Information processing, Analytical processing and Data Mining are the three types of datawarehouse applications that are discussed below:

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    4/82

    PREPARED BY ARUN PRATAP SINGH 3

    3

    Information processing- Data Warehouse allow us to process the information stored in it. Theinformation can be processed by means of querying, basic statistical analysis, reporting usingcrosstabs, tables, charts, or graphs.

    Analytical Processing - Data Warehouse supports analytical processing of the informationstored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.

    Data Mining - Data Mining supports knowledge discovery by finding the hidden patterns andassociations, constructing analytical models, performing classification and prediction.Thesemining results can be presented using the visualization tools.

    SN Data Warehouse (OLAP) Operational Database(OLTP)

    1This involves historical processing ofinformation.

    This involves day to day processing.

    2OLAP systems are used by knowledgeworkers such as executive, manager and

    analyst.

    OLTP system are used by clerk, DBA, or

    database professionals.

    3 This is used to analysis the business. This is used to run the business.

    4 It focuses on Information out. It focuses on Data in.

    5This is based on Star Schema, SnowflakeSchema and Fact Constellation Schema.

    This is based on Entity Relationship Model.

    6 It focuses on Information out. This is application oriented.

    7 This contains historical data. This contains current data.

    8This provides summarized andconsolidated data.

    This provide primitive and highly detailed data.

    9This provide summarized andmultidimensional view of data.

    This provides detailed and flat relational view ofdata.

    10 The number or users are in Hundreds. The number of users are in thousands.

    11The number of records accessed are inmillions.

    The number of records accessed are in tens.

    12 The database size is from 100GB to TB The database size is from 100 MB to GB.

    13 This are highly flexible. This provide high performance.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    5/82

    PREPARED BY ARUN PRATAP SINGH 4

    4

    What is Data Warehousing?

    Data Warehousing is the process of constructing and using the data warehouse. The datawarehouse is constructed by integrating the data from multiple heterogeneous sources. This datawarehouse supports analytical reporting, structured and/or ad hoc queries and decision making.Data Warehousing involves data cleaning, data integration and data consolidations.

    Using Data Warehouse Information

    There are decision support technologies available which help to utilize the data warehouse. Thesetechnologies helps the executives to use the warehouse quickly and effectively. They can gatherthe data, analyse it and take the decisions based on the information in the warehouse. Theinformation gathered from the warehouse can be used in any of the following domains:

    Tuning production strategies- The product strategies can be well tuned by repositioning theproducts and managing product portfolios by comparing the sales quarterly or yearly.

    Customer Analysis - The customer analysis is done by analyzing the customer's buyingpreferences, buying time, budget cycles etc.

    Operations Analysis - Data Warehousing also helps in customer relationship management,making environmental corrections. The Information also allow us to analyse the businessoperations.

    In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is

    a database used for reporting anddata analysis. Integrating data from one or more disparate

    sources creates a central repository of data, a data warehouse (DW). Data warehouses store

    current and historical data and are used for creating trending reports for senior management

    reporting such as annual and quarterly comparisons.

    The data stored in the warehouse isuploaded from the operational systems (such as marketing,

    sales, etc.). The data may pass through anoperational data store for additional operations before

    it is used in the DW for reporting.

    http://en.wikipedia.org/wiki/Computinghttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Business_reportinghttp://en.wikipedia.org/wiki/Data_analysishttp://en.wikipedia.org/wiki/Uploading_and_downloadinghttp://en.wikipedia.org/wiki/Operational_data_storehttp://en.wikipedia.org/wiki/Operational_data_storehttp://en.wikipedia.org/wiki/Uploading_and_downloadinghttp://en.wikipedia.org/wiki/Data_analysishttp://en.wikipedia.org/wiki/Business_reportinghttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Computing
  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    6/82

    PREPARED BY ARUN PRATAP SINGH 5

    5

    Data warehouses support business decisions by collecting, consolidating, and organizing data forreporting and analysis with tools such as online analytical processing (OLAP) and data mining.

    Although data warehouses are built on relational database technology, the design of a datawarehouse database differs substantially from the design of an online transaction processing

    system (OLTP) database.

    Data Warehouses, OLTP, OLAP, and Data Mining

    A relational database is designed for a specific purpose. Because the purpose of a datawarehouse differs from that of an OLTP, the design characteristics of a relational database thatsupports a data warehouse differ from the design characteristics of an OLTP database.

    A Data Warehouse Supports OLTP-

    A data warehouse supports an OLTP system by providing a place for the OLTP database tooffload data as it accumulates, and by providing services that would complicate and degradeOLTP operations if they were performed in the OLTP database.Without a data warehouse to hold historical information, data is archived to static media such asmagnetic tape, or allowed to accumulate in the OLTP database.If data is simply archived for preservation, it is not available or organized for use by analysts anddecision makers. If data is allowed to accumulate in the OLTP so it can be used for analysis, the

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    7/82

    PREPARED BY ARUN PRATAP SINGH 6

    6

    OLTP database continues to grow in size and requires more indexes to service analytical and

    report queries. These queries access and process large portions of the continually growinghistorical data and add a substantial load to the database. The large indexes needed to supportthese queries also tax the OLTP transactions with additional index maintenance. These queries

    can also be complicated to develop due to the typically complex OLTP database schema.A data warehouse offloads the historical data from the OLTP, allowing the OLTP to operate atpeak transaction efficiency. High volume analytical and reporting queries are handled by the datawarehouse and do not load the OLTP, which does not need additional indexes for their support.

    As data is moved to the data warehouse, it is also reorganized and consolidated so that analyticalqueries are simpler and more efficient.

    OLAP is a Data Warehouse Tool-

    Online analytical processing (OLAP) is a technology designed to provide superior performancefor ad hoc business intelligence queries. OLAP is designed to operate efficiently with dataorganized in accordance with the common dimensional model used in data warehouses.

    A data warehouse provides a multidimensional view of data in an intuitive model designed tomatch the types of queries posed by analysts and decision makers. OLAP organizes datawarehouse data into multidimensional cubes based on this dimensional model, and thenpreprocesses these cubes to provide maximum performance for queries that summarize data invarious ways. For example, a query that requests the total sales income and quantity sold for arange of products in a specific geographical region for a specific time period can typically beanswered in a few seconds or less regardless of how many hundreds of millions of rows of dataare stored in the data warehouse database.

    Data warehouse database OLTP database

    Designed for analysis of business measuresby categories and attributes

    Designed for real-time business operations

    Optimized for bulk loads and large, complex,unpredictable queries that access manyrows per table

    Optimized for a common set of transactions,usually adding or retrieving a single row at atime per table

    Loaded with consistent, valid data; requires

    no real time validation

    Optimized for validation of incoming data

    during transactions; uses validation datatables

    Supports few concurrent users relative toOLTP

    Supports thousands of concurrent users

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    8/82

    PREPARED BY ARUN PRATAP SINGH 7

    7

    OLAP is not designed to store large volumes of text or binary data, nor is it designed to supporthigh volume update transactions. The inherent stability and consistency of historical data in a datawarehouse enables OLAP to provide its remarkable performance in rapidly summarizinginformation for analytical queries.In SQL Server 2000, Analysis Services provides tools for developing OLAP applications and a

    server specifically designed to service OLAP queries.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    9/82

    PREPARED BY ARUN PRATAP SINGH 8

    8

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    10/82

    PREPARED BY ARUN PRATAP SINGH 9

    9

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    11/82

    PREPARED BY ARUN PRATAP SINGH 10

    10

    Data Warehouse Tools and Utilities Functions

    The following are the functions of Data Warehouse tools and Utilities:

    Data Extraction - Data Extraction involves gathering the data from multiple heterogeneoussources.

    Data Cleaning- Data Cleaning involves finding and correcting the errors in data.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    12/82

    PREPARED BY ARUN PRATAP SINGH 11

    11

    Data Transformation - Data Transformation involves converting data from legacy format towarehouse format.

    Data Loading- Data Loading involves sorting, summarizing, consolidating, checking integrity andbuilding indices and partitions.

    Refreshing- Refreshing involves updating from data sources to warehouse.Note:Data Cleaning and Data Transformation are important steps in improving the quality of data

    and data mining results.

    Data Warehouse :

    Data warehouse is subject Oriented, Integrated, Time-Variant and nonvolatile collection of datathat support of management's decision making process. Let's explore this Definition of datawarehouse.

    Subject Oriented- The Data warehouse is subject oriented because it provide us the informationaround a subject rather the organization's ongoing operations. These subjects can be product,customers, suppliers, sales, revenue etc. The data warehouse does not focus on the ongoingoperations rather it focuses on modelling and analysis of data for decision making.

    Integrated- Data Warehouse is constructed by integration of data from heterogeneous sourcessuch as relational databases, flat files etc. This integration enhance the effective analysis of data.

    Time-Variant- The Data in Data Warehouse is identified with a particular time period. The datain data warehouse provide information from historical point of view.

    Non Volatile- Non volatile means that the previous data is not removed when new data is addedto it. The data warehouse is kept separate from the operational database therefore frequentchanges in operational database is not reflected in data warehouse.

    Metadata- Metadata is simply defined as data about data. The data that are used to representother data is known as metadata. For example the index of a book serve as metadata for thecontents in the book.In other words we can say that metadata is the summarized data that leadus to the detailed data.

    In terms of data warehouse we can define metadata as following:

    Metadata is a road map to data warehouse.

    Metadata in data warehouse define the warehouse objects.

    The metadata act as a directory.This directory helps the decision support system to locate thecontents of data warehouse.

    Metadata Respiratory :

    The Metadata Respiratory is an integral part of data warehouse system. The Metadata

    Respiratory contains the following metadata:

    Business Metadata- This metadata has the data ownership information, business definition andchanging policies.

    Operational Metadata-This metadata includes currency of data and data lineage. Currency ofdata means whether data is active, archived or purged. Lineage of data means history of datamigrated and transformation applied on it.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    13/82

    PREPARED BY ARUN PRATAP SINGH 12

    12

    Data for mapping from operational environment to data warehouse-This metadata includessource databases and their contents, data extraction,data partition, cleaning, transformationrules, data refresh and purging rules.

    The algorithms for summarization- This includes dimension algorithms, data on granularity,aggregation, summarizing etc.

    Data cube :

    Data cube help us to represent the data in multiple dimensions. The data cube is defined bydimensions and facts. The dimensions are the entities with respect to which an enterprise keepthe records.

    Illustration of Data cube

    Suppose a company wants to keep track of sales records with help of sales data warehouse withrespect to time, item, branch and location. These dimensions allow to keep track of monthly salesand at which branch the items were sold.There is a table associated with each dimension. Thistable is known as dimension table. This dimension table further describes the dimensions. For

    example "item" dimension table may have attributes such as item_name, item_type anditem_brand.

    The following table represents 2-D view of Sales Data for a company with respect to time,itemand location dimensions.

    But here in this 2-D table we have records with respect to time and item only. The sales for NewDelhi are shown with respect to time and item dimensions according to type of item sold. If wewant to view the sales data with one new dimension say the location dimension. The 3-D view of

    the sales data with respect to time, item, and location is shown in the table below:

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    14/82

    PREPARED BY ARUN PRATAP SINGH 13

    13

    The above 3-D table can be represented as 3-D data cube as shown in the following figure:

    DATA MART :

    Data mart contains the subset of organization-wide data. This subset of data is valuable to specificgroup of an organization. In other words we can say that data mart contains only that data whichis specific to a particular group. For example the marketing data mart may contain only datarelated to item, customers and sales. The data mart are confined to subjects.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    15/82

    PREPARED BY ARUN PRATAP SINGH 14

    14

    Points to remember about data marts:

    Window based or Unix/Linux based servers are used to implement data marts. They areimplemented on low cost server.

    The implementation cycle of data mart is measured in short period of time i.e. in weeks rather

    than months or years.

    The life cycle of a data mart may be complex in long run if it's planning and design are notorganization-wide.

    Data mart are small in size.

    Data mart are customized by department.

    The source of data mart is departmentally structured data warehouse.

    Data mart are flexible.

    Graphical Representation of data mart.

    A data martis the access layer of thedata warehouse environment that is used to get data out to the

    users. The data mart is a subset of the data warehouse that is usually oriented to a specific business

    line or team. Data marts are small slices of the data warehouse. Whereas data warehouses have an

    enterprise-wide depth, the information in data marts pertains to a single department. In some

    deployments, each department or business unit is considered the ownerof its data mart including all

    the hardware, softwareand data.[1]This enables each department to use, manipulate and develop

    their data any way they see fit; without altering information inside other data marts or the data

    warehouse. In other deployments where conformed dimensions are used, this business unit ownership

    will not hold true for shared dimensions like customer, product, etc.

    http://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Data_mart#cite_note-1http://en.wikipedia.org/wiki/Data_mart#cite_note-1http://en.wikipedia.org/wiki/Data_mart#cite_note-1http://en.wikipedia.org/wiki/Data_mart#cite_note-1http://en.wikipedia.org/wiki/Data_warehouse
  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    16/82

    PREPARED BY ARUN PRATAP SINGH 15

    15

    The reasons why organizations are building data warehouses and data marts are because the

    information in the database is not organized in a way that makes it easy for organizations to find what

    they need. Also complicated queries might take a long time to answer what people want to know since

    the database systems are designed to process millions of transactions per day. Transactional

    database are designed to be updated, however, data warehouses or marts are read only. Datawarehouses are designed to access large groups of related records.

    Data marts improve end-user response time by allowing users to have access to the specific type of

    data they need to view most often by providing the data in a way that supports the collective view of a

    group of users.

    A data mart is basically a condensed and more focused version of a data warehouse that reflects the

    regulations and process specifications of each business unit within an organization. Each data mart is

    dedicated to a specific business function or region. This subset of data may span across many or all

    of an enterprises functional subject areas. It is common for multiple data marts to be used in order to

    serve the needs of each individual business unit (different data marts can be used to obtain specific

    information for various enterprise departments, such as accounting, marketing, sales, etc.).

    Reasons for creating a data mart :

    Easy access to frequently needed data

    Creates collective view by a group of users

    Improves end-userresponse time

    Ease of creation

    Lower cost than implementing a fulldata warehouse

    Potential users are more clearly defined than in a full data warehouse

    Contains only business essential data and is less cluttered.

    http://en.wikipedia.org/wiki/Response_time_(technology)http://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Response_time_(technology)
  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    17/82

    PREPARED BY ARUN PRATAP SINGH 16

    16

    DEPENDENT DATA MART :

    According to theInmon school of data warehousing, a dependent data martis a logical subset (view)

    or a physical subset (extract) of a largerdata warehouse,isolated for one of the following reasons:

    A need refreshment for a specialdata model orschema:e.g., to restructure forOLAP

    Performance: to offload the data mart to a separatecomputer for greater efficiency or to obviate

    the need to manage that workload on the centralized data warehouse.

    Security: to separate an authorized data subset selectively

    Expediency: to bypass the data governance and authorizations required to incorporate a new

    application on the Enterprise Data Warehouse

    Proving Ground: to demonstrate the viability and ROI (return on investment) potential of an

    application prior to migrating it to the Enterprise Data Warehouse

    Politics: a coping strategy for IT (Information Technology) in situations where a user group has

    more influence than funding or is not a good citizen on the centralized data warehouse.

    Politics: a coping strategy for consumers of data in situations where a data warehouse team is

    unable to create a usable data warehouse.

    According to the Inmon school of data warehousing, tradeoffs inherent with data marts include limited

    scalability, duplication of data, data inconsistency with other silos of information, and inability toleverage enterprise sources of data.

    The alternative school of data warehousing is that ofRalph Kimball.In his view, a data warehouse is

    nothing more than the union of all the data marts. This view helps to reduce costs and provides fast

    development, but can create an inconsistent data warehouse, especially in large organizations.

    Therefore, Kimball's approach is more suitable for small-to-medium corporations.

    http://en.wikipedia.org/wiki/Bill_Inmonhttp://en.wikipedia.org/wiki/View_(database)http://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Data_modelhttp://en.wikipedia.org/wiki/Logical_schemahttp://en.wikipedia.org/wiki/Online_analytical_processinghttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Ralph_Kimballhttp://en.wikipedia.org/wiki/Ralph_Kimballhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Online_analytical_processinghttp://en.wikipedia.org/wiki/Logical_schemahttp://en.wikipedia.org/wiki/Data_modelhttp://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/View_(database)http://en.wikipedia.org/wiki/Bill_Inmon
  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    18/82

    PREPARED BY ARUN PRATAP SINGH 17

    17

    Virtual Warehouse :

    The view over a operational data warehouse is known as virtual warehouse. It is easy to built thevirtual warehouse. Building the virtual warehouse requires excess capacity on operational

    database servers.

    PROCESS FLOW IN DATA WAREHOUSE :

    There are four major processes that build a data warehouse. Here is the list of four processes:

    Extract and load data.

    Cleaning and transforming the data.

    Backup and Archive the data.

    Managing queries & directing them to the appropriate data sources.

    Extract and Load Process

    The Data Extraction takes data from the source systems.

    Data load takes extracted data and loads it into data warehouse.

    Note: Before loading the data into data warehouse the information extracted from externalsources must be reconstructed.

    Points to remember while extract and load process:

    Controlling the process

    When to Initiate Extract

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    19/82

    PREPARED BY ARUN PRATAP SINGH 18

    18

    Loading the Data

    CONTROLLING THE PROCESS

    Controlling the process involves determining that when to start data extraction and consistency

    check on data. Controlling process ensures that tools, logic modules, and the programs areexecuted in correct sequence and at correct time.

    WHEN TO INITIATE EXTRACT

    Data need to be in consistent state when it is extracted i.e. the data warehouse should representsingle, consistent version of information to the user.

    For example in a customer profiling data warehouse in telecommunication sector it is illogical tomerge list of customers at 8 pm on wednesday from a customer database with the customersubscription events up to 8 pm on tuesday. This would mean that we are finding the customersfor whom there are no associated subscription.

    LOADING THE DATA

    After extracting the data it is loaded into a temporary data store.Here in the temporary data storeit is cleaned up and made consistent.

    Note: Consistency checks are executed only when all data sources have been loaded intotemporary data store.

    Clean and Transform Process

    Once data is extracted and loaded into temporary data store it is the time to perform Cleaning

    and Transforming. Here is the list of steps involved in Cleaning and Transforming:

    Clean and Transform the loaded data into a structure.

    Partition the data.

    Aggregation

    CLEAN AND TRANSFORM THE LOADED DATA INTO A STRUCTURE

    This will speed up the queries.This can be done in the following ways:

    Make sure data is consistent within itself.

    Make sure data is consistent with other data within the same data source.

    Make sure data is consistent with data in other source systems.

    Make sure data is consistent with data already in the warehouse.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    20/82

    PREPARED BY ARUN PRATAP SINGH 19

    19

    Transforming involves converting the source data into a structure. Structuring the data will resultin increases query performance and decreases operational cost. Information in data warehousemust be transformed to support performance requirement from the business and also the ongoingoperational cost.

    PARTITION THE DATA

    It will optimize the hardware performance and simplify the management of data warehouse. Inthis we partition each fact table into a multiple separate partitions.

    AGGREGATION

    Aggregation is required to speed up the common queries. Aggregation rely on the fact that mostcommon queries will analyse a subset or an aggregation of the detailed data.

    Backup and Archive the data

    In order to recover the data in event of data loss, software failure or hardware failure it isnecessary to backed up on regular basis.Archiving involves removing the old data from thesystem in a format that allow it to be quickly restored whenever required.

    For example in a retail sales analysis data warehouse, it may be required to keep data for 3 yearswith latest 6 months data being kept online. In this kind of scenario there is often requirement tobe able to do month-on-month comparisons for this year and last year. In this case we requiresome data to be restored from the archive.

    Query Management Process

    This process performs the following functions

    This process manages the queries.

    This process speed up the queries execution.

    This Process direct the queries to most effective data sources.

    This process should also ensure that all system sources are used in most effective way.

    This process is also required to monitor actual query profiles.

    Information in this process is used by warehouse management process to determine which

    aggregations to generate.

    This process does not generally operate during regular load of information into data warehouse.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    21/82

    PREPARED BY ARUN PRATAP SINGH 20

    20

    THREE-TIER DATA WAREHOUSE ARCHITECTURE :

    Generally the data warehouses adopt the three-tier architecture. Following are the three tiers ofdata warehouse architecture.

    Bottom Tier- The bottom tier of the architecture is the data warehouse database server.It is the

    relational database system.We use the back end tools and utilities to feed data into bottomtier.these back end tools and utilities performs the Extract, Clean, Load, and refresh functions.

    Middle Tier- In the middle tier we have OLAp Server. the OLAP Server can be implemented ineither of the following ways.

    o By relational OLAP (ROLAP), which is an extended relational database management system. TheROLAP maps the operations on multidimensional data to standard relational operations.

    o By Multidimensional OLAP (MOLAP) model, which directly implements multidimensional data andoperations.

    Top-Tier- This tier is the front-end client layer. This layer hold the query tools and reporting tool,analysis tools and data mining tools.

    Following diagram explains the Three-tier Architecture of Data warehouse:

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    22/82

    PREPARED BY ARUN PRATAP SINGH 21

    21

    OLAP :

    Introduction

    Online Analytical Processing Server (OLAP) is based on multidimensional data model. It allowsthe managers , analysts to get insight the information through fast, consistent, interactive access

    to information. In this chapter we will discuss about types of OLAP, operations on OLAP,Difference between OLAP and Statistical Databases and OLTP.

    Types of OLAP Servers

    We have four types of OLAP servers that are listed below.

    Relational OLAP(ROLAP)

    Multidimensional OLAP (MOLAP)

    Hybrid OLAP (HOLAP)

    Specialized SQL Servers

    Relational OLAP(ROLAP)

    The Relational OLAP servers are placed between relational back-end server and client front-endtools. To store and manage warehouse data the Relational OLAP use relational or extended-relational DBMS.

    ROLAP includes the following.

    implementation of aggregation navigation logic.

    optimization for each DBMS back end.

    additional tools and services.

    Multidimensional OLAP (MOLAP)

    Multidimensional OLAP (MOLAP) uses the array-based multidimensional storage engines formultidimensional views of data.With multidimensional data stores, the storage utilization may be

    low if the data set is sparse. Therefore many MOLAP Server uses the two level of data storagerepresentation to handle dense and sparse data sets.

    Hybrid OLAP (HOLAP)

    The hybrid OLAP technique combination of ROLAP and MOLAP both. It has both the higherscalability of ROLAP and faster computation of MOLAP. HOLAP server allows to store the largedata volumes of detail data. the aggregations are stored separated in MOLAP store.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    23/82

    PREPARED BY ARUN PRATAP SINGH 22

    22

    Specialized SQL Servers

    specialized SQL servers provides advanced query language and query processing support forSQL queries over star and snowflake schemas in a read-only environment.

    OLAP OperationsAs we know that the OLAP server is based on the multidimensional view of data hence we willdiscuss the OLAP operations in multidimensional data.

    Here is the list of OLAP operations.

    Roll-up

    Drill-down

    Slice and dice

    Pivot (rotate)

    ROLL-UP

    This operation performs aggregation on a data cube in any of the following way:

    By climbing up a concept hierarchy for a dimension

    By dimension reduction.

    Consider the following diagram showing the roll-up operation.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    24/82

    PREPARED BY ARUN PRATAP SINGH 23

    23

    The roll-up operation is performed by climbing up a concept hierarchy for the dimension location.

    Initially the concept hierarchy was "street < city < province < country".

    On rolling up the data is aggregated by ascending the location hierarchy from the level of city tolevel of country.

    The data is grouped into cities rather than countries.

    When roll-up operation is performed then one or more dimensions from the data cube areremoved.

    DRILL-DOWN

    Drill-down operation is reverse of the roll-up. This operation is performed by either of the followingway:

    By stepping down a concept hierarchy for a dimension.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    25/82

    PREPARED BY ARUN PRATAP SINGH 24

    24

    By introducing new dimension.

    Consider the following diagram showing the drill-down operation:

    The drill-down operation is performed by stepping down a concept hierarchy for the dimensiontime.

    Initially the concept hierarchy was "day < month < quarter < year."

    On drill-up the time dimension is descended from the level quarter to the level of month.

    When drill-down operation is performed then one or more dimensions from the data cube areadded.

    It navigates the data from less detailed data to highly detailed data.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    26/82

    PREPARED BY ARUN PRATAP SINGH 25

    25

    SLICE

    The slice operation performs selection of one dimension on a given cube and give us a new subcube. Consider the following diagram showing the slice operation.

    The Slice operation is performed for the dimension time using the criterion time ="Q1".

    It will form a new sub cube by selecting one or more dimensions.

    DICE

    The Dice operation performs selection of two or more dimension on a given cube and give us anew subcube. Consider the following diagram showing the dice operation:

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    27/82

    PREPARED BY ARUN PRATAP SINGH 26

    26

    The dice operation on the cube based on the following selection criteria that involve threedimensions.

    (location = "Toronto" or "Vancouver")

    (time = "Q1" or "Q2")

    (item =" Mobile" or "Modem").

    PIVOT

    The pivot operation is also known as rotation.It rotates the data axes in view in order to providean alternative presentation of data.Consider the following diagram showing the pivot operation.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    28/82

    PREPARED BY ARUN PRATAP SINGH 27

    27

    In this the item and location axes in 2-D slice are rotated.

    OLAP vs OLTP

    SN Data Warehouse (OLAP) Operational Database(OLTP)

    1 This involves historical processing ofinformation.

    This involves day to day processing.

    2OLAP systems are used byknowledge workers such asexecutive, manager and analyst.

    OLTP system are used by clerk, DBA, ordatabase professionals.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    29/82

    PREPARED BY ARUN PRATAP SINGH 28

    28

    3This is used to analysis thebusiness.

    This is used to run the business.

    4 It focuses on Information out. It focuses on Data in.

    5This is based on Star Schema,Snowflake Schema and FactConstellation Schema.

    This is based on Entity RelationshipModel.

    6 It focuses on Information out. This is application oriented.

    7 This contains historical data. This contains current data.

    8This provides summarized andconsolidated data.

    This provide primitive and highly detaileddata.

    9This provide summarized andmultidimensional view of data.

    This provides detailed and flat relationalview of data.

    10The number or users are inHundreds.

    The number of users are in thousands.

    11The number of records accessed arein millions.

    The number of records accessed are intens.

    12

    The database size is from 100GB to

    TB The database size is from 100 MB to GB.

    13 This are highly flexible. This provide high performance.

    CONCEPTUAL MODELING OF DATA WAREHOUSES :

    Dimensional modeling is a technique for conceptualizing and visualizing data models as a set ofmeasures that are described by common aspects of the business. Dimensional modeling has twobasic concepts.

    Facts:

    A fact is a collection of related data items, consisting of measures.

    A fact is a focus of interest for the decision making process.

    Measures are continuously valued attributes that describe facts.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    30/82

    PREPARED BY ARUN PRATAP SINGH 29

    29

    A fact is a business measure.

    Dimension:

    The parameter over which we want to perform analysis of facts

    The parameter that gives meaning to a measure number of customers is a fact, perform analysisover time.

    Dimensional modeling also has emerged as the only coherent architecture for building distributedDW systems. If we come up with more complex questions for our warehouse which involves threeor more dimensions.

    This is where the multi-dimensional database plays a significant role analysis. Dimensions arecategories by which summarized data can be viewed. Cubes are data processing units composedof fact tables and dimensions from the data warehouse. Dimensional modeling also has emergedas the only coherent architecture for building distributed data warehouse systems.

    Multi-Dimensional Modeling

    Multidimensional database technology has come a long way since its inception more than 30years ago. It has recently begun to reach the mass market, with major vendors now deliveringmultidimensional engines along with their relational database offerings, often at no extra cost.Multi-dimensional technology has also made significant gains in scalability and maturity.

    Multidimensional data model emerged for use when the objective is to analyze rather than toperform on-line transactions.

    Multidimensional model is based on three key concepts:

    Modeling business rules

    Cube and measures

    Dimensions

    Multidimensional data-base technology is a key factor in the interactive analysis of large amountsof data for decision-making purposes. Multidimensional data model is introduced based onrelational elements. Dimensions are modeled as dimension relations.

    languages similar to structured query language. They can not treat all dimensions and measuressymmetrically the definition of multidimensional schema describes multiple levels along adimension, and there is at least one key attribute in each level that is included in the keys of thestar schema in RD systems. Multidimensional database enable end-users to model data in amultidimensional environment. This is real product strength, as it provides for the fastest, mostflexible method to process multidimensional requests.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    31/82

    PREPARED BY ARUN PRATAP SINGH 30

    30

    The principal characteristic of a dimensional model is a set of detailed business facts surroundedby multiple dimensions that describe those facts. When realized in a database, the schema for adimensional model contains a central fact table and multiple dimension tables. A dimensionalmodel may produce a star schemaor a snowflake schema.

    The schema is a logical description of the entire database. The schema includes the name anddescription of records of all record types including all associated data-items and aggregates.Likewise the database the data warehouse also require the schema. The database uses therelational model on the other hand the data warehouse uses the Stars, snowflake and factconstellation schema. In this chapter we will discuss the schemas used in data warehouse.

    STAR SCHEMA :

    In star schema each dimension is represented with only one dimension table.

    This dimension table contains the set of attributes.

    In the following diagram we have shown the sales data of a company with respect to the fourdimensions namely, time, item, branch and location.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    32/82

    PREPARED BY ARUN PRATAP SINGH 31

    31

    There is a fact table at the centre. This fact table contains the keys to each of four dimensions.

    The fact table also contain the attributes namely, dollars sold and units sold.

    Note:Each dimension has only one dimension table and each table holds a set of attributes. Forexample the location dimension table contains the attribute set

    {location_key,street,city,province_or_state,country}. This constraint may cause data redundancy.For example the "Vancouver" and "Victoria" both cities are both in Canadian province of BritishColumbia. The entries for such cities may cause data redundancy along the attributesprovince_or_state and country.

    What is star schema? The star schema architecture is the simplest data warehouse schema. Itis called a star schema because the diagram resembles a star, with points radiating from acenter. The center of the star consists of fact table and the points of the star are the dimensiontables. Usually the fact tables in a star schema are in third normal form(3NF) whereasdimensional tables are de-normalized. Despite the fact that the star schema is the simplestarchitecture, it is most commonly used nowadays and is recommended by Oracle.

    Fact Tables

    A fact table typically has two types of columns: foreign keys to dimension tables and measuresthose that contain numeric facts. A fact table can contain fact's data on detail or aggregatedlevel.

    Dimension Tables

    A dimension is a structure usually composed of one or more hierarchies that categorizes data. If

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    33/82

    PREPARED BY ARUN PRATAP SINGH 32

    32

    a dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primarykeys of each of the dimension tables are part of the composite primary key of the fact table.Dimensional attributes help to describe the dimensional value. They are normally descriptive,textual values. Dimension tables are generally small in size then fact table.

    Typical fact tables store data about sales while dimension tables data about geographic

    region(markets, cities) , clients, products, times, channels.

    The main characteristics of star schema:-> easy to understand schema

    -> small number of tables to join-> de-normalization, redundancy

    data caused that size of the table could be large.

    SNOWFLAKE SCHEMA :

    In Snowflake schema some dimension tables are normalized.

    The normalization split up the data into additional tables.

    Unlike Star schema the dimensions table in snowflake schema are normalized for example theitem dimension table in star schema is normalized and split into two dimension tables namely,item and supplier table.

    Therefore now the item dimension table contains the attributes item_key, item_name, type, brand,and supplier-key.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    34/82

    PREPARED BY ARUN PRATAP SINGH 33

    33

    The supplier key is linked to supplier dimension table. The supplier dimension table contains theattributes supplier_key, and supplier_type.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    35/82

    PREPARED BY ARUN PRATAP SINGH 34

    34

    The shipping fact table has the five dimensions namely, item_key, time_key, shipper-key, from-location.

    The shipping fact table also contains two measures namely, dollars sold and units sold.

    It is also possible for dimension table to share between fact tables. For example time, item and

    location dimension tables are shared between sales and shipping fact table.

    DATA MINING

    Data Mining is defined as extracting the information from the huge set of data. In other words wecan say that data mining is mining the knowledge from data.

    Knowledge Discovery and Data Mining (KDD) is an interdisciplinary area focusing upon

    methodologies for extracting useful knowledge from data. The ongoing rapid growth of online datadue to the Internet and the widespread use of databases have created an immense need for KDDmethodologies. The challenge of extracting knowledge from data draws upon research instatistics, databases, pattern recognition, machine learning, data visualization, optimization, andhigh-performance computing, to deliver advanced business intelligence and web discoverysolutions.

    Introduction

    There is huge amount of data available in Information Industry. This data is of no use untilconverted into useful information. Analysing this huge amount of data and extracting usefulinformation from it is necessary.

    The extraction of information is not the only process we need to perform, it also involves otherprocesses such as Data Cleaning, Data Integration, Data Transformation, Data Mining, PatternEvaluation and Data Presentation. Once all these processes are over, we are now position to usethis information in many applications such as Fraud Detection, Market Analysis, ProductionControl, Science Exploration etc.

    What is Data Mining

    Data Mining is defined as extracting the information from the huge set of data. In other words wecan say that data mining is mining the knowledge from data. This information can be used for anyof the following applications:

    Market Analysis

    Fraud Detection

    Customer Retention

    Production Control

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    36/82

    PREPARED BY ARUN PRATAP SINGH 35

    35

    Science Exploration

    Need of Data Mining

    Here are the reasons listed below:

    In field of Information technology we have huge amount of data available that need to be turnedinto useful information.

    This information further can be used for various applications such as market analysis, frauddetection, customer retention, production control, science exploration etc.

    Data Mining Applications

    Here is the list of applications of Data Mining:

    Market Analysis and Management

    Corporate Analysis & Risk Management

    Fraud Detection

    Data Mining deals with what kind of patterns can be mined. On the basis of kind of data to bemined there are two kind of functions involved in Data Mining, that are listed below:

    Descriptive

    Classification and Prediction

    Classification Criteria:

    Classification according to kind of databases mined

    Classification according to kind of knowledge mined

    Classification according to kinds of techniques utilized

    Classification according to applications adapted

    CLASSIFICATION ACCORDING TO KIND OF DATABASES MINED

    We can classify the data mining system according to kind of databases mined. Database systemcan be classified according to different criteria such as data models, types of data etc. And thedata mining system can be classified accordingly. For example if we classify the databaseaccording to data model then we may have a relational, transactional, object- relational, or datawarehouse mining system.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    37/82

    PREPARED BY ARUN PRATAP SINGH 36

    36

    CLASSIFICATION ACCORDING TO KIND OF KNOWLEDGE MINED

    We can classify the data mining system according to kind of knowledge mined. It is means datamining system are classified on the basis of functionalities such as:

    Characterization

    Discrimination

    Association and Correlation Analysis

    Classification

    Prediction

    Clustering

    Outlier Analysis

    Evolution Analysis

    CLASSIFICATION ACCORDING TO KINDS OF TECHNIQUES UTILIZED

    We can classify the data mining system according to kind of techniques used. We can describesthese techniques according to degree of user interaction involved or the methods of analysisemployed.

    CLASSIFICATION ACCORDING TO APPLICATIONS ADAPTED

    We can classify the data mining system according to application adapted. These applications areas follows:

    Finance

    Telecommunications

    DNA

    Stock Markets

    E-mail

    DATA MINING FUNCTIONALITIES :

    Characterization

    Discrimination

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    38/82

    PREPARED BY ARUN PRATAP SINGH 37

    37

    Association and Correlation Analysis

    Classification

    Prediction

    Clustering

    Outlier Analysis

    Evolution Analysis

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    39/82

    PREPARED BY ARUN PRATAP SINGH 38

    38

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    40/82

    PREPARED BY ARUN PRATAP SINGH 39

    39

    DATA MINING SYSTEM CATEGORIZATION AND ITS ISSUES :

    Introduction

    There is a large variety of Data Mining Systems available. Data mining System may integratetechniques from the following:

    Spatial Data Analysis

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    41/82

    PREPARED BY ARUN PRATAP SINGH 40

    40

    Information Retrieval

    Pattern Recognition

    Image Analysis

    Signal Processing

    Computer Graphics

    Web Technology

    Business

    Bioinformatics

    Data Mining System Classification

    The data mining system can be classified according to the following criteria:

    Database Technology

    Statistics

    Machine Learning

    Information Science

    Visualization

    Other Disciplines

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    42/82

    PREPARED BY ARUN PRATAP SINGH 41

    41

    Data mining is an interdisciplinary field, the confluence of a set of disciplines , including database

    systems, statistics, machine learning, visualization, and information science. Moreover, depending on

    the data mining approach used, techniques from other disciplines may be applied, such as neural

    networks, fuzzy and/or rough set theory, knowledge representation, inductive logic programming, or

    high performance computing. Depending on the kinds of data to be mined or on the given data mining

    application, the data mining system may also integrate techniques from spatial data analysis,information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web

    technology, economics, or psychology.

    Because of the diversity of disciplines contributing to data mining, data mining research is expected

    to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear

    classification of data mining systems. Such a classification may help potential users distinguish data

    mining systems and identify those that best match their needs. Data mining systems can be

    categorized according to various criteria, as follows.

    Classification according to kind of databases mined

    Classification according to kind of knowledge mined

    Classification according to kinds of techniques utilized

    Classification according to applications adapted

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    43/82

    PREPARED BY ARUN PRATAP SINGH 42

    42

    CLASSIFICATION ACCORDING TO KIND OF DATABASES MINED :

    We can classify the data mining system according to kind of databases mined. Database systemcan be classified according to different criteria such as data models, types of data etc. And thedata mining system can be classified accordingly. For example if we classify the database

    according to data model then we may have a relational, transactional, object- relational, or datawarehouse mining system.

    A data mining system can be classified according to the kinds of databases mined. Database

    systems themselves can be classified according to different criteria (such as data models, or the

    types of data or applications involved), each of which may require its own data mining technique.

    Data mining systems can therefore be classified accordingly. For instance, if classifying according

    to data models, we may have a relational, transactional, object-oriented, object-relational, or data

    warehouse mining system. If classifying according to the special types of data handled, we may

    have a spatial, time-series, text, or multimedia data mining system, or a World-Wide Web mining

    system. Other system types include heterogeneous data mining systems, and legacy data miningsystems.

    CLASSIFICATION ACCORDING TO KIND OF KNOWLEDGE MINED :

    We can classify the data mining system according to kind of knowledge mined. It is means datamining system are classified on the basis of functionalities such as:

    Characterization

    Discrimination

    Association and Correlation Analysis

    Classification

    Prediction

    Clustering

    Outlier Analysis

    Evolution Analysis

    Data mining systems can be categorized according to the kinds of knowledge they mine, i.e.,based on data mining functionalities, such as characterization, discrimination, association,classification, clustering, trend and evolution analysis, deviation analysis, similarity analysis, etc.

    A comprehensive data mining system usually provides multiple and/or integrated data miningfunctionalities.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    44/82

    PREPARED BY ARUN PRATAP SINGH 43

    43

    Moreover, data mining systems can also be distinguished based on the granularity or levels ofabstraction of the knowledge mined, including generalized knowledge (at a high level ofabstraction), primitive-level knowledge (at a raw data level), or knowledge at multiple levels(considering several levels of abstraction). An advanced data mining system should facilitate thediscovery of knowledge at multiple levels of abstraction.

    CLASSIFICATION ACCORDING TO KINDS OF TECHNIQUES UTILIZED :

    We can classify the data mining system according to kind of techniques used. We can describesthese techniques according to degree of user interaction involved or the methods of analysisemployed.

    Data mining systems can also be categorized according to the underlying data mining techniquesemployed. These techniques can be described according to the degree of user interaction involved(e.g., autonomous systems, interactive exploratory systems, query-driven systems), or the methodsof data analysis employed (e.g., database-oriented or data warehouse-oriented techniques, machinelearning, statistics, visualization, pattern recognition, neural networks, and so on). A sophisticated datamining system will often adopt multiple data mining techniques or work out an effective, integratedtechnique which combines the merits of a few individual approaches.

    CLASSIFICATION ACCORDING TO APPLICATIONS ADAPTED :

    We can classify the data mining system according to application adapted. These applications areas follows:

    Finance

    Telecommunications

    DNA

    Stock Markets

    E-mail

    ISSUES IN DATA MINING :

    Introduction

    Data mining is not that easy. The algorithm used are very complex. The data is not available atone place it needs to be integrated form the various heterogeneous data sources. These factorsalso creates some issues. Here in this tutorial we will discuss the major issues regarding:

    Mining Methodology and User Interaction

    Performance Issues

    Diverse Data Types Issues

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    45/82

    PREPARED BY ARUN PRATAP SINGH 44

    44

    The following diagram describes the major issues.

    Mining Methodology and User Interaction Issues

    It refers to the following kind of issues:

    Mining different kinds of knowledge in databases. - The need of different users is not the

    same. And Different user may be in interested in different kind of knowledge. Therefore it isnecessary for data mining to cover broad range of knowledge discovery task.

    Interactive mining of knowledge at multiple levels of abstraction.- The data mining processneeds to be interactive because it allows users to focus the search for patterns, providing andrefining data mining requests based on returned results.

    Incorporation of background knowledge. - To guide discovery process and to express thediscovered patterns, the background knowledge can be used. Background knowledge may beused to express the discovered patterns not only in concise terms but at multiple level ofabstraction.

    Data mining query languages and ad hoc data mining.- Data Mining Query language thatallows the user to describe ad hoc mining tasks, should be integrated with a data warehousequery language and optimized for efficient and flexible data mining.

    Presentation and visualization of data mining results.- Once the patterns are discovered itneeds to be expressed in high level languages, visual representations. This representationsshould be easily understandable by the users.

    Handling noisy or incomplete data.- The data cleaning methods are required that can handlethe noise, incomplete objects while mining the data regularities. If data cleaning methods are notthere then the accuracy of the discovered patterns will be poor.

    Pattern evaluation.- It refers to interestingness of the problem. The patterns discovered shouldbe interesting because either they represent common knowledge or lack novelty.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    46/82

    PREPARED BY ARUN PRATAP SINGH 45

    45

    Performance Issues

    It refers to the following issues:

    Efficiency and scalability of data mining algorithms. - In order to effectively extract theinformation from huge amount of data in databases, data mining algorithm must be efficient and

    scalable. Parallel, distributed, and incremental mining algorithms.- The factors such as huge size of

    databases, wide distribution of data,and complexity of data mining methods motivate thedevelopment of parallel and distributed data mining algorithms. These algorithm divide the datainto partitions which is further processed parallel. Then the results from the partitions is merged.The incremental algorithms, updates databases without having mine the data again from scratch.

    Diverse Data Types Issues

    Handling of relational and complex types of data.- The database may contain complex dataobjects, multimedia data objects, spatial data, temporal data etc. It is not possible for one systemto mine all these kind of data.

    Mining information from heterogeneous databases and global information systems. - The

    data is available at different data sources on LAN or WAN. These data source may be structured,semi structured or unstructured. Therefore mining knowledge from them adds challenges to datamining.

    OTHER ISSUES IN DATA MINING :

    Some of these issues are addressed below. Note that these issues are not exclusive and are notordered in any way.

    Security and social issues: Security is an important issue with any data collection that is shared

    and/or is intended to be used for strategic decision-making. In addition, when data is collected forcustomer profiling, user behavior understanding, correlating personal data with other information,etc., large amounts of sensitive and private information about individuals or companies isgathered and stored. This becomes controversial given the confidential nature of some of thisdata and the potential illegal access to the information. Moreover, data mining could disclose newimplicit knowledge about individuals or groups that could be against privacy policies, especially ifthere is potential dissemination of discovered information. Another issue that arises from thisconcern is the appropriate use of data mining. Due to the value of data, databases of all sorts ofcontent are regularly sold, and because of the competitive advantage that can be attained fromimplicit knowledge discovered, some important information could be withheld, while otherinformation could be widely distributed and used without control.

    User interface issues: The knowledge discovered by data mining tools is useful as long as it isinteresting, and above all understandable by the user. Good data visualization eases theinterpretation of data mining results, as well as helps users better understand their needs. Manydata exploratory analysis tasks are significantly facilitated by the ability to see data in anappropriate visual presentation. There are many visualization ideas and proposals for effectivedata graphical presentation. However, there is still much research to accomplish in order to obtaingood visualization tools for large datasets that could be used to display and manipulate mined

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    47/82

    PREPARED BY ARUN PRATAP SINGH 46

    46

    knowledge. The major issues related to user interfaces and visualization are "screen real-estate", information rendering, and interaction. Interactivity with the data and data mining resultsis crucial since it provides means for the user to focus and refine the mining tasks, as well as topicture the discovered knowledge from different angles and at different conceptual levels.

    Mining methodology issues: These issues pertain to the data mining approaches applied andtheir limitations. Topics such as versatility of the mining approaches, the diversity of dataavailable, the dimensionality of the domain, the broad analysis needs (when known), theassessment of the knowledge discovered, the exploitation of background knowledge andmetadata, the control and handling of noise in data, etc. are all examples that can dictate miningmethodology choices. For instance, it is often desirable to have different data mining methodsavailable since different approaches may perform differently depending upon the data at hand.Moreover, different approaches may suit and solve user's needs differently.

    Most algorithms assume the data to be noise-free. This is of course a strong assumption. Mostdatasets contain exceptions, invalid or incomplete information, etc., which may complicate, if notobscure, the analysis process and in many cases compromise the accuracy of the results. As aconsequence, data preprocessing (data cleaning and transformation) becomes vital. It is oftenseen as lost time, but data cleaning, as time-consuming and frustrating as it may be, is one of themost important phases in the knowledge discovery process. Data mining techniques should beable to handle noise in data or incomplete information.

    More than the size of data, the size of the search space is even more decisive for data miningtechniques. The size of the search space is often depending upon the number of dimensions inthe domain space. The search space usually grows exponentially when the number of dimensionsincreases. This is known as the curse of dimensionality. This "curse" affects so badly theperformance of some data mining approaches that it is becoming one of the most urgent issuesto solve.

    Performance issues: Many artificial intelligence and statistical methods exist for data analysisand interpretation. However, these methods were often not designed for the very large data setsdata mining is dealing with today. Terabyte sizes are common. This raises the issues of scalabilityand efficiency of the data mining methods when processing considerably large data. Algorithmswith exponential and even medium-order polynomial complexity cannot be of practical use fordata mining. Linear algorithms are usually the norm. In same theme, sampling can be used formining instead of the whole dataset. However, concerns such as completeness and choice ofsamples may arise. Other topics in the issue of performance are incremental updating, and

    parallel programming. There is no doubt that parallelism can help solve the size problem if thedataset can be subdivided and the results can be merged later. Incremental updating is importantfor merging results from parallel mining, or updating data mining results when new data becomesavailable without having to re-analyze the complete dataset.

    Data source issues: There are many issues related to the data sources, some are practical suchas the diversity of data types, while others are philosophical like the data glut problem. We

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    48/82

    PREPARED BY ARUN PRATAP SINGH 47

    47

    certainly have an excess of data since we already have more data than we can handle and weare still collecting data at an even higher rate. If the spread of database management systemshas helped increase the gathering of information, the advent of data mining is certainlyencouraging more data harvesting. The current practice is to collect as much data as possiblenow and process it, or try to process it, later. The concern is whether we are collecting the rightdata at the appropriate amount, whether we know what we want to do with it, and whether we

    distinguish between what data is important and what data is insignificant. Regarding the practicalissues related to data sources, there is the subject of heterogeneous databases and the focus ondiverse complex data types. We are storing different types of data in a variety of repositories. It isdifficult to expect a data mining system to effectively and efficiently achieve good mining resultson all kinds of data and sources. Different kinds of data and sources may require distinctalgorithms and methodologies. Currently, there is a focus on relational databases and datawarehouses, but other approaches need to be pioneered for other specific complex data types. Aversatile data mining tool, for all sorts of data, may not be realistic. Moreover, the proliferation ofheterogeneous data sources, at structural and semantic levels, poses important challenges notonly to the database community but also to the data mining community.

    DATA PROCESSING :

    What is the need for Data Processing?

    To get the required information from huge, incomplete, noisy and inconsistent set of data it isnecessary to use data processing.

    Steps in Data Processing:

    Data Cleaning

    Data Integration

    Data Transformation

    Data reduction

    Data Summarization

    What is Data Cleaning?

    Data cleaning is a procedure to clean the data by filling in missing values, smoothing noisy data,identifying or removing outliers, and resolving inconsistencies

    What is Data Integration?

    Integrating multiple databases, data cubes, or files, this is called data integration.

    What is Data Transformation?

    Data transformation operations, such as normalization and aggregation, are additional datapreprocessing procedures that would contribute toward the success of the mining process.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    49/82

    PREPARED BY ARUN PRATAP SINGH 48

    48

    What is Data Reduction?

    Data reduction obtains a reduced representation of the data set that is much smaller in volume,yet produces the same (or almost the same) analytical results.

    What is Data Summarization?

    It is the processes of representing the collected data in an accurate and compact way withoutlosing any information, it also involves getting a information from collected data.Ex: Display thedata as a graph and get the mean, median, mode etc.

    How to Clean Data?

    Handling Missing values

    Ignore the tuple

    Fill in the missing value manually

    Use a global constant to fill in the missing value

    Use the attribute mean to fill in the missing value

    Use the attribute mean for all samples belonging to the same class as the given tuple

    Use the most probable value to fill in the missing value.

    Handle Noisy Data

    Binning: Binning methods smooth a sorted data value by consulting its neighborhood.

    Regression: Data can be smoothed by fitting the data to a function, such as withregression.

    Clustering: Outliers may be detected by clustering, where similar values are organizedinto groups, or clusters.

    Data Integration :

    Data Integration combines data from multiple sources into a coherent data store, as in datawarehousing. These sources may include multiple databases, data cubes, or flat files. Issues thatarises during data integration like Schema integration and object matching Redundancy is anotherimportant issue.

    Data Transformation

    Data transformation can be achieved in following ways

    Smoothing: which works to remove noise from the data

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    50/82

    PREPARED BY ARUN PRATAP SINGH 49

    49

    Aggregation: where summary or aggregation operations are applied to the data. Forexample, the daily sales data may be aggregated so as to compute weekly and annuualtotal scores.

    Generalization of the data: where low-level or primitive (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes,

    like street, can be generalized to higher-level concepts, like city or country.

    Normalization: where the attribute data are scaled so as to fall within a small specifiedrange, such as 1.0 to 1.0, or 0.0 to 1.0.

    Attribute construction : this is where new attributes are constructed and added from thegiven set of attributes to help the mining process.

    Data Reduction techniques

    These are the techniques that can be applied to obtain a reduced representation of the data setthat is much smaller in volume, yet closely maintains the integrity of the original data.

    1) Data cube aggregation

    2) Attribute subset selection

    3) Dimensionality reduction

    4) Numerosity reduction

    5) Discretization and concept hierarchy generation

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    51/82

    PREPARED BY ARUN PRATAP SINGH 50

    50

    DATA REDUCTION :

    What is Data Reduction?

    Data reduction obtains a reduced representation of the data set that is much smaller in volume,yet produces the same (or almost the same) analytical results.

    Data reduction techniques can be applied to obtain a reduced representation of the data set thatis much smaller in volume, yet closely maintains the integrity of the original data. That is, miningon the reduced data set should be more efficient yet produce thesame (or almost the same)analytical results.

    Data Reduction techniques

    These are the techniques that can be applied to obtain a reduced representation of the data setthat is much smaller in volume, yet closely maintains the integrity of the original data.

    1) Data cube aggregation

    2) Attribute subset selection

    3) Dimensionality reduction

    4) Numerosity reduction

    5) Discretization and concept hierarchy generation

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    52/82

    PREPARED BY ARUN PRATAP SINGH 51

    51

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    53/82

    PREPARED BY ARUN PRATAP SINGH 52

    52

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    54/82

    PREPARED BY ARUN PRATAP SINGH 53

    53

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    55/82

    PREPARED BY ARUN PRATAP SINGH 54

    54

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    56/82

    PREPARED BY ARUN PRATAP SINGH 55

    55

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    57/82

    PREPARED BY ARUN PRATAP SINGH 56

    56

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    58/82

    PREPARED BY ARUN PRATAP SINGH 57

    57

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    59/82

    PREPARED BY ARUN PRATAP SINGH 58

    58

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    60/82

    PREPARED BY ARUN PRATAP SINGH 59

    59

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    61/82

    PREPARED BY ARUN PRATAP SINGH 60

    60

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    62/82

    PREPARED BY ARUN PRATAP SINGH 61

    61

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    63/82

    PREPARED BY ARUN PRATAP SINGH 62

    62

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    64/82

    PREPARED BY ARUN PRATAP SINGH 63

    63

    DATA MINING STATISTICS :

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    65/82

    PREPARED BY ARUN PRATAP SINGH 64

    64

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    66/82

    PREPARED BY ARUN PRATAP SINGH 65

    65

    DATA MINING TECHNIQUES :

    Many different data mining, query model, processing model, and data collection techniques areavailable. Which one do you use to mine your data, and which one can you use in combinationwith your existing software and infrastructure? Examine different data mining and analyticstechniques and solutions, and learn how to build them using existing software and installations.Explore the different data mining tools that are available, and learn how to determine whether thesize and complexity of your information might result in processing and storage complexities, andwhat to do.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    67/82

    PREPARED BY ARUN PRATAP SINGH 66

    66

    This overview provides a description of some of the most common data mining algorithms inuse today. We have broken the discussion into two sections, each with a specific theme:

    Classical Techniques: Statistics, Neighborhoods and Clustering Next Generation Techniques: Trees, Networks and Rules

    I. Classical Techniques: Statistics, Neighborhoods and Clustering

    1.1. The Classics

    These two sections have been broken up based on when the data mining technique wasdeveloped and when it became technically mature enough to be used for business, especially foraiding in the optimization of customer relationship management systems. Thus this sectioncontains descriptions of techniques that have classically been used for decades the next sectionrepresents techniques that have only been widely used since the early 1980s.

    This section should help the user to understand the rough differences in the techniques and atleast enough information to be dangerous and well armed enough to not be baffled by the vendors

    of different data mining tools.

    The main techniques that we will discuss here are the ones that are used 99.9% of the time onexisting business problems. There are certainly many other ones as well as proprietarytechniques from particular vendors - but in general the industry is converging to those techniquesthat work consistently and are understandable and explainable.

    1.2. Statistics

    By strict definition "statistics" or statistical techniques are not data mining. They were being usedlong before the term data mining was coined to apply to business applications. However,statistical techniques are driven by the data and are used to discover patterns and build predictive

    models. And from the users perspective you will be faced with a conscious choice when solvinga "data mining" problem as to whether you wish to attack it with statistical methods or other datamining techniques. For this reason it is important to have some idea of how statistical techniqueswork and how they can be applied.

    What is different between statistics and data mining?

    I flew the Boston to Newark shuttle recently and sat next to a professor from one the Boston areaUniversities. He was going to discuss the drosophila (fruit flies) genetic makeup to apharmaceutical company in New Jersey. He had compiled the world's largest database on thegenetic makeup of the fruit fly and had made it available to other researchers on the internetthrough Java applications accessing a larger relational database.

    He explained to me that they not only now were storing the information on the flies but also weredoing "data mining" adding as an aside "which seems to be very important these days whateverthat is". I mentioned that I had written a book on the subject and he was interested in knowingwhat the difference was between "data mining" and statistics. There was no easy answer.

    The techniques used in data mining, when successful, are successful for precisely the samereasons that statistical techniques are successful (e.g. clean data, a well defined target to predict

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    68/82

    PREPARED BY ARUN PRATAP SINGH 67

    67

    and good validation to avoid overfitting). And for the most part the techniques are used in thesame places for the same types of problems (prediction, classification discovery). In fact someof the techniques that are classical defined as "data mining" such as CART and CHAID arosefrom statisticians.

    So what is the difference? Why aren't we as excited about "statistics" as we are about data

    mining? There are several reasons. The first is that the classical data mining techniques suchas CART, neural networks and nearest neighbor techniques tend to be more robust to bothmessier real world data and also more robust to being used by less expert users. But that is notthe only reason. The other reason is that the time is right. Because of the use of computers forclosed loop business data storage and generation there now exists large quantities of data thatis available to users. IF there were no data - there would be no interest in mining it. Likewise thefact that computer hardware has dramatically upped the ante by several orders of magnitude instoring and processing the data makes some of the most powerful data mining techniques feasibletoday.

    1 3 Nearest NeighborClustering and the Nearest Neighbor prediction technique are among the oldest techniques usedin data mining. Most people have an intuition that they understand what clustering is - namelythat like records are grouped or clustered together. Nearest neighbor is a prediction techniquethat is quite similar to clustering - its essence is that in order to predict what a prediction value isin one record look for records with similar predictor values in the historical database and use theprediction value from the record that it nearest to the unclassified record.

    A simple example of clustering

    A simple example of clustering would be the clustering that most people perform when they dothe laundry - grouping the permanent press, dry cleaning, whites and brightly colored clothes is

    important because they have similar characteristics. And it turns out they have importantattributes in common about the way they behave (and can be ruined) in the wash. To clusteryour laundry most of your decisions are relatively straightforward. There are of course difficultdecisions to be made about which cluster your white shirt with red stripes goes into (since it ismostly white but has some color and is permanent press). When clustering is used in businessthe clusters are often much more dynamic - even changing weekly to monthly and many more ofthe decisions concerning which cluster a record falls into can be difficult.

    A simple example of nearest neighbor

    A simple example of the nearest neighbor prediction algorithm is that if you look at the people inyour neighborhood (in this case those people that are in fact geographically near to you). You

    may notice that, in general, you all have somewhat similar incomes. Thus if your neighbor hasan income greater than $100,000 chances are good that you too have a high income. Certainlythe chances that you have a high income are greater when all of your neighbors have incomesover $100,000 than if all of your neighbors have incomes of $20,000. Within your neighborhoodthere may still be a wide variety of incomes possible among even your closest neighbors but ifyou had to predict someones income based on only knowing their neighbors youre best chanceof being right would be to predict the incomes of the neighbors who live closest to the unknownperson.

  • 8/12/2019 Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

    69/82

    PREPARED BY ARUN PRATAP SINGH 68

    68

    The nearest neighbor prediction algorithm works in very much the same way except thatnearness in a database may consist of a variety of factors not just where the person lives. Itmay, for instance, be far more important to know which school someone attended and whatdegree they attained when predicting income. The better definition of near might in fact be otherpeople that you graduated from college with rather than the people that you live next to.

    Nearest Neighbor techniques are among the easiest to use and understand because they workin a way similar to the way that people think - by detecting closely matching examples. They alsoperform quite well in terms of automation, as many of the algorithms are robust with respect todirty data and missing data. Lastly they are particularly adept at performing complex ROIcalculations because the predictions are made at a local level where business simulations couldbe performed in order to optimize ROI. As they enjoy similar levels of accuracy compared toother techniques the measures of accuracy such as lift are as good as from any other.

    How to use Nearest Neighbor for Prediction

    One of the essential elements underlying the concept of clustering is that one particular object(whether they be cars, food or customers) can be closer to another object than can some thirdobject. It is interesting that most people have an innate sense of ordering placed on a variety ofdifferent objects. Most people would agree that an apple is closer to an orange than it is to atomato and that a Toyota Corolla is closer to a Honda Civic than to a Porsche. This sense ofordering on many different objects helps us place them in time and space and to make sense ofthe world. It is what allows us to build clusters - both in databases on computers as well as in ourdaily lives. This definition of nearness that seems to be ubiquitous also allows us to makepredictions.

    The nearest neighbor prediction algorithm simply stated is:

    Objects that are near to each other will have similar prediction values as well. Thus if you know

    the prediction value of one of the objects you can predict it for its nearest neig