unit-v data warehousing, data mining & olap

Upload: vishvajeet-singh

Post on 08-Aug-2018

221 views

Category:

Documents


1 download

TRANSCRIPT

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    1/64

    UNIT-V DATA Warehousing.

    Data Warehousing Components. Building a Data Warehouse.

    Mapping the Data Warehouse to a

    Multiprocessor Architecture. DBMS Schemas for Decision Support. Data

    Extraction, cleanup & Transformation Tools.

    Metadata.

    Data Mining: Introduction to data mining

    Kapil Tomar, IT Deptt. AKGEC 1

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    2/64

    What is Data Warehousing

    Data Warehousing is an architectural

    construct of information systems that

    provides users with current and historical

    decision support information that is hard to

    access or present in traditional operational

    data stores.

    Kapil Tomar, IT Deptt. AKGEC 2

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    3/64

    Data Warehouse definition

    A formal definition of the data warehouse isoffered by W.H. Inmon:

    A data warehouse is asubject-oriented,integrated, time-variant, nonvolatile

    collection of data in support of management

    decisions

    Kapil Tomar, IT Deptt. AKGEC 3

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    4/64

    Seven data warehouse components

    Data sourcing, cleanup, transformation, and

    migration tools

    Metadata repository

    Warehouse/database technology

    Data marts Data query, reporting, analysis, and mining tools

    Data warehouse administration and management

    Information delivery system

    Kapil Tomar, IT Deptt. AKGEC 4

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    5/64

    Typically, the source data for the warehouse is coming from the

    operational applications [an exception might be an operational data

    store (ODS), As the data enters the data warehouse, it is transformedinto an integrated structure format. The transformation process may

    involve conversion, summarization, filtering" and condensation of data.

    Because data within the data warehouse contains a large historical

    component (sometimes covering 5 to 10 years), the data warehouse

    must be capable of holding and managing large volumes of data as wellas different data structures for the same database over time.

    Kapil Tomar, IT Deptt. AKGEC 5

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    6/64

    Kapil Tomar, IT Deptt. AKGEC 6

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    7/64

    Kapil Tomar, IT Deptt. AKGEC 7

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    8/64

    Sourcing, Acquisition, Cleanup,

    and Transformation Tools

    A significant portion of the data warehouse implementation effort is spentextracting data from operational systems and putting it in a format suitable

    for informational applications that will run off the data warehouse.

    perform all of the conversions, summarizations, key changes, structural

    changes, and condensations needed to transform disparate data intoinformation that can be used by the decision support tool.

    Removing unwanted data from operational databases

    Kapil Tomar, IT Deptt. AKGEC 8

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    9/64

    The functionality includes

    Removing unwanted data from operational databases

    Converting to common data names and definitions

    Calculating summaries and derived data

    Establishing defaults for missing data

    Accommodating source data definition changes

    Kapil Tomar, IT Deptt. AKGEC 9

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    10/64

    The data sourcing, cleanup, extract, transformation, and migration

    tools have to deal with some significant issues as follows:

    Database heterog eneity. DBMSs are very different in data

    models, data access language, data navigation, operations,

    concurrency, integrity, recovery, etc.

    Data heterogeneity. This is the difference in the way data is

    defined andused in different models- homonyms, synonyms, unit

    incompatibility different attributes for the same entity, and different

    ways of modeling the same fact.

    Kapil Tomar, IT Deptt. AKGEC 10

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    11/64

    Metadata Metadata is data about data that describes the data warehouse. It is

    used for building, maintaining, managing, and using the data warehouse.

    Metadata can be classified into

    Technical metadata,

    Business metadata

    Kapil Tomar, IT Deptt. AKGEC 11

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    12/64

    Technical metadata, which contains information about warehouse data

    for use by warehouse designers and administrators when carrying out

    warehouse development and management tasks. Technical meta data

    documents include

    Information about data source

    Transformation descriptions, i.e., the mapping method from operational

    databases into the warehouse, and algorithms used to convert, enhance

    or transform data Warehouse object and data structure definitions for data targets

    The rules used to perform data cleanup and data enhancement

    Data mapping operations when capturing data from source systems and

    applying it to the target warehouse database

    Access authorization, backup history, archive history, informationdelivery history, data acquisition history, data access, etc.

    Kapil Tomar, IT Deptt. AKGEC 12

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    13/64

    Business metadata contains information that gives users an easy-to

    understand perspective of the information stored in the data-ware house.

    Business metadata documents information about

    Subject areas and information object type, including queries, reports,

    image video, and/or audio clips.

    Internet home pages.

    Other information to support all data warehousing components. For

    example, the information related to the information delivery system (see

    Sec. 6.8) should include subscription information, scheduling information,

    details of delivery destinations, and the business query objects such as

    predefined queries, reports, and analyses.

    Data warehouse operational information, e.g., data history (snapshots,

    versions), ownership, extract audit trail, usage data

    Kapil Tomar, IT Deptt. AKGEC 13

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    14/64

    Metadata repository management software can be used to map the

    source data to the target database, generate code for data

    transformations, integrate and transform the data, and control moving

    data to the warehouse.

    One of the important functional components of the metadata repository is

    the information directory.

    Kapil Tomar, IT Deptt. AKGEC 14

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    15/64

    From a technical requirements point of view, the information directory

    and the entire metadata repository

    Should be a gateway to the data warehouse environment, and thus

    should be accessible from any platform via transparent and seamless

    connections

    Should support an easy distribution and replication of its content for high

    performance and availability

    Should be searchable by business-oriented key words

    Should act as a launch platform for end-user data access and analysis

    tools

    Kapil Tomar, IT Deptt. AKGEC 15

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    16/64

    Should support the sharing of information objects such as queries,

    reports, data collections, and subscriptions between users

    Should support a variety of scheduling options for requests against the

    data warehouse, including on-demand, one-time, repetitive, event-driven,and conditional delivery (in conjunction with the information delivery

    system)

    Should support the distribution of the query results to one or more

    destinations in any of the user-specified formats (in conjunction with the

    information delivery system) .

    Should support and provide interfaces to other applications such as e-

    mail, spreadsheet, and schedulers

    Kapil Tomar, IT Deptt. AKGEC 16

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    17/64

    Access Tools The principal purpose of data warehousing is to provide information to

    business users for strategic decision making. These users interact with

    the data warehouse using front-end tools.

    For the purpose of this discussion let's divide these tools into five main

    groups:

    Data query and reporting tools

    Application development tools

    Executive information system (EIS) tools

    On-line analytical processing tools

    Data mining tools

    Kapil Tomar, IT Deptt. AKGEC 17

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    18/64

    Data query and reporting tools

    This category can be further divided into two groups:

    1. rep ort in g too ls and2. managed query to ols .

    1. Reporting tools can be divided into

    i. production reporting tools and

    ii. desktop report writers.

    Kapil Tomar, IT Deptt. AKGEC 18

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    19/64

    Production reporting tools will let companies generate regular

    operational reports or support high-volume 'batch jobs, such as calculating

    and printing paychecks.

    Report writers, on the other hand, are inexpensive desktop tools

    designed for end users.

    2. Managed query tools shield end users from the complexities of SQL and

    database structures by inserting a metalayer between users and the

    database. The metalayer is the software that provides subject-oriented

    views of a database andsupports point-and-click creation of SQL.

    Kapil Tomar, IT Deptt. AKGEC 19

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    20/64

    Application development tools

    in-house application development

    PowerBuilder from PowerSoft,

    Visual Basic from Microsoft,

    Forte from Forte Software, and

    Business Objects from Business Objects.

    Kapil Tomar, IT Deptt. AKGEC 20

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    21/64

    On-line analytical processing tools

    On-line analytical processing (OLAP) tools. These tools are based on the

    concepts of multidimensional databases and allow a sophisticated user

    to analyze the data using elaborate, multidimensional views.

    Typically business applications for these tools include product

    performance and profitability, effectiveness of a sales program or a

    marketing campaign, sales forecasting, and capacity planning.

    Kapil Tomar, IT Deptt. AKGEC 21

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    22/64

    Data Mining

    A critical success factor for any business today is its ability to use

    information effectively.

    Knowing this information, an organization can formulate effective

    business, marketing, and sales strategies; precisely target promotional

    activity; discover and penetrate new markets; and successfully compete

    in the marketplace from a position of informed strength.

    A relatively new and promising technology aimed at achieving this

    strategic advantage is known as data mining.

    major attraction of data mining is its ability to buildpredictive rather thanretrospective models.

    Kapil Tomar, IT Deptt. AKGEC 22

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    23/64

    Most organizations engage in data mining to

    Visu alize Data

    Correct Data Disco ver knowledge. The goal of knowledge discovery is to determine

    explicithidden relationships, patterns, or correlations from data stored in

    an enterprise's database. Specifically data mining can be used to

    perform:

    Segmentation (e.g. group customer records for custom-tailored marketing) Classification (assignment of input data to a predefined class, discovery and

    understanding of trends, text document classification)

    Association (discovery of cross-sales opportunities)

    Preferencing(determining preference of customer's majority)

    Kapil Tomar, IT Deptt. AKGEC 23

    PACS

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    24/64

    Data Marts However, the term data mart means different things to different people.

    A rigorous definition of this term is a data store that is subsidiary to a

    data warehouse of integrated data. The data mart is directed at a

    partition of data (often called a subject area) that is created for the use of

    a dedicatedgroup of users.

    A data mart might, in fact, be a set of denormalized, summarized, or

    aggregated data. Sometimes, such a set could be placed on the data

    warehouse database rather than a physically separate store of data.

    In most instances, however, the data mart is a physically separate storeof data and is normally resident on a separate database server, often on

    the local area network serving a dedicated user group.

    Kapil Tomar, IT Deptt. AKGEC 24

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    25/64

    it is often a necessary and valid solution to a pressing business

    problem, thus achieving the goal of rapid delivery of enhanced

    decision support functionality to end users. The business drivers

    underlying such developments include

    Extremely urgent user requirements

    The absence of a budget for a full data warehouse strategy

    The. absence of a sponsor for an enterprise wide decision support

    strategy The decentralization of business units

    The attraction of easy-to-use tools and a mind-sized project

    Kapil Tomar, IT Deptt. AKGEC 25

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    26/64

    In summary, data marts present two problems:

    (1) scalability in situations where an initial small data mart grows

    quickly in multiple dimensions and

    (2) data integration.

    Therefore, when designing data marts, the organizations should pay

    close attention to system scalability, data consistency, and

    manageability issues.

    The key to a successful data mart strategy is the development of an

    overall scalable data warehouse architecture; and the key step in

    that architecture is identifying and implementing the common

    dimensions.

    Kapil Tomar, IT Deptt. AKGEC 26

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    27/64

    Data Warehouse Administration and Management

    Security and priority management

    Monitoring updates from multiple sources

    Data quality checks

    Managing and updating metadata

    Auditing and reporting data warehouse usage and status (for managing

    the response time and resource utilization, and providing chargeback

    information)

    Purging data

    Replicating, subsetting, and distributing data

    Backup and recovery

    Data warehouse storage management [e.g., capacity planning,

    hierarchical storage management (HSM), purging of aged data]

    Kapil Tomar, IT Deptt. AKGEC 27

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    28/64

    Information Delivery System

    The information delivery component is used to enable the process of

    subscribing for data warehouse information and having it delivered to

    one or more destinations of choice according to some user-specIfIedscheduling algorithm.

    Kapil Tomar, IT Deptt. AKGEC 28

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    29/64

    Information Delivery System

    The information delivery component is used to enable the process of

    subscribing for data warehouse information and having it delivered to

    one or more destinations of choice according to some user-specIfIedscheduling algorithm.

    In other words, the infrormation delivery system distributes warehouse-

    stored data and other information objects to other data warehouses and

    end-user products such as spreadsheets and local databases.

    Kapil Tomar, IT Deptt. AKGEC 29

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    30/64

    Information Delivery System

    Delivery of information may be based on time of day, or on a completion

    of an external event.

    The value of data warehousing is maximized when the right information

    gets into the hands of those individuals who need it, where they need It,

    and when they need it the most.

    Kapil Tomar, IT Deptt. AKGEC 30

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    31/64

    Building a Data Warehouse

    Kapil Tomar, IT Deptt. AKGEC 31

    S f

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    32/64

    Nine-Step Method in the Design of a

    Data Warehouse

    1. Choosing the subject matter2. Deciding what a fact table represents

    3. Identifying and conforming the dimensions

    4. Choosing the facts5. Storing precalculations in the fact table

    6. Rounding out the dimension tables

    7. Choosing the duration of the database

    8. The need to track slowly changing dimensions

    9. Deciding the query priorities and the query modes

    Kapil Tomar, IT Deptt. AKGEC 32

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    33/64

    Benefits of Data Warehousing

    Locating the right information Presentation of information (reports, graphs)

    Testing of hypothesis

    Discovery of information

    Sharing the analysis

    Kapil Tomar, IT Deptt. AKGEC 33

    T ibl b fit

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    34/64

    Tangible benefits Product inventory turnover is improved.

    Costs of product introduction are decreased with improved selection of targetmarkets.

    More cost-effective decision making is enabled by separating (ad hoc) query

    processing from running against operational databases.

    Better business intelligence is enabled by increased quality and flexibility of market

    analysis available through multilevel data structures, which may range from detailed

    to highly summarized. For example, determining the effectiveness of marketing

    programs allows the elimination of weaker programs and enhancement of stronger

    ones.

    Enhanced asset and liability management means that a data warehouse can

    provide a "big picture of enterprise wide purchasing and inventory patterns, and

    can indicate otherwise unseen credit exposure and opportunities for cost savings.

    Kapil Tomar, IT Deptt. AKGEC 34

    I t ibl b fit

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    35/64

    Intangible benefits Improved productivity, by keeping all required data in a single location and

    eliminating the rekeying of data

    Reduced redundant processing, support, and software to support

    overlapping decision support applications

    Enhanced customer relations through improved knowledge of individual

    requirements and trends, through customization, Improvedcommunications, and tailored product offerings

    Enabling business process reengineering - data warehousing can provide

    useful insights into the work processes themselves, resulting in developing

    breakthrough ideas for the reengineering of those processes

    Kapil Tomar, IT Deptt. AKGEC 35

    M i th D t W h t

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    36/64

    Mapping the Data Warehouse to a

    Multiprocessor Architecture

    The organizations that embarked on data warehousingdevelopment deal with ever-increasing amounts of data.

    Generally speaking, the size of a data warehouse rapidly

    approaches the point where the search for better

    performance and scalability becomes a real necessity. This

    search is pursuing two goals:

    Speed-up-the ability to execute the same request on the

    same amount ofdata in less time

    Scale-up-the ability to obtain the same performance on the

    same request as the database size increases

    Kapil Tomar, IT Deptt. AKGEC 36

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    37/64

    An additional and important goal is to achieve linear speed-up and

    scale-up; doublingthe number of processors cuts the response time

    in half (linear speed-up) or provides the same performance on twice

    as much data (linear scale-up).

    These goals of linear performance and scalability can be satisfied by

    parallel hardware architectures, parallel operating systems, and

    parallel database management systems. Parallel hardware

    architectures are based on multiprocessor systems designed as a

    shared-memory model [symmetric multiprocessors (SMPs),shared-disk model, or distributed-memory model [massively parallel

    processors (MPPs), and clusters of uniprocessors and/or SMPs].

    Kapil Tomar, IT Deptt. AKGEC 37

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    38/64

    Types of parallelism

    Horizontal parallelism

    Vertical parallelism

    Kapil Tomar, IT Deptt. AKGEC 38

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    39/64

    Horizon tal paral lel ism, which means that the database is

    partitioned across multiple disks, and parallel processing occurs

    within a specific task (i.e., table scan) that is performed

    concurrently on different processors against different sets ofdata.

    Vertic al paral lel ism, which occurs among different tasks-all

    componentquery operations (i.e., scan, join, sort) are executedin parallel in a pipelined fashion. In other words, an output

    from one task (e.g., scan) becomes an input into another task

    (e.g., join) as soon as records become available

    A truly parallel DBMS should support both horizontal and

    vertical types of parallelism concurrently (see Fig. 8.1, case 4).

    Kapil Tomar, IT Deptt. AKGEC 39

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    40/64

    Kapil Tomar, IT Deptt. AKGEC 40

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    41/64

    Data partitioning

    Hash part i t ion ing

    Key range part i t ion ing

    Schema part i t ioning

    User-def ined part i t ioning

    Kapil Tomar, IT Deptt. AKGEC 41

    D t titi i

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    42/64

    Data partitioning Hash part i t ionin g. A hash algorithm is used to calculate the partition number

    (hash value) based on the value of the partitioning key for each row.

    Key range part i t ionin g. Rows are placed and located in the partitions according

    to the value of the partitioning key (all rows with the key value from A to K are in

    partition 1, L to T are in partition 2, etc.).

    Schema part i t ionin g. This is an option not to partition a table across disks;instead, an entire table is placed on one disk, another table is placed on a

    different disk, etc. This is useful for small reference tables that are more

    effectively used when replicated in each partition rather than spread across

    partitions.

    User-def ined part i t ion ing. This is a partitioning method that allows a table to be

    partitioned on the basis of a user-defined expression (e.g., use state codes to

    place rows in one of 50 partitions) ..

    Kapil Tomar, IT Deptt. AKGEC 42

    D t b A hit t

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    43/64

    Database Architectures

    for Parallel Processing

    Shared-memory architecture

    Shared-disk architecture

    Shared-nothing architecture

    Kapil Tomar, IT Deptt. AKGEC 43

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    44/64

    Shared-memory architecture

    Kapil Tomar, IT Deptt. AKGEC 44

    Shared disk architecture

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    45/64

    Shared-disk architecture

    Kapil Tomar, IT Deptt. AKGEC 45

    S

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    46/64

    Shared-nothing architecture

    Kapil Tomar, IT Deptt. AKGEC 46

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    47/64

    Parallel DBMS Features

    Scope and techniques of parallel DBMS operations

    Optimizer implementation

    Application transparency

    The parallel environment. DBMS management tools

    Price /performance

    Kapil Tomar, IT Deptt. AKGEC 47

    DBMS S h f

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    48/64

    DBMS Schemas for

    Decision Support

    Data warehousing projects were forced to choose

    between a data model and a corresponding database

    schema that is intuitive for analysis but performs poorly

    and a model-schema that performs better but is not wellsuited for analysis.

    The schema methodology that is gaining widespread

    acceptance for data warehousing is the star schema.

    Kapil Tomar, IT Deptt. AKGEC 48

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    49/64

    Indeed, solving modern business problems such as

    market analysis and financial forecasting requires query-

    centric database schemas that are array oriented and

    multidimensional in nature. These business problems

    are characterized by the need to retrieve large numbersof records from very large data sets (hundreds of

    gigabytes and even terabytes) and summarize them on

    the fly.

    Kapil Tomar, IT Deptt. AKGEC 49

    DBMS S h f

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    50/64

    DBMS Schemas for

    Decision Support Star Schema

    Potential performance problems with star

    schemas

    Kapil Tomar, IT Deptt. AKGEC 50

    Star Schema

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    51/64

    Star Schema The multidimensional view of data that is expressed using relational

    database semantics is provided by the database schema design called

    star schema.

    The basic premise of star schemas is that information can be classified

    into two groups: facts and dimensions.

    Facts are the core data element being analyzed.

    For example, units of individual items sold are facts,

    while dimensions are attributes about the facts.

    For example, dimensions are the product types purchased and the date

    of purchase (see Fig 9.1).

    Kapil Tomar, IT Deptt. AKGEC 51

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    52/64

    facts (UNITS) through a set of dimensions (MARKETS, PRODUCTS,

    PERIOD).

    It's-important to notice that, in the typical star schema, the fact table is

    much larger than any of its dimension tables.

    This point becomes an important consideration of the performance

    issues associated with star schemas.

    Kapil Tomar, IT Deptt. AKGEC 52

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    53/64

    Kapil Tomar, IT Deptt. AKGEC 53

    Potential performance problems

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    54/64

    Potential performance problems

    with star schemas

    Indexing, using indexes can enforce the uniqueness of the keys

    It requires multiple metadata definitions (one for each key component) to

    define a single relationship (table); this adds to the design complexity,

    and sluggishness in performance.

    Since the fact table must carry all key components as part of its primary

    key, addition or deletion of levels in the hierarchy will require physical

    modification of the affected table, which is a time-consuming process that

    limits flexibility.

    Carrying all the segments of the compound dimensional key in the fact

    table increases the size of the index, thus impacting both performance

    and scalability.

    Kapil Tomar, IT Deptt. AKGEC 54

    Metadata

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    55/64

    Metadata Metadata is one of the most important aspects of data warehousing. It is

    data about data stored in the warehouse and its users. At a minimum,

    metadata contains:--

    The location and description of warehouse system and data components

    (warehouse objects).

    Names, definition, structure, and content of the data warehouse andenduser views.

    Identification of authoritative data sources (systems of record).

    Integration and transformation rules used to populate the data warehouse;these include the mapping method from operational databases into the

    warehouse, and algorithms used to convert, enhance, or transform data.

    Kapil Tomar, IT Deptt. AKGEC 55

    Integration and transformation rules used to deliver data to end-user

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    56/64

    g

    analytical tools.

    Subscription information for the information delivery to the analysis

    subscribers.

    Data warehouse operational information, which includes a history of

    warehouse updates, refreshments, snapshots, versions, ownership

    authorizations, and extract audit trail.

    Metrics used to analyze warehouse usage and performance according end

    user usage patterns.

    Security authorizations, access control lists, etc.

    Kapil Tomar, IT Deptt. AKGEC 56

    Metadata Repositor

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    57/64

    Metadata Repository Metadata repository management software can be used to map the

    source data to the target database, generate code for datatransformations, integrate and transform the data, and control moving

    data to the warehouse. This software, which typically runs on a

    workstation, enables users to specify how the data should be

    transformed, such as data mapping,conversion,.and summarization.

    Metadata is searched by users to find data definitions or subject areas.

    In other words, metadata provides decision support oriented pointers to

    warehouse data, and thus provides a logical link between warehouse

    data and the decision support application.

    Kapil Tomar, IT Deptt. AKGEC 57

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    58/64

    Kapil Tomar, IT Deptt. AKGEC 58

    Having such metadata repository implemented as a part of the data

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    59/64

    g p y p p

    ware house framework provides the following benefits:

    It provides a comprehensive suite of tools for enterprise wide metadata

    management.

    It reduces and eliminates information redundancy, inconsistency, and

    under utilization.

    It simplifies management and improves organization, control, and

    accounting of information assets.

    It increases identification, understanding, coordination, and utilization of

    enterprise wide information assets.

    It provides effective data administration tools to better manage corporate

    information assets with full-function data dictionary.

    Kapil Tomar, IT Deptt. AKGEC 59

    It increases flexibility, control, and reliability of the application development

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    60/64

    y y

    process and accelerates internal application development.

    It leverages investment in legacy systems with the ability to inventory and

    utilize existing applications.

    It provides a universal relational model for heterogeneous RDBMSs to

    interact and share information.

    It enforces CASE development standards and eliminates redundancy with

    the ability to share and reuse metadata.

    Kapil Tomar, IT Deptt. AKGEC 60

    M t d t M t

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    61/64

    Metadata Management A frequently occurring problem in data warehousing is the inability to

    communicate to the end user what information resides in the data

    warehouse and how it can be accessed.

    The key to providing users and applications with a roadmap to the

    information stored in the warehouse is the metadata.

    It can define all data elements and their attributes, data sources and

    timing, and the rules that govern data use and data transformations.

    Since metadata describes the information in the warehouse from multiple

    viewpoints (input, sources, transformation, access, etc.),

    Kapil Tomar, IT Deptt. AKGEC 61

    What data exists in the data warehouse

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    62/64

    What data exists in the data warehouse

    Where to find the data

    What the original sources of the data are

    How summarizations were created

    What transformations were used

    Who is responsible for correcting errors

    What queries can be used to access the data

    How business definitions have changed over time

    What underlying business assumptions have been

    made

    Kapil Tomar, IT Deptt. AKGEC 62

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    63/64

    Kapil Tomar, IT Deptt. AKGEC 63

  • 8/22/2019 UNIT-V Data Warehousing, Data Mining & OLAP

    64/64

    Thank You