dbms

DATA WAREHOUSE CONCEPTS

Data warehouse is subject Oriented, Integrated, Time-Variant and non-volatile collection of data that support of management's decision making process.Data warehousing is a collection of methods, techniques, and tools used to support knowledge workerssenior managers, directors, managers, and analyststo conduct data analyses that help with performing decision-making processes and improving information resources.A data warehouse is a collection of data that supports decision-making processes. A Data Warehouse is a structured repository of historic data. It is developed in an evolutionary process by integrating data from non-integrated legacy systems. Data warehousing as a technological method aims to provide technical support to companies with their data management needs as an important aspect in each companys success.Data Warehousing is a good investment and asset for the company especially since it keeps the companys efficiency, productivity, profitability and competitive performance. An organization collects various data from different areas of the company more manageable including inventory needs, sales leads, customer service, etc. These data are then passed through the data management system needed for the companys policy-making measure.The data warehouse is that portion of an overall Architected Data Environment that serves as the single integrated source of data for processing information. The data warehouse has specific characteristics that include the following:

Subject-Oriented: Information is presented according to specific subjects or areas of interest, not simply as computer files. Data is manipulated to provide information about a particular subject. For example, the SRDB is not simply made accessible to end-users, but is provided structure and organized according to the specific needs.

Integrated: A single source of information for and about understanding multiple areas of interest. The data warehouse provides one-stop shopping and contains information about a variety of subjects. Thus the OIRAP data warehouse has information on students, faculty and staff, instructional workload, and student outcomes.

Non-Volatile: Stable information that doesnt change each time an operational process is executed. Information is consistent regardless of when the warehouse is accessed. Time-Variant: Containing a history of the subject, as well as current information. Historical information is an important component of a data warehouse.

Accessible: The primary purpose of a data warehouse is to provide readily accessible information to end-users.

Process-Oriented: It is important to view data warehousing as a process for delivery of information. The maintenance of a data warehouse is ongoing and iterative in nature.

Note: - Data Warehouse does not require transaction processing, recovery and concurrency control because it is physically stored separate from the operational database.

OUR GOAL FOR A DATA WAREHOUSE?

Collect Data-Scrub, Integrate & Make It Accessible Provide Information - For Our Businesses Start Managing Knowledge So Our Business Partners Will Gain Wisdom!

UNDERSTANDING DATA WAREHOUSE

The Data Warehouse is that database which is kept separate from the organization's operational database. There is no frequent updation done in data warehouse. Data warehouse possess consolidated historical data which help the organization to analyse its business. Data warehouse helps the executives to organize, understand and use their data to take strategic decision. Data warehouse systems available which helps in integration of diversity of application systems. The Data warehouse system allows analysis of consolidated historical data analysis.

DATA WAREHOUSE APPLICATIONS

Data Warehouse helps the business executives in organize, analyse and use their data for decision making. Data Warehouse serves as a soul part of a plan-execute-assess "closed-loop" feedback system for enterprise management. Data Warehouse is widely used in the following fields:

Financial services Banking Services Consumer goods Retail sectors. Controlled manufacturing

DATA WAREHOUSE TYPES

Information processing, Analytical processing and Data Mining are the three types of data warehouse applications that are discussed below:

Information processing - Data Warehouse allow us to process the information stored in it. The information can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs.

Analytical Processing - Data Warehouse supports analytical processing of the information stored in it. The data can be analysed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.

Data Mining - Data Mining supports knowledge discovery by finding the hidden patterns and associations, constructing analytical models, performing classification and prediction. These mining results can be presented using the visualization tools.

DATA WAREHOUSE TOOLS AND UTILITIES FUNCTIONS

The following are the functions of Data Warehouse tools and Utilities:

Data Extraction - Data Extraction involves gathering the data from multiple heterogeneous sources.

Data Cleaning - Data Cleaning involves finding and correcting the errors in data.

Data Transformation - Data Transformation involves converting data from legacy format to warehouse format.

Data Loading - Data loading involves sorting, summarizing, consolidating, checking integrity and building indices and partitions.

Refreshing - Refreshing involves updating from data sources to warehouse.

Note: Data Cleaning and Data Transformation are important steps in improving the quality of data and data mining results.

DATA MINING DEFINITIONData mining is the process of extracting previously unknown but significant information from large databases and using it to make crucial business decisions. Data mining transforms the data into information and tends to be bottom-up.

DATA MINING PROCESS1. Data extraction process extracts useful subsets of data for mining.2. Aggregation may be done if summary statistics are useful.3. Initial searches should be carried out on aggregated data to develop a bird's eye view of the information. (extracted information)4. Focus on the detailed data provides a clearer view. (assimilated information)

OPERATIONAL VERSUS INFORMATIONAL SYSTEMS

Operational SystemInformational System

1Supports day-to-day decisionsSupports long-term, strategic decisions

2Transaction drivenAnalysis driven

3Data constantly changesData rarely changes

4Repetitive processingHeuristic processing

5Holds current dataHolds historical data

6Stores detailed dataStores summarized and detailed data

7Application orientedSubject oriented

8Predictable pattern of usageUnpredictable pattern of usage

9Serves clerical, transactional communityServes managerial community

METADATAIt is data about data.It is used as A directory to locate the contents of the data warehouse. A guide to the mapping of data as the data is transformed from the operational environment to the data warehouse environment. A guide to the algorithms used for summarization between the current data and the summarized data.

It also contains information about Structure of the data Data extraction/transformation history Data usage statistics Data warehouse table sizes Column sizes Attribute hierarchies and dimensions Performance metrics

Operational versus Data Warehouse Systems

FeatureOperationalData Warehouse

Data contentcurrent values archival data, summarized data, calculated data

Data organizationapplication by application subject areas across enterprise

Nature of dataDynamic static until refreshed

Data structure, complex; format suitable for operational computation simple; suitable for business analysis

Access probabilityHighmoderate to low

Data updateupdated on a field-by-field basisaccessed and manipulated; no direct update

Usagehighly structured repetitive processinghighly unstructured analytical processing

Response timesub second to 2-3 secondsseconds to minutes

DATA WAREHOUSE DESIGN APPROACHESData warehouse design is one of the key techniques in building the data warehouse. Choosing a right data warehouse design can save the project time and cost. Basically there are two data warehouse design approaches are popular.

BOTTOM-UP DESIGN:

In the bottom-up design approach, the data marts are created first to provide reporting capability. A data mart addresses a single business area such as sales, Finance etc. These data marts are then integrated to build a complete data warehouse. The integration of data marts is implemented using data warehouse bus architecture. In the bus architecture, a dimension is shared between facts in two or more data marts. These dimensions are called conformed dimensions. These conformed dimensions are integrated from data marts and then data warehouse is built.

ADVANTAGES OF BOTTOM-UP DESIGN ARE:

This model contains consistent data marts and these data marts can be delivered quickly. As the data marts are created first, reports can be generated quickly. The data warehouse can be extended easily to accommodate new business units. It is just creating new data marts and then integrating with other data marts.

DISADVANTAGES OF BOTTOM-UP DESIGN ARE: The positions of the data warehouse and the data marts are reversed in the bottom-up approach design.

TOP-DOWN DESIGN:

In the top-down design approach the, data warehouse is built first. The data marts are then created from the data warehouse.

ADVANTAGES OF TOP-DOWN DESIGN ARE: Provides consistent dimensional views of data across data marts, as all data marts are loaded from the data warehouse. This approach is robust against business changes. Creating a new data mart from the data warehouse is very easy.

DISADVANTAGES OF TOP-DOWN DESIGN ARE: This methodology is inflexible to changing departmental needs during implementation phase. It represents a very large project and the cost of implementing the project is significant.

DATA WAREHOUSE ARCHITECTUREThree-Tier Data Warehouse ArchitectureGenerally the data warehouses adopt the three-tier architecture. Following are the three tiers of data warehouse architecture. Bottom Tier- The bottom tier of the architecture is the data warehouse database server. It is the relational database system. We use the back end tools and utilities to feed data into bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh functions. Middle Tier- In the middle tier we have OLAP Server. The OLAP Server can be implemented in either of the following ways. By relational OLAP (ROLAP), this is an extended relational database management system. The ROLAP maps the operations on multidimensional data to standard relational operations. By Multidimensional OLAP (MOLAP) model, this directly implements multidimensional data and operations. Top-Tier- This tier is the front-end client layer. This layer holds the query tools and reporting tool, analysis tools and data mining tools.Following diagram explains the Three-tier Architecture of Data warehouse:

DATA WAREHOUSE MODELSFrom the perspective of data warehouse architecture we have the following data warehouse models: Virtual Warehouse Data mart Enterprise WarehouseVIRTUAL WAREHOUSE The view over an operational data warehouse is known as virtual warehouse. It is easy to build the virtual warehouse. Building the virtual warehouse requires excess capacity on operational database servers.DATA MART Data mart contains the subset of organisation-wide data. This subset of data is valuable to specific group of an organisationNote:in other words we can say that data mart contains only that data which is specific to a particular group. For example the marketing data mart may contain only data related to item, customers and sales. The data marts are confined to subjects.Points to remember about data marts Window based or Unix/Linux based servers are used to implement data marts. They are implemented on low cost server. The implementation cycle of data mart is measured in short period of time i.e. in weeks rather than months or years. The life cycle of a data mart may be complex in long run if it's planning and designs are not organisation-wide. Data marts are small in size. Data marts are customized by department. The source of data mart is departmentally structured data warehouse. Data marts are flexible.ENTERPRISE WAREHOUSE The enterprise warehouse collects all the information all the subjects spanning the entire organization This provides us the enterprise-wide data integration. This provides us the enterprise-wide data integration. The data is integrated from operational systems and external information providers. This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.LOAD MANAGER This Component performs the operations required to extract and load process. The size and complexity of load manager varies between specific solutions from data warehouse to data warehouse.

LOAD MANAGER ARCHITECTUREThe load manager performs the following functions: Extract the data from source system. Fast Load the extracted data into temporary data store. Perform simple transformations into structure similar to the one in the data warehouse.

EXTRACT DATA FROM SOURCEThe data is extracted from the operational databases or the external information providers. A gateway is the application programs that are used to extract data. It is supported by underlying DBMS and allows client program to generate SQL to be executed at a server.FAST LOAD In order to minimize the total load window the data need to be loaded into the warehouse in the fastest possible time. The transformations affect the speed of data processing. It is more effective to load the data into relational database prior to applying transformations and checks. Gateway technology proves to be not suitable; since they tend not be performant when large data volumes are involved.

SIMPLE TRANSFORMATIONSWhile loading it may be required to perform simple transformations. After this has been completed we are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we need to perform the following checks: Strip out all the columns that are not required within the warehouse. Convert all the values to required data types.Warehouse Manager Warehouse manager is responsible for the warehouse management process. The warehouse manager consists of third party system software, C programs and shell scripts. The size and complexity of warehouse manager varies between specific solutions.

WAREHOUSE MANAGER ARCHITECTUREThe warehouse manager includes the following: The Controlling process Stored procedures or C with SQL Backup/Recovery tool SQL ScriptsOPERATIONS PERFORMED BY WAREHOUSE MANAGER Warehouse manager analyses the data to perform consistency and referential integrity checks. Creates the indexes, business views, partition views against the base data. Generates the new aggregations and also updates the existing aggregation. Generates the normalizations. Warehouse manager Warehouse manager transforms and merge the source data into the temporary store into the published data warehouse. Backup the data in the data warehouse. Warehouse Manager archives the data that has reached the end of its captured life.Note:Warehouse Manager also analyses query profiles to determine index and aggregations are appropriate.

QUERY MANAGER Query Manager is responsible for directing the queries to the suitable tables. By directing the queries to appropriate table the query request and response process is speed up. Query Manager is responsible for scheduling the execution of the queries posed by the user.

QUERY MANAGER ARCHITECTUREQuery Manager includes the following: The query redirection via C tool or RDBMS. Stored procedures. Query Management tool. Query Scheduling via C tool or RDBMS.DETAILED INFORMATIONThe following diagram shows the detailed information

The detailed information is not kept online rather is aggregated to the next level of detail and then archived to the tape. The detailed information part of data warehouse keeps the detailed information in the star flake schema. The detailed information is loaded into the data warehouse to supplement the aggregated data.Note:If the detailed information is held offline to minimize the disk storage we should make sure that the data has been extracted, cleaned up, and transformed then into star flake schema before it is archived.In general, all data warehouse systems have the following layers: Data Source Layer Data Extraction Layer Staging Area ETL Layer Data Storage Layer Data Logic Layer Data Presentation Layer Metadata Layer System Operations LayerThe picture below shows the relationships among the different components of the data warehouse architecture:

Each component is discussed individually below:Data Source LayerThis represents the different data sources that feed data into the data warehouse. The data source can be of any format -- plain text file, relational database, other types of database, Excel file, etc., can all act as a data source.Many different types of data can be a data source: Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data. Web server logs with user browsing data. Internal market research data. Third-party data, such as census data, demographics data, or survey data.All these data sources together form the Data Source Layer.Data Extraction LayerData gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation.

Staging AreaThis is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration.ETL LayerThis is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens. TheETL design phaseis often the most time-consuming phase in a data warehousing project, and anETL toolis often used in this layer.Data Storage LayerThis is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types.Data Logic LayerThis is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but do affect what the report looks like.Data Presentation LayerThis refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent every day, or an alert that warns users of exceptions, among others. Usually atool and/or areporting toolare used in this layer.Metadata LayerThis is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that's in the metadata layer. A metadatais often used to manage metadata.System Operations LayerThis layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.

OTHER DEFINITIONSData Warehouse: A data structure that is optimized for distribution. It collects and stores integrated sets of historical data from multiple operational systems and feeds them to one or more data marts. It may also provide end-user access to support enterprise views of data.

Data Mart: A data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of workers.

Staging Area: Any data store that is designed primarily to receive data into a warehousing environment.

Operational Data Store: A collection of data that addresses operational needs of various operational units. It is not a component of a data warehousing architecture, but a solution to operational needs.

OLAP (On-Line Analytical Processing): A method by which multidimensional analysis occurs.

Multidimensional Analysis: The ability to manipulate information by a variety of relevant categories or dimensions to facilitate analysis and understanding of the underlying data. It is also sometimes referred to as drilling-down, drilling-across and slicing and dicing

Star Schema: A means of aggregating data based on a set of known dimensions. It stores data multi-dimensionally in a two dimensional Relational Database Management System (RDBMS), such as Oracle.

Snowflake Schema: An extension of the star schema by means of applying additional dimensions to the dimensions of a star schema in a relational environment.

Multidimensional Database: Also known as MDDB or MDDBS. A class of proprietary, non-relational database management tools that store and manage data in a multidimensional manner, as opposed to the two dimensions associated with traditional relational database management systems.

OLAP Tools: A set of software products that attempt to facilitate multidimensional analysis. Can incorporate data acquisition, data access, data manipulation, or any combination thereof.

METADATA RESPIRATORY

The Metadata Respiratory is an integral part of data warehouse system. The Metadata Respiratory contains the following metadata: Business Metadata - This metadata has the data ownership information, business definition and changing policies.

Operational Metadata -This metadata includes currency of data and data lineage. Currency of data means whether data is active, archived or purged. Lineage of data means history of data migrated and transformation applied on it.

Data for mapping from operational environment to data warehouse -This metadata includes source databases and their contents, data extraction, data partition, cleaning, transformation rules, data refresh and purging rules.

The algorithms for summarization - This includes dimension algorithms, data on granularity, aggregation, summarizing etc.

DATA MART

Adatamartis a subject-oriented archive that stores data and uses the retrieved set of information to assist and support the requirements involved within a particular business function or department. Data marts exist within a single organizationaldatawarehouserepository

A data mart is a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers.A data mart is a repository ofdatathat is designed to serve a particular community of knowledge workers.Data marts improve end-user response time by allowing users to have access to the specific type of data they need to view most often by providing the data in a way that supports the collective view of a group of users.Metadata is simply defined as data about data. The data that are used to represent other data is known as metadata. For example the index of a book serves as metadata for the contents in the book. In other words we can say that metadata is the summarized data that leads us to the detailed data. In terms of data warehouse we can define metadata as following. Metadata is a road map to data warehouse. Metadata in data warehouse define the warehouse objects. The metadata act as a directory. This directory helps the decision support system to locate the contents of data warehouse.Note:In data warehouse we create metadata for the data names and definitions of a given data warehouse. Along with this metadata additional metadata are also created for time stamping any extracted data, the source of extracted data.Categories of Metadata

The metadata can be broadly categorized into three categories: Business Metadata- This metadata has the data ownership information, business definition and changing policies. Technical Metadata- Technical metadata includes database system names, table and column names and sizes, data types and allowed values. Technical metadata also includes structural information such as primary and foreign key attributes and indices. Operational Metadata- This metadata includes currency of data and data lineage. Currency of data means whether data is active, archived or purged. Lineage of data means history of data migrated and transformation applied on it.

ROLE OF METADATAMetadata has very important role in data warehouse. The role of metadata in warehouse is different from the warehouse data yet it has very important role. The various roles of metadata are explained below. The metadata act as a directory. This directory helps the decision support system to locate the contents of data warehouse. Metadata helps in decision support system for mapping of data when data are transformed from operational environment to data warehouse environment. Metadata helps in summarization between current detailed data and highly summarized data. Metadata also helps in summarization between lightly detailed data and highly summarized data. Metadata are also used for query tools. Metadata are used in reporting tools. Metadata are used in extraction and cleansing tools. Metadata are used in transformation tools. Metadata also plays important role in loading functions.

DIAGRAM TO UNDERSTAND ROLE OF METADATA.

WHY TO CREATE DATA MART

The following are the reasons to create data mart:

To partition data in order to imposeaccess control strategies. To speed up the queries by reducing the volume of data to be scanned. To segment data into different hardware platforms. To structure data in a form suitable for a user access tool.Note:Do not data mart for any other reason since the operation cost of data marting could be very high. Before data marting, make sure that data marting strategy is appropriate for your particular solution.Steps to determine that data mart appears to fit the billFollowing steps need to be followed to make cost effective data marting: Identify the Functional Splits Identify User Access Tool Requirements Identify Access Control Issues

POINTS TO REMEMBER ABOUT DATA MARTS:

Window based or Unix/Linux based servers are used to implement data marts. They are implemented on low cost server. The implementation cycle of data mart is measured in short period of time i.e. in weeks rather than months or years. The life cycle of a data mart may be complex in long run if it's planning and design is not organisation-wide. Data mart is small in size. Data mart is customized by department. The source of data mart is departmentally structured data warehouse. Data mart is flexible.

DATA WAREHOUSE V/S DATA MART

DATA WAREHOUSE: Holds multiple subject areas Holds very detailed information Works to integrate all data sources Does not necessarily use a dimensional model but feeds dimensional models.

DATA MART: Often holds only one subject area- for example, Finance, or Sales May hold more summarized data (although many hold full detail) Concentrates on integrating information from a given subject area or set of source systems Is built focused on a dimensional model using a star schema.

REASONS FOR CREATING A DATA MART Easy access to frequently needed data Creates collective view by a group of users Improves end-userresponse time Ease of creation Lower cost than implementing a fulldata warehouse Potential users are more clearly defined than in a full data warehouse Contains only business essential data and is less cluttered.

DECISION SUPPORT SYSTEM (DDS)

Decision support systems are interactive software-based systems intended to help managers in decision making by accessing large volume of information generated from various related information systems involved in organizational business processes, like, office automation system, transaction processing system etc.DSS uses the summary information, exceptions, patterns and trends using the analytical models. Decision Support System helps in decision making but does not always give a decision itself. The decision makers compile useful information from raw data, documents, personal knowledge, and/or business models to identify and solve problems and make decisions.Programmed and Non-programmed DecisionsThere are two types of decisions - programmed and non-programmed decisions.Programmed decisions are basically automated processes, general routine work, where: These decisions have been taken several times These decisions follow some guidelines or rulesFor example, selecting a reorder level for inventories, is a programmed decision Non-programmed decisions occur in unusual and non-addressed situations, so: It would be a new decision There will not be any rules to follow These decisions are made based on available information These decisions are based on the manger's discretion, instinct, perception and judgmentFor example, investing in a new technology is a non-programmed decision Decision support systems generally involve non-programmed decisions. Therefore, there will be no exact report, content or format for these systems. Reports are generated on the fly.

ATTRIBUTES OF A DSS Adaptability and flexibility High level of Interactivity Ease of use Efficiency and effectiveness Complete control by decision-makers. Ease of development Extendibility Support for modelling and analysis Support for data access Standalone, integrated and Web-basedCharacteristics of a DSS Support for decision makers in semi structured and unstructured problems. Support for managers at various managerial levels, ranging from top executive to line managers. Support for individuals and groups. Less structured problems often requires the involvement of several individuals from different departments and organization level. Support for interdependent or sequential decisions. Support for intelligence, design, choice, and implementation. Support for variety of decision processes and styles DSSs are adaptive over time.

BENEFITS OF DSS Improves efficiency and speed of decision making activities Increases the control, competitiveness and capability of futuristic decision making of the organization Facilitates interpersonal communication Encourages learning or training Since it is mostly used in non-programmed decisions, it reveals new approaches and sets up new evidences for an unusual decision Helps automate managerial processes

COMPONENTS OF A DSSFollowing are the components of the Decision Support System: Database Management System (DBMS): To solve a problem the necessary data may come from internal or external database. In an organization, internal data are generated by a system such as TPS and MIS.External data come from a variety of sources such as newspapers, online data services, databases (financial, marketing, human resources). Model Management system: It stores and accesses models that managers use to make decisions. Such models are used for designing manufacturing facility, analyzing the financial health of an organization. Forecasting demand of a product or service etc.Support Tools: Support tools like online help; pull down menus, user interfaces, graphical analysis, error correction mechanism, facilitates the user interactions with the system.Classification of DSSThere are several ways to classify DSS. Hoi Apple and Whinstone classify DSS in following: Text Oriented DSS:It contains textually represented information that could have a bearing on decision. It allows documents to be electronically created, revise and viewed as needed Database Oriented DSS:Database plays a major role here; it contains organized and highly structured data. Spreadsheet Oriented DSS:it contains information in spread sheets that allows create, view, modify procedural knowledge and also instruct the system to execute self-contained instructions. The most popular tool is Excel and Lotus 1-2-3. Solver Oriented DSS:it is based on a solver, which is an algorithm or procedure written for performing certain calculations and particular program type. Rules Oriented DSS:It follows certain procedures adopted as rules. Rules Oriented DSS:Procedures are adopted in rules oriented DSS. Export system is the example. Compound DSS:It is built by using two or more of the five structures explained above.

TYPES OF DSSFollowing are some typical DSSs: Status Inquiry System:helps in taking operational management level or middle level management decisions, for example daily schedules of jobs to machines or machines to operators. Data Analysis System:needs comparative analysis and makes use of formula or an algorithm, for example cash flow analysis, inventory analysis etc. Information Analysis System:In this system data is analyzed and the information report is generated. For example, sales analysis, accounts receivable systems, market analysis etc. Accounting System: keep tracks of accounting and finance related information, for example, final account, accounts receivables, accounts payables etc. that keep track of the major aspects of the business. Model Based System: simulation models or optimization models used for decision- making used infrequently and creates general guidelines for operation or management.

EXECUTIVE SUPPORT SYSTEM (ESS)Executive support systems are intended to be used by the senior managers directly to provide support to non-programmed decisions in strategic management.These information are often external, unstructured and even uncertain. Exact scope and context of such information is often not known beforehand.

This information is intelligence based: Market intelligence Investment intelligence Technology intelligenceExamples of Intelligent Information

Following are some examples of intelligent information, which is often source of an ESS: External databases Technology reports like patent records etc. Technical reports from consultants Market reports Confidential information about competitors Speculative information like market conditions Government policies Financial reports and information

ADVANTAGES OF ESS: Easy for upper level executive to use Ability to analyze trends Augmentation of managers' leadership capabilities Enhance personal thinking and decision making Contribution to strategic control flexibility Enhance organizational competitiveness in the market place Increased executive time horizons. Better reporting system Improved mental model of business executive Help improve consensus building and communication Improve office automation Reduce time for finding information Better understanding Time management Increased communication capacity and quality

DISADVANTAGE OF ESS Functions are limited Hard to quantify benefits Executive may encounter information overload System may become slow Difficult to keep current data May lead to less reliable and insecure data Excessive cost for small companyKNOWLEDGE MANAGEMENT SYSTEM (KMS)

All the systems we are discussing here come under knowledge management category. A knowledge management system is not radically different from all these information systems, but it just extends the already existing systems by assimilating more information.As we have seen data is raw facts, information is processed and/or interpreted data and knowledge is personalized information.What is knowledge? personalized information state of knowing and understanding an object to be stored and manipulated a process of applying expertise a condition of access to information potential to influence actionSources of Knowledge of an Organization Intranet Data warehouses and knowledge repositories Decision support tools Groupware for supporting collaboration Networks of knowledge workers Internal expertise

DEFINITION OF KMS

Knowledge management comprises a range of practices used in an organization to identify, create represent distribute and enable adoption to insight and experience. Such insights and experience comprise knowledge, either embodied in individual or embedded in organizational processes and practices. PURPOSE OF A KMS Improved performance Competitive advantage Innovation Sharing of knowledge Integration Continuous improvement by: Driving strategy Starting new lines of business Solving problems faster Developing professional skills Recruit and retain talent

ACTIVITIES IN KNOWLEDGE MANAGEMENT Start with the business problem and the business value to be delivered first. Identify what kind of strategy to pursue to deliver this value and address the KM problem Think about the system required from a people and process point of view. Finally, think about what kind of technical infrastructure are required to support the people and processes. Implement system and processes with appropriate change management and iterative staged release.

LEVEL OF KNOWLEDGE MANAGEMENT

DATA WAREHOUSING - SYSTEM PROCESSES

We have fixed number of operations to be applied on operational databases and we have well defined techniques such asuse normalized data, keep table smalletc. These techniques are suitable for delivering a solution. But in case of decision support system we do not know what query and operation need to be executed in future. Therefore techniques applied on operational databases are not suitable for data warehouses.In this chapter well focus on designing data warehousing solution built on the top open-system technologies like UNIX and relational databases.

PROCESS FLOW IN DATA WAREHOUSEThere are four major processes that build a data warehouse. Here is the list of four processes: Extract and load data. Cleaning and transforming the data. Backup and Archive the data. Managing queries & directing them to the appropriate data sources.

Extract and Load Process The Data Extraction takes data from the source systems. Data load takes extracted data and loads it into data warehouse.Note:Before loading the data into data warehouse the information extracted from external sources must be reconstructed.Points to remember while extract and load process: Controlling the process When to Initiate Extract Loading the Data

CONTROLLING THE PROCESSControlling the process involves determining that when to start data extraction and consistency check on data. Controlling process ensures that tools, logic modules, and the programs are executed in correct sequence and at correct time.WHEN TO INITIATE EXTRACTData need to be in consistent state when it is extracted i.e. the data warehouse should represent single, consistent version of information to the user.For example in a customer profiling data warehouse in telecommunication sector it is illogical to merge list of customers at 8 pm on Wednesday from a customer database with the customer subscription events up to 8 pm on Tuesday. This would mean that we are finding the customers for whom there is no associated subscription.LOADING THE DATAAfter extracting the data it is loaded into a temporary data store. Here in the temporary data store it is cleaned up and made consistent.Note:Consistency checks are executed only when all data sources have been loaded into temporary data store.Clean and Transform ProcessOnce data is extracted and loaded into temporary data store it is the time to perform Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming: Clean and Transform the loaded data into a structure. Partition the data.CLEAN AND TRANSFORM THE LOADED DATA INTO A STRUCTUREThis will speed up the queries. This can be done in the following ways: Make sure data is consistent within itself. Make sure data is consistent with other data within the same data source. Make sure data is consistent with data in other source systems. Make sure data is consistent with data already in the warehouse.Transforming involves converting the source data into a structure. Structuring the data will result in increases query performance and decreases operational cost. Information in data warehouse must be transformed to support performance requirement from the business and also the ongoing operational cost.

PARTITION THE DATAIt will optimize the hardware performance and simplify the management of data warehouse. In this we partition each fact table into a multiple separate partitions.

AGGREGATIONAggregation is required to speed up the common queries. Aggregation relies on the fact that most common queries will analyse a subset or an aggregation of the detailed data.

BACKUP AND ARCHIVE THE DATAIn order to recover the data in event of data loss, software failure or hardware failure it is necessary to backed up on regular basis. Archiving involves removing the old data from the system in a format that allow it to be quickly restored whenever required.For example in a retail sales analysis data warehouse, it may be required to keep data for 3 years with latest 6 months data being kept online. In this kind of scenario there is often requirement to be able to do month-on-month comparisons for this year and last year. In this case we require some data to be restored from the archive.

QUERY MANAGEMENT PROCESSThis process performs the following functions This process manages the queries. This process speed up the queries execution. This Process directs the queries to most effective data sources. This process should also ensure that all system sources are used in most effective way. This process is also required to monitor actual query profiles. Information in this process is used by warehouse management process to determine which aggregations to generate. This process does not generally operate during regular load of information into data warehouse.

DATA WAREHOUSING - OLAP

INTRODUCTIONOnline Analytical Processing Server (OLAP) is based on multidimensional data model. It allows the managers, analysts to get insight the information through fast, consistent, interactive access to information. In this chapter we will discuss about types of OLAP, operations on OLAP, Difference between OLAP and Statistical Databases and OLTP.FeatureOLTPOLAP

PurposeRun day-to-day operationInformation retrieval and analysis

StructureRDBMSRDBMS

Data ModelNormalizedMultidimensional

AccessSQLSQL plus data analysis extensions

Type of DataData that runs the businessData to analyse the business

Condition of dataChanging, incompleteHistorical, descriptive

TYPES OF OLAP SERVERSWe have four types of OLAP servers that are listed below. Relational OLAP(ROLAP) Multidimensional OLAP (MOLAP) Hybrid OLAP (HOLAP) Specialized SQL Servers

RELATIONAL OLAP (ROLAP)The Relational OLAP servers are placed between relational back-end server and client front-end tools. To store and manage warehouse data the Relational OLAP use relational or extended-relational DBMS.ROLAP includes the following. Implementation of aggregation navigation logic. Optimization for each DBMS back end. Additional tools and services.

MULTIDIMENSIONAL OLAP (MOLAP)Multidimensional OLAP (MOLAP) uses the array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore many MOLAP Server uses the two level of data storage representation to handle dense and sparse data sets.

HYBRID OLAP (HOLAP)The hybrid OLAP technique combination of ROLAP and MOLAP both. It has both the higher scalability of ROLAP and faster computation of MOLAP. HOLAP server allows storing the large data volumes of detail data. The aggregations are stored separated in MOLAP store.Specialized SQL Serversspecialized SQL servers provides advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment.OLAP OperationsAs we know that the OLAP server is based on the multidimensional view of data hence we will discuss the OLAP operations in multidimensional data.

Here is the list of OLAP operations. Roll-up Drill-down Slice and dice Pivot (rotate)

ROLL-UPThis operation performs aggregation on a data cube in any of the following way: By climbing up a concept hierarchy for a dimension By dimension reduction.Consider the following diagram showing the roll-up operation.

The roll-up operation is performed by climbing up a concept hierarchy for the dimension location. Initially the concept hierarchy was "street < city < province < country". On rolling up the data is aggregated by ascending the location hierarchy from the level of city to level of country. The data is grouped into cities rather than countries. When roll-up operation is performed then one or more dimensions from the data cube are removed.

DRILL-DOWNDrill-down operation is reverse of the roll-up. This operation is performed by either of the following way: By stepping down a concept hierarchy for a dimension. By introducing new dimension.Consider the following diagram showing the drill-down operation:

The drill-down operation is performed by stepping down a concept hierarchy for the dimension time. Initially the concept hierarchy was "day < month < quarter < year." On drill-up the time dimension is descended from the level quarter to the level of month. When drill-down operation is performed then one or more dimensions from the data cube are added. It navigates the data from less detailed data to highly detailed data.

SLICEThe slice operation performs selection of one dimension on a given cube and gives us a new sub cube. Consider the following diagram showing the slice operation.

The Slice operation is performed for the dimension time using the criterion time ="Q1". It will form a new sub cube by selecting one or more dimensions.DICEThe Dice operation performs selection of two or more dimension on a given cube and gives us a new sub cube. Consider the following diagram showing the dice operation:

The dice operation on the cube based on the following selection criteria that involve three dimensions. (location = "Toronto" or "Vancouver") (time = "Q1" or "Q2") (item =" Mobile" or "Modem").

PIVOTThe pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data. Consider the following diagram showing the pivot operation.

In this the item and location axes in 2-D slice are rotated.

DATA WAREHOUSING - FUTURE ASPECTS

Following are the future aspects of Data Warehousing. As we have seen that the size of the open database has grown approximately double the magnitude in last few years. This change in magnitude is of greater significance. As the size of the databases grows, the estimates of what constitutes a very large database continue to grow. The Hardware and software that are available today do not allow keeping a large amount of data online. For example a Telco call record requires 10TB of data to be kept online which is just a size of one month record. If it requires keeping record of sales, marketing customer, employee etc. then the size will be more than 100 TB. The record not only contains the textual information but also contain some multimedia data. Multimedia data cannot be easily manipulated as text data. Searching the multimedia data is not an easy task whereas the textual information can be retrieved by the relational software available today. Apart from size planning, building and running ever-larger data warehouse systems are very complex. As the number of users increases the size of the data warehouse also increases. These users will also require to access to the system. With growth of internet there is requirement of users to access data online.

QUESTION AND ANSWER OF DATA WAREHOUSE

Q: Define Data Warehouse?A:Data warehouse is Subject Oriented, Integrated, Time-Variant and Non-volatile collection of data that support management's decision making process.Q: What does the subject oriented data warehouse signifies?A:Subject oriented signifies that the data warehouse stores the information around a particular subject such as product, customer, sales etc.Q: List any five applications of Data Warehouse?A:Some applications include financial services, Banking Services, Customer goods, Retail Sectors, Controlled Manufacturing.Q: What does OLAP and OLTP stand for?A:OLAP is acronym ofOnline Analytical Processingand OLTP is acronym of Online Transactional ProcessingQ: What is the very basic difference between data warehouse and Operational Databases?A:Data warehouse contains the historical information that is made available for analysis of the business whereas the Operational database contains the current information that is required to run the business.Q: List the Schema that Data Warehouse System implements?A:Data Warehouse can implement Star Schema, Snowflake Schema or the Fact Constellation SchemaQ: What is Data Warehousing?A:Data Warehousing is the process of constructing and using the data warehouse.Q: List the process that are involved in Data Warehousing?A:Data Warehousing involves data cleaning, data integration and data consolidations.Q: List the functions of data warehouse tools and utilities?A:The functions performed by Data warehouse tool and utilities are Data Extraction, Data Cleaning, Data Transformation, Data Loading and RefreshingQ: What do you mean by Data Extraction?A:Data Extraction means gathering the data from multiple heterogeneous sources.Q: Define Metadata?A:Metadata is simply defined as data about data. In other words we can say that metadata is the summarized data that lead us to the detailed data.Q: What does Metadata Respiratory contains?A:Metadata respiratory contains Definition of data warehouse, Business Metadata, Operational Metadata, Data for mapping from operational environment to data warehouse and the Algorithms for summarizationQ: How does a Data Cube help?A:Data cube help us to represent the data in multiple dimensions. The data cube is defined by dimensions and facts.Q: Define Dimension?A:The dimensions are the entities with respect to which an enterprise keeps the records.Q: Explain Data mart?A:Data mart contains the subset of organisation-wide data. This subset of data is valuable to specific group of an organisation. In other words we can say that data mart contains only that data which is specific to a particular group.Q: What is Virtual Warehouse?A:The view over an operational data warehouse is known as virtual warehouse.Q: List the phases involved in Data warehouse delivery Process?A:The stages are IT strategy, Education, Business Case Analysis, technical Blueprint, Build the version, History Load, Ad hoc query, Requirement Evolution, Automation, Extending Scope.Q: Explain Load Manager?A:This Component performs the operations required to extract and load process. The size and complexity of load manager varies between specific solutions from data warehouse to data warehouse.Q: Define the function of Load Manager?A:Extract the data from source system. Fast Load the extracted data into temporary data store. Perform simple transformations into structure similar to the one in the data warehouse.Q: Explain Warehouse Manager?A:Warehouse manager is responsible for the warehouse management process. The warehouse manager consists of third party system software, C programs and shell scripts. The size and complexity of warehouse manager varies between specific solutions.Q: Define functions of Warehouse Manager?A:The Warehouse Manager performs consistency and referential integrity checks, Creates the indexes, business views, partition views against the base data, transforms and merge the source data into the temporary store into the published data warehouse, Backup the data in the data warehouse and archives the data that has reached the end of its captured life.Q: What is Summary Information?A:Summary Information is the area in data warehouse where the predefined aggregations are kept.Q: What does the Query Manager responsible for?A:Query Manager is responsible for directing the queries to the suitable tables.Q: List the types of OLAP server?A:There are four types of OLAP Server namely Relational OLAP, Multidimensional OLAP, Hybrid OLAP, and Specialized SQL ServersQ: Which one is faster Multidimensional OLAP or Relational OLAP?A:Multidimensional OLAP is faster than the Relational OLAPQ: List the functions performed by OLAP?A:The functions such as roll-up, drill-down, slice, dice, and pivot are performed by OLAPQ: How many dimensions are selected in Slice operation?A:Only one dimension is selected for the slice operation.Q: How many dimensions are selected in dice operation?A:For dice operation two or more dimensions are selected for a given cube.Q: How many fact tables are there in Star Schema?A:There is only one fact table in Star Schema.Q: What is Normalization?A:The normalization split up the data into additional tables.Q: out of Star Schema and Snowflake Schema, the dimension table is normalised?A:The snowflake schema uses the concept of normalization.Q: What is the benefit of Normalization?A:Normalization helps to reduce the data redundancy.Q: Which language is used for defining Schema Definition?A:Data Mining Query Language (DMQL) id used for Schema Definition.Q: What language is the base of DMQL?A:DMQL is based on Structured Query Language (SQL)Q: What are the reasons for partitioning?A:Partitioning is done for various reasons such as easy management, to assist backup recovery, to enhance performance.Q: What kind of costs is involved in Data Martin?A:Data Marting involves Hardware & Software cost, Network access cost and Time cost.

FACTOR ANALYSIS

WHY USE FACTOR ANALYSIS?Factor analysisis a useful tool for investigating variable relationships for complex concepts such as socioeconomic status, dietary patterns, or psychological scales.It allows researchers to investigate concepts that are not easily measured directly by collapsing a large number of variables into a few interpretable underlying factors.WHAT IS A FACTOR?The key concept of factor analysis is that multiple observed variables have similar patterns of responses because ofFor example, people may respond similarly to questions about income, education, and occupation, which are all associated with the latent variable socioeconomic status.In every factor analysis, there is the same number of factors as there are variables. Each factor captures a certain amount of the overall variance in the observed variables, and the factors are always listed in order of how much variation they explain.The eigen value is a measure of how much of the variance of the observed variables a factor explains. Any factor with an eigen value 1 explains more variance than a single observed variable.So if the factor for socioeconomic status had an Eigen value of 2.3 it would explain as much variance as 2.3 of the three variables. This factor, which captures most of the variance in those three variables, could then be used in other analyses.The factors that explain the least amount of variance are generally discarded. Deciding how many factors are useful to retain will be the subject of another post.WHAT ARE FACTOR LOADINGS?The relationship of each variable to the underlying factor is expressed by the so-called factor loading. Here is an example of the output of a simple factor analysis looking at indicators of wealth, with just six variables and two resulting factors.VariablesFactor 1Factor 2

Income0.650.11

Education0.590.25

Occupation0.480.19

House value0.380.60

Number of public parks in neighbourhood0.130.57

Number of violent crimes per year in neighbourhood0.230.55

The variable with the strongest association to the underlying latent variable. Factor 1, is income, with a factor loading of 0.65.Since factor loadings can be interpreted likestandardized regression coefficients, one could also say that the variable income has a correlation of 0.65 with Factor 1. This would be considered a strong association for a factor analysis in most research fields.Two other variables, education and occupation, are also associated with Factor 1. Based on the variables loading highly onto Factor 1, we could call it Individual socioeconomic status.House value, number of public parks, and number of violent crimes per year, however, have high factor loadings on the other factor, Factor 2. They seem to indicate the overall wealth within the neighbourhood, so we may want to call Factor 2 Neighbourhood socioeconomic status.Notice that the variable house value also is marginally important in Factor 1 (loading = 0.38). This makes sense, since the value of a persons house should be associated with his or her income.

FEATURES OF FACTOR ANALYSIS

Data reduction tool Removes redundancy or duplication from a set of Correlated variables Represents correlated variables with a smaller set of derived variables. Factors are formed that are relatively independent of one another. Two types of variables:

LATENT VARIABLES: FACTORS

OBSERVED VARIABLES

Some Applications of Factor Analysis

1. Identification of Underlying Factors: clusters variables into homogeneous sets creates new variables (i.e. factors) allows us to gain insight to categories

2. Screening of Variables: identifies groupings to allow us to select one variable to represent many useful in regression (recall collinearity)

3. Summary: Allows us to describe many variables using a few factors

4. Clustering of objects: Helps us to put objects (people) into categories depending on their factor scores

dbms

Documents