data mining and data warehousing

34
Overview of Data Mining: Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. DEFINITION OF 'DATA MINING' A process used by companies to turn raw data into useful information. By using software to look for patterns in large batches of data, businesses can learn more about their customers and develop more effective marketing strategies as well as increase sales and decrease costs. Data mining depends on effective data collection and warehousing as well as computer processing. Grocery stores are well-known users of data mining techniques. Many supermarkets offer free loyalty cards to customers that give them access to reduced prices not available to non- members. The cards make it easy for stores to track who is buying what, when they are buying it, and at what price. The stores can then use this data, after analyzing it, for multiple purposes, such as offering customers coupons that are targeted to their buying habits and deciding when to put items on sale and when to sell them at full price. Data Mining Engine Data mining engine is very essential to the data mining system. It consists of a set of functional modules. These modules are for following tasks: o Characterization

Upload: sunny-gandhi

Post on 15-Jul-2015

461 views

Category:

Business


0 download

TRANSCRIPT

Overview of Data Mining:

Generally, data mining (sometimes called data or knowledge discovery) is the process of

analyzing data from different perspectives and summarizing it into useful information -

information that can be used to increase revenue, cuts costs, or both.

Data mining, the extraction of hidden predictive information from large databases, is a

powerful new technology with great potential to help companies focus on the most important

information in their data warehouses. Data mining tools predict future trends and behaviors,

allowing businesses to make proactive, knowledge-driven decisions.

Data mining software is one of a number of analytical tools for analyzing data. It allows

users to analyze data from many different dimensions or angles, categorize it, and summarize

the relationships identified.

DEFINITION OF 'DATA MINING'

A process used by companies to turn raw data into useful information. By using software to

look for patterns in large batches of data, businesses can learn more about their customers

and develop more effective marketing strategies as well as increase sales and decrease costs.

Data mining depends on effective data collection and warehousing as well as computer

processing.

Grocery stores are well-known users of data mining techniques. Many supermarkets offer

free loyalty cards to customers that give them access to reduced prices not available to non-

members. The cards make it easy for stores to track who is buying what, when they are

buying it, and at what price. The stores can then use this data, after analyzing it, for multiple

purposes, such as offering customers coupons that are targeted to their buying habits and

deciding when to put items on sale and when to sell them at full price.

Data Mining Engine

Data mining engine is very essential to the data mining system. It consists of a set of

functional modules.

These modules are for following tasks:

o Characterization

o Association and Correlation Analysis

o Classification

o Prediction

o Cluster analysis

o Outlier analysis

o Evolution analysis

Purpose and Uses of Data Mining

The purpose of data mining is to identify patterns in order to make predictions from

information contained in databases. It allows the user to be proactive in identifying and

predicting trends with that information.

Common uses of data mining in government include knowledge discovery, fraud detection,

and analysis of research, decision support, and website personalization.

The most common federal government uses of data mining as identified by GAO include:

1) Improving service or performance

2) Detecting fraud, waste, and abuse

3) Analyzing scientific and research information

4) Managing human resources

5) Detecting criminal activities or patterns

6) Analyzing intelligence and detecting terrorist activities.

State government data mining efforts include programs to ensure that the proper beneficiaries

of state benefits programs receive the correct amount of benefits. Such uses can save states

substantial amounts of money that otherwise would be erroneously paid out in the form of

state benefits.

Moreover, in a recent report, GAO found that twenty one states are using data mining

software to look for unusual patterns in claims, provider, and beneficiary information stored

in data warehouses in order to identify potential provider abuse.

Major data mining Tasks

The two high-level primary goals of data mining, in practice, are prediction and description.

1) Prediction involves using some variables or fields in the database to predict unknown or

future values of other variables of interest.

2) Description focuses on finding human- interpretable patterns describing the data.

The relative importance of prediction and description for particular data mining applications

can vary considerably. However, in the context of knowledge discovery process (KDD),

description tends to be more important than prediction. This is in contrast to pattern

recognition and machine learning applications (such as speech recognition) where prediction

is often the primary goal of the KDD process.

The goals of prediction and description are achieved by using the following primary data

mining tasks:

1. Classification is learning a function that maps (classifies) a data item into one of several

predefined classes.

2. Regression is learning a function which maps a data item to a real-valued prediction variable.

3. Clustering is a common descriptive task where one seeks to identify a finite set of categories

or clusters to describe the data.

o Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming

group of objects that are very similar to each other but are highly different from the

objects in other clusters.

o Closely related to clustering is the task of probability density estimation which consists of

techniques for estimating, from data, the joint multi-variate probability density function

of all of the variables/fields in the database.

4. Summarization involves methods for finding a compact description for a subset of data.

5. Dependency Modeling consists of finding a model which describes significant dependencies

between variables.

o Dependency models exist at two levels:

i. The structural level of the model specifies (often graphically) which variables are

locally dependent on each other, and

ii. The quantitative level of the model specifies the strengths of the dependencies using

some numerical scale.

6. Change and Deviation Detection focuses on discovering the most significant changes in the

data from previously measured or normative values.

Mining Methodology and User Interaction Issues

It refers to the following kind of issues:

1. Mining different kinds of knowledge in databases

The need of different users is not the same. And Different user may be in interested in

different kind of knowledge. Therefore it is necessary for data mining to cover broad range of

knowledge discovery task.

2. Interactive mining of knowledge at multiple levels of abstraction

The data mining process needs to be interactive because it allows users to focus the search

for patterns, providing and refining data mining requests based on returned results.

3. Incorporation of background knowledge

To guide discovery process and to express the discovered patterns, the background

knowledge can be used. Background knowledge may be used to express the discovered

patterns not only in concise terms but at multiple level of abstraction.

4. Data mining query languages and ad hoc data mining

Data Mining Query language that allows the user to describe ad hoc mining tasks should be

integrated with a data warehouse query language and optimized for efficient and flexible data

mining.

5. Presentation and visualization of data mining results

Once the patterns are discovered it needs to be expressed in high level languages, visual

representations. These representations should be easily understandable by the users.

6. Handling noisy or incomplete data

The data cleaning methods are required that can handle the noise, incomplete objects while

mining the data regularities. If data cleaning methods are not there then the accuracy of the

discovered patterns will be poor.

7. Pattern evaluation

It refers to interestingness of the problem. The patterns discovered should be interesting

because either they represent common knowledge or lack novelty.

Performance Issues

It refers to the following issues:

1. Efficiency and scalability of data mining algorithms

In order to effectively extract the information from huge amount of data in databases, data

mining algorithm must be efficient and scalable.

2. Parallel, distributed, and incremental mining algorithms

The factors such as huge size of databases, wide distribution of data, and complexity of data

mining methods motivate the development of parallel and distributed data mining algorithms.

These algorithms divide the data into partitions which is further processed parallel. Then the

results from the partitions are merged. The incremental algorithms, updates databases

without having mine the data again from scratch.

Diverse Data Types Issues

1. Handling of relational and complex types of data

The database may contain complex data objects, multimedia data objects, spatial data,

temporal data etc. It is not possible for one system to mine all these kind of data.

2. Mining information from heterogeneous databases and global information systems

The data is available at different data sources on LAN or WAN. These data source may be

structured, semi structured or unstructured. Therefore mining knowledge from them adds

challenges to data mining.

Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. preparing the data

involves the following activities:

1) Data Cleaning

Data cleaning involves removing the noise and treatment of missing values. The noise is

removed by applying smoothing techniques and the problem of missing values is solved by

replacing a missing value with most commonly occurring value for that attribute.

2) Relevance Analysis

Database may also have the irrelevant attributes. Correlation analysis is used to know

whether any two given attributes are related.

3) Data Transformation and reduction

The data can be transformed by any of the following methods.

Normalization - The data is transformed using normalization. Normalization involves

scaling all values for given attribute in order to make them fall within a small specified

range. Normalization is used when in the learning step, the neural networks or the

methods involving measurements are used.

Generalization -The data can also be transformed by generalizing it to the higher

concept. For this purpose we can use the concept hierarchies.

Data Mining Applications

Here is the list of areas where data mining is widely used:

Financial Data Analysis

Retail Industry

Telecommunication Industry

Biological Data Analysis

Other Scientific Applications

Intrusion Detection

1. FINANCIAL DATA ANALYSIS

The financial data in banking and financial industry is generally reliable and of high quality

which facilitates the systematic data analysis and data mining.

Here are the few typical cases:

o Design and construction of data warehouses for multidimensional data analysis and data

mining.

o Loan payment prediction and customer credit policy analysis.

o Classification and clustering of customers for targeted marketing.

o Detection of money laundering and other financial crimes.

2. RETAIL INDUSTRY

Data Mining has its great application in Retail Industry because it collects large amount data

from on sales, customer purchasing history, goods transportation, consumption and services.

It is natural that the quantity of data collected will continue to expand rapidly because of

increasing ease, availability and popularity of web.

The Data Mining in Retail Industry helps in identifying customer buying patterns and trends

that leads to improved quality of customer service and good customer retention and

satisfaction. Here is the list of examples of data mining in retail industry:

o Design and Construction of data warehouses based on benefits of data mining.

o Multidimensional analysis of sales, customers, products, time and region.

o Analysis of effectiveness of sales campaigns.

o Customer Retention.

o Product recommendation and cross-referencing of items.

3. TELECOMMUNICATION INDUSTRY

Today the Telecommunication industry is one of the most emerging industries providing

various services such as fax, pager, cellular phone, Internet messenger, images, e-mail, web

data transmission etc.

Due to the development of new computer and communication technologies, the

telecommunication industry is rapidly expanding. This is the reason why data mining is

become very important to help and understand the business.

Data Mining in Telecommunication industry helps in identifying the telecommunication

patterns, catch fraudulent activities, make better use of resource, and improve quality of

service.

Here is the list examples for which data mining improve telecommunication services:

o Multidimensional Analysis of Telecommunication data.

o Fraudulent pattern analysis.

o Identification of unusual patterns.

o Multidimensional association and sequential patterns analysis.

o Mobile Telecommunication services.

o Use of visualization tools in telecommunication data analysis.

4. BIOLOGICAL DATA ANALYSIS

Nowadays we see that there is vast growth in field of biology such as genomics, proteomics,

functional Genomics and biomedical research. Biological data mining is very important part

of Bioinformatics.

Following are the aspects in which Data mining contribute for biological data analysis:

o Semantic integration of heterogeneous, distributed genomic and proteomic databases.

o Alignment, indexing, similarity search and comparative analysis multiple nucleotide

sequences.

o Discovery of structural patterns and analysis of genetic networks and protein pathways.

o Association and path analysis.

o Visualization tools in genetic data analysis.

5. OTHER SCIENTIFIC APPLICATIONS

The applications discussed above tend to handle relatively small and homogeneous data sets

for which the statistical techniques are appropriate. Huge amount of data have been collected

from scientific domains such as geosciences, astronomy etc. There is large amount of data

sets being generated because of the fast numerical simulations in various fields such as

climate, and ecosystem modeling, chemical engineering, fluid dynamics etc.

Following are the applications of data mining in field of Scientific Applications:

o Data Warehouses and data preprocessing.

o Graph-based mining.

o Visualization and domain specific knowledge.

6. INTRUSION DETECTION

Intrusion refers to any kind of action that threatens integrity, confidentiality, or availability of

network resources. In this world of connectivity security has become the major issue. With

increased usage of internet and availability of tools and tricks for intruding and attacking

network prompted intrusion detection to become a critical component of network

administration.

Here is the list of areas in which data mining technology may be applied for intrusion

detection:

Development of data mining algorithm for intrusion detection.

Association and correlation analysis, aggregation to help select and build discriminating

attributes.

Analysis of Stream data.

Distributed data mining.

Visualization and query tools.

Data mining Process

Data Mining is an analytic process designed to explore data (usually large amounts of data -

typically business or market related - also known as "big data") in search of consistent

patterns and/or systematic relationships between variables, and then to va lidate the findings

by applying the detected patterns to new subsets of data.

The ultimate goal of data mining is prediction - and predictive data mining is the most

common type of data mining and one that has the most direct business applications.

The process of data mining consists of three stages:

1. The initial exploration.

2. Model building or pattern identification with validation/verification.

3. Deployment (i.e., the application of the model to new data in order to generate predictions).

Stage 1: Exploration

This stage usually starts with data preparation which may involve cleaning data, data

transformations, selecting subsets of records and - in case of data sets with large numbers

of variables ("fields") - performing some preliminary feature selection operations to bring

the number of variables to a manageable range (depending on the statistical methods

which are being considered).

Then, depending on the nature of the analytic problem, this first stage of the process of

data mining may involve anywhere between a simple choice of straightforward predictors

for a regression model, to elaborate exploratory analyses using a wide variety of

graphical and statistical methods.

Stage 2: Model building and validation

This stage involves considering various models and choosing the best one based on their

predictive performance (i.e., explaining the variability in question and producing stable

results across samples). This may sound like a simple operation, but in fact, it sometimes

involves a very elaborate process. There are a variety of techniques developed to achieve

that goal - many of which are based on so-called "competitive evaluation of models," that

is, applying different models to the same data set and then comparing their performance

to choose the best.

These techniques - which are often considered the core of predictive data mining -

include: Bagging(Voting, Averaging), Boosting, Stacking (Stacked Generalizations),

and Meta-Learning.

Stage 3: Deployment

That final stage involves using the model selected as best in the previous stage and applying

it to new data in order to generate predictions or estimates of the expected result or outcome.

The concept of Data Mining is becoming increasingly popular as a business information

management tool where it is expected to reveal knowledge structures that can guide decisions

in conditions of limited certainty or assurance.

In recent times, there has been increased interest in developing new analytic techniques

specifically designed to address the issues relevant to business data

mining (e.g., Classification Trees), but Data Mining is still based on the conceptual principles

of statistics including the traditional Exploratory Data Analysis (EDA) and modeling and it

shares with them both some components of its general approaches and specific techniques.

However, an important general difference in the focus and purpose between Data Mining and

the traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented

towards applications than the basic nature of the underlying phenomena. In other words, Data

Mining is relatively less concerned with identifying the specific relations between the

involved variables. For example, uncovering the nature of the under lying functions or the

specific types of interactive, multivariate dependencies between variables are not the main

goal of Data Mining. Instead, the focus is on producing a solution that can generate useful

predictions. Therefore, Data Mining accepts among others a "black box" approach to data

exploration or knowledge discovery and uses not only the traditional Exploratory Data

Analysis (EDA) techniques, but also such techniques as Neural Networks which can generate

valid predictions but are not capable of identifying the specific nature of the interrelations

between the variables on which the predictions are based.

The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable business

information in a large database — for example, finding linked products in gigabytes of store

scanner data — and mining a mountain for a element of valuable ore. Both processes require

either sifting through an immense amount of material, or intelligently probing it to find

exactly where the value resides. Given databases of sufficient size and quality, data mining

technology can generate new business opportunities by providing these capabilities:

1. Automated prediction

Automated prediction of trends and behaviors. Data mining automates the process of finding

predictive information in large databases. Questions that traditionally required extensive

hands-on analysis can now be answered directly from the data quickly.

A typical example of a predictive problem is targeted marketing. Data mining uses data on

past promotional mailings to identify the targets most likely to maximize return on

investment in future mailings. Other predictive problems include forecasting bankruptcy and

other forms of default, and identifying segments of a population likely to respond similarly to

given events.

2. Automated discovery

Automated discovery of previously unknown patterns. Data mining tools sweep through

databases and identify previously hidden patterns in one step.

An example of pattern discovery is the analysis of retail sales data to identify seemingly

unrelated products that are often purchased together. Other pattern discovery problems

include detecting fraudulent credit card transactions and identifying anomalous data that

could represent data entry keying errors.

Data mining techniques can yield the benefits of automation on existing software and

hardware platforms, and can be implemented on new systems as existing platforms are

upgraded and new products developed. When data mining tools are implemented on high

performance parallel processing systems, they can analyze massive databases in minutes.

Faster processing means that users can automatically experiment with more models to

understand complex data. High speed makes it practical for users to analyze huge quantities

of data. Larger databases, in turn, yield improved predictions.

Techniques of data mining

1. Decision trees:

Tree-shaped structures that represent sets of decisions. These decisions generate rules for the

classification of a dataset. Specific decision tree methods include Classification and

Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID).

In decision tree technique, the root of the decision tree is a simple question or condition that

has multiple answers. Each answer then leads to a set of questions or conditions that help us

determine the data so that we can make the final decision based on it.

For example, we use the following decision tree to determine whether or not to play tennis.

Starting at the root node, if the outlook is overcast then we should definitely play tennis. If it

is rainy, we should only play tennis if the wind is week. And if it is sunny then we should

play tennis in case the humidity is normal.

2. Association

Association is one of the best known data mining technique. In association, a pattern is

discovered based on a relationship between items in the same transaction. That’s the reason

why association technique is also known as relation technique. The association technique is

used in market basket analysis to identify a set of products that customers frequently

purchase together.

Retailers are using association technique to research customer’s buying habits. Based on

historical sale data, retailers might find out that customers always buy crisps when they buy

beers, and therefore they can put beers and crisps next to each other to save time for customer

and increase sales.

3. Classification

Classification is a classic data mining technique based on machine learning. Basically

classification is used to classify each item in a set of data into one of predefined set of classes

or groups. Classification method makes use of mathematical techniques such as decision

trees, linear programming, neural network and statistics.

In classification, we develop the software that can learn how to classify the data items into

groups.

For example, we can apply classification in application that “given all records of employees

who left the company; predict who will probably leave the company in a future period.” In

this case, we divide the records of employees into two groups that named “leave” and “stay”.

And then we can ask our data mining software to classify the employees into separate groups.

4. Clustering

Clustering is a data mining technique that makes meaningful or useful cluster of objects

which have similar characteristics using automatic technique.

The clustering technique defines the classes and puts objects in each class, while in the

classification techniques, objects are assigned into predefined classes. To make the concept

clearer, we can take book management in library as an example. In a library, there is a wide

range of books in various topics available. The challenge is how to keep those books in a way

that readers can take several books in a particular topic without hassle. By using clustering

technique, we can keep books that have some kinds of similarities in one cluster or one shelf

and label it with a meaningful name. If readers want to grab books in that topic, they would

only have to go to that shelf instead of looking for entire library.

5. Prediction

The prediction, as it name implied, is one of a data mining techniques that discovers

relationship between independent variables and relationship between dependent and

independent variables.

For instance, the prediction analysis technique can be used in sale to predict profit for the

future if we consider sale is an independent variable, profit could be a dependent variable.

Then based on the historical sale and profit data, we can draw a fitted regression curve that is

used for profit prediction.

6. Sequential Patterns

Sequential patterns analysis is one of data mining technique that seeks to discover or identify

similar patterns, regular events or trends in transaction data over a business period.

In sales, with historical transaction data, businesses can identify a set of items that customers

buy together different times in a year. Then businesses can use this information to

recommend customers buy it with better deals based on their purchasing frequency in the

past.

Challenges in Web Mining

The web poses great challenges for resource and knowledge discovery based on the

following observations:

1. The web is too huge

The size of the web is very huge and rapidly increasing. This seems that the web is too huge

for data warehousing and data mining.

2. Complexity of Web pages

The web pages do not have unifying structure. They are very complex as compared to

traditional text document. There are huge amount of documents in digital library of web.

These libraries are not arranged according in any particular sorted order.

3. Web is dynamic information source

The information on the web is rapidly updated. The data such as news, stock markets,

weather, sports, shopping etc are regularly updated.

4. Diversity of user communities

The user community on the web is rapidly expanding. These users have different

backgrounds, interests, and usage purposes. There are more than 100 million workstations

that are connected to the Internet and still rapidly increasing.

5. Relevancy of Information

It is considered that a particular person is generally interested in only small portion of the

web, while the rest of the portion of the web contains the information that is not relevant to

the user and may swamp desired results.

Advantages of Data Mining

1. Marketing / Retail

Data mining helps marketing companies build models based on historical data to predict who

will respond to the new marketing campaigns such as direct mail, online marketing

campaign. Through the results, marketers will have appropriate approach to sell profitable

products to targeted customers.

Data mining brings a lot of benefits to retail companies in the same way as marketing.

Through market basket analysis, a store can have an appropriate production arrangement in a

way that customers can buy frequent buying products together with pleasant. In addition, it

also helps the retail companies offer certain discounts for particular products that will attract

more customers.

2. Finance / Banking

Data mining gives financial institutions information about loan information and credit

reporting. By building a model from historical customer’s data, the bank and financial

institution can determine good and bad loans. In addition, data mining helps banks detect

fraudulent credit card transactions to protect credit card’s owner.

3. Manufacturing

By applying data mining in operational engineering data, manufacturers can detect faulty

equipments and determine optimal control parameters.

For example semi-conductor manufacturers has a challenge that even the conditions of

manufacturing environments at different wafer production plants are similar, the quality of

wafer are lot the same and some for unknown reasons even has defects. Data mining has

been applying to determine the ranges of control parameters that lead to the production of

golden wafer. Then those optimal control parameters are used to manufacture wafers with

desired quality.

4. Governments

Data mining helps government agency by digging and analyzing records of financial

transaction to build patterns that can detect money laundering or criminal activities.

Disadvantages of data mining

1. Privacy Issues

The concerns about the personal privacy have been increasing enormously recently

especially when internet is booming with social networks, e-commerce, forums, blogs.

Because of privacy issues, people are afraid of their personal information is collected and

used in unethical way that potentially causing them a lot of troubles. Businesses collect

information about their customers in many ways for understanding their purchasing

behaviors trends.

However businesses don’t last forever, some days they may be acquired by other or gone. At

this time the personal information they own probably is sold to other or leak.

2. Security issues

Security is a big issue. Businesses own information about their employees and customers

including social security number, birthday, payroll and etc.

However how properly this information is taken care is still in questions. There have been a

lot of cases that hackers accessed and stole big data of customers from big corporation such

as Ford Motor Credit Company, Sony, with so much personal and financial informatio n

available, the credit card stolen and identity theft become a big problem.

3. Misuse of information/inaccurate information

Information is collected through data mining intended for the ethical purposes can be

misused. This information may be exploited by unethical people or businesses to take

benefits of vulnerable people or discriminate against a group of people.

In addition, data mining technique is not perfectly accurate. Therefore if inaccurate

information is used for decision-making, it will cause serious consequence.

Data Mining Example: Marketing

In marketing in the area of advertising campaigns data mining can often increase

the response and purchase rate by a factor of two to three.

The following describes a typical [data mining] example:

A company wants to launch an advertising campaign for a product. Among its present

customers the company wants to post product information to those with a high probability of

purchasing the product. The company has data describing the past customer behaviour and

personal data about each of its customers. There are also customers who have already bought

the product, e.g. in a trial period. The customers of the trial period are divided into two

classes: those who have bought the product and those who have not. With this data a

prediction model is created to predict the probability of purchasing the product. After that the

probability of purchasing the product is predicted for all other customers. Only those with a

higher probability are addressed. As a side effect the company learns with this data mining

analysis which are the relevant driver attributes of its customers buying a specific product (or

at least being very interested in it).

The example shows how Data Mining can help in marketing to predict the purchase

probability of customers for a specific product. This reduces cost, because sales activity can

be focused much better (lower cost for mailings and flyers or for cost intensive sales agents’

visits on the spot). The customers benefit at the same time because the average relevance of

the company’s offers increases (or the other way round: the “spam” quota of non-relevant

offers is reduced).

A Producer wants to know…………

1. Which are our lowest/highest margin customers?

2. Who are my customer and what products they are buying?

3. Which customers are most likely to go to the competitors?

4. What impacts will new products/ services have on revenues and margins?

5. What product promotions have biggest impact on revenues?

6. What is the most effective distribution channel?

What is a Data Warehouse?

A single, complete and consistent store of data obtained from a variety of different

sources made available to end users in what they can understand and use in a business

context- Barry Devlin

“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

collection of data in support of management’s decision-making process.”

Data warehousing:

o The process of constructing and using data warehouses.

o Organized around major subjects, such as customer, product, sales.

o Focusing on the modeling and analysis of data for decision makers, not on daily

operations or transaction processing.

o Provide a simple and concise view around particular subject issues by excluding data

that are not useful in the decision support process.

Characteristics of Data warehousing

1. Subject-Oriented

Organized around major subjects, such as customer, product, sales.

Focusing on the modeling and analysis of data for decision makers, not on daily operations or

transaction processing.

Provide a simple and concise view around particular subject issues by excluding data that are

not useful in the decision support process.

2. Integrated

Constructed by integrating multiple, heterogeneous data sources.

o Relational databases, flat files, on-line transaction records.

Data cleaning and data integration techniques are applied.

o Ensure consistency in naming conventions, encoding structures, attribute measures, etc.

among different data sources.

E.g., Hotel price: currency, tax, breakfast covered, etc.

o When data is moved to the warehouse, it is converted.

3. Time Variant

The time horizon for the data warehouse is significantly longer than that of operational

systems.

o Operational database: current value data.

o Data warehouse data: provide information from a historical perspective (e.g., past 5-10

years)

Every key structure in the data warehouse contains an element of time, explicitly or

implicitly but the key of operational data may or may not contain “time element”.

4. Non-Volatile

Once data is entered into warehouse are not changed and updated.

A physically separate store of data transformed from the operational environment.

Operational update of data does not occur in the data warehouse environment.

o Does not require transaction processing, recovery, and concurrency control mechanisms.

o Requires only two operations in data accessing: Initial loading of data and access of data.

Purpose of Data warehousing (why data warehousing)

"As part of a company's business intelligence solution, a data warehouse is integral to the

gathering, processing and use of all the information a business receives daily. A strong

business intelligence plan, coupled with a robust data warehouse, will guarantee a business

has all the tools needed to make the right decisions for today and for the future.

The term "Business Intelligence" describes the process a business uses to gather all its raw

data from multiple sources and process it into practical information they will apply to

determine effectiveness of business processes, create policy, forecast trends, analyze the

market and much more. Data warehousing is an integral part of any effective business

intelligence endeavour. Data warehousing is more than just a database- like method of storing

information. While a database simply holds data, a well designed data warehousing system is

actually comprised of three segments:

a. Staging: raw data is stored and manipulated by developers. The goal of developers in

this stage is to take raw information from widely disparate sources, standardize and

organize it, readying it for integration.

b. Integration: raw data is further categorized and stored logically according to the

needs of the end user, allowing easier access.

c. Access: data is presented to users in a coherent way that is easy to understand and

use. Clients employ computer applications to both access and analyze the information

the data warehousing system provides.

While many companies are on board with data warehousing and storage and use business

intelligence systems daily, others find the concept of a data warehouse and its benefits hard

to grasp.

Here is a look at some of the pros of employing a data warehousing solution:

1. Improved user access: a standard database can be read and manipulated by programs like

SQL Query Studio or the Oracle client, but there is considerable ramp up time for end users

to effectively use these apps to get what they need. Business intelligence and data warehouse

end-user access tools are built specifically for the purposes data warehouses are used:

analysis, benchmarking, prediction and more.

2. Better consistency of data : developers work with data warehousing systems after data has

been received so that all the information contained in the data warehouse is standardized.

Only uniform data can be used efficiently for successful comparisons. Other solutions simply

cannot match a data warehouse's level of consistency.

3. All-in-one:

o A data warehouse has the ability to receive data from many different sources, meaning

any system in a business can contribute its data.

o Let's face it: different business segments use different applications. Only a proper data

warehouse solution can receive data from all of them and give a business the "big

picture" view that is needed to analyze the business, make plans, track competitors and

more.

4. Advanced query processing: in most businesses, even the best database systems are bound

to either a single server or a handful of servers in a cluster. A data warehouse is a purpose-

built hardware solution far more advanced than standard database servers. What this means is

a data warehouse will process queries much faster and more effectively, leading to efficiency

and increased productivity.

5. Retention of data history: end-user applications typically don't have the ability, not to

mention the space, to maintain much transaction history and keep track of multiple changes

to data. Data warehousing solutions have the ability to track all alterations to data, providing

a reliable history of all changes, additions and deletions. With a data warehouse, the integrity

of data is ensured.

6. Disaster recovery implications : a data warehouse system offers a great deal of security

when it comes to disaster recovery. Since data from disparate systems is all sent to a data

warehouse, that data warehouse essentially acts as another information backup source.

Considering the data warehouse will also be backed up, that's now four places where the

same information will be stored: the original source, its backup, the data warehouse and its

subsequent backup. This is unparalleled information security.

Data Warehouse Tools and Utilities Functions

The following are the functions of Data Warehouse tools and Utilities:

1. Data Extraction

Data Extraction involves gathering the data from multiple heterogeneous sources.

2. Data Cleaning

Data Cleaning involves finding and correcting the errors in data.

3. Data Transformation

Data Transformation involves converting data from legacy format to warehouse format.

4. Data Loading

Data Loading involves sorting, summarizing, consolidating, checking integrity and building

indices and partitions.

5. Refreshing

Refreshing involves updating from data sources to warehouse.

DATA WAREHOUSING APPLICATIONS

In the world of computing the term data warehousing is an efficient system which is used for

reporting and analysis.

These systems are used to store the historical as well as current data which is used for

making trending reports which is used further for senior management reporting used for

comparisons annually and quarterly.

It helps in bringing all the data in a central location called data warehouse. All the data that is

stored in this warehouse is uploaded from the operational systems. The data in this

warehouse is passed through various operations. The data warehouse environment consists of

various source systems that provide this warehouse with data. Various data integration

technologies are used to make the data ready to use.

Various architectures, tools and applications are included for storing data in this warehouse.

A data warehouse has its foundation on a mainframe server. The data here is extracted and

organized which serves the user queries. It gives us the advantage of gathering information

and data from diverse resources for easy access and analysis. The applications of data

warehousing are data mining, web mining and decision support systems.

1. Data mining

It is the analysis of the data for the new relationships between various types of data. It is

basically done by sorting and analysing the data to recognize the patterns and relationships

between various types of data. Association of patterns is done by relating one event to

another. A sequence or path is setup after analysis of patterns where one event is responsible

for the occurrence of other. All the patterns are classified and organization of data is done

accordingly. And discovering of new patterns every time is used for predictive analysis. The

data mining techniques are used in research areas.

2. Web mining

It is becoming important in the field of customer relationship management. It is basically the

integration of data and information by data mining methodologies. The information is

gathered from all over the world. When used in customer relationship management it is used

to observe the customer behaviour and their needs more properly and surely this he lps in

success of the market. The data mining parameters like classification association and

clustering are used for evaluation of the data.

3. Decision support system

It is an application of data warehousing which is used in analysis of data related to business

and presents its results in such a way so as to make the business decisions easier for the

business users. It is considered to be an informational application. It basic purposes are to

compare the sales figures of various weeks. Assumptions are also done by forecasting the

revenue figures based on the sales of products. The past experiences and sales are also

counted and make the decisions right. The information presented by decision support system

is done graphically. It may also include an artificial intelligence system for the purpose.

Seeing to all the above points it is clear that data warehousing has lot of applications which are

being used in almost every field.

Advantages and disadvantages of data warehouses

Data warehouses are the traditional solution for data integration, and for good reason, but this

is becoming increasingly difficult to scale and copy data from multiple data sources in

multiple organizations in multiple locations.

1. A Data Warehouse Delivers Enhanced Business Intelligence

By providing data from various sources, managers and executives will no longer need to

make business decisions based on limited data or their gut. In addition, “data warehouses and

related BI can be applied directly to business processes including marketing segmentation,

inventory management, financial management, and sales.”

2. A Data Warehouse Saves Time

Since business users can quickly access critical data from a number of sources (all in one

place) they can rapidly make informed decisions on key initiatives. They won’t waste

precious time retrieving data from multiple sources.

Not only can that but the business execs query the data themselves with little or no support

from IT, saving more time and more money. That means the business users won’t have to

wait until IT gets around to generating the reports, and those hardworking folks in IT can do

what they do best—keep the business running.

3. A Data Warehouse Enhances Data Quality and Consistency

A data warehouse implementation includes the conversion of data from numerous source

systems into a common format.

Since each data from the various departments is standardized, each department will produce

results that are in line with all the other departments. So you can have more confidence in the

accuracy of your data. And accurate data is the basis for strong business decisions.

4. A Data Warehouse Provides Historical Intelligence

A data warehouse stores large amounts of historical data so you can analyze different time

periods and trends in order to make future predictions. Such data typically cannot be stored in

a transactional database or used to generate reports from a transactional system.

5. A Data Warehouse Generates a High ROI

Finally, the piece de resistance—return on investment. Companies that have implemented

data warehouses and complementary BI systems have generated more revenue and saved

more money than companies that haven’t invested in BI systems and data warehouses.

The Disadvantages of a Data Warehouse

1. Extra Reporting Work

Depending on the size of the organization, a data warehouse runs the risk of extra work on

departments. Each type of data that's needed in the warehouse typically has to be generated

by the IT teams in each division of the business. This can be as simple as duplicating data

from an existing database, but at other times, it involves gathering data from customers or

employees that wasn't gathered before.

2. Cost/Benefit Ratio

A commonly cited disadvantage of data warehousing is the cost/benefit analysis. A data

warehouse is a big IT project, and like many big IT projects, it can suck a lot of IT man hours

and budgetary money to generate a tool that doesn't get used often enough to justify the

implementation expense.

This is completely sidestepping the issue of the expense of maintaining the data warehouse

and updating it as the business grows and adapts to the market.

3. Data Ownership Concerns

Data warehouses are often, but not always, Software as a Service implementations, or cloud

services applications. Your data security in this environment is only as good as your cloud

vendor. Even if implemented locally, there are concerns about data access throughout the

company. Make sure that the people doing the analysis are individuals that your organization

trusts, especially with customers' personal data. A data warehouse that leaks customer data is

a privacy and public relations nightmare.

4. Data Flexibility

Data warehouses tend to have static data sets with minimal ability to "drill down" to specific

solutions. The data is imported and filtered through a schema, and it is often days or weeks

old by the time it's actually used. In addition, data warehouses are usually subject to ad hoc

queries and are thus notoriously difficult to tune for processing speed and query speed. While

the queries are often ad hoc, the queries are limited by what data relations were set when the

aggregation was assembled.

Top 10 challenges in building data warehouse for large banks

1) Lack of strategic focus to build Enterprise Data Warehouse (EDW)

Building EDW is a strategic initiative since it requires shift in culture, longer timescale &

more importantly it is an expensive affaire. Hence, it should be one of the top agendas of the

CXOs and they need to closely monitor the progress and also need to provide executive

support to break any unwanted barriers.

2) Need of considerable Time, Effort & Cost

Typical time taken for a global bank to build an EDW varies from a couple of years to 5

years. It also requires substantial effort & eventually huge amount of money to build a data

warehouse. Also, Evidence of successful ROI is very opaque in the existing data warehouse

implementation.

3) Lack of cross divisional collaboration

Building EDW requires constructive collaboration from various teams like multiple business

divisions, source system teams, architecture & design teams, project teams and vendor

teams.

4) Technological complexity

Mostly, source data is kept in multiple operating systems & multiple data base technologies.

There are plenty of tools for data sourcing, data quality management, data integration, data

ware housing, reporting & analytics.

Choosing appropriate technology is not so simple and is complicated by various emerging

techniques like data virtualization, self service BI, in-data base analytics, columnar data

base, NOSQL database, massively parallel processing, in-memory computing and etc,.

Also, traditional data warehouse is required to be integrated with big data technologies &

Internet of Things for gaining business insights.

5) Ill-defined, changing business data requirements & Insensitivity of technical team in

understanding business requirements

Most of the time business finds difficulty in defining the data requirements, since data

requirements keep evolving as the use of data increases. However, technical team wants

finalised data requirements from business before designing & building a data warehouse.

6) Lack of clarity on true source of data

Most of the large banks have great legacy behind them and have been growing over decades

through mergers & acquisition. They have widespread footprint across geographies and

various customer segments. In this process, they have acquired many systems which are

poorly integrated, less documented and data 2is scattered across multiple systems. It is

nightmare for these banks to identify the true source of their data.

7) Lack of ability to manage data quality issues

Since data is an organisational asset it needs to be acquired & maintained well.

Many front office/customer facing systems don’t capture quality data at its origination. There

is no unified data capturing process across organisation.

For example, last name of a personal customer would not have been captured in a front office

system, since it is not a mandatory field, whereas it may be mandatory field for another

system.

Sometimes there is lack of well defined processes & technologies to curtail the data quality

issues.

8) Vested interest of vendors in promoting their own solution

Most of the top data warehousing vendors have their own suit of solutions/products in the

entire data warehousing eco system. These vendors tend to promote their own solution

rather than advocating what is best suited for the customer.

9) Comfort of using divisional data marts

Reporting is indispensable activity of banking. Many banks have built divisional data

marts for fulfilling their own divisional needs. Though divisional marts do not provide

enterprise wide view, many business users are comfortable in using divisional data mart

assuming that “Known devil is better than unknown angel”.

10) Subordinate use of data ware house

Business users from various divisions need to use data warehouse for reporting, business

intelligence, data analytics & advanced analytics to unleash full potential of the enterprise

data asset. Under utilised data warehouse will not grow & will not yield the desired return

on investment (ROI).

Case Study

Data Warehousing Solution for One of Europe's Largest Financial Services Groups

o The client sought a business intelligence solution to consolidate the mortgage administration

processes, provide better sales cycle management, mortgage product performance analysis,

financial forecasting based on sales demands, fraud detection and general mortgage

operational reporting. Infosys delivered a highly scalable solution.

The Client

o The client is one of Europe's largest financial services groups in corporate and commercial

banking, retail banking, credit cards and general insurance. The company sells mortgages to

corporate and retail customers through various channels. These mortgage systems run on

different technology platforms and follow different business processes.

Business Need

o Consolidate the mortgage administration processes for all brands and BI for different brands.

o Satisfy better sales cycle management, mortgage product performance analysis, financial

forecasting based on sales demands, fraud detection and general mortgage operational

reporting.

The Challenges

o The biggest challenge was to provide scalable architecture for consolidating huge amount of

data.

Our Solution

o Infosys designed and implemented a data warehouse solution to extract information from the

Mortgage Sales Application and administration systems of different brands and house them

in a single data warehouse database. This resulted in a highly scalable solution that met the

following requirements:

Transaction volume expected: 73 Million per year; annual growth rate of 110%

Size expected: 180 GB at the end of Year 1; annual growth rate of 45%

Implementation Process

o Infosys followed an iterative phased approach to implement the solution that included the

following phases:

Business requirements analysis

Data warehouse dimensional modeling

Architecture design

ETL (Extract, Transform and Load) and business intelligence reporting development and

implementation

Benefits

o Highly scalable solution to meet the following requirements:

Transaction volume expected: 73 Million per year; annual growth rate of 110%\

Size expected: 180 GB at the end of Year 1; annual growth rate of 45%

Integrating Data Mining System with a Database or Data Warehouse System

The data mining system needs to be integrated with database or the data warehouse system. If

the data mining system is not integrated with any database or data warehouse system then

there will be no system to communicate with. This scheme is known as non-coupling

scheme. In this scheme the main focus is put on data mining design and for developing

efficient and effective algorithms for mining the available data sets.

Here is the list of Integration Schemes:

1) No Coupling

In this scheme the Data Mining system does not utilize any of the database or data warehouse

functions. It then fetches the data from a particular source and processes that data using some

data mining algorithms. The data mining result is stored in other file.

2) Loose Coupling

In this scheme the data mining system may use some of the functions of database and data

warehouse system. It then fetches the data from data respiratory managed by these systems

and perform data mining on that data. It then stores the mining result either in a file or in a

designated place in a database or data warehouse.

3) Semi-tight Coupling

In this scheme the data mining system is along with the kinking the efficient implementation

of data mining primitives can be provided in database or data warehouse systems.

4) Tight coupling

In this coupling scheme data mining system is smoothly integrated into database or data

warehouse system. The data mining subsystem is treated as one functional component of an

information system.

Bibliography

http://forum.jntuworld.com/showthread.php?3818-Data-Warehousing-and-Data-

Mining-(DWDM)-Unit-wise-Notes-All-8-Units

http://www.thearling.com/text/dmwhite/dmwhite.htm

http://www.thearling.com/text/dmtechniques/dmtechniques.htm

http://www.infosys.com/consulting/information-management/case-studies/Pages/data-

warehousing-solutions.aspx

http://www.watchwise.net/data-warehousing.htm

http://www.information-management.com/issues/19990101/232-1.html

DEPARTMENT OF BUSINESS AND INDUSTRIAL

MANAGEMENT

TERM ASSIGNMENT 2014-15

INFORMATION TECHNOLOGY FOR BUSINESS

FYMBA- SEM-I

SECTION-A

TOPIC: DATA MINING AND DATA WAREHOUSING

GROUP NUMBER: 9

BY,

16: CHAWLA DIVYA

23: GANDHI SANI

43: LADHNI ROMA

SUBMITTED ON -24TH DECEMBER, 2014

SUBMITTED TO – DR. JAYDEEP CHAUDRY