data mining,data mining process,presentation

7/28/2019 data mining,data mining process,presentation

1/13

DATA MINING

data mining, also called knowledge discovery in databases, in computer

science, the process of discovering interesting and useful patterns and

relationships in large volumes of data. The field combines tools from statistics

and artificial intelligence (such as neural networks and machine learning) with

database management to analyse large digital collections, known as data sets.

Data mining is widely used in business (insurance, banking, retail), science

research (astronomy, medicine), and government security (detection of

criminals and terrorists).

The proliferation of numerous large, and sometimes connected, government and

private databases has led to regulations to ensure that individual records are

accurate and secure from unauthorized viewing or tampering. Most types of

data mining are targeted toward ascertaining general knowledge about a group

rather than knowledge about specific individualsa supermarket is less

concerned about selling one more item to one person than about selling many

items to many peoplethough pattern analysis also may be used to discern

anomalous individual behaviour such as fraud or other criminal activity.

Origins and early applications

As computer storage capacities increased during the 1980s, many companies

began to store more transactional data. The resulting record collections, often

called data warehouses, were too large to be analysed with traditional statistical

approaches. Several computer science conferences and workshops were held to

consider how recent advances in the field of artificial intelligence (AI)such asdiscoveries from expert systems, genetic algorithms, machine learning, and


2/13

neural networkscould be adapted for knowledge discovery (the preferred term

in the computer science community). The process led in 1995 to the First

International Conference on Knowledge Discovery and Data Mining, held in

Montreal, and the launch in 1997 of the journal Data Mining and Knowledge

Discovery. This was also the period when many early data-mining companies

were formed and products were introduced.

One of the earliest successful applications of data mining, perhaps second only

to marketing research, was credit-card-fraud detection. By studying aconsumers purchasing behaviour, a typical pattern usually becomes apparent;

purchases made outside this pattern can then be flagged for later investigation or

to deny a transaction. However, the wide variety of normal behaviours makes

this challenging; no single distinction between normal and fraudulent behaviour

works for everyone or all the time. Every individual is likely to make some

purchases that differ from the types he has made before, so relying on what is

normal for a single individual is likely to give too many false alarms. One

approach to improving reliability is first to group individuals that have similar

purchasing patterns, since group models are less sensitive to minor anomalies.

For example, a frequent business travellers group will likely have a pattern

that includes unprecedented purchases in diverse locations, but members of this

group might be flagged for other transactions, such as catalogue purchases, that

do not fit that groups profile.

Modeling and data-mining approaches

The complete data-mining process involves multiple steps, from understanding

the goals of a project and what data are available to implementing processchanges based on the final analysis. The three key computational steps are the


3/13

model-learning process, model evaluation, and use of the model. This division

is clearest with classification of data. Model learning occurs when one

algorithm is applied to data about which the group (or class) attribute is known

in order to produce a classifier, or an algorithm learned from the data. The

classifier is then tested with an independent evaluation set that contains data

with known attributes. The extent to which the models classifications agree

with the known class for the target attribute can then be used to determine the

expected accuracy of the model. If the model is sufficiently accurate, it can be

used to classify data for which the target attribute is unknown.

Data-mining techniques

There are many types of data mining, typically divided by the kind of

information (attributes) known and the type of knowledge sought from the data-

mining model.

PREDICTIVE MODELING

Predictive modeling is used when the goal is to estimate the value of a particular

target attribute and there exist sample training data for which values of that

attribute are known. An example is classification, which takes a set of data

already divided into predefined groups and searches for patterns in the data that

differentiate those groups. These discovered patterns then can be used to

classify other data where the right group designation for the target attribute is

unknown (though other attributes may be known). For instance, a manufacturer

could develop a predictive model that distinguishes parts that fail under extreme

heat, extreme cold, or other conditions based on their manufacturing

environment, and this model may then be used to determine appropriate

applications for each part. Another technique employed in predictive modeling

is regression analysis, which can be used when the target attribute is a numeric

value and the goal is to predict that value for new data.


4/13

DESCRIPTIVE MODELING

Descriptive modeling, or clustering, also divides data into groups. With

clustering, however, the proper groups are not known in advance; the patterns

discovered by analysing the data are used to determine the groups. For example,

an advertiser could analyse a general population in order to classify potential

customers into different clusters and then develop separate advertising

campaigns targeted to each group. Fraud detection also makes use of clustering

to identify groups of individuals with similar purchasing patterns.

Pattern mining concentrates on identifying rules that describe specific patterns

within the data. Market-basket analysis, which identifies items that typically

occur together in purchase transactions, was one of the first applications of data

mining. For example, supermarkets used market-basket analysis to identify

items that were often purchased togetherfor instance, a store featuring a fish

sale would also stock up on tartar sauce. Although testing for such associations

has long been feasible and is often simple to see in small data sets, data mining

has enabled the discovery of less apparent associations in immense data sets. Of

most interest is the discovery of unexpected associations, which may open new

avenues for marketing or research. Another important use of pattern mining is

the discovery of sequential patterns; for example, sequences of errors or

warnings that precede an equipment failure may be used to schedule

preventative maintenance or may provide insight into a design flaw.

ANOMALY DETECTION

Anomaly detection can be viewed as the flip side of clusteringthat is, finding

data instances that are unusual and do not fit any established pattern. Fraud


5/13

detection is an example of anomaly detection. Although fraud detection may be

viewed as a problem for predictive modeling, the relative rarity of fraudulent

transactions and the speed with which criminals develop new types of fraud

mean that any predictive model is likely to be of low accuracy and to quickly

become out of date. Thus, anomaly detection instead concentrates on modeling

what is normal behaviour in order to identify unusual transactions. Anomaly

detection also is used with various monitoring systems, such as for intrusion

detection.

Numerous other data-mining techniques have been developed, including pattern

discovery in time series data (e.g., stock prices), streaming data (e.g., sensor

networks), and relational learning (e.g., social networks).

The potential for invasion of privacy using data mining has been a concern for

many people. Commercial databases may contain detailed records of peoples

medical history, purchase transactions, and telephone usage, among other

aspects of their lives. Civil libertarians consider some databases held by

businesses and governments to be an unwarranted intrusion and an invitation to

abuse. For example, the American Civil Liberties Union sued the U.S. National

Security Agency (NSA) alleging warrantless spying on American citizens

through the acquisition of call records from some American telecommunication

companies. The program, which began in 2001, was not discovered by the

public until 2006, when the information began to leak out. Often the risk is not

from data mining itself (which usually aims to produce general knowledge

rather than to learn information about specific issues) but from misuse or

inappropriate disclosure of information in these databases.

In the United States, many federal agencies are now required to produce annual

reports that specifically address the privacy implications of their data-mining

projects. The U.S. law requiring privacy reports from federal agencies defines


6/13

data mining quite restrictively as analyses to discover or locate a predictive

pattern or anomaly indicative of terrorist or criminal activity on the part of any

individual or individuals. As various local, national, and international law-

enforcement agencies have begun to share or integrate their databases, the

potential for abuse or security breaches has forced governments to work with

industry on developing more secure computers and networks. In particular,

there has been research in techniques for privacy-preserving data mining that

operate on distorted, transformed, or encrypted data to decrease the risk of

disclosure of any individuals data.

Data mining is evolving, with one driver being competitions on challenge

problems. A commercial example of this was the $1 million Netflix Prize.

Netflix, an American company that offers movie rentals delivered by mail or

streamed over the Internet, began the contest in 2006 to see if anyone could

improve by 10 per cent its recommendation system, an algorithm for predicting

an individuals movie preferences based on previous rental data. The prize was

awarded on Sept. 21, 2009, to BellKors Pragmatic Chaosa team of seven

mathematicians, computer scientists, and engineers from the United States,

Canada, Austria, and Israel who had achieved the 10 per cent goal on June 26,

2009, and finalized their victory with an improved algorithm 30 days later. The

three-year open competition had spurred many clever data-mining innovations

from contestants. For example, the 2007 and 2008 Conferences on Knowledge

Discovery and Data Mining held workshops on the Netflix Prize, at which

research papers were presented on topics ranging from new collaborative

filtering techniques to faster matrix factorization (a key component of many

recommendation systems). Concerns over privacy of such data have also led to

advances in understanding privacy and anonymity.

Data mining is not a panacea, however, and results must be viewed with the

same care as with any statistical analysis. One of the strengths of data mining is


7/13

the ability to analyze quantities of data that would be impractical to analyze

manually, and the patterns found may be complex and difficult for humans to

understand; this complexity requires care in evaluating the patterns.

Nevertheless, statistical evaluation techniques can result in knowledge that is

free from human bias, and the large amount of data can reduce biases inherent

in smaller samples. Used properly, data mining provides valuable insights into

large data sets that otherwise would not be practical or possible to obtain.

Pre-processing

Before data mining algorithms can be used, a target data set must be assembled.

As data mining can only uncover patterns actually present in the data, the target

data set must be large enough to contain these patterns while remaining concise

enough to be mined within an acceptable time limit. A common source for data

is a data mart or data warehouse. Pre-processing is essential to analyze the

multivariate data sets before data mining. The target set is then cleaned. Data

cleaning removes the observations containing noise and those with missing

data.

Data mining involves six common classes of tasks

1. Anomaly detection (Outlier/change/deviation detection) Theidentification of unusual data records, that might be interesting or data

errors that require further investigation.

2. Association rule learning (Dependency modeling) Searches forrelationships between variables. For example a supermarket might gather

data on customer purchasing habits. Using association rule learning, the

supermarket can determine which products are frequently bought together


8/13

and use this information for marketing purposes. This is sometimes

referred to as market basket analysis.

3. Clustering is the task of discovering groups and structures in the datathat are in some way or another "similar", without using known structures

in the data.

4. Classification is the task of generalizing known structure to apply tonew data. For example, an e-mail program might attempt to classify an e-

mail as "legitimate" or as "spam".

5. Regression Attempts to find a function which models the data with theleast error.

Summarizationproviding a more compact representation of the data set,

including visualization and report generation.

6. Sequential pattern mining Sequential pattern mining finds sets of dataitems that occur together frequently in some sequences. Sequential

pattern mining, which extracts frequent subsequences from a sequence

database, has attracted a great deal of interest during the recent data

mining research because it is the basis of many applications, such as: web

user analysis, stock trend prediction, DNA sequence analysis, finding

language or linguistic patterns from natural language texts, and using the

history of symptoms to predict certain kind of disease.

In Science and engineering

In recent years, data mining has been used widely in the areas of science and

engineering, such as bioinformatics, genetics, medicine, education and

electrical power engineering.


9/13

In the study of human genetics, sequence mining helps address the important

goal of understanding the mapping relationship between the inter-individual

variations in human DNA sequence and the variability in disease

susceptibility. In simple terms, it aims to find out how the changes in an

individual's DNA sequence affects the risks of developing common diseases

such as cancer, which is of great importance to improving methods of

diagnosing, preventing, and treating these diseases. The data mining method

that is used to perform this task is known as multifactor dimensionality

reduction.

In the area of electrical power engineering, data mining methods have been

widely used for condition monitoring of high voltage electrical equipment.

The purpose of condition monitoring is to obtain valuable information on,

for example, the status of the insulation (or other important safety-related

parameters). Data clustering techniques such as the self-organizing map

(SOM), have been applied to vibration monitoring and analysis of

transformer on-load tap-changers (OLTCS). Using vibration monitoring, it

can be observed that each tap change operation generates a signal that

contains information about the condition of the tap changer contacts and the

drive mechanisms. Obviously, different tap positions will generate different

signals. However, there was considerable variability amongst normal

condition signals for exactly the same tap position. SOM has been applied to

detect abnormal conditions and to hypothesize about the nature of the

abnormalities.

Data mining methods have also been applied to dissolved gas analysis

(DGA) in power transformers. DGA, as a diagnostics for power

transformers, has been available for many years. Methods such as SOM has

been applied to analyze generated data and to determine trends which are not

obvious to the standard DGA ratio methods (such as Duval Triangle).


10/13

Another example of data mining in science and engineering is found in

educational research, where data mining has been used to study the factors

leading students to choose to engage in behaviours which reduce their

learning,and to understand factors influencing university student retention. A

similar example of social application of data mining is its use in expertise

finding systems, whereby descriptors of human expertise are extracted,

normalized, and classified so as to facilitate the finding of experts,

particularly in scientific and technical fields. In this way, data mining can

facilitate institutional memory.

Other examples of application of data mining methods are biomedical data

facilitated by domain ontologies, mining clinical trial data, and traffic

analysis using SOM.

In adverse drug reaction surveillance, the Uppsala Monitoring Centre has,

since 1998, used data mining methods to routinely screen for reporting

patterns indicative of emerging drug safety issues in the WHO global

database of 4.6 million suspected adverse drug reaction incidents. Recently,

similar methodology has been developed to mine large collections of

electronic health records for temporal patterns associating drug prescriptions

to medical diagnoses.

Subject-based data mining

"Subject-based data mining" is a data mining method involving the search

for associations between individuals in data. In the context of combating

terrorism, the National Research Council provides the following definition:

"Subject-based data mining uses an initiating individual or other datum that

is considered, based on other information, to be of high interest, and the goal


11/13

is to determine what other persons or financial transactions or movements,

etc., are related to that initiating datum."

Spatial data mining

Spatial data mining is the application of data mining methods to spatial data.

The end objective of spatial data mining is to find patterns in data with

respect to geography. So far, data mining and Geographic InformationSystems (GIS) have existed as two separate technologies, each with its own

methods, traditions, and approaches to visualization and data analysis.

Particularly, most contemporary GIS have only very basic spatial analysis

functionality. The immense explosion in geographically referenced data

occasioned by developments in IT, digital mapping, remote sensing, and the

global diffusion of GIS emphasizes the importance of developing data-

driven inductive approaches to geographical analysis and modeling.

Data mining offers great potential benefits for GIS-based applied decision-

making. Recently, the task of integrating these two technologies has become

of critical importance, especially as various public and private sector

organizations possessing huge databases with thematic and geographically

referenced data begin to realize the huge potential of the information

contained therein. Among those organizations are:

1. offices requiring analysis or dissemination of geo-referenced statisticaldata

2.public health services searching for explanations of disease clustering3. environmental agencies assessing the impact of changing land-use

patterns on climate change


12/13

4. geo-marketing companies doing customer segmentation based on spatiallocation.

Challenges in Spatial mining: Geospatial data repositories tend to be very

large. Moreover, existing GIS datasets are often splintered into feature and

attribute components that are conventionally archived in hybrid data

management systems. Algorithmic requirements differ substantially for

relational (attribute) data management and for topological (feature) data

management. Related to this is the range and diversity of geographic data

formats, which present unique challenges. The digital geographic data

revolution is creating new types of data formats beyond the traditional

"vector" and "raster" formats. Geographic data repositories increasingly

include ill-structured data, such as imagery and geo-referenced multi-media.

There are several critical research challenges in geographic knowledge

discovery and data mining. Miller and Hanoffer the following list of

emerging research topics in the field:

Developing and supporting geographic data warehouses (GDW's): Spatial

properties are often reduced to simple spatial attributes in mainstream data

warehouses. Creating an integrated GDW requires solving issues of spatial

and temporal data interoperability including differences in semantics,

referencing systems, geometry, accuracy, and position.

Better spatio-temporal representations in geographic knowledge discovery:

Current geographic knowledge discovery (GKD) methods generally use very

simple representations of geographic objects and spatial relationships.

Geographic data mining methods should recognize more complex

geographic objects (i.e. lines and polygons) and relationships (i.e. non-

Euclidean distances, direction, connectivity, and interaction through

attributed geographic space such as terrain). Furthermore, the time


13/13

dimension needs to be more fully integrated into these geographic

representations and relationships.

Geographic knowledge discovery using diverse data types: GKD methods

should be developed that can handle diverse data types beyond the

traditional raster and vector models, including imagery and geo-referenced

multimedia, as well as dynamic data types (video streams, animation).

Knowledge grid

Knowledge discovery "On the Grid" generally refers to conducting

knowledge discovery in an open environment using grid computing

concepts, allowing users to integrate data from various online data sources,

as well make use of remote resources, for executing their data mining tasks.

The earliest example was the Discovery Net, developed at Imperial College

London, which won the "Most Innovative Data-Intensive Application

Award" at the ACM SC02 (Supercomputing 2002) conference and

exhibition, based on a demonstration of a fully interactive distributed

knowledge discovery application for a bioinformatics application. Other

examples include work conducted by researchers at the University of

Calabria, who developed Knowledge Grid architecture for distributed

knowledge discovery, based on grid computing

data mining,data mining process,presentation

Documents