data mining,data mining process,presentation
TRANSCRIPT
-
7/28/2019 data mining,data mining process,presentation
1/13
DATA MINING
data mining, also called knowledge discovery in databases, in computer
science, the process of discovering interesting and useful patterns and
relationships in large volumes of data. The field combines tools from statistics
and artificial intelligence (such as neural networks and machine learning) with
database management to analyse large digital collections, known as data sets.
Data mining is widely used in business (insurance, banking, retail), science
research (astronomy, medicine), and government security (detection of
criminals and terrorists).
The proliferation of numerous large, and sometimes connected, government and
private databases has led to regulations to ensure that individual records are
accurate and secure from unauthorized viewing or tampering. Most types of
data mining are targeted toward ascertaining general knowledge about a group
rather than knowledge about specific individualsa supermarket is less
concerned about selling one more item to one person than about selling many
items to many peoplethough pattern analysis also may be used to discern
anomalous individual behaviour such as fraud or other criminal activity.
Origins and early applications
As computer storage capacities increased during the 1980s, many companies
began to store more transactional data. The resulting record collections, often
called data warehouses, were too large to be analysed with traditional statistical
approaches. Several computer science conferences and workshops were held to
consider how recent advances in the field of artificial intelligence (AI)such asdiscoveries from expert systems, genetic algorithms, machine learning, and
-
7/28/2019 data mining,data mining process,presentation
2/13
neural networkscould be adapted for knowledge discovery (the preferred term
in the computer science community). The process led in 1995 to the First
International Conference on Knowledge Discovery and Data Mining, held in
Montreal, and the launch in 1997 of the journal Data Mining and Knowledge
Discovery. This was also the period when many early data-mining companies
were formed and products were introduced.
One of the earliest successful applications of data mining, perhaps second only
to marketing research, was credit-card-fraud detection. By studying aconsumers purchasing behaviour, a typical pattern usually becomes apparent;
purchases made outside this pattern can then be flagged for later investigation or
to deny a transaction. However, the wide variety of normal behaviours makes
this challenging; no single distinction between normal and fraudulent behaviour
works for everyone or all the time. Every individual is likely to make some
purchases that differ from the types he has made before, so relying on what is
normal for a single individual is likely to give too many false alarms. One
approach to improving reliability is first to group individuals that have similar
purchasing patterns, since group models are less sensitive to minor anomalies.
For example, a frequent business travellers group will likely have a pattern
that includes unprecedented purchases in diverse locations, but members of this
group might be flagged for other transactions, such as catalogue purchases, that
do not fit that groups profile.
Modeling and data-mining approaches
The complete data-mining process involves multiple steps, from understanding
the goals of a project and what data are available to implementing processchanges based on the final analysis. The three key computational steps are the
-
7/28/2019 data mining,data mining process,presentation
3/13
model-learning process, model evaluation, and use of the model. This division
is clearest with classification of data. Model learning occurs when one
algorithm is applied to data about which the group (or class) attribute is known
in order to produce a classifier, or an algorithm learned from the data. The
classifier is then tested with an independent evaluation set that contains data
with known attributes. The extent to which the models classifications agree
with the known class for the target attribute can then be used to determine the
expected accuracy of the model. If the model is sufficiently accurate, it can be
used to classify data for which the target attribute is unknown.
Data-mining techniques
There are many types of data mining, typically divided by the kind of
information (attributes) known and the type of knowledge sought from the data-
mining model.
PREDICTIVE MODELING
Predictive modeling is used when the goal is to estimate the value of a particular
target attribute and there exist sample training data for which values of that
attribute are known. An example is classification, which takes a set of data
already divided into predefined groups and searches for patterns in the data that
differentiate those groups. These discovered patterns then can be used to
classify other data where the right group designation for the target attribute is
unknown (though other attributes may be known). For instance, a manufacturer
could develop a predictive model that distinguishes parts that fail under extreme
heat, extreme cold, or other conditions based on their manufacturing
environment, and this model may then be used to determine appropriate
applications for each part. Another technique employed in predictive modeling
is regression analysis, which can be used when the target attribute is a numeric
value and the goal is to predict that value for new data.
-
7/28/2019 data mining,data mining process,presentation
4/13
DESCRIPTIVE MODELING
Descriptive modeling, or clustering, also divides data into groups. With
clustering, however, the proper groups are not known in advance; the patterns
discovered by analysing the data are used to determine the groups. For example,
an advertiser could analyse a general population in order to classify potential
customers into different clusters and then develop separate advertising
campaigns targeted to each group. Fraud detection also makes use of clustering
to identify groups of individuals with similar purchasing patterns.
Pattern mining concentrates on identifying rules that describe specific patterns
within the data. Market-basket analysis, which identifies items that typically
occur together in purchase transactions, was one of the first applications of data
mining. For example, supermarkets used market-basket analysis to identify
items that were often purchased togetherfor instance, a store featuring a fish
sale would also stock up on tartar sauce. Although testing for such associations
has long been feasible and is often simple to see in small data sets, data mining
has enabled the discovery of less apparent associations in immense data sets. Of
most interest is the discovery of unexpected associations, which may open new
avenues for marketing or research. Another important use of pattern mining is
the discovery of sequential patterns; for example, sequences of errors or
warnings that precede an equipment failure may be used to schedule
preventative maintenance or may provide insight into a design flaw.
ANOMALY DETECTION
Anomaly detection can be viewed as the flip side of clusteringthat is, finding
data instances that are unusual and do not fit any established pattern. Fraud
-
7/28/2019 data mining,data mining process,presentation
5/13
detection is an example of anomaly detection. Although fraud detection may be
viewed as a problem for predictive modeling, the relative rarity of fraudulent
transactions and the speed with which criminals develop new types of fraud
mean that any predictive model is likely to be of low accuracy and to quickly
become out of date. Thus, anomaly detection instead concentrates on modeling
what is normal behaviour in order to identify unusual transactions. Anomaly
detection also is used with various monitoring systems, such as for intrusion
detection.
Numerous other data-mining techniques have been developed, including pattern
discovery in time series data (e.g., stock prices), streaming data (e.g., sensor
networks), and relational learning (e.g., social networks).
The potential for invasion of privacy using data mining has been a concern for
many people. Commercial databases may contain detailed records of peoples
medical history, purchase transactions, and telephone usage, among other
aspects of their lives. Civil libertarians consider some databases held by
businesses and governments to be an unwarranted intrusion and an invitation to
abuse. For example, the American Civil Liberties Union sued the U.S. National
Security Agency (NSA) alleging warrantless spying on American citizens
through the acquisition of call records from some American telecommunication
companies. The program, which began in 2001, was not discovered by the
public until 2006, when the information began to leak out. Often the risk is not
from data mining itself (which usually aims to produce general knowledge
rather than to learn information about specific issues) but from misuse or
inappropriate disclosure of information in these databases.
In the United States, many federal agencies are now required to produce annual
reports that specifically address the privacy implications of their data-mining
projects. The U.S. law requiring privacy reports from federal agencies defines
-
7/28/2019 data mining,data mining process,presentation
6/13
data mining quite restrictively as analyses to discover or locate a predictive
pattern or anomaly indicative of terrorist or criminal activity on the part of any
individual or individuals. As various local, national, and international law-
enforcement agencies have begun to share or integrate their databases, the
potential for abuse or security breaches has forced governments to work with
industry on developing more secure computers and networks. In particular,
there has been research in techniques for privacy-preserving data mining that
operate on distorted, transformed, or encrypted data to decrease the risk of
disclosure of any individuals data.
Data mining is evolving, with one driver being competitions on challenge
problems. A commercial example of this was the $1 million Netflix Prize.
Netflix, an American company that offers movie rentals delivered by mail or
streamed over the Internet, began the contest in 2006 to see if anyone could
improve by 10 per cent its recommendation system, an algorithm for predicting
an individuals movie preferences based on previous rental data. The prize was
awarded on Sept. 21, 2009, to BellKors Pragmatic Chaosa team of seven
mathematicians, computer scientists, and engineers from the United States,
Canada, Austria, and Israel who had achieved the 10 per cent goal on June 26,
2009, and finalized their victory with an improved algorithm 30 days later. The
three-year open competition had spurred many clever data-mining innovations
from contestants. For example, the 2007 and 2008 Conferences on Knowledge
Discovery and Data Mining held workshops on the Netflix Prize, at which
research papers were presented on topics ranging from new collaborative
filtering techniques to faster matrix factorization (a key component of many
recommendation systems). Concerns over privacy of such data have also led to
advances in understanding privacy and anonymity.
Data mining is not a panacea, however, and results must be viewed with the
same care as with any statistical analysis. One of the strengths of data mining is
-
7/28/2019 data mining,data mining process,presentation
7/13
the ability to analyze quantities of data that would be impractical to analyze
manually, and the patterns found may be complex and difficult for humans to
understand; this complexity requires care in evaluating the patterns.
Nevertheless, statistical evaluation techniques can result in knowledge that is
free from human bias, and the large amount of data can reduce biases inherent
in smaller samples. Used properly, data mining provides valuable insights into
large data sets that otherwise would not be practical or possible to obtain.
Pre-processing
Before data mining algorithms can be used, a target data set must be assembled.
As data mining can only uncover patterns actually present in the data, the target
data set must be large enough to contain these patterns while remaining concise
enough to be mined within an acceptable time limit. A common source for data
is a data mart or data warehouse. Pre-processing is essential to analyze the
multivariate data sets before data mining. The target set is then cleaned. Data
cleaning removes the observations containing noise and those with missing
data.
Data mining involves six common classes of tasks
1. Anomaly detection (Outlier/change/deviation detection) Theidentification of unusual data records, that might be interesting or data
errors that require further investigation.
2. Association rule learning (Dependency modeling) Searches forrelationships between variables. For example a supermarket might gather
data on customer purchasing habits. Using association rule learning, the
supermarket can determine which products are frequently bought together
-
7/28/2019 data mining,data mining process,presentation
8/13
and use this information for marketing purposes. This is sometimes
referred to as market basket analysis.
3. Clustering is the task of discovering groups and structures in the datathat are in some way or another "similar", without using known structures
in the data.
4. Classification is the task of generalizing known structure to apply tonew data. For example, an e-mail program might attempt to classify an e-
mail as "legitimate" or as "spam".
5. Regression Attempts to find a function which models the data with theleast error.
Summarizationproviding a more compact representation of the data set,
including visualization and report generation.
6. Sequential pattern mining Sequential pattern mining finds sets of dataitems that occur together frequently in some sequences. Sequential
pattern mining, which extracts frequent subsequences from a sequence
database, has attracted a great deal of interest during the recent data
mining research because it is the basis of many applications, such as: web
user analysis, stock trend prediction, DNA sequence analysis, finding
language or linguistic patterns from natural language texts, and using the
history of symptoms to predict certain kind of disease.
In Science and engineering
In recent years, data mining has been used widely in the areas of science and
engineering, such as bioinformatics, genetics, medicine, education and
electrical power engineering.
-
7/28/2019 data mining,data mining process,presentation
9/13
In the study of human genetics, sequence mining helps address the important
goal of understanding the mapping relationship between the inter-individual
variations in human DNA sequence and the variability in disease
susceptibility. In simple terms, it aims to find out how the changes in an
individual's DNA sequence affects the risks of developing common diseases
such as cancer, which is of great importance to improving methods of
diagnosing, preventing, and treating these diseases. The data mining method
that is used to perform this task is known as multifactor dimensionality
reduction.
In the area of electrical power engineering, data mining methods have been
widely used for condition monitoring of high voltage electrical equipment.
The purpose of condition monitoring is to obtain valuable information on,
for example, the status of the insulation (or other important safety-related
parameters). Data clustering techniques such as the self-organizing map
(SOM), have been applied to vibration monitoring and analysis of
transformer on-load tap-changers (OLTCS). Using vibration monitoring, it
can be observed that each tap change operation generates a signal that
contains information about the condition of the tap changer contacts and the
drive mechanisms. Obviously, different tap positions will generate different
signals. However, there was considerable variability amongst normal
condition signals for exactly the same tap position. SOM has been applied to
detect abnormal conditions and to hypothesize about the nature of the
abnormalities.
Data mining methods have also been applied to dissolved gas analysis
(DGA) in power transformers. DGA, as a diagnostics for power
transformers, has been available for many years. Methods such as SOM has
been applied to analyze generated data and to determine trends which are not
obvious to the standard DGA ratio methods (such as Duval Triangle).
-
7/28/2019 data mining,data mining process,presentation
10/13
Another example of data mining in science and engineering is found in
educational research, where data mining has been used to study the factors
leading students to choose to engage in behaviours which reduce their
learning,and to understand factors influencing university student retention. A
similar example of social application of data mining is its use in expertise
finding systems, whereby descriptors of human expertise are extracted,
normalized, and classified so as to facilitate the finding of experts,
particularly in scientific and technical fields. In this way, data mining can
facilitate institutional memory.
Other examples of application of data mining methods are biomedical data
facilitated by domain ontologies, mining clinical trial data, and traffic
analysis using SOM.
In adverse drug reaction surveillance, the Uppsala Monitoring Centre has,
since 1998, used data mining methods to routinely screen for reporting
patterns indicative of emerging drug safety issues in the WHO global
database of 4.6 million suspected adverse drug reaction incidents. Recently,
similar methodology has been developed to mine large collections of
electronic health records for temporal patterns associating drug prescriptions
to medical diagnoses.
Subject-based data mining
"Subject-based data mining" is a data mining method involving the search
for associations between individuals in data. In the context of combating
terrorism, the National Research Council provides the following definition:
"Subject-based data mining uses an initiating individual or other datum that
is considered, based on other information, to be of high interest, and the goal
-
7/28/2019 data mining,data mining process,presentation
11/13
is to determine what other persons or financial transactions or movements,
etc., are related to that initiating datum."
Spatial data mining
Spatial data mining is the application of data mining methods to spatial data.
The end objective of spatial data mining is to find patterns in data with
respect to geography. So far, data mining and Geographic InformationSystems (GIS) have existed as two separate technologies, each with its own
methods, traditions, and approaches to visualization and data analysis.
Particularly, most contemporary GIS have only very basic spatial analysis
functionality. The immense explosion in geographically referenced data
occasioned by developments in IT, digital mapping, remote sensing, and the
global diffusion of GIS emphasizes the importance of developing data-
driven inductive approaches to geographical analysis and modeling.
Data mining offers great potential benefits for GIS-based applied decision-
making. Recently, the task of integrating these two technologies has become
of critical importance, especially as various public and private sector
organizations possessing huge databases with thematic and geographically
referenced data begin to realize the huge potential of the information
contained therein. Among those organizations are:
1. offices requiring analysis or dissemination of geo-referenced statisticaldata
2.public health services searching for explanations of disease clustering3. environmental agencies assessing the impact of changing land-use
patterns on climate change
-
7/28/2019 data mining,data mining process,presentation
12/13
4. geo-marketing companies doing customer segmentation based on spatiallocation.
Challenges in Spatial mining: Geospatial data repositories tend to be very
large. Moreover, existing GIS datasets are often splintered into feature and
attribute components that are conventionally archived in hybrid data
management systems. Algorithmic requirements differ substantially for
relational (attribute) data management and for topological (feature) data
management. Related to this is the range and diversity of geographic data
formats, which present unique challenges. The digital geographic data
revolution is creating new types of data formats beyond the traditional
"vector" and "raster" formats. Geographic data repositories increasingly
include ill-structured data, such as imagery and geo-referenced multi-media.
There are several critical research challenges in geographic knowledge
discovery and data mining. Miller and Hanoffer the following list of
emerging research topics in the field:
Developing and supporting geographic data warehouses (GDW's): Spatial
properties are often reduced to simple spatial attributes in mainstream data
warehouses. Creating an integrated GDW requires solving issues of spatial
and temporal data interoperability including differences in semantics,
referencing systems, geometry, accuracy, and position.
Better spatio-temporal representations in geographic knowledge discovery:
Current geographic knowledge discovery (GKD) methods generally use very
simple representations of geographic objects and spatial relationships.
Geographic data mining methods should recognize more complex
geographic objects (i.e. lines and polygons) and relationships (i.e. non-
Euclidean distances, direction, connectivity, and interaction through
attributed geographic space such as terrain). Furthermore, the time
-
7/28/2019 data mining,data mining process,presentation
13/13
dimension needs to be more fully integrated into these geographic
representations and relationships.
Geographic knowledge discovery using diverse data types: GKD methods
should be developed that can handle diverse data types beyond the
traditional raster and vector models, including imagery and geo-referenced
multimedia, as well as dynamic data types (video streams, animation).
Knowledge grid
Knowledge discovery "On the Grid" generally refers to conducting
knowledge discovery in an open environment using grid computing
concepts, allowing users to integrate data from various online data sources,
as well make use of remote resources, for executing their data mining tasks.
The earliest example was the Discovery Net, developed at Imperial College
London, which won the "Most Innovative Data-Intensive Application
Award" at the ACM SC02 (Supercomputing 2002) conference and
exhibition, based on a demonstration of a fully interactive distributed
knowledge discovery application for a bioinformatics application. Other
examples include work conducted by researchers at the University of
Calabria, who developed Knowledge Grid architecture for distributed
knowledge discovery, based on grid computing