data mining,data mining process,presentation

Upload: aneesh-manum

Post on 03-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 data mining,data mining process,presentation

    1/13

    DATA MINING

    data mining, also called knowledge discovery in databases, in computer

    science, the process of discovering interesting and useful patterns and

    relationships in large volumes of data. The field combines tools from statistics

    and artificial intelligence (such as neural networks and machine learning) with

    database management to analyse large digital collections, known as data sets.

    Data mining is widely used in business (insurance, banking, retail), science

    research (astronomy, medicine), and government security (detection of

    criminals and terrorists).

    The proliferation of numerous large, and sometimes connected, government and

    private databases has led to regulations to ensure that individual records are

    accurate and secure from unauthorized viewing or tampering. Most types of

    data mining are targeted toward ascertaining general knowledge about a group

    rather than knowledge about specific individualsa supermarket is less

    concerned about selling one more item to one person than about selling many

    items to many peoplethough pattern analysis also may be used to discern

    anomalous individual behaviour such as fraud or other criminal activity.

    Origins and early applications

    As computer storage capacities increased during the 1980s, many companies

    began to store more transactional data. The resulting record collections, often

    called data warehouses, were too large to be analysed with traditional statistical

    approaches. Several computer science conferences and workshops were held to

    consider how recent advances in the field of artificial intelligence (AI)such asdiscoveries from expert systems, genetic algorithms, machine learning, and

  • 7/28/2019 data mining,data mining process,presentation

    2/13

    neural networkscould be adapted for knowledge discovery (the preferred term

    in the computer science community). The process led in 1995 to the First

    International Conference on Knowledge Discovery and Data Mining, held in

    Montreal, and the launch in 1997 of the journal Data Mining and Knowledge

    Discovery. This was also the period when many early data-mining companies

    were formed and products were introduced.

    One of the earliest successful applications of data mining, perhaps second only

    to marketing research, was credit-card-fraud detection. By studying aconsumers purchasing behaviour, a typical pattern usually becomes apparent;

    purchases made outside this pattern can then be flagged for later investigation or

    to deny a transaction. However, the wide variety of normal behaviours makes

    this challenging; no single distinction between normal and fraudulent behaviour

    works for everyone or all the time. Every individual is likely to make some

    purchases that differ from the types he has made before, so relying on what is

    normal for a single individual is likely to give too many false alarms. One

    approach to improving reliability is first to group individuals that have similar

    purchasing patterns, since group models are less sensitive to minor anomalies.

    For example, a frequent business travellers group will likely have a pattern

    that includes unprecedented purchases in diverse locations, but members of this

    group might be flagged for other transactions, such as catalogue purchases, that

    do not fit that groups profile.

    Modeling and data-mining approaches

    The complete data-mining process involves multiple steps, from understanding

    the goals of a project and what data are available to implementing processchanges based on the final analysis. The three key computational steps are the

  • 7/28/2019 data mining,data mining process,presentation

    3/13

    model-learning process, model evaluation, and use of the model. This division

    is clearest with classification of data. Model learning occurs when one

    algorithm is applied to data about which the group (or class) attribute is known

    in order to produce a classifier, or an algorithm learned from the data. The

    classifier is then tested with an independent evaluation set that contains data

    with known attributes. The extent to which the models classifications agree

    with the known class for the target attribute can then be used to determine the

    expected accuracy of the model. If the model is sufficiently accurate, it can be

    used to classify data for which the target attribute is unknown.

    Data-mining techniques

    There are many types of data mining, typically divided by the kind of

    information (attributes) known and the type of knowledge sought from the data-

    mining model.

    PREDICTIVE MODELING

    Predictive modeling is used when the goal is to estimate the value of a particular

    target attribute and there exist sample training data for which values of that

    attribute are known. An example is classification, which takes a set of data

    already divided into predefined groups and searches for patterns in the data that

    differentiate those groups. These discovered patterns then can be used to

    classify other data where the right group designation for the target attribute is

    unknown (though other attributes may be known). For instance, a manufacturer

    could develop a predictive model that distinguishes parts that fail under extreme

    heat, extreme cold, or other conditions based on their manufacturing

    environment, and this model may then be used to determine appropriate

    applications for each part. Another technique employed in predictive modeling

    is regression analysis, which can be used when the target attribute is a numeric

    value and the goal is to predict that value for new data.

  • 7/28/2019 data mining,data mining process,presentation

    4/13

    DESCRIPTIVE MODELING

    Descriptive modeling, or clustering, also divides data into groups. With

    clustering, however, the proper groups are not known in advance; the patterns

    discovered by analysing the data are used to determine the groups. For example,

    an advertiser could analyse a general population in order to classify potential

    customers into different clusters and then develop separate advertising

    campaigns targeted to each group. Fraud detection also makes use of clustering

    to identify groups of individuals with similar purchasing patterns.

    Pattern mining concentrates on identifying rules that describe specific patterns

    within the data. Market-basket analysis, which identifies items that typically

    occur together in purchase transactions, was one of the first applications of data

    mining. For example, supermarkets used market-basket analysis to identify

    items that were often purchased togetherfor instance, a store featuring a fish

    sale would also stock up on tartar sauce. Although testing for such associations

    has long been feasible and is often simple to see in small data sets, data mining

    has enabled the discovery of less apparent associations in immense data sets. Of

    most interest is the discovery of unexpected associations, which may open new

    avenues for marketing or research. Another important use of pattern mining is

    the discovery of sequential patterns; for example, sequences of errors or

    warnings that precede an equipment failure may be used to schedule

    preventative maintenance or may provide insight into a design flaw.

    ANOMALY DETECTION

    Anomaly detection can be viewed as the flip side of clusteringthat is, finding

    data instances that are unusual and do not fit any established pattern. Fraud

  • 7/28/2019 data mining,data mining process,presentation

    5/13

    detection is an example of anomaly detection. Although fraud detection may be

    viewed as a problem for predictive modeling, the relative rarity of fraudulent

    transactions and the speed with which criminals develop new types of fraud

    mean that any predictive model is likely to be of low accuracy and to quickly

    become out of date. Thus, anomaly detection instead concentrates on modeling

    what is normal behaviour in order to identify unusual transactions. Anomaly

    detection also is used with various monitoring systems, such as for intrusion

    detection.

    Numerous other data-mining techniques have been developed, including pattern

    discovery in time series data (e.g., stock prices), streaming data (e.g., sensor

    networks), and relational learning (e.g., social networks).

    The potential for invasion of privacy using data mining has been a concern for

    many people. Commercial databases may contain detailed records of peoples

    medical history, purchase transactions, and telephone usage, among other

    aspects of their lives. Civil libertarians consider some databases held by

    businesses and governments to be an unwarranted intrusion and an invitation to

    abuse. For example, the American Civil Liberties Union sued the U.S. National

    Security Agency (NSA) alleging warrantless spying on American citizens

    through the acquisition of call records from some American telecommunication

    companies. The program, which began in 2001, was not discovered by the

    public until 2006, when the information began to leak out. Often the risk is not

    from data mining itself (which usually aims to produce general knowledge

    rather than to learn information about specific issues) but from misuse or

    inappropriate disclosure of information in these databases.

    In the United States, many federal agencies are now required to produce annual

    reports that specifically address the privacy implications of their data-mining

    projects. The U.S. law requiring privacy reports from federal agencies defines

  • 7/28/2019 data mining,data mining process,presentation

    6/13

    data mining quite restrictively as analyses to discover or locate a predictive

    pattern or anomaly indicative of terrorist or criminal activity on the part of any

    individual or individuals. As various local, national, and international law-

    enforcement agencies have begun to share or integrate their databases, the

    potential for abuse or security breaches has forced governments to work with

    industry on developing more secure computers and networks. In particular,

    there has been research in techniques for privacy-preserving data mining that

    operate on distorted, transformed, or encrypted data to decrease the risk of

    disclosure of any individuals data.

    Data mining is evolving, with one driver being competitions on challenge

    problems. A commercial example of this was the $1 million Netflix Prize.

    Netflix, an American company that offers movie rentals delivered by mail or

    streamed over the Internet, began the contest in 2006 to see if anyone could

    improve by 10 per cent its recommendation system, an algorithm for predicting

    an individuals movie preferences based on previous rental data. The prize was

    awarded on Sept. 21, 2009, to BellKors Pragmatic Chaosa team of seven

    mathematicians, computer scientists, and engineers from the United States,

    Canada, Austria, and Israel who had achieved the 10 per cent goal on June 26,

    2009, and finalized their victory with an improved algorithm 30 days later. The

    three-year open competition had spurred many clever data-mining innovations

    from contestants. For example, the 2007 and 2008 Conferences on Knowledge

    Discovery and Data Mining held workshops on the Netflix Prize, at which

    research papers were presented on topics ranging from new collaborative

    filtering techniques to faster matrix factorization (a key component of many

    recommendation systems). Concerns over privacy of such data have also led to

    advances in understanding privacy and anonymity.

    Data mining is not a panacea, however, and results must be viewed with the

    same care as with any statistical analysis. One of the strengths of data mining is

  • 7/28/2019 data mining,data mining process,presentation

    7/13

    the ability to analyze quantities of data that would be impractical to analyze

    manually, and the patterns found may be complex and difficult for humans to

    understand; this complexity requires care in evaluating the patterns.

    Nevertheless, statistical evaluation techniques can result in knowledge that is

    free from human bias, and the large amount of data can reduce biases inherent

    in smaller samples. Used properly, data mining provides valuable insights into

    large data sets that otherwise would not be practical or possible to obtain.

    Pre-processing

    Before data mining algorithms can be used, a target data set must be assembled.

    As data mining can only uncover patterns actually present in the data, the target

    data set must be large enough to contain these patterns while remaining concise

    enough to be mined within an acceptable time limit. A common source for data

    is a data mart or data warehouse. Pre-processing is essential to analyze the

    multivariate data sets before data mining. The target set is then cleaned. Data

    cleaning removes the observations containing noise and those with missing

    data.

    Data mining involves six common classes of tasks

    1. Anomaly detection (Outlier/change/deviation detection) Theidentification of unusual data records, that might be interesting or data

    errors that require further investigation.

    2. Association rule learning (Dependency modeling) Searches forrelationships between variables. For example a supermarket might gather

    data on customer purchasing habits. Using association rule learning, the

    supermarket can determine which products are frequently bought together

  • 7/28/2019 data mining,data mining process,presentation

    8/13

    and use this information for marketing purposes. This is sometimes

    referred to as market basket analysis.

    3. Clustering is the task of discovering groups and structures in the datathat are in some way or another "similar", without using known structures

    in the data.

    4. Classification is the task of generalizing known structure to apply tonew data. For example, an e-mail program might attempt to classify an e-

    mail as "legitimate" or as "spam".

    5. Regression Attempts to find a function which models the data with theleast error.

    Summarizationproviding a more compact representation of the data set,

    including visualization and report generation.

    6. Sequential pattern mining Sequential pattern mining finds sets of dataitems that occur together frequently in some sequences. Sequential

    pattern mining, which extracts frequent subsequences from a sequence

    database, has attracted a great deal of interest during the recent data

    mining research because it is the basis of many applications, such as: web

    user analysis, stock trend prediction, DNA sequence analysis, finding

    language or linguistic patterns from natural language texts, and using the

    history of symptoms to predict certain kind of disease.

    In Science and engineering

    In recent years, data mining has been used widely in the areas of science and

    engineering, such as bioinformatics, genetics, medicine, education and

    electrical power engineering.

  • 7/28/2019 data mining,data mining process,presentation

    9/13

    In the study of human genetics, sequence mining helps address the important

    goal of understanding the mapping relationship between the inter-individual

    variations in human DNA sequence and the variability in disease

    susceptibility. In simple terms, it aims to find out how the changes in an

    individual's DNA sequence affects the risks of developing common diseases

    such as cancer, which is of great importance to improving methods of

    diagnosing, preventing, and treating these diseases. The data mining method

    that is used to perform this task is known as multifactor dimensionality

    reduction.

    In the area of electrical power engineering, data mining methods have been

    widely used for condition monitoring of high voltage electrical equipment.

    The purpose of condition monitoring is to obtain valuable information on,

    for example, the status of the insulation (or other important safety-related

    parameters). Data clustering techniques such as the self-organizing map

    (SOM), have been applied to vibration monitoring and analysis of

    transformer on-load tap-changers (OLTCS). Using vibration monitoring, it

    can be observed that each tap change operation generates a signal that

    contains information about the condition of the tap changer contacts and the

    drive mechanisms. Obviously, different tap positions will generate different

    signals. However, there was considerable variability amongst normal

    condition signals for exactly the same tap position. SOM has been applied to

    detect abnormal conditions and to hypothesize about the nature of the

    abnormalities.

    Data mining methods have also been applied to dissolved gas analysis

    (DGA) in power transformers. DGA, as a diagnostics for power

    transformers, has been available for many years. Methods such as SOM has

    been applied to analyze generated data and to determine trends which are not

    obvious to the standard DGA ratio methods (such as Duval Triangle).

  • 7/28/2019 data mining,data mining process,presentation

    10/13

    Another example of data mining in science and engineering is found in

    educational research, where data mining has been used to study the factors

    leading students to choose to engage in behaviours which reduce their

    learning,and to understand factors influencing university student retention. A

    similar example of social application of data mining is its use in expertise

    finding systems, whereby descriptors of human expertise are extracted,

    normalized, and classified so as to facilitate the finding of experts,

    particularly in scientific and technical fields. In this way, data mining can

    facilitate institutional memory.

    Other examples of application of data mining methods are biomedical data

    facilitated by domain ontologies, mining clinical trial data, and traffic

    analysis using SOM.

    In adverse drug reaction surveillance, the Uppsala Monitoring Centre has,

    since 1998, used data mining methods to routinely screen for reporting

    patterns indicative of emerging drug safety issues in the WHO global

    database of 4.6 million suspected adverse drug reaction incidents. Recently,

    similar methodology has been developed to mine large collections of

    electronic health records for temporal patterns associating drug prescriptions

    to medical diagnoses.

    Subject-based data mining

    "Subject-based data mining" is a data mining method involving the search

    for associations between individuals in data. In the context of combating

    terrorism, the National Research Council provides the following definition:

    "Subject-based data mining uses an initiating individual or other datum that

    is considered, based on other information, to be of high interest, and the goal

  • 7/28/2019 data mining,data mining process,presentation

    11/13

    is to determine what other persons or financial transactions or movements,

    etc., are related to that initiating datum."

    Spatial data mining

    Spatial data mining is the application of data mining methods to spatial data.

    The end objective of spatial data mining is to find patterns in data with

    respect to geography. So far, data mining and Geographic InformationSystems (GIS) have existed as two separate technologies, each with its own

    methods, traditions, and approaches to visualization and data analysis.

    Particularly, most contemporary GIS have only very basic spatial analysis

    functionality. The immense explosion in geographically referenced data

    occasioned by developments in IT, digital mapping, remote sensing, and the

    global diffusion of GIS emphasizes the importance of developing data-

    driven inductive approaches to geographical analysis and modeling.

    Data mining offers great potential benefits for GIS-based applied decision-

    making. Recently, the task of integrating these two technologies has become

    of critical importance, especially as various public and private sector

    organizations possessing huge databases with thematic and geographically

    referenced data begin to realize the huge potential of the information

    contained therein. Among those organizations are:

    1. offices requiring analysis or dissemination of geo-referenced statisticaldata

    2.public health services searching for explanations of disease clustering3. environmental agencies assessing the impact of changing land-use

    patterns on climate change

  • 7/28/2019 data mining,data mining process,presentation

    12/13

    4. geo-marketing companies doing customer segmentation based on spatiallocation.

    Challenges in Spatial mining: Geospatial data repositories tend to be very

    large. Moreover, existing GIS datasets are often splintered into feature and

    attribute components that are conventionally archived in hybrid data

    management systems. Algorithmic requirements differ substantially for

    relational (attribute) data management and for topological (feature) data

    management. Related to this is the range and diversity of geographic data

    formats, which present unique challenges. The digital geographic data

    revolution is creating new types of data formats beyond the traditional

    "vector" and "raster" formats. Geographic data repositories increasingly

    include ill-structured data, such as imagery and geo-referenced multi-media.

    There are several critical research challenges in geographic knowledge

    discovery and data mining. Miller and Hanoffer the following list of

    emerging research topics in the field:

    Developing and supporting geographic data warehouses (GDW's): Spatial

    properties are often reduced to simple spatial attributes in mainstream data

    warehouses. Creating an integrated GDW requires solving issues of spatial

    and temporal data interoperability including differences in semantics,

    referencing systems, geometry, accuracy, and position.

    Better spatio-temporal representations in geographic knowledge discovery:

    Current geographic knowledge discovery (GKD) methods generally use very

    simple representations of geographic objects and spatial relationships.

    Geographic data mining methods should recognize more complex

    geographic objects (i.e. lines and polygons) and relationships (i.e. non-

    Euclidean distances, direction, connectivity, and interaction through

    attributed geographic space such as terrain). Furthermore, the time

  • 7/28/2019 data mining,data mining process,presentation

    13/13

    dimension needs to be more fully integrated into these geographic

    representations and relationships.

    Geographic knowledge discovery using diverse data types: GKD methods

    should be developed that can handle diverse data types beyond the

    traditional raster and vector models, including imagery and geo-referenced

    multimedia, as well as dynamic data types (video streams, animation).

    Knowledge grid

    Knowledge discovery "On the Grid" generally refers to conducting

    knowledge discovery in an open environment using grid computing

    concepts, allowing users to integrate data from various online data sources,

    as well make use of remote resources, for executing their data mining tasks.

    The earliest example was the Discovery Net, developed at Imperial College

    London, which won the "Most Innovative Data-Intensive Application

    Award" at the ACM SC02 (Supercomputing 2002) conference and

    exhibition, based on a demonstration of a fully interactive distributed

    knowledge discovery application for a bioinformatics application. Other

    examples include work conducted by researchers at the University of

    Calabria, who developed Knowledge Grid architecture for distributed

    knowledge discovery, based on grid computing