b28129.pdf

Upload: akdenizerdem

Post on 04-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 b28129.pdf

    1/94

    Oracle Data MiningConcepts11g Release 1 (1.1)

    B28129-01

    January 2007

    Beta Draft

  • 8/13/2019 b28129.pdf

    2/94

    Oracle Data Mining Concepts, 11 g Release 1 (1.1)

    B28129-01

    Copyright 2005, 2007, Oracle. All rights reserved.

    The Programs (which include both the software and documentation) contain proprietary information; theyare provided under a license agreement containing restrictions on use and disclosure and are also protected by copyright, patent, and other intellectual and industrial property laws. Reverse engineering, disassembly,or decompilation of the Programs, except to the extent required to obtain interoperability with otherindependently created software or as specified by law, is prohibited.The information contained in this document is subject to change without notice. If you find any problems inthe documentation, please report them to us in writing. This document is not warranted to be error-free.Except as may be expressly permitted in your license agreement for these Programs, no part of thesePrograms may be reproduced or transmitted in any form or by any means, electronic or mechanical, for anypurpose.

    If the Programs are delivered to the United States Government or anyone licensing or using the Programs on behalf of the United States Government, the following notice is applicable:

    U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical datadelivered to U.S. Government customers are "commercial computer software" or "commercial technical data"pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. Assuch, use, duplication, disclosure, modification, and adaptation of the Programs, including documentationand technical data, shall be subject to the licensing restrictions set forth in the applicable Oracle licenseagreement, and, to the extent applicable, the additional rights set forth in FAR 52.227-19, CommercialComputer Software--Restricted Rights (June 1987). Oracle USA, Inc., 500 Oracle Parkway, Redwood City, CA94065.

    The Programs are not intended for use in any nuclear, aviation, mass transit, medical, or other inherentlydangerous applications. It shall be the licensee's responsibility to take all appropriate fail-safe, backup,redundancy and other measures to ensure the safe use of such applications if the Programs are used for suchpurposes, and we disclaim liability for any damages caused by such use of the Programs.

    Oracle, JD Edwards, PeopleSoft, and Siebel are registered trademarks of Oracle Corporation and/or itsaffiliates. Other names may be trademarks of their respective owners.

    The Programs may provide links to Web sites and access to content, products, and services from thirdparties. Oracle is not responsible for the availability of, or any content provided on, third-party Web sites.You bear all risks associated with the use of such content. If you choose to purchase any products or servicesfrom a third party, the relationship is directly between you and the third party. Oracle is not responsible for:(a) the quality of third-party products or services; or (b) fulfilling any of the terms of the agreement with thethird party, including delivery of products or services and warranty obligations related to purchasedproducts or services. Oracle is not responsible for any loss or damage of any sort that you may incur from

    dealing with any third party.Alpha and Beta Draft documentation are considered to be in prerelease status. This documentation isintended for demonstration and preliminary use only. We expect that you may encounter some errors,ranging from typographical errors to data inaccuracies. This documentation is subject to change withoutnotice, and it may not be specific to the hardware on which you are using the software. Please be advisedthat prerelease documentation is not warranted in any manner, for any purpose, and we will not beresponsible for any loss, costs, or damages incurred due to the use of this documentation.

  • 8/13/2019 b28129.pdf

    3/94

    Beta Draft iii

    Contents

    Preface .................................................................................................................................................................v

    Audience.......................................................................................................................................................vDocumentation Accessibility..................................................................................................................... vRelated Documentation..............................................................................................................................viConventions .............. ............... .............. ............... .............. ................ .............. ................ ............... ............vi

    What's New in Oracle Data Mining? ..................................................................................................ix

    Oracle Data Mining 11 g Release 1 (11.1) New Features ........................................................................ ixOracle Data Mining 10 g Release 2 (10.2) New Features ........................................................................ xi

    1 Introducing Oracle Data Mining

    Data Mining in the Database Kernel ....................................................................................................1-1Data Mining Functions ...........................................................................................................................1-2Data Mining Algorithms .........................................................................................................................1-4Data Transformations ..............................................................................................................................1-6How Do I Use Oracle Data Mining? .....................................................................................................1-6Where Do I Find Information About Oracle Data Mining? ......................................................... 1-10Oracle Data Mining and Oracle Database Analytics .....................................................................1-11

    2 What Is Data Mining?

    What Is Data Mining? ..............................................................................................................................2-1What Can Data Mining Do and Not Do? .............................................................................................2-3Data Mining Methodology .....................................................................................................................2-4

    3 Classification

    About Classification ................................................................................................................................3-1Classification Algorithms .......................................................................................................................3-3Costs ............................................................................................................................................................3-7Priors ...........................................................................................................................................................3-8Testing Supervised Models .................................................................................................................3-10

    4 Regression

    About Regression .....................................................................................................................................4-1Support Vector Machine .........................................................................................................................4-1

  • 8/13/2019 b28129.pdf

    4/94

    iv Beta Draft

    Generalized Linear Models ....................................................................................................................4-2Multivariate Linear Regression .............................................................................................................4-2

    5 Attribute Importance

    Attribute Importance ...............................................................................................................................5-1Data Preparation for Attribute Importance .........................................................................................5-1Algorithm for Attribute Importance .....................................................................................................5-1

    6 Anomaly Detection

    Anomaly Detection ..................................................................................................................................6-1Algorithm for Anomaly Detection ........................................................................................................6-1

    7 Clustering

    Clustering ...................................................................................................................................................7-1Algorithms for Clustering ......................................................................................................................7-2

    8 Market Basket Analysis

    Association .................................................................................................................................................8-1Data for Association Models ..................................................................................................................8-2Difficult Cases for Associations ............................................................................................................8-2Algorithm for Associations ....................................................................................................................8-3

    9 Feature Extraction

    Feature Extraction .....................................................................................................................................9-1Algorithm for Feature Extraction ..........................................................................................................9-1

    10 Text Mining

    Introduction ............................................................................................................................................10-1What is Text Mining? ............................................................................................................................10-1Oracle Data Mining Support for Text Mining .................................................................................10-2Oracle Support for Text Mining .........................................................................................................10-4

    11 Predictive Analytics

    About Predictive Analytics ..................................................................................................................11-1How Do I Use Oracle Predictive Analytics? ....................................................................................11-3Embedding Predictive Analytics in Applications ..........................................................................11-5Behind the Scenes .................................................................................................................................11-7

    Glossary

    Index

  • 8/13/2019 b28129.pdf

    5/94

    Beta Draft v

    Preface

    This manual describes the features of Oracle Data Mining, a comprehensive datamining solution within Oracle Database. It explains the data mining algorithms, andand it lays a conceptual foundation for much of the practical information contained inother manuals. (See "Related Documentation" . )

    The preface contains these topics: Audience Documentation Accessibility Related Documentation Conventions

    AudienceOracle Data Mining Concepts is intended for analysts and data mining specialists.Whether you will use Oracle Data Miner to design models or the programmaticinterfaces to build or extend applications, Oracle Data Mining Concepts provides anessential introduction.

    Documentation AccessibilityOur goal is to make Oracle products, services, and supporting documentationaccessible, with good usability, to the disabled community. To that end, ourdocumentation includes features that make information available to users of assistivetechnology. This documentation is available in HTML format, and contains markup tofacilitate access by the disabled community. Accessibility standards will continue toevolve over time, and Oracle is actively engaged with other market-leadingtechnology vendors to address technical obstacles so that our documentation can beaccessible to all of our customers. For more information, visit the Oracle AccessibilityProgram Web site at

    http://www.oracle.com/accessibility/

    Accessibility of Code Examples in DocumentationScreen readers may not always correctly read the code examples in this document. Theconventions for writing code require that closing braces should appear on anotherwise empty line; however, some screen readers may not always read a line of textthat consists solely of a bracket or brace.

  • 8/13/2019 b28129.pdf

    6/94

    vi Beta Draft

    Accessibility of Links to External Web Sites in DocumentationThis documentation may contain links to Web sites of other companies ororganizations that Oracle does not own or control. Oracle neither evaluates nor makesany representations regarding the accessibility of these Web sites.

    TTY Access to Oracle Support Services

    Oracle provides dedicated Text Telephone (TTY) access to Oracle Support Serviceswithin the United States of America 24 hours a day, seven days a week. For TTYsupport, call 800.446.2398.

    Related DocumentationThe documentation set for Oracle Data Mining is part of the Oracle Database 11 g Release 1 (1.1) Online Documentation Library. The Oracle Data Miningdocumentation set consists of the following documents: Oracle Data Mining Concepts Oracle Data Mining Application Developer's Guide

    Oracle Data Mining Java API Reference (javadoc) Oracle Data Mining Administrator's Guide

    For information about Oracle Data Miner, the graphical user interface for Data Mining,see the online help. Oracle Data Miner is distributed on Oracle Technology Network at

    http://www.oracle.com/technology/index.html

    For detailed information about the Oracle Data Mining PL/SQL interface, see OracleDatabase PL/SQL Packages and Types Reference.

    For detailed information about the SQL data mining functions, see Oracle Database SQLLanguage Reference. -- deep link here --

    For information about the data mining process in general, independent of bothindustry and tool, a good source is the CRISP-DM project (Cross-Industry StandardProcess for Data Mining) at

    http://www.crisp-dm.org

    ConventionsThe following text conventions are used in this document:

    Note: Information to assist you in installing and using the DataMining demo programs is provided in Oracle Data Mining

    Administrator's Guide .

    Convention Meaning

    boldface Boldface type indicates graphical user interface elements associatedwith an action, or terms defined in text or the glossary.

    italic Italic type indicates book titles, emphasis, or placeholder variables forwhich you supply particular values.

  • 8/13/2019 b28129.pdf

    7/94

    Beta Draft vii

    monospace Monospace type indicates commands within a paragraph, URLs, codein examples, text that appears on the screen, or text that you enter.

    Convention Meaning

  • 8/13/2019 b28129.pdf

    8/94

    viii Beta Draft

  • 8/13/2019 b28129.pdf

    9/94

    Beta Draft ix

    What's New in Oracle Data Mining?

    This section describes new features of Oracle Data Mining 11 g Release 1 (1.1) andprovides pointers to additional information. Information about new features in theprevious release is also retained to help users migrating to the current release.

    The following sections describe the new features in Oracle Data Mining: Oracle Data Mining 11g Release 1 (11.1) New Features Oracle Data Mining 10g Release 2 (10.2) New Features

    Oracle Data Mining 11g Release 1 (11.1) New Features Mining Model schema objects

    In Oracle 11 g, Data Mining models are implemented as data dictionary objects inthe SYS schema. A set of new data dictionary views present mining models andtheir properties. New system and object privileges control access to mining modelobjects.

    In previous releases, Data Mining models were implemented as a collection oftables and metadata within the DMSYS schema. In Oracle 11 g, the DMSYS schemano longer exists.

    Automatic data preparation

    In most cases, data must be transformed using techniques such as binning,normalization, or missing value treatment before it can be mined. In Oracle 10 g Release 2, you had to manually prepare the build data, test data, and apply datawith the same transformations, unless you were using Oracle Data Miner.

    In Oracle 11 g, the data preparation process can be automated. Algorithm-appropriate transformation instructions are embedded in the model andautomatically applied to the build data and scoring data. The automatictransformations can be complemented by or replaced with user-specifiedtransformations.

    See Also:

    Oracle Data Mining Administrator's Guide for information on privileges foraccessing mining models

    Oracle Database Reference for information on Oracle Data Mining datadictionary views

  • 8/13/2019 b28129.pdf

    10/94

    x Beta Draft

    Scoping of Nested Data and Enhanced Handling of Sparse DataOracle Data Mining supports nested data types for both categorical and numericaldata. Multi-record case data must be transformed to nested columns for mining.

    In Oracle Data Mining 10 gR2, the nested columns were processed as top-levelattributes; if two nested columns each contained an attribute with the same name,the algorithm could not distinguish between them. In Oracle Data Mining 11 g,nested attributes are scoped with the column name, which prevents this problemfrom occuring.

    Handling of sparse data and missing values has been standardized acrossalgorithms in Oracle Data Mining 11g. Data is sparse when a high percentage ofthe cells are empty, but all the values are assumed to be known. This is the case inmarket basket data. When some cells are empty, and their values are not known,they are assumed to be missing at random. With this definition, only nested datacan be sparse. Oracle Data Mining assigns zeros to empty cells in nested columnsthat are sparse. Oracle Data Mining replaces values that are missing at randomwith either the mean (numerical) or the mode (categorical).

    In Oracle Data Mining 11 g, Adaptive Bayes Network and O-Cluster algorithms donot support nested data.

    Generalized Linear Models

    A new algorithm, Generalized Linear Models, is introduced in Oracle 11 g. It

    supports two mining functions: classification (Binary Logistic Regression) andregression (Multivariate Linear Regression).

    New SQL Data Mining Function

    A new SQL Data Mining function, PREDICTION_BOUNDS , has been introduced foruse with Generalized Linear Models.

    Enhanced Support for Cost-Sensitive Decision Making

    Cost matrix support is significantly enhanced in Oracle 11 g. A cost matrix can beadded or removed from any classification model using the new procedures,DBMS_DATA_MINING.ADD_COST_MATRIX andDBMS_DATA_MINING.REMOVE_COST_MATRIX .

    The SQL Data Mining functions support new syntax for specifying an in-line costmatrix. With this new feature, cost-sensitive model results can be returned withina SQL statement even if the model was not built with a cost matrix.

    Only Decision Tree models can be built with a cost matrix.

    See Also:

    Oracle Data Mining Application Developer's Guide for information on automaticand custom data transformation for Data Mining

    Oracle Database PL/SQL Packages and Types Reference for information onDBMS_DATA_MINING_TRANSFORM.

    See Also: Oracle Data Mining Application Developer's Guide

    See Also: Oracle Data Mining Concepts .

    See Also: Oracle Database SQL Language Reference

  • 8/13/2019 b28129.pdf

    11/94

    Beta Draft xi

    Enhancements to Predictive Analytics

    The DBMS_PREDICTIVE_ANALYTICS PL/SQL package has an additionalprocedure called PROFILE . This procedure segments the data based on a specifiedtarget value.

    Enhancements to Java API

    The Java API fully supports the new 11 g Data Mining features, includingGeneralized Linear Models and automatic data preparation.

    The Java API provides new support for asynchronous execution of miningtasks. Each task is stored as a DBMS_SCHEDULER job object in the user'sschema.

    Oracle Data Mining 10g Release 2 (10.2) New Features Java Data Mining (JDM) compliant Java API

    Oracle 10 g Release 2 introduces a completely new Java API for Data Mining. TheAPI implements JSR-000073, developed through the Java Community Process(http://jcp.org ).

    The new Java API is layered on the PL/SQL API, and the two APIs are fullyinteroperable. The new Java API is not compatible with the Java API available inthe previous release (Oracle 10 g Release 1).

    SQL built-in functions for Data Mining

    New built-in SQL functions support the scoring of classification, regression,clustering, and feature extraction models. Within the context of standard SQLstatements, pre-created models can be applied to new data and the resultsreturned for further processing. The Data Mining SQL functions are:

    PREDICTION , PREDICTION_COST , PREDICTION_DETAILS ,PREDICTION_PROBABILITY , PREDICTION_SET

    CLUSTER_ID , CLUSTER_PROBABILITY , CLUSTER_SET

    FEATURE_ID , FEATURE_SET , FEATURE_VALUE Predictive Analytics

    Predictive Analytics automates the process of data mining. Without userintervention, Predictive Analytics routines manage data preparation, algorithmselection, model building, and model scoring.

    In the DBMS_PREDICTIVE_ANALYTICS PL/SQL package, Oracle Data Miningprovides Predictive Analytics routines that calculate predictions and determine therelative influence of attributes on the prediction.

    See Also:

    Oracle Data Mining Application Developer's Guide

    Oracle Database SQL Language Reference

    See Also: Oracle Database PL/SQL Packages and Types Reference

    See Also: Oracle Data Mining Application Developer's Guide and OracleData Mining Java API Reference

  • 8/13/2019 b28129.pdf

    12/94

    xii Beta Draft

    Oracle Spreadsheet Add-In for Predictive Analytics implementsDBMS_PREDICTIVE_ANALYTICS within the context of an Excel spreadsheet. TheSpreadsheet Add-In is distributed on Oracle Technology Network.

    New and enhanced algorithms

    The new Decision Tree algorithm generates human-understandable rules for aprediction.

    The new One-Class Support Vector Machine algorithm supports anomalydetection.

    The Support Vector Machine algorithm is enhanced with active learning forthe management of large build data sets.

    Both the PL/SQL and Java APIs support the O-Cluster algorithm. In Oracle10 g Release 1, O-Cluster was only supported in the Java API.

  • 8/13/2019 b28129.pdf

    13/94

    Beta Draft Introducing Oracle Data Mining 1-1

    1Introducing Oracle Data MiningThis chapter introduces the basics you will need to start using Oracle Data Mining.

    This chapter includes the following sections: Data Mining in the Database Kernel Data Mining Functions Data Mining Algorithms Data Transformations How Do I Use Oracle Data Mining? Where Do I Find Information About Oracle Data Mining? Oracle Data Mining and Oracle Database Analytics

    Data Mining in the Database KernelOracle Data Mining provides comprehensive, state-of-the-art data miningfunctionality within Oracle Database.

    Oracle Data Mining is implemented in the Oracle Database kernel, and mining modelsare first class database objects. Oracle Data Mining processes use built-in features ofOracle Database to maximize scalability and make efficient use of system resources.Oracle Data Miner, the graphical user interface to Oracle Data Mining, runs as a client, but all the processing occurs within the database server.

    Data mining within Oracle Database offers many advantages: No Data Movement . Some data mining products require that the data be exported

    from a corporate database and converted to a specialized format for mining. WithOracle Data Mining, no data movement or conversion is needed. This makes theentire mining process less complex, time-consuming, and error-prone.

    Security . Your data is protected by the extensive security mechanisms of OracleDatabase.

    Data Preparation and Administration . Most data must be cleansed, filtered,normalized, sampled, and transformed in various ways before it can be mined. Upto 80% of the effort in a data mining project is often devoted to data preparation.Oracle Data Mining automatically manages key steps in the data preparation

    Note: If you are not familiar with data mining technology, pleaseread Chapter 2, "What Is Data Mining?" before you read this chapter.

  • 8/13/2019 b28129.pdf

    14/94

    Data Mining Functions

    1-2 Oracle Data Mining Concepts Beta Draft

    process. Additionally, Oracle Database provides extensive administrative tools forpreparing and managing data.

    Ease of Data Refresh . Mining processes within Oracle Database have ready accessto refreshed data. Oracle Data Mining can easily deliver mining results based oncurrent data, thereby maximizing its timeliness and relevance.

    Oracle Technology Stack . You can take advantage of all aspects of Oracle'stechnology stack to integrate data mining within a larger framework for businessintelligence or scientific inquiry.

    Domain Environment . Data mining models have to be built, tested, validated,managed, and deployed in their appropriate application domain environments.Data mining results may need to be post-processed as part of domain specificcomputations (for example, calculating estimated risks and response probabilities)and then stored into permanent repositories or data warehouses. With Oracle DataMining, the pre- and post-mining activities can all be accomplished within thesame environment.

    Application Programming Interfaces . PL/SQL and Java APIs and SQL languageoperators provide direct access to Oracle Data Mining functionality in OracleDatabase.

    Data Mining FunctionsA basic understanding of data mining functions and algorithms is required for usingOracle Data Mining. This section introduces the concept of data mining functions.Algorithms are introduced in the following section, "Data Mining Algorithms" onpage 1-4 .

    Each data mining function specifies a class of problems that can be modeled andsolved. Data mining functions fall generally into two categories: supervised andunsupervised . Notions of supervised and unsupervised learning are derived from thescience of machine learning, which has been called a sub-area of artificial intelligence.

    Artificial intelligence refers to the implementation and study of systems that exhibitautonomous intelligence or behavior of their own. Machine learning deals withtechniques that enable devices to learn from their own performance and modify theirown functioning. Data mining applies machine learning concepts to data.

    Supervised Data MiningSupervised learning is also known as directed learning. The learning process isdirected by a previously known dependent attribute or target. Directed data mining

    attempts to explain the behavior of the target as a function of a set of independentattributes or predictors.

    Supervised learning generally results in predictive models. This is in contrast tounsupervised learning where the goal is pattern detection.

    The building of a supervised model involves training , a process whereby the softwareanalyzes many cases where the target value is already known. In the training process,the model "learns" the logic for making the prediction. For example, a model thatseeks to identify the customers who are likely to respond to a promotion must be

    See Also: "Oracle Data Mining and Oracle Database Analytics" onpage 1-11 for a summary of additional analytics available withinOracle Database.

  • 8/13/2019 b28129.pdf

    15/94

    Data Mining Functions

    Beta Draft Introducing Oracle Data Mining 1-3

    trained by analyzing the characteristics of many customers who are known to haveresponded or not responded to a promotion in the past.

    Build Data and Test DataSeparate data sets are required for building (training) and testing a predictive model.The build data (training data) and test data must have the same column structure.

    Typically, one large table or view is split into two data sets: one for building the model,and the other for testing the model.

    The process of applying the model to test data helps to determine whether the model, built on one chosen sample, is generalizable to other data. In particular, it helps toavoid the phenomenon of overfitting , which can occur when the logic of the model fitsthe build data too well and therefore has little predictive power.

    Apply DataApply data, also called scoring data, refers to the actual population to which a modelis applied. For example, you might build a model that identifies the characteristics ofcustomers who frequently buy a certain product. To obtain a list of people who aremost likely to be your best customers when you introduce a related product, you

    might apply the model to your customer database. In this case, the scoring dataconsists of your customer database.

    Most supervised learning is applied to a population of interest. An exception isattribute importance, which cannot be applied to separate data. (See Table 11, " OracleData Mining Supervised Functions" .)

    Unsupervised Data MiningUnsupervised learning is non-directed. There is no distinction between dependent andindependent attributes. There is no previously-known result to guide the algorithm in building the model.

    Unsupervised learning can be used for descriptive purposes. For example, a clusteringmodel that seeks to divide a customer base into segments based on similarities in age,gender, and spending habits uses descriptive data mining techniques. However,unsupervised learning does not imply descriptive models. Unsupervised models canalso be used to make predictions.

    Apply DataAlthough unsupervised data mining makes no distinction between predictors(independent attributes) and target (dependent attribute), unsupervised models canstill be applied to obtain predictive information.

    A clustering model uses unsupervised learning, but it can be applied to predict theprobability that individual attributes will belong to a given cluster. Anomaly detectionis also unsupervised since it does not use a target, but an anomaly detection model istypically used to predict whether a data point is typical for a given distribution.

    Most unsupervised learning is applied to a population of interest. An exception is theOracle Data Mining implementation of association rules. In Oracle Data Mining,association models cannot be applied to separate data. (See Table 12, " Oracle DataMining Unsupervised Functions" .)

  • 8/13/2019 b28129.pdf

    16/94

    Data Mining Algorithms

    1-4 Oracle Data Mining Concepts Beta Draft

    Oracle Data Mining FunctionsOracle Data Mining supports the supervised data mining functions described inTable 11.

    Oracle Data Mining supports the unsupervised functions described in Table 12.

    Data Mining AlgorithmsAn algorithm is a mathematical procedure for solving a specific kind of problem.Oracle Data Mining supports at least one algorithm for each data mining function. Forsome functions, you can choose among several algorithms. For example, Oracle DataMining supports five classification algorithms.

    Each data mining model is produced by one algorithm. Some data mining problemscan best be solved by using more than one algorithm. This necessitates the

    development of more than one model. For example, you might first use a featureextraction model to create an optimized set of predictors, then a classification model tomake a prediction on the results.

    Table 11 Oracle Data Mining Supervised Functions

    Function Description Sample Problem

    Classification Assigns items to discrete classes andpredicts the class to which an item belongs.

    Given demographic data about a set ofcustomers, predict customer response to anaffinity card program.

    Regression Approximates and forecasts continuousvalues.

    Given demographic and purchasing dataabout a set of customers, predict customers'age.

    Attribute Importance Identifies the attributes that are mostimportant in predicting a target attribute

    Given customer response to an affinity cardprogram, find the most significantpredictors.

    Table 12 Oracle Data Mining Unsupervised Functions Function Description Sample Problem

    Clustering Finds natural groupings in the data. Segment demographic data into clusters andrank the probability that an individual will belong to a given cluster.

    Anomaly Detection Identifies items (outliers) that do notsatisfy the characteristics of "normal"data.

    Given demographic data about a set ofcustomers, identify customer purchasing behavior that is significantly different from thenorm.

    Association Rules Finds items that tend to co-occur in thedata and specifies the rules that governtheir co-occurrence.

    Find the items that tend to be purchasedtogether and specify their relationship.

    Feature Extraction Creates new attributes (features) bycombining linear combinations of theoriginal attributes.

    Given demographic data about a set ofcustomers, extract the significant features of theindividuals.

    Note: You can be successful at data mining without understandingthe inner workings of each algorithm. However, it is important tounderstand the general characteristics of the algorithms and theirsuitability for different kinds of applications. Refer to chapters xxxxxx

  • 8/13/2019 b28129.pdf

    17/94

    Data Mining Algorithms

    Beta Draft Introducing Oracle Data Mining 1-5

    Oracle Data Mining Supervised AlgorithmsOracle Data Mining supports the supervised data mining algorithms described inTable 13. These algorithms are described in detail in Chapters x and y. The algorithmabbreviations are used throughout this manual.

    Oracle Data Mining Unsupervised AlgorithmsOracle Data Mining supports the unsupervised data mining algorithms described inTable 14. These algorithms are described in detail in Chapters x and y. The algorithmabbreviations are used throughout this manual.

    Table 13 Oracle Data Mining Algorithms for Supervised Functions

    Algorithm Function DescriptionDecision Tree (DT) Classification Decision trees extract predictive information in the form of

    human-understandable rules. The rules are if-then-else expressions in XM; theyexplain the decisions that lead to the prediction.

    Support Vector Machine(SVM)

    Classification andRegression

    Distinct versions of SVM use different kernel functions to handle different typesof data sets. Linear and Gaussian (non-linear) kernels are supported.SVM classification attempts to separate the target classes with the widest possiblemargin.SVM regression tries to find a continuous function such that maximum number ofdata points lie within an epsilon-wide tube around it.

    Binary LogisticRegression (GLM)

    Classification Generalized Linear Models (GLM) for classification. GLM uses linear modelingtechniques and supports confidence intervals for predictions. Binary logisticregression generates prediction probabilities.

    Multivariate LinearRegression (GLM)

    Regression Generalized Linear Models (GLM) for regression. GLM uses linear modelingtechniques and supports confidence intervals for predictions. Multivariate linearregression generates continuous values.

    Naive Bayes (NB) Classification Makes predictions using Bayes's Theorem, which derives the probability of aprediction from the underlying evidence, as observed in the data.

    Adaptive BayesNetwork (ABN)

    Classification Builds models based on counts observed from the data. Supports three modes ofoperation: pruned Naive Bayes, single-tree, and boosted multi-tree. In thesingle-tree mode, ABN provides model transparency with human interpretablerules.

    Minimum DescriptionLength (MDL)

    AttributeImportance

    An information theoretic model selection principle. MDL assumes that thesimplest, most compact representation of data is the best and most probableexplanation of the data.

    Table 14 Oracle Data Mining Algorithms for Unsupervised Functions

    Algorithm Function Description

    Apriori (AP) Association Performs market basket analysis by discovering co-occurring items (frequentitemsets) within a set. Apriori finds rules with support greater than a specifiedminimum support and confidence greater than a specified minimum confidence.

    k -Means (KM) Clustering A distance-based clustering algorithm that partitions the data into apredetermined number of clusters. Each cluster has a centroid (center of gravity).Cases (individuals within the population) that are in a cluster are close to thecentroid.Oracle Data Mining supports an enhanced version of k-Means. It goes beyond theclassical implementation by defining a hierarchical parent-child relationship ofclusters.

  • 8/13/2019 b28129.pdf

    18/94

    Data Transformations

    1-6 Oracle Data Mining Concepts Beta Draft

    Data TransformationsProper preparation of the data is a key factor in any data mining project. The columnsof data must be identified, cleansed, and joined into a single view. Most algorithmsrequire some form of data transformation, such as binning or normalization, and thetransformations must be applied to the scoring data just as they are to the build data.

    Oracle Data Mining supports automatic data transformation based on the algorithmand other heuristics. The transformations are applied to the data during the model build process, and the transformation instructions are stored in the model. When themodel is applied, the transformation instructions are automatically applied to thescoring data. Automatic data preparation can significantly reduce the time and effortinvolved in developing a data mining model.

    Oracle Data Mining also uses consistent methodology for managing missing andsparse data.

    How Do I Use Oracle Data Mining?Oracle Data Mining is an option to the Enterprise Edition of Oracle Database. Itincludes a graphical user interface, a spreadsheet add-in, and programmatic interfacesfor SQL, Java, and predictive analytics.

    Oracle Data MinerOracle Data Miner is a graphical tool for developing data mining models in OracleDatabase. Oracle Data Miner includes an extensive help system and a readme file withinstallation instructions. You can download Oracle Data Miner, and a tutorial to helpyou get started, from the Oracle Technology Network:

    http://www.oracle.com/technology/products/bi/odm/index.html .

    Oracle Data Miner provides Activity wizards to guide you through the process of building, testing, and applying a model using any of the algorithms described inTable 13 or Table 14. The Oracle Data Miner Activity menu is shown in Figure 11.

    Non-Negative MatrixFactorization (NMF)

    Feature Extraction Generates new attributes using linear combinations of the original attributes. Thecoefficients of the linear combinations are non-negative. During model apply, anNMF model maps the original data into the new set of attributes (features)discovered by the model.

    One Class Support

    Vector Machine (OneClass SVM)

    Anomaly

    Detection

    Builds a profile of one class and when applied, flags cases that are somehow

    different from that profile. This allows for the detection of rare cases that are notnecessarily related to each other.

    Orthogonal PartitioningClustering (O-Cluster)

    Clustering Creates a hierarchical, grid-based clustering model. The algorithm creates clustersthat define dense areas in the attribute space. A parameter called sensitivitydefines the baseline density level.

    See Also: Oracle Data Mining Application Developer's Guide fordetails.

    Table 14 (Cont.) Oracle Data Mining Algorithms for Unsupervised Functions

    Algorithm Function Description

  • 8/13/2019 b28129.pdf

    19/94

    How Do I Use Oracle Data Mining?

    Beta Draft Introducing Oracle Data Mining 1-7

    Figure 11 Oracle Data Miner Activity Menu

    The Oracle Data Miner Tools menu is shown in Figure 12. The Tools menu providesoptions for publishing mining results to Oracle BI Discoverer, for accessing SQLWorksheet, and for setting preferences. You can use the Synchronize Repository option to make all mining objects (including those created by the APIs) visible in theOracle Data Miner Navigator.

    Figure 12 Oracle Data Miner Tools Menu

  • 8/13/2019 b28129.pdf

    20/94

    How Do I Use Oracle Data Mining?

    1-8 Oracle Data Mining Concepts Beta Draft

    Oracle Spreadsheet Add-In for Predictive AnalyticsPredictive Analytics automate the data mining process with routines for PREDICT ,EXPLAIN , and PROFILE . These routines, which are implemented in the PL/SQLpackage DMBS_PREDICTIVE_ANALYTICS , automatically prepare the data and buildand apply a data mining model. The model does not persist after the routinecompletes execution.

    You can use Predictive Analytics routines in an Excel spreadsheet. You can downloadthe latest version of the Spreadsheet Add-In, including a readme file, from the OracleTechnology Network.

    http://www.oracle.com/technology/products/bi/odm/index.html .

    You can use the Spreadsheet Add-In to mine local data in Excel, or you can connect to

    an Oracle database. Figure 13 shows the Predictive Analytics Add-In.

    Figure 13 Spreadsheet Add-In for Predictive Analytics

    Application Programming InterfacesOracle Data Mining provides programmatic interfaces for SQL and Java.

    PL/SQL PackagesThe Oracle Data Mining PL/SQL API is implemented in the following PL/SQLpackages: DBMS_DATA_MINING Contains routines for building, testing, and applying data

    mining models. DBMS_DATA_MINING_TRANSFORM Contains routines for transforming the data

    sets prior to building or applying a model. Use of these routines is not required,

    Note: Since the Oracle Data Miner release schedule does not alwayscoincide with a release of Oracle Database, the user interface images inthis manual might not exactly match the user interface in the mostrecent release of Oracle Data Miner.

    See Also: Chapter 11, "Predictive Analytics" .

    See Also: Table 15, " Oracle Data Mining Documentation" for APIdocumentation.

  • 8/13/2019 b28129.pdf

    21/94

    How Do I Use Oracle Data Mining?

    Beta Draft Introducing Oracle Data Mining 1-9

    since most transformations can be performed automatically by Oracle DataMining.

    DBMS_PREDICTIVE_ANALYTICS Contains automated data mining routines forPREDICT , EXPLAIN , and PROFILE .

    The following example shows the PL/SQL routine for creating an SVM classificationmodel called my_model . The algorithm is specified in a settings table calledmy_settings . The algorithm must be specified as a setting because Naive Bayes, notSVM, is the default classifier.

    CREATE TABLE my_settings( setting_name VARCHAR2(30), setting_value VARCHAR2(30));

    INSERT INTO my_settings VALUES (dbms_data_mining.algo_name,

    dbms_data_mining.algo_support_vector_machines);

    BEGIN DBMS_DATA_MINING.CREATE_MODEL( model_name => 'my_model',

    mining_function => dbms_data_mining.classification, data_table_name => 'build_data', case_id_column_name => 'cust_id', target_column_name => 'affinity_card', settings_table_name => 'my_settings');END;/

    SQL FunctionsThe Data Mining functions are SQL language operators for the deployment of datamining models. They allow data mining to be easily incorporated into SQL queries,and thus into SQL-based applications.

    The following example illustrates the Data Mining PREDICTION_SET operator. Theoperator applies the classification model my_model (created in "PL/SQL Packages" onpage 1-8 ) to the data set apply_data .

    column prediction format 9;SELECT T.cust_id, S.prediction, S.probability FROM (SELECT cust_id, PREDICTION_SET(my_model COST MODEL USING *) pset FROM apply_data WHERE cust_id < 100011) T, TABLE(T.pset) SORDER BY cust_id, S.prediction;

    The SELECT statement returns ten customers, listed by customer ID, along with the

    likelihood that they will use (1) or reject (0) an affinity card.

    CUST_ID PREDICTION PROBABILITY---------- ---------- -----------

    100001 0 .966183575100001 1 .033816425100002 0 .740384615100002 1 .259615385100003 0 .909090909100003 1 .090909091100004 0 .909090909

  • 8/13/2019 b28129.pdf

    22/94

    Where Do I Find Information About Oracle Data Mining?

    1-10 Oracle Data Mining Concepts Beta Draft

    100004 1 .090909091100005 0 .272357724100005 1 .727642276100006 0 1100006 1 0100007 0 .909090909100007 1 .090909091100008 0 .909090909100008 1 .090909091100009 0 .272357724100009 1 .727642276100010 0 .675965665100010 1 .324034335

    Java APIThe Oracle Data Mining Java API is an Oracle implementation of the JDM standard Java API for data mining (JSR-73). The Java API is layered on the PL/SQL API, and thetwo APIs are fully interoperable.

    The following code fragment creates a Decision Tree model called JDM_TREE_MODEL by creating, saving, and executing a mining model build task.

    BuildTask buildTask = m_buildFactory.create("treeBuildData_jdm", "treeBuildSettings_jdm", "JDM_TREE_MODEL");buildTask.setDescription("This is a build task");saveTask(buildTask, "treeBuildTask_jdm", null);

    This code fragment creates, stores, and executes a mining model apply task to applythe model.

    DataSetApplyTask applyTask =m_dsApplyFactory.create(" JDM_APPLY_PDS ", "JDM_TREE_MODEL",

    " JDM_APPLY_SETTINGS ", "JDM_APPLY_OUTPUT_TABLE");dmeConn.saveObject("JDM_APPLY_TASK", applyTask, true);

    Where Do I Find Information About Oracle Data Mining?Oracle Data Mining documentation is included in the documentation set for OracleDatabase. Four manuals are dedicated to Oracle Data Mining. SQL and PL/SQLsemantics for Oracle Data Mining are documented in Database manuals.

    For your convenience, the Oracle Data Mining and related Oracle Database manualsare listed in Table 15. The links in this table will take you directly to the referenceddocumentation.

    Table 15 Oracle Data Mining Documentation

    Go to This Document... ... To Learn

    Oracle Data Mining Concepts About the mining functions, algorithms, data preparation,predictive analytics, and other special features supported byOracle Data Mining.

    Oracle Data Mining Application Developer'sGuide

    How to use the PL/SQL and Java APIs and the SQL operators forData Mining.

    Oracle Data Mining Administrator's Guide How to install and administer a database for Data Mining. Howto install and use the demo programs.

    Oracle Data Mining Java API Reference How to use the Oracle Data Mining Java API syntax (javadoc).

  • 8/13/2019 b28129.pdf

    23/94

    Oracle Data Mining and Oracle Database Analytics

    Beta Draft Introducing Oracle Data Mining 1-11

    The help systems for Oracle Data Mining and Oracle Spreadsheet Add-In forPredictive Analytics also provide valuable documentation.

    Oracle Data Mining Resources on the Oracle Technology Network The Oracle Technology Network (OTN) is easily accessible and provides a wealth ofinformation. You can visit the Oracle Data Mining home page at:

    http://www.oracle.com/technology/products/bi/odm/index.html

    This site provides news and discussion forums as well as tools and educationalmaterials for download. On this site, you will find: Oracle Data Miner , the graphical user interface to Oracle Data Mining. Oracle

    Data Miner includes a help system that explains the various mining activities andprovides valuable background information.

    A tutorial that guides you through the use of Oracle Data Miner. The tutorial is anexcellent hands-on introduction to Oracle Data Mining.

    Oracle Spreadsheet Add-In for Predictive Analytics . You can use the Add-In toimplement one-click data mining in an Excel spreadsheet.

    White papers and web casts of presentations and training sessions. Demo programs in SQL and Java that illustrate the Oracle Data Mining APIs. Oracle Data Mining discussion forum at

    http://forums.oracle.com/forums/forum.jspa?forumID=55

    Blog on Data Mining and Analytics, with a special focus on Oracle , athttp://oracledmt.blogspot.com/

    Oracle Data Mining PublicationsThe following books are available on Amazon ( http://www.amazon.com/ ). Java Data Mining: Strategy, Standard, and Practice , (The Morgan Kaufmann Series in

    Data Management Systems), by Mark F. Hornick, Erik Marcad, and SunilVenkayala.

    Oracle Data Mining: Mining Gold from Your Warehouse , (Oracle In-Focus series), byDr. Carolyn Hamm.

    Oracle Data Mining and Oracle Database AnalyticsAs described in "Data Mining in the Database Kernel" on page 1-1 , the advantages ofdatabase analytics are considerable. When analytical capabilities are implementedwhere the data is stored, the data does not have to be exported to an external server foranalysis. The results of analysis do not need to be imported; they reside in the databasewhere they can be easily accessed and combined with other data.

    Oracle Database PL/SQL Packages and TypesReference

    How to use the Oracle Data Mining PL/SQL syntax.

    Oracle Database SQL Language Reference How to use the SQL Data Mining operator syntax.

    Oracle Database Reference How to query the data dictionary views of mining models,mining model attributes, and mining model settings.

    Table 15 (Cont.) Oracle Data Mining Documentation

    Go to This Document... ... To Learn

  • 8/13/2019 b28129.pdf

    24/94

    Oracle Data Mining and Oracle Database Analytics

    1-12 Oracle Data Mining Concepts Beta Draft

    Along with data mining and predictive analytics, Oracle Database supports a widearray of analytical features. Since these features are part of a common server it ispossible to combine them efficiently. The results of analytical processing can beintegrated with Oracle Business Intelligence tools such as Oracle Discover and OraclePortal. Taken as a whole, these features make the Oracle Database a powerful platformfor developing analytical applications.

    The possibilities for combining different analytics are virtually limitless. Example 11 shows data mining and text processing within a single SQL query.

    Example 11 Combine Oracle Data Mining and Oracle Text in a SQL Query

    SELECT A.cust_name, A.contact_info FROM customers A WHERE PREDICTION_PROBABILITY(tree_model, attrite USING A.*) > 0.8 AND A.cust_value > 90 AND A.cust_id IN (SELECT B.cust_id FROM call_center B WHERE B.call_date BETWEEN 01-Jan-2005

    AND 30-Jun-2005 AND CONTAINS(B.notes, Checking Plus, 1) > 0);

    The query in Example 11 selects all customers who have a high propensity to attrite(> 80% chance), are valuable customers (customer value rating > 90), and have had arecent conversation with customer services regarding a Checking Plus account. Thepropensity to attrite information is computed using a Data Mining model calledtree_model . The query uses the Oracle Text CONTAINS operator to search call centernotes for references to Checking Plus accounts.

    Some of the analytics supported by Oracle Database are described in Table 16. Usethe links in the Documentation column to find the referenced documentation.

    Table 16 Oracle Database Analytics

    Analytical Feature Description Documentation

    Complex datatransformations

    Data transformation is a key aspect of analytical applicationsand ETL (extract, transform, and load). You can use SQLexpressions to implement data transformations, or you can usethe DBMS_DATA_MINING_TRANSFORM package. DBMS_DATA_MINING_TRANSFORM is a flexible datatransformation package that includes a variety of missing valueand outlier treatments, as well as binning and normalizationcapabilities.

    Oracle DatabasePL/SQL Packages andTypes Reference

    Frequent Itemsets The DBMS_FREQUENT_ITEMSET supports frequent itemsetcounting, a mechanism for counting how often multiple eventsoccur together. DBMS_FREQUENT_ITEMSET is used as a building block for the Association algorithm used by Oracle Data Mining.

    Oracle DatabasePL/SQL Packages andTypes Reference

    Image feature extraction Oracle Intermedia supports the extraction of image features suchas color histogram, texture, and positional color. Image featurescan be used to characterize and analyze images.

    Oracle interMediaUser's Guide

    Linear algebra The UTL_NLA package exposes a subset of the popular BLAS andLAPACK (Version 3.0) libraries for operations on vectors andmatrices represented as VARRAYs. This package includesprocedures to solve systems of linear equations, invert matrices,and compute eigenvalues and eigenvectors.

    Oracle DatabasePL/SQL Packages andTypes Reference

  • 8/13/2019 b28129.pdf

    25/94

    Oracle Data Mining and Oracle Database Analytics

    Beta Draft Introducing Oracle Data Mining 1-13

    OLAP Oracle OLAP supports multidimensional analysis and can beused to improve performance of multidimensional queries.Oracle OLAP provides functionality previously found only inspecialized OLAP databases. Moving beyond drill-downs androll-ups, Oracle OLAP also supports time-series analysis,modeling, and forecasting.

    Oracle OLAP User'sGuide

    Spatial analytics Oracle Spatial provides advanced spatial features to supporthigh-end GIS and LBS solutions. Oracle Spatial's analysis andmining capabilities include functions for binning, detection ofregional patterns, spatial correlation, colocation mining, andspatial clustering.

    Oracle Spatial also includes support for topology and networkdata models and analytics. The topology data model of OracleSpatial allows one to work with data about nodes, edges, andfaces in a topology. It includes network analysis functions forcomputing shortest path, minimum cost spanning tree,nearest-neighbors analysis, traveling salesman problem, amongothers.

    Oracle Spatial User'sGuide and Reference

    Statistical functions The Oracle Database provides a long list of SQL statisticalfunctions with support for: hypothesis testing (such as t-test,F-test), correlation computation (such as pearson correlation),cross-tab statistics, and descriptive statistics (such as median andmode). The DBMS_STAT_FUNCS package adds distributionfitting procedures and a summary procedure that returnsdescriptive statistics for a column.

    Oracle Database SQLLanguage Reference and Oracle DatabasePL/SQL Packages andTypes Reference

    Text Mining Oracle Text uses standard SQL to index, search, and analyze textand documents stored in the Oracle database, in files, and on theweb. It also supports automatic classification and clustering ofdocument collections. Many of these analytical features arelayered on top of ODM functionality

    Oracle Text ApplicationDeveloper's Guide

    Table 16 (Cont.) Oracle Database Analytics

    Analytical Feature Description Documentation

  • 8/13/2019 b28129.pdf

    26/94

    Oracle Data Mining and Oracle Database Analytics

    1-14 Oracle Data Mining Concepts Beta Draft

  • 8/13/2019 b28129.pdf

    27/94

    Beta Draft What Is Data Mining? 2-1

    2What Is Data Mining?This chapter provides a high-level orientation to data mining technology.

    This chapter includes the following sections:

    What Is Data Mining? What Can Data Mining Do and Not Do? Data Mining Methodology

    What Is Data Mining?Data mining is the practice of automatically searching large stores of data for patternsand trends that go beyond simple analysis. Data mining uses sophisticatedmathematical algorithms to segment the data and evaluate the probability of futureevents. Data mining is also known as Knowledge-Discovery in Databases (KDD).

    The key properties of data mining are: Automatic discovery of patterns Prediction of likely outcomes Creation of actionable information Focus on large data sets and databases

    Data mining can answer questions that cannot be addressed through simple query andreporting techniques.

    See Also:

    Information about data mining is widely available. Here are a fewsuggestions:

    http://www.kdnuggets.com/ This site is an excellentsource of information about data mining. It includes a bibliography of publications.

    http://www.twocrows.com/ On this site, you will find thefree tutorial, Introduction to Data Mining and Knowledge Discovery ,and other useful information about data mining.

    http://www.crisp-dm.org/ Cross Industry StandardProcess for Data Mining. The CRISP-DM process model issummarized in "Data Mining Methodology" on page 2-4 .

  • 8/13/2019 b28129.pdf

    28/94

    What Is Data Mining?

    2-2 Oracle Data Mining Concepts Beta Draft

    Automatic DiscoveryData mining is accomplished by building models and applying them to data. Thenotion of automatic discovery refers to the application and deployment of data miningmodels. The process of applying a model is also called scoring .

    PredictionMany forms of data mining are predictive . For example, a data mining model mightpredict income based on education and other demographic factors. The model would be built using a complete set of data including income figures as well as all thepredictive attributes. The results of a predictive model can, and should be, comparedwith the original data. Predictions have an associated confidence (How confident can I be of this prediction?) and support (How frequently does an association appear in thedata set? ).

    GroupingOther forms of data mining identify natural groupings in the data. For example, amodel might identify the segment of the population that belongs to AAA, has a good

    driving record, and that leases a new car on a yearly basis.

    Actionable InformationData mining can derive actionable information from large volumes of data. Forexample, a college might use the model described under "Prediction" to target specifichigh schools for recruiting. A car leasing agency might use the model described under"Grouping" to design a discount policy for AAA members.

    Data Mining and StatisticsThere is a great deal of overlap between data mining and statistics. In fact most of thetechniques used in data mining can be placed in a statistical framework. However,data mining techniques are not the same as traditional statistical techniques.

    Traditional statistical methods, in general, require a great deal of user interaction inorder to validate the correctness of a model. As a result, statistical methods can bedifficult to automate. Moreover, statistical methods typically do not usually scale wellto very large datasets.

    Data mining methods are suitable for large datasets and can be more readilyautomated. In fact, data mining algorithms often require large data sets for the creationof quality models.

    Data Mining and OLAPOn-Line Analytical Processing (OLAP) can been defined as fast analysis of sharedmultidimensional data. OLAP and data mining are different but complementaryactivities.

    OLAP supports activities such as data summarization, cost allocation, time seriesanalysis, and what-if analysis. However, most OLAP systems do not have inductiveinference capabilities beyond the support for time-series forecast. Inductive inferenceis a characteristic of data mining.

    See Also: "Data Mining Functions" on page 1-2 for an overview ofpredictive and descriptive data mining. A general introduction toalgorithms is provided in "Data Mining Algorithms" on page 1-4 .

  • 8/13/2019 b28129.pdf

    29/94

    What Can Data Mining Do and Not Do?

    Beta Draft What Is Data Mining? 2-3

    OLAP systems provide a multidimensional conceptual view of the data, including fullsupport for hierarchies. This view of the data is a natural way to analyze businessesand organizations. Data mining, on the other hand, usually does not have a concept ofdimensions and hierarchies.

    Data mining and OLAP can be integrated in a number of ways. For example, datamining can be used to select the dimensions for a cube, create new values for a

    dimension, or create new measures for a cube. OLAP can be used to analyze datamining results at different levels of granularity.

    Data Mining and Data WarehousingData mining does not require a data warehouse. However, data mining is easier withinthe context of a data warehouse.

    Data cleansing is a central feature of data warehousing. Data cleansing is also requiredfor data mining. If the data has already been cleansed for a data warehouse, then it isprobably suitable for mining. Furthermore, the data consolidation and maintenanceprocedures implemented in a data warehouse also facilitate data mining activities.

    What Can Data Mining Do and Not Do?Data mining is a powerful tool that can help you find patterns and relationshipswithin your data. But data mining does not work by itself. It does not eliminate theneed to know your business, to understand your data, or to understand analyticalmethods.

    Data mining discovers information in your data, but it does not tell you the value ofthe information to your organization. Furthermore, the patterns uncovered by datamining must be verified in the real world.

    You might already be aware of important patterns as a result of working with yourdata over time. Data mining can confirm or qualify such empirical observations inaddition to finding new, subtle patterns that are not discernable through simple

    observation.It is important to remember that the predictive relationships discovered through datamining are not necessarily causes of an action or behavior. For example, data miningmight determine that males with incomes between $50,000 and $65,000 who subscribeto certain magazines are likely to buy a given product. You can use this information tohelp you develop a marketing strategy. However, you should not assume that thepopulation identified through data mining will buy the product because they belong tothis population.

    Asking the Right QuestionsData mining does not automatically discover solutions without guidance. The patterns

    you find through data mining will be very different depending on how you formulatethe problem.

    To obtain meaningful results, you must learn how ask the right questions. Forexample, rather than trying to learn how to "improve the response to a direct mailsolicitation," you might try to find the characteristics of people who have responded toyour solicitations in the past.

  • 8/13/2019 b28129.pdf

    30/94

    Data Mining Methodology

    2-4 Oracle Data Mining Concepts Beta Draft

    Understanding Your DataTo ensure meaningful data mining results, you must understand your data. Datamining algorithms are often sensitive to specific characteristics of the data: outliers(data values that are very different from the typical values in your database), irrelevantcolumns, columns that vary together (such as age and date of birth), data coding, anddata that you choose to include or exclude.

    Oracle Data Mining can automatically perform much of the data preparation required by the algorithm. But some of the data preparation is typically specific to the domainor the data mining problem. At any rate, you need to understand the data that wasused to build the model in order to properly interpret the results when the model isapplied.

    Understanding the ToolOracle Data Miner is a graphical tool that guides you through the various activities ofdata mining. It shelters you from the intricacies of the algorithms and the statisticaltechniques used in data mining, and it lets you perform the full range of mining taskswithout having to learn the programmatic interfaces. Nevertheless, you still need tounderstand how the tool works and the choices that it offers. The settings andoptimizations you choose will affect the accuracy and efficiency of your models.

    Data Mining MethodologyThe Cross-Industry Standard Process for Data Mining (CRISP-DM) is anon-proprietary, freely available, standard process model for data mining. CRISP-DMwas developed by a group of early adopters of data mining technology. The purposewas to provide guidelines for new users and to demonstrate the maturity of thetechnology to prospective users. Today, CRISP-DM is the industry standardmethodology for data mining and predictive analytics.

    The principal features of CRISP-DM are summarized in this section. For morecomprehensive information, visit the CRISP-DM web site athttp://www.crisp-dm.org/index.htm . The CRISP 1.0 Process and User Guide isavailable for download on the web site.

    Life Cycle of a Data Mining ProjectCRISP-DM defines six phases in the life cycle of a data mining project, but thesequence of the phases is not strict. Data mining is an iterative process. Moving backand forth between different phases is always required. Figure 21 illustrates the lifecycle of a data mining project. The arrows indicate the most important and frequentdependencies between phases.

  • 8/13/2019 b28129.pdf

    31/94

    Data Mining Methodology

    Beta Draft What Is Data Mining? 2-5

    Figure 21 CRISP-DM Life Cycle of a Data Mining Project

    The outer circle in the figure symbolizes the cyclic nature of data mining itself. A datamining process continues after a solution has been deployed. The lessons learnedduring the process can trigger new, often more focused business questions.Subsequent data mining processes will benefit from the experiences of previous ones.

    Phases of a Data Mining ProjectCRISP-DM defines the phases of a data mining project as follows:

    1. Business Understanding

    This initial phase focuses on understanding the project objectives andrequirements. Once you have specified the project from a business perspective,you can formulate it as a data mining problem and develop a preliminaryimplementation plan.

    For example, your business problem might be: "How can I sell more of my productto customers?" You might translate this into a data mining problem such as:"Which customers are most likely to purchase the product?" A model that predictswho is most likely to purchase the product must be built on data that describes thecustomers who have purchased the product in the past. Before building the model,you must specify the data set.

    2. Data Understanding

    The data understanding phase involves data collection and exploration. As youtake a closer look at the data, you can determine how well it addresses the

    business problem. You might decide to remove some of the data or add additionaldata. This is also the time to identify data quality problems and to scan forpatterns in the data.

    It often makes sense to begin by working with a reasonable sample of the data,since the final data set might consist of hundreds or thousands of records.

    3. Data Preparation

    The data preparation phase covers all the additional tasks involved in creating thefinal data set. Data preparation tasks are likely to be performed multiple times,and not in any prescribed order. Tasks include table, record, and attribute selection

  • 8/13/2019 b28129.pdf

    32/94

    Data Mining Methodology

    2-6 Oracle Data Mining Concepts Beta Draft

    as well as data cleansing and transformation. For example, you might transform aDATE_OF_BIRTH column to AGE ; you might insert the average income in recordswhere the INCOME column is null.

    4. Modeling

    In this phase, you select and apply various modeling techniques and calibrate theparameters to optimal values. Typically, there are several techniques for the samedata mining problem type.Some data transformations must be performed to meet the requirements of a datamining algorithm. This necessitates stepping back to the previous phase.

    5. Evaluation

    At this stage in the project you have built a model (or models) that appears to have

    high quality, from a data analysis perspective. Before proceeding to finaldeployment of the model, it is important to more thoroughly evaluate the model,and review the steps executed to construct the model, to be certain it properlyachieves the business objectives. A key objective is to determine if there is someimportant business issue that has not been sufficiently considered. At the end ofthis phase, a decision on the use of the data mining results should be reached.

    6. Deployment

    Creation of the model is generally not the end of the project. Even if the purpose ofthe model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a usable format.

    Depending on the requirements, the deployment phase can be as simple asgenerating a report or as complex as implementing a repeatable data miningprocess.

    Note: Oracle Data Mining can perform algorithm-specifictransformations automatically, which greatly facilitates the modelingphase.

  • 8/13/2019 b28129.pdf

    33/94

  • 8/13/2019 b28129.pdf

    34/94

    About Classification

    3-2 Oracle Data Mining Concepts Beta Draft

    Classification problems can have either binary and multiclass targets. Binary targetsare those that take on only two values, for example, good credit risk and poor credit risk .Multiclass targets have more than two values, for example, the product purchased(comb or hair brush or hair pin). Multiclass target values are not assumed to exist in anordered relation to each other, for example, hair brush is not assumed to be greater orless than comb.

    Classification problems may require the specification of Costs , described on page 3-7 and Priors , described on page 3-8 .

    Oracle Data Mining provides the following algorithms for classification: Decision Tree Naive Bayes Adaptive Bayes Network Support Vector Machine Logistic Regression

    Table 31 compares several important features of the classification algorithms.

    Data Preparation for ClassificationThis section summarizes data preparation that may be required by classificationalgorithms.

    OutliersOutliers affect classification algorithms as follows: Naive Bayes and Adaptive Bayes Network: The presence of outliers, when

    external equal-width binning is used, makes most of the data concentrate in a few bins (a single bin in extreme cases). As a result, the discriminating power of these

    algorithms may be significantly reduced. In this case, quantile binning helps toovercome these problems. Support Vector Machine: The presence of outliers can significantly impact models.

    Use a clipping transformation to avoid the problems caused by outliers. Decision Tree: The presence of outliers does not impact decision tree models.

    NULL ValuesThe meaning of NULL values and how to treat them depends on the algorithm asfollows:

    Table 31 Classification Algorithm Comparison

    Feature Naive BayesAdaptive BayesNetwork

    Support VectorMachine Decision Tree

    Speed Very fast Fast Fast with activelearning

    Fast

    Accuracy Good in manydomains

    Good in manydomains

    Significant Good in manydomains

    Transparency No rules (black box)

    Rules for SingleFeature Buildonly

    No rules (black box)

    Rules

    Missing value

    interpretation

    Missing value Missing value Sparse data Missing value

  • 8/13/2019 b28129.pdf

    35/94

    Classification Algorithms

    Beta Draft Classification 3-3

    Support Vector Machine: NULL values indicate sparse data. Missing values are notautomatically handled. If the data is not sparse and the values are indeed missingat random, it is necessary to perform missing data imputation (that is, performsome kind of missing values treatment) and substitute a non-NULL value for theNULL value. One simple approach is to use the mean for numerical attributes andthe mode for categorical attributes. If you do not treat missing values, thealgorithm will not handle the data correctly.

    For all other classification algorithms, NULL values indicate missing values:

    Decision Tree, Naive Bayes, and Adaptive Bayes Network: Missing values arehandled automatically.

    NormalizationSupport Vector Machine may benefit from normalization.

    Classification Algorithms

    Decision TreeDecision tree rules provide model transparency so that a business user, marketinganalyst, or business analyst can understand the basis of the model's predictions, andtherefore, be comfortable acting on them and explaining them to others.

    In addition to transparency, the Decision Tree algorithm provides speed andscalability. The build algorithm scales linearly with the number of predictor attributesand on the order of nlog(n ) with the number of rows, n . Scoring is very fast. Both build and apply are parallelized.

    The Decision Tree algorithm builds models for binary and multi-class targets. Itproduces accurate and interpretable models with relatively little user interventionrequired. The Decision Tree algorithm is implemented in such a way as to handle datain the typical data table formats, to have reasonable defaults for splitting andtermination criteria, to perform automatic pruning, and to perform automatichandling of missing values. However, it does not distinguish sparse data from missingdata. (See for more information.) Users can specify costs and priors.

    Decision Tree does not support nested tables.

    Decision Tree Models can be converted to XML.

    Decision Tree RulesA Decision Tree model always produces rules. Decision tree rules are in the form "IFpredictive information THEN target," as in "IF income is greater than $70K andhousehold size is greater than 3 THEN the probability of Churn is 0.075."

    XML for Decision Tree ModelsYou can generate XML representing a decision tree model; the generated XML satisfiesthe definition specified in the Data Mining Group Predictive Model Markup Language(PMML) version 2.1 specification. The specification is available athttp://www.dmg.org .

  • 8/13/2019 b28129.pdf

    36/94

    Classification Algorithms

    3-4 Oracle Data Mining Concepts Beta Draft

    Naive BayesThe Naive Bayes algorithm (NB) can be used for both binary and multiclassclassification problems.

    NB builds and scores models extremely rapidly; it scales linearly in the number ofpredictors and rows.

    NB makes predictions using Bayes' Theorem, which derives the probability of aprediction from the underlying evidence. Bayes' Theorem states that the probability ofevent A occurring given that event B has occurred (P(A|B)) is proportional to theprobability of event B occurring given that event A has occurred multiplied by theprobability of event A occurring ((P(B|A)P(A)).

    Naive Bayes makes the assumption that each attribute is conditionally independent ofthe others, that is, given a particular value of the target, the distribution of eachpredictor is independent of the other predictors. In practice, this assumption ofindependence, even when violated, does not degrade the model's predictive accuracysignificantly, and makes the difference between a fast, computationally feasiblealgorithm and an intractable one.

    Adaptive Bayes Network Adaptive Bayes Network (ABN) is an Oracle proprietary algorithm that provides afast, scalable, non-parametric means of extracting predictive information from datawith respect to a target attribute. (Non-parametric statistical techniques avoidassuming that the population is characterized by a family of simple distributionalmodels, such as standard linear regression, where different members of the family aredifferentiated by a small set of parameters.)

    ABN, in Single Feature Build mode, can describe the model in the form ofhuman-understandable rules. The rules produced by ABN are one of its mainadvantages over Naive Bayes. ABN rules provide model transparency so that a business user, marketer, or business analyst can understand the basis of the model'spredictions and therefore, be comfortable acting on them and explaining them toothers.In addition to rules, ABN provides performance and scalability, which are derivedfrom various user parameters controlling the trade-off of accuracy and build time.

    ABN predicts binary as well as multiclass targets.

    ABN can use costs and priors for both building and scoring (see "Costs" on page 3-7 and "Priors" on page 3-8 ).

    ABN Model TypesAn ABN model is an adaptive conditional independence model that uses theminimum description length principle to construct and prune an array of conditionallyindependent network features. Each network feature consists of one or moreconditional probability expressions. The collection of network features forms a productmodel that provides estimates of the target class probabilities. There can be one ormore network features. The number and depth of the network features in the modeldetermine the model mode. There are three model modes for ABN: Pruned Naive Bayes (Naive Bayes Build) Simplified decision tree (Single Feature Build) Boosted (Multi Feature Build)

    Users can select the ABN model type. Rules are available only for Single Feature Build.

  • 8/13/2019 b28129.pdf

    37/94

    Classification Algorithms

    Beta Draft Classification 3-5

    Each network feature consists of one or more attributes included in a conditionalprobability expression. An array of single attribute network features is anMDL-pruned Naive Bayes model. A single multi-attribute network feature model isequivalent to a simplified C4.5 decision tree; such a model is simplified in the sensethat numerical attributes are binned and treated as categorical. Furthermore, a singlepredictor is used to split all nodes at a given tree depth. The splits are k -way, where k isthe number of unique (binned) values of the splitting predictor. Finally, a collection ofmulti-attribute network features forms a product model ( boosted mode). All three typesprovide estimates of the target class probabilities.

    ABN RulesRules can be extracted from an Adaptive Bayes Network model as compoundpredicates. Rules form a human-interpretable depiction of the model and includestatistics indicating the number of the relevant build data instances in support of therule. A record apply instance specifies a pathway in a network feature taking the formof a compound predicate.

    For example, suppose the feature consists of two training attributes: Age {20-40, 40-60,60-80} and Income {50K}. A record instance consisting of a person age 25 andincome $42K is expressed as

    IF AGE IN (20-40) and INCOME IN (

  • 8/13/2019 b28129.pdf

    38/94

    Classification Algorithms

    3-6 Oracle Data Mining Concepts Beta Draft

    SVM projects the input data into a kernel space. Then it builds a linear model in thiskernel space. A classification SVM model attempts to separate the target classes withthe widest possible margin. A regression SVM model tries to find a continuousfunction such that maximum number of data points lie within an epsilon-wide tubearound it. Different types of kernels and different kernel parameter choices canproduce a variety of decision boundaries (classification) or function approximators(regression). The Oracle Data Mining SVM implementation supports two types ofkernels: linear and Gaussian. Oracle Data Mining also provides automatic parameterestimation based on the characteristics of the data.

    SVM performs well with real-world applications such as classifying text, recognizinghand-written characters, classifying images, as well as bioinformatics and biosequenceanalysis. The introduction of SVM in the early 1990s led to an explosion of applicationsand deepening theoretical analysis that established SVM along with neural networksas one of the standard tools for machine learning and data mining.

    There is no upper limit on the number of attributes and target cardinality for SVMs;the only constraints are those imposed by hardware.

    SVM is the preferred algorithm for sparse data.

    The following new features have been added to the SVM algorithm in Oracle DataMining 10 g Release 2: SVM can also be used to identify novel or anomalous patterns using one-class

    SVM. For more information, see Chapter 6, "Anomaly Detection" . SVM supports active learning. For more information, see "Active Learning" on

    page 3-6 . SVM automatically creates stratified samples for large training sets (see "Sampling

    for Classification" on page 3-6 ) and automatically chooses a kernel type for model build (see "Automatic Kernel Selection" on page 3-7 ).

    Active Learning

    SVM models grow as the size of the build data set increases. This property limits SVMmodels to small and medium size training sets (less than 100,000 cases). Activelearning provides a way to deal with large training sets.

    The termination criteria for active learning is usually an upper bound on the numberof support vectors; when the upper bound is attained, the build stops. Alternatively,stopping criteria are qualitative, such as no significant improvement in modelaccuracy on a held-aside sample.

    Active learning forces the SVM algorithm to restrict learning to the most informativetraining examples and not to attempt to use the entire body of data. In most cases, theresulting models have predictive accuracy comparable to that of the standard (exact)SVM model.

    Active learning can be applied to all SVM models (classification, regression, andone-class).

    Active learning is on by default. It can be turned off.

    Sampling for ClassificationFor classification, SVM automatically performs stratified sampling during model build. The algorithm scans the entire build data set and selects a sample that is balanced across target values.

  • 8/13/2019 b28129.pdf

    39/94

    Costs

    Beta Draft Classification 3-7

    Automatic Kernel SelectionSVM automatically determines the appropriate kernel type based on build datacharacteristics. This selection can be overridden by explicitly specifying a kernel type.

    Data Preparation and Settings Choice for Support Vector MachinesYou can influence both the Support Vector Machine (SVM) model quality (accuracy)

    and performance (build time) through two basic mechanisms: data preparation andmodel settings. Significant performance degradation can be caused by a poor choice ofsettings or inappropriate data preparation. Poor settings choices can also lead toinaccurate models.

    For detailed information about data preparation for SVM models, see the Oracle Data Mining Application Developer's Guide .

    SVM has built-in mechanisms that attempt to choose appropriate settingsautomatically based on the data provided. You may need to override thesystem-determined settings for some domains.

    Binary Logistic RegressionOracle Data Mining supports generalized linear models (GLM) for both classificationand regression. GLM supports binary logistic regression for classification andmultivariate linear regression for regression. See "Generalized Linear Models" onpage 4-2 .

    Binary logistic regression predicts the probability for each row of scoring data. Thedependent variable (target) is binary and categorical. For example, demographicattributes might be used to predict whether customer response to a promotion is lowor high.

    CostsIn a classification problem, it may be important to specify the costs involved in makingan incorrect decision. Doing so can be useful when the costs of differentmisclassifications vary significantly.

    For example, suppose the problem is to predict whether a user will respond to apromotional mailing. The target has two categories: YES (the customer responds) andNO (the customer does not respond). Suppose a positive response to the promotiongenerates $500 and that it costs $5 to do the mailing. If the model predicts YES and theactual value is YES, the cost of misclassification is $0. If the model predicts YES andthe actual value is NO, the cost of misclassification is $5. If the model predicts NO andthe actual value is YES, the cost of misclassification is $500. If the model predicts NOand the actual value is NO, the cost is $0. In this case, you would probably want toavoid cases where the model predicts NO and the actual value is YES

    Exactly how costs are specified depends on the classification algorithm used: NB and ABN use a cost matrix SVM uses weights

    The cost of misclassification is summarized in a cost matrix. The rows of the matrixrepresent actual values and the columns, predicted values. A cell in the matrixrepresents the misclassification cost that occurs when the model predicts the classindicated by the column when the class is really the one specified by the row.

    Classification algorithms apply the cost information to the predicted probabilitiesduring scoring to estimate the least expensive prediction. If a cost matrix is specified

  • 8/13/2019 b28129.pdf

    40/94

    Priors

    3-8 Oracle Data Mining Concepts Beta Draft

    for scoring, the output of the scoring is the minimum cost for the prediction. If no costmatrix is supplied, the output is the most likely prediction.

    You must be careful how you assign costs. You are making a trade-off betweenfalse-positives (falsely accusing someone of fraud) and false negatives (letting a crimego unpunished). Your costs should reflect this trade-off. Perhaps you are willing to letsome crimes go unpunished so that you don't falsely accuse millions of committing

    fraud; for example, you must be sure that you are right before you accuse someone(say 99%, rather than just 50% sure). Predicting on probability means you areindifferent to the type of error you make. If you are concerned about the type of error,a cost matrix or carefully adjusted weights are warranted.

    In classification models, you can specify a cost matrix to represent the costs associatedwith false positive and false negative predictions. A cost matrix can be used when themodel is created or which it is applied, as indicated by the cost_matrix_type_create and cost_matrix_type_score settings.

    cost_matrix_type_create The cost matrix will be used when the model iscreated. (Decision Tree only)

    cost_matrix_type_score The cost matrix will be used when the model isapplied.

    To specify the cost matrix, create a cost matrix table with the columns described inTable 32 and provide its name in the clas_cost_table_name setting for themodel.

    Oracle Data Mining enables you to evaluate the cost of predictions from classificationmodels in an iterative manner during the experimental phase of mining, and toeventually apply the optimal cost matrix to predictions on the actual scoring data in aproduction environment.

    The data input to each test computation (a COMPUTE procedure in PL/SQL, or aTestMetrics object in Java) is the result generated from applying the model on testdata. In addition, if you also provide a cost matrix as an input, the computationgenerates test results taking the cost matrix into account. This enables you toexperiment with various costs for a given prediction against the same APPLY results,without rebuilding the model and applying it against the same test data for everyiteration.

    Once you arrive at an optimal cost matrix, you can then input this cost matrix to theRANK_APPLY operation along with the results of APPLY on your scoring data. RANK_

    APPLY wil