introducing - salford systemsmedia.salford-systems.com/spm/spm8brochure.pdf · modeling strategies...

The SPM Salford Predictive Modeler software suite is a highly accurate and ultra-fast platform for developing

predictive, descriptive, and analytical models from databases of any size, complexity, or organization. The

SPM software suite’s automation accelerates the process of model building by conducting substantial portions of

the model exploration and refinement process for the analyst. While the analyst is always in full control, we optionally anticipate the analyst’s next best steps and

package a complete set of results from alternative modeling strategies for easy review. Do in one day what normally requires a week or more using other systems.

Introducing

CART is the definitive classification tree generating clear and easy to understand flow chart representation of predictive models. It is

applicable to data sets of almost any size ranging from the smallest to the largest.

TreeNet Gradient Boosting is Salford's most flexible and powerful data mining tool,

consistently generating extremely accurate models. TreeNet demonstrates remarkable

performance for both regression and classification. The algorithm typically generates

thousands of small decision trees built in a sequential error–correcting process to converge

to an accurate model. TreeNet has been responsible for the majority of Salford’s

modeling competition awards.

MARS is ideal for users who prefer results in a form similar to traditional regression while

capturing essential nonlinearities and interactions.

RandomForests features include prediction clusters and segment discoveries, anomaly

tagging detection, and multivariate class discrimination.Random Forests has established

itself as one of the most powerful and flexible learning machines for both classification and regression. Developed in collaboration with

Leo Breiman, Random Forests excels in detecting important predictors even in the presence of tens of thousands of factors.

...and see our recently released advanced new data mining technologies including

Generalized PathSeeker™,ISLE™ and RuleLearner™.

9685 Via Excelencia, Suite 208, San Diego, CA 92126Telephone: (619)543-8880 Fax: (619)543-8888

http://www.salford-systems.com

Simply SuperiorSalford Systems’ tools have dominated the fiercely contested field of data mining competitions for more than a decade. Salford tools have won an award in almost every year since 2000. No other vendor has come close to our record of excellence. Here is a partial list of our wins:

2015 DMA Analytics ChallengeRepeat Purchase Task

2013 DMA Analytics ChallengeHealthcare Response Task

2010 DMA Analytics ChallengeMake-A-Wish Foundation Targeting Solution Lapsed Donor Segments

INFORMS 2009 Healthcare Quality Task

2009 KDDCup CRM task, telecom dataset

2008 DMA Analytics Challenge Direct Marketing Optimization task

2008 Scientific Computing Data Mining Readers’ Choice Award

2007 DMA Analytics Challenge Targeted Marketing task

2007 PAKDD Cross-selling task, financial dataset

2006 PAKDD Upselling task, telecom dataset

2004 KDDCup Particle Physics task

2002 Duke/TeraData Churn Modeling, CRM

2000 KDDCup Web Analytics

SALFORD PREDICTIVE MODELER

The CompanyFounded in 1983, Salford Systems specializes in advanced data mining and predictive analytics software and consultation services. Applications in both software and consulting span market research segmentation, direct marketing, fraud detection, credit scoring, ad serving, risk management, biomedical research, manufacturing quality control and more. Industries using Salford Systems products and services include telecommunications, transportation, banking, financial services, insurance, health care, manufacturing, retail and catalog sales, and education. Salford Systems software is installed at more than 3,500 sites worldwide, including 300 major Universities.

Scientific PedigreeSalford Systems maintains an active R&D program, staffed by researchers trained at Harvard, MIT, Stanford, and UC Berkeley, and leveraging our ties to leading universities.

The ServicesSalford Systems offers corporations and management consulting companies a variety of analytical and strategic consulting services. Salford pairs business consultants and technical Ph.D.s with experienced scientific programmers to find innovative solutions for complex modeling and data analysis problems. Salford Systems maintains a rapid response data mining center equipped with high speed servers and massive storage capacity. Demonstration projects and proof of concept studies can be planned and executed in as little as one week, and assessments of the value of large scale data mining projects can be generated quickly and cost effectively. Salford Systems conducts large scale data mining projects from initial conceptualization to the final installation of productivity software.

About

TrainingSalford Systems offers an ongoing series of data mining training seminars for CART, MARS, TreeNet, RandomForests and modern approaches to regression, classification, and segmentation. Training can be delivered on site, via Webex, and in periodic public courses in major cities worldwide.

SPM accelerates the process of model building via 70+ automated modeling processes including variable reduction, multiple splitting rules, missing value handling strategies, random selection of predictors, sample size optimization, and much more.

Salford Predictive Modeler Software Suite

SPM analysis methods include GUI and non-GUI versions of CART, MARS, TreeNet and RandomForests, as well as, command-line versions of TreeNet ICL, Generalized PathSeeker, ISLE, RuleLearner, Linear and Logistic regression.

Complete descriptive statistics in table and graphic form, including detailed quantile reports and complete tabulation, are available. All summary tables can be saved to Excel spreadsheets.


The SPM Salford Predictive Modeler software suite is a highly accurate and ultra-fast platform for developing predictive, descriptive, and analytical models from databases of any size, complexity, or organization. SPM is in use in major organizations and by leaders in fraud detection, credit risk, insurance, direct marketing, online analytics, manufacturing, pharmaceuticals, logistics, natural resources, auditing, security, national defense, and more.

The SPM software suite’s data mining technologies span classification, regression, survival analysis, missing value analysis, and clustering/segmentation to cover all aspects of your data science projects. SPM algorithms are considered to be essential in sophisticated data science circles.

SPM incorporates algorithms by the original creators of CART: Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. Friedman continues to collaborate with Salford Systems to enhance the algorithms with proprietary technology. Breiman and Friedman are responsible for many of the ideas now taught in all graduate courses in data mining and predictive modeling. No other organization can boast of such a powerful and influential brain trust.

The SPM software suite‘s automation accelerates the process of model building by conducting substantial portions of the model exploration and refinement process for the analyst. While the analyst is always in full control we optionally anticipate the analysts’ next best steps and package a complete set of results from alternative modeling strategies for easy review. Do in one day what normally requires a week or more using other systems!

Classification and Regression Trees

Engines

CART is the ultimate classification tree that has revolutionized the field of advanced analytics, and inaugurated the current era of data science. CART, which is continually being improved, is one of the most important tools in modern data science methods. Others have tried to copy CART, but no one has succeeded, as evidenced by accuracy, performance, feature set, built-in automation and ease of use.

Designed for both non-technical and technical users, CART can quickly reveal important data relationships that could remain hidden using other analytical tools.

Technically, CART is based on landmark mathematical theory introduced in 1984 by four world-renowned statisticians at Stanford University and the University of California, Berkeley. Salford Systems’ implementation of CART is the only decision tree software embodying the original proprietary code. The CART creators continue to collaborate with Salford Systems to continually enhance CART with proprietary advances. Patented extensions to CART are specifically designed to enhance results for market research and web analytics. CART supports high-speed deployment, allowing Salford Systems models to predict and score in real time on a massive scale.

Results are displayed as a tree-shaped visual diagram and presented as a flow chart. Compare this to complex parameter coefficients in a logistic regression output or a stream of numbers in a neural-net.

CART’s Hotspot Detection is specifically designed to search many trees to find nodes of ultra-high response.

CART users can dictate the splitting variable to be used. More specific controls allow the user to specify the split values for both continuous and categorical variables.


Gradient Boosting

Engines

TreeNet is Salford Systems’ most flexible and powerful data mining tool, responsible for at least a dozen prizes in major data mining competitions since its introduction in 2002.

The algorithm typically generates up to thousands of small decision trees built in a sequential error-correcting process to converge to an accurate model. TreeNet models are usually complex and thus the software generates a number of special reports designed to extract the meaning of the model. Graphs produced by TreeNet software display the impact of any relevant predictor, or pair of predictors, on the target, thus revealing the underlying data structure.

TreeNet's robustness extends to data contaminated with erroneous target labels. This type of data error can be very challenging for most data mining methods and will be catastrophic for many. In contrast, TreeNet is generally immune to such errors. This is because TreeNet dynamically rejects training data that is too much at variance with the existing model. In addition, TreeNet adds the advantage of a degree of accuracy usually not attainable by a single model or by ensembles such as bagging or conventional boosting. As opposed to neural networks, TreeNet is not sensitive to data errors and needs no time-consuming data preparation, preprocessing or imputation of missing values.

Interaction Detection establishes whether interactions of any kind are needed in a predictive model, and is a search engine to discover specifically which interactions are required. The interaction detection system not only helps improve model performance (sometimes dramatically); but, also assists in the discovery of valuable new segments and previously unrecognized patterns.

RuleLearner is a rule search engine that discovers and rank orders the most predictively powerful rules. RuleLearner identifies individual rules and also creates an optimally weighted combination of rules that can function as a high performance predictive model. RuleLearner also allows discovery of segment specific variable importance rankings.

This is the TreeNet results window that opens once the analysis is done. One can model performance on both the learn and test samples output.

This graph shows predicted f(X) after 200 iterations. The resulting model is nice and smooth. It can be further approximated by a simple collection of first or second order splines.

The user has the ability to view one or two variable dependency plots. This allows quick and easy hotspot identification by simply checking the color of the response surface.


Engines

MARS is ideal for users who prefer results in a form similar to traditional regression while capturing essential nonlinearities and interactions. The MARS approach to regression modeling effectively uncovers important data patterns and relationships that are difficult, if not impossible, for other regression methods to reveal.

Conventional regression models typically fit straight lines to data. MARS approaches model construction more flexibly, allowing for bends, thresholds, and other departures from straight-line methods. MARS builds its model by piecing together a series of straight lines with each allowed its own slope. This permits MARS to trace out any pattern detected in the data.

The MARS model is designed to predict continuous numeric outcomes such as the average monthly bill of a mobile phone customer or the amount that a shopper is expected to spend in a web site visit. MARS is also capable of producing high quality probability models for a yes/no outcome. MARS performs variable selection, variable transformation, interaction detection, and self-testing, all automatically and at high speed.

Areas where MARS has exhibited especially high-performance results include forecasting electricity demand for power generating companies, relating customer satisfaction scores to the engineering specifications of products, and presence/absence modeling in geographical information systems (GIS).

Automated Non-linear Regression

Three-dimensional, rotatable surface plots depict the relationship between a pair of predictor variables and the target variable.

MARS fits non-linear functions using piecewise linear segments. Each segment is defined using a basis function.

MARS produces simple graphs displaying the relationship between each important variable and the target.


Random Forests is a bagging tool that leverages the power of multiple alternative analyses, randomization strategies, and ensemble learning. Its strengths are spotting outliers and anomalies in data, displaying clusters, predicting future outcomes, identifying important predictors, replacing missing values with imputations, and providing insightful graphics.

Much of the insight provided by Random Forests is generated by methods applied after the trees are grown and include new technology for identifying clusters or segments in data as well as new methods for ranking the importance of variables.

The method was developed by Leo Breiman and Adele Cutler of the University of California, Berkeley, and is licensed exclusively to Salford Systems. Ongoing research is being undertaken by Salford Systems in collaboration with Professor Adele Cutler, the surviving co-author of RandomForests.

Random Forests is a collection of many CART-style trees that are not influenced by each other when constructed. The sum of the predictions made from decision trees determines the overall prediction of the forest. Random Forests has been successfully used in both small and huge data sets and is especially effective in identifying the important predictors even in the presence of hundreds or thousands of features.

A nearly perfect separation of two classes is clearly seen.

The user can view a Parallel Coordinates plot to see how predictor values are associated with different target classes. The graph can be scrolled forward to view all of the predictors.

Breiman and Cutler’s Random Forests

A RandomForest is a collection of single trees grown in a special way. Combining trees always improves performance, with the optimal number of trees around 108 in this example.

SALFORD PREDICTIVE MODELEREngines

Salford Systems offers state-of-the-art regression technology. Our regression algorithms are vastly enhanced to incorporate the key concepts of modern data mining approaches. The algorithms are specifically designed to work with massive datasets and with data including missing values, nonlinear relationships, local patterns and interactions. SPM includes several tools for regression:

MARS models look somewhat like conventional regression models but are constructed automatically by first breaking predictors into regions exhibiting different effects on the outcome variable. MARS also automatically discovers interactions among predictors and displays results in conventional regression formats and graphically.

TreeNet is a multi-purpose learning machine/data mining tool that also excels in the development of nonlinear regressions but via boosted regression trees. The results are understood via graphical dependency plots illustrating how changes in each predictor, or a pair of predictors, affect the output.

Modern Regularized Regression (GPS Generalized PathSeeker) is provided for both linear and logistic regression models. Regularization comes in the form of ridge regression, the lasso, mixtures of the two (elastic net), and a super regularizer to deliver ultra-compact models only slightly less accurate than far more complex models. Incorporating Jerome Friedman's recent advances in the field, GPS is both extraordinarily fast and comfortably handles both deep (many rows) and wide (many columns) data sets.

Logistic Regression incorporates a sophisticated set of options and controls for the advanced user and ultra-flexibility in model setup for the everyday user. Binary and multinomial logistic regression and regularized binary logistic regression are supported, as well as conversion of text and categorical variables into collections of dummy variables.

MARS nonlinearity plots: target versus individual predictors.

GPS model performance comparison using multiple elasticities from lasso to ridge.

Logistic Regression risk prediction (Hosmer Lemeshow Goodness of Fit): assessing whether or not the observed event rates match expected event rates for both responders and non-responders.

What’s New?Modern Regression Technology


Salford Systems’ enhanced technology accelerates the model building process by automating substantial portions of the model exploration, discovery, and refinement process. While the analyst is always in full control, we anticipate the next best steps to take, and offer to run these for the analyst. The software offers complete sets of results from alternative modeling strategies for easy review. The parameters, settings and options are clearly highlighted giving the best results according to various criteria such as: classification accuracy; area under the ROC curve (sensitivity vs specificity, precision vs recall); and, lift in the top percentiles. Modelers can quickly review graphs displaying model performance and model complexity, assisting in the choice of a model best tuned to the requirements and constraints of any real world predictive environment.

Included are 60+ pre-packaged experiments extracted from our extensive real world consulting experience. These experiments codify best practice techniques used routinely by the world's leading predictive analytics modelers. Multiple models, using different control settings, test samples, learning machine engines, and modeling strategies, are run automatically and conveniently summarized in clear tables and graphs so that the analyst can easily see options and trade-offs. The result is the elimination of much of the gruntwork, allowing the analyst to focus on the creative aspects of model development.

The Salford philosophy of modeling automation is to assist the modeler as far as possible by anticipating the routine stages of modeling experimentation, allowing the modeler to rapidly make good decisions and to avoid the making of mistakes by failing to run some useful tests and diagnostics. Our goal is to help the analyst do a better and faster job of predictive modeling development.

Expert modelers typically devote a lot of time to optimizing their variable importance list; Automate Shaving automates this process with minimal (if any) sacrifice to model accuracy.

Expert analysts will typically try a large number of different configurations of prior probabilities. This process is fully automated in Automate Priors.

Automate Target Shuffle (Monte Carlo Shuffling of the Target) allows the analyst to determine whether the model performance is as accurate as it appears to be by automatically constructing a large number of “no signal” models based on randomly shuffled target variables. If a dataset with deliberately destroyed target dependency can give you a model with good accuracy, then relying on the original model becomes rather dubious.

What’s New?Enhanced Automation Technology


What is CART? Classification And Regression Trees is a decision-tree procedure introduced in 1984 by world-renowned statisticians Leo Breiman, Jerome Friedman,Richard Olshen, and Charles Stone. Their landmark work created the modern field of sophisticated, mathematically- and theoretically-founded decision trees. A decision tree is a flow chart or diagram representing a classification system or predictive model. The tree is structured as a sequence of simple questions with the answers tracing a path down the tree. The end point reached determines the classification or prediction made by the model, which can be a qualitative judgment or a numerical forecast. The CART methodology solves performance, accuracy, and operational problems.

What makes Salford Systems’ CART the only “true” CART?

Salford Systems’ CART is the only decision tree formulated from the original code of Breiman, Friedman, Olshen, and Stone. Since the code is proprietary, CART is the only true implementation of this methodology. In addition, the procedure has been substantially enhanced with new features and capabilities in exclusive collaboration with CART’s creators. While other decision-tree products claim to implement selected features of this technology, they are unable to reproduce genuine CART trees and lack key performance and accuracy components. Furthermore, CART’s creators continue to collaborate with Salford Systems to advance CART and to develop the next generation of analytical tools.

What makes CART so easy to interpret?

The results of a decision tree are displayed as a tree diagram using a simple set of if-then rules. Discovered relationships and patterns - even in massively complex datasets with thousands of variables - are presented as a flow chart. Compared to complex parameter coefficients in a logistic regression, or a stream of numbers in neural-nets, the visual display enables users to see the hierarchical interaction of the variables.

How are CART decision trees grown?

CART uses an exhaustive, recursive partitioning routine to generate binary splits that divide each parent node into two child nodes by posing a series of yes-no questions. CART searches for questions that split nodes into relatively homogenous child nodes. As the tree evolves, the nodes become increasingly more homogenous, identifying segments.

Why is CART unique among decision tree tools?

• Reliable pruning strategy - CART’s developers determined that a no-stopping rule would discover the optimal tree, so they introduced the strategy of over-growing trees and then pruning them back. This fundamental idea ensures that important structure is not overlooked by stopping too soon.• Powerful binary-split search approach - CART’s binary decision tree is sparing with data and detects structures before too little data is left for further splitting.• Automatic self-validation procedures - CART’s test methods ensure that the tree structure will retain its predictive power when applied to new data. The testing and selection of the optimal tree are integral parts of the CART algorithm.

FAQSALFORD PREDICTIVE MODELER

What splitting criteria does CART provide?

CART includes several single-variable splitting criteria for classification. It offers gini, symgini, twoing, ordered twoing, class probability, and class entropy. For regression, CART provides least squares and least absolute deviation. Additionally, CART offers one multi-variable splitting criteria using linear combinations.

What are adjustable misclassification penalties?

Misclassification penalties accommodate situations in which some misclassified segments are more serious than others. CART users can specify an adjustable penalty for misclassifying certain segments, and the software will direct the tree away from that type of error. When CART cannot guarantee a correct classification, it will try to ensure that the misclassification is less costly.

What are intelligent surrogates for missing values?

CART handles missing values by use of surrogate splits, a splitting rule that closely mimics the action of a primary split. Not only must a good surrogate split the parent node into descendant nodes similar in size and composition to the primary descendant nodes, but, to the extent possible, the surrogate must also match the primary split on the specific cases that go to the left and right child nodes. A surrogate is thus evaluated by its ability to match a primary split on a case-by-case basis.

What are CART’s test procedures?

CART uses two test procedures to select the optimal tree with the lowest overall misclassification cost, thus the highest accuracy. Both test disciplines are automated and ensure the optimal tree will accurately classify existing data and predict results. For smaller datasets, the user can employ cross-validation where ten different trees are typically grown, each built from a different ten percent of the total sample. When the results of the ten trees are considered, the optimal tree size is obtained. For larger datasets, the user may specify a random test sample, a separate test file, or a pre-determined set of test records.

What is a committee-of-experts or bootstrap aggregation?

The use of multiple trees in a committee-of-experts is an effective way of combining trees. Prediction errors can be considerably reduced by selecting many different random samples from the training data, growing a different tree on each random sample, and finally, by allowing the different trees to “vote” on the best classification.

How can CART complement other data mining packages and/or suites?

CART is an excellent pre-processing complement to classical statistical packages, such as SAS®. In the first stage of a data mining project, CART can extract the most important variables from a large list of potential predictors. Focusing on the top variables from the CART model can significantly speed up neural networks and other data mining techniques. For neural nets, CART bypasses noise and irrelevant variables, effectively selecting the best variables for input. In addition, CART outputs or predicted values can be used as inputs to the neural net. CART can also be used to establish performance benchmarks, detect important interactions, and impute missing values.

FAQ

How does TreeNet work and what does a TreeNet model look like?

where each Ti is a small tree. An example of the first few terms of a model to predict home values based on the 1970 Census Boston Housing data set is:

A TreeNet model normally consists of from several dozen to several hundred small trees, each typically no larger than two to eight terminal nodes. The model is similar in spirit to a long series expansion (such as a Fourier or Taylor's series) - a sum of factors that becomes progressively more accurate as the expansion continues. The expansion can be written as:

The model tells us that we start with the mean home value (in 1970) of $22,533 and adjust that estimate upwards by $13,541 for larger homes, and adjust it upwards again by $2,607 for neighborhoods with good socioeconomic status indicators. In practice the adjustments are usually much smaller than shown in this regression example and hundreds of adjustments may be needed. The final model is thus a collection of weighted and summed trees. For binary classification problems, a yes or no response is determined by whether the sign of the predicted outcome is positive or negative. For multi-class problems a score is developed separately for each class via class-specific expansions, and the scores are converted into a set of probabilities of class membership.

The example above uses the smallest possible two-node tree in each stage. More complicated models tracking complex interactions are possible with three or more nodes at each stage.

What is the technology underlying TreeNet and how does it differ from boosting?

TreeNet uses gradient boosting to achieve the benefit of boosting (accuracy) without the drawback of a tendency to be misled by bad data. In boosting, each tree grown would normally be a fully articulated stand-alone model, with each boosted tree combined with its mates via a weighted voting scheme. In contrast, each TreeNet component is a small tree, often no larger than two terminal nodes; trees are summed together with very small weights on each component.


TreeNet generates extremely accurate models. TreeNet’s level of accuracy is usually not attainable by single models or by ensembles such as bagging or conventional boosting. Independent real world tests in text mining, fraud detection, and credit worthiness have shown TreeNet to be dramatically more accurate on test data than other competing methods. TreeNet is able to deliver results within a few hours comparable to or better than results requiring months of hands-on development by expert data mining teams. TreeNet advantages include: • Automatic selection from thousands of candidate predictors - No prior variable selection or data reduction is required • Ability to handle data without preprocessing - Data do not need to be rescaled, transformed, or modified in any way • Resistance to outliers in predictors or the target variable • Automatic handling of missing values • General robustness to dirty and partially inaccurate data • High speed • Trees are grown quickly; small trees are grown extraordinarily quickly • TreeNet is able to focus on the data that are not easily predictable as the model evolves - Thus, as additional trees are grown fewer and fewer data needs to be processed - In many cases, TreeNet is able to train effectively on 20% of the data • Resistance to over training - When working with large data bases, even models with 2,000 trees show little evidence of overtraining - Most models show maximum accuracy well before 1,000 trees are grown

What are the advantages of TreeNet?

What does TreeNet output look like?

The TreeNet model is a complex structure not easily understood by studying its individual components. However, TreeNet produces a number of clear reports and graphs that reveal the core message and predictive content of the model.These include: • Variable importance ranking • Graphs of the typical relationship between the target and any one predictor - All other variable effects are taken into account to arrive at a typical relationship - Technically, we graph E(Y|Xi) for a single predictor Xi, integrating out all other relevant predictors. • 3-D graphs of the target against any pair or predictors • The first few trees of the model may also be displayed as a set of text rules

What are the advantages of TreeNet over a neural net?

TreeNet is not sensitive to data errors and needs no time-consuming data preparation, preprocessing or imputation of missing values. TreeNet is resistant to overtraining and is over 100 times faster than a neural net. Finally, TreeNet is not troubled by hundreds or thousands of predictors.

What is MARS? Multivariate Adaptive Regression Splines was developed in the early 1990s by world-renowned Stanford physicist and statistician Jerome Friedman. It is an innovative, flexible modeling tool that automates the building of accurate predictive models for continuous and binary dependent variables. It excels at finding optimal variable transformations and interactions, the complex data structure that often hides in high-dimensional data. This approach to regression modeling effectively uncovers important data patterns and relationships that are difficult, if not impossible, for other methods to reveal.

How does MARS differ from conventional regression?

Conventional regression models typically fit straight lines to data. MARS approaches model construction more flexibly, allowing for bends, thresholds, and other departures from straight-line methods. MARS builds its model by piecing together a series of straight lines with each allowed its own slope. This permits MARS to trace out any pattern detected in the data.

How does MARS help analysts with regression modeling?

The major advantage of MARS is that it automates aspects of regression modeling that are difficult and time-consuming. These include: • selecting which predictor variables to use • handling missing values • transforming variables, accounting for non-linear relationships • detecting interactions • self-testing, ensuring that the model will perform well on future dataThe results are more accurate and complete than handcrafted models.


How does MARS handle missing values?

MARS automatically creates a missing value indicator – a dummy variable – that becomes one of the available predictors. These dummy variables represent the absence or the presence of data for the predictor variables in focus.

How does MARS ensure that a model will perform as claimed on future data?

Almost all modeling technologies can track training data accurately. MARS protects users from misleading results through its two-stage modeling process. MARS overfits its model initially but then prunes away all components that would not hold up with new data. MARS provides assessments through use of one of two built-in testing regimens: cross-validation or reference to independent test data. Using these tests, MARS determines the degree of accuracy that can be expected from the best predictive model.

How can MARS models be implemented for predictive purposes?

A MARS predictive model can be implemented in two ways. First, new databases can be scored directly by identifying the MARS model and the data to be scored. MARS will perform all the required data transformations and calculations automatically and output the predicted scores. Second, the MARS predictive equation can be exported as ready-to-run C and SAS® compatible code that can be deployed in the user’s application framework.

How does MARS compare with neural nets?

MARS is not a black box. It is faster, more interpretable, and more accurate than neural nets.

Why is MARS better than a decision tree for regression?

MARS is capable of predicting with much high resolution and accuracy, typically producing unique scores for every record in a database. In this way, MARS expands on the capabilities of decision trees for regression.

What is RandomForests? RandomForests represents a newly-developed data analysis tool for data mining and predictive modeling. It generates and combines decision trees into predictive models and displays data patterns with a high degree of accuracy. The method was developed by Leo Breiman and Adele Cutler of University of California, Berkeley, and is licensed exclusively to Salford Systems.

What are the advantages of RandomForests?

• Automatic predictor selection from any number of candidates- the analyst does not need to do any variable selection or data reduction- will automatically identify the best predictors

• Ability to handle data without preprocessing- data does not need to be rescaled, transformed, or modified- resistant to outliers- automatic handling of missing values

• Resistance to over training- generates numerous trees based on two forms of randomization- growing a large number of RandomForests trees does not create a risk of overfitting- each tree is an independent, random experiment

• Self-testing using “out-of-bag” data- self-testing is based on an extension of cross-validation- self-tests provide highly reliable assessments of the model

• Cluster identification- can be used to generate tree-based clusters- predictor variables defining clusters are chosen automatically

• Visualization- RandomForests offers graphics, which yield new insights into data

How does RandomForests work?

RandomForests is a collection of many CART trees that are not influenced by each other when constructed. The sum of the predictions made from decision trees determines the overall prediction of the forest. Two forms of randomization occur in RandomForests; one is by trees and one by node. At the tree level, randomization takes place via observations. At the node level, this occurs by using a randomly selected subset of predictors. Each tree is grown to a maximal size and left unpruned. This process is repeated until a user-defined number of trees is created; the collection is a random forest. Once this is created, the predictions for each tree are used in a “voting” process. The overall prediction is determined by voting for classification and by averaging for regression.


What are RandomForests’ strengths?

RandomForests specializes in classification and regression problems. Its strengths are spotting outliers and anomalies in data, displaying proximity clusters, predicting future outcomes, identifying important predictors, discovering data patterns, replacing missing values with imputations, and providing insightful graphics. Additionally, it can provide clustering and density estimations.

Is RandomForests a black box? RandomForests is not a black box. It produces descriptive reports and displays that allow the user to gain insight into the data.

Feature Matrix:Salford Systems Predictive Modeler Suite

Modeling Engine: CART (Decision Trees)

Modeling Engine: MARS (Nonlinear Regression)

Modeling Engine: TreeNet (Stochastic Gradient Boosting)

Modeling Engine: RandomForests for Classification

Reporting ROC curves during model building and model

scoring

Model performance stats based on Cross Validation

Model performance stats based on out of bag data during

bootstrapping

Reporting performance summaries on learn and test data

partitions

Reporting Gains and Lift Charts during model building and

model scoring

Automatic creation of Command Logs

Built-in support to create, edit, and execute command files

Reading and writing datasets all current database/statistical

file formats

Option to save processed datasets into all current

database/statistical file formats

Select Cases in Score Setup

TreeNet scoring offset in Score Setup

Setting of focus class supported for all categorical variables

Scalable limits on terminal nodes. This is a special mode that

will ensure the ATOM and/or MINCHILD size

Descriptive Statisics: Summary Stats, Stratified Stats, Charts

and Histograms

Activity Window: Brief data description, quick navigation to

most common activities

Additional Modeling Engines: Regularized Regression

(LASSO/Ridge/LARS/Elastic Net/GPS)

Translating models into SAS®-compatible language

Data analysis Binning Engine

Automatic creation of missing value indicators

Option to treat missing value in a categorical predictor as a

new level

License to any level supported by RAM (currently 32MB to

1TB)

License for multi-core capabilities

Pro ProEX


UltraBasic


Using built-in BASIC Programming Language during data

preparation

Automatic creation of lag variables based on user

specifications during data preparation

Automatic creation and reporting of key overall and stratified

summary statistics for user supplied list of variables

Display charts, histograms, and scatter plots for user selected

variables

Command Line GUI Assistant to simplify creating and editing

command files

Translating models into SAS/PMML/C/Java/Classic and ability

to create classic and specialized reports for existing models

Unsupervised Learning - Breiman's column scrambler

Scoring any Automate (pre-packaged scenario of runs) as an

ensemble model

Summary statistics based missing value imputation using

scoring mechanism

Impute options in Score Setup

Quick Impute Analysis Engine: One-step statistical and model

based imputation

Advanced Imputation via Automate TARGET. Control over fill

selection and new impute variable creation

Correlation computation of over 10 different types of

correlation

Save OOB predictions from cross-validation models

Custom selection of a new predictors list from an existing

variable importance report

User defined bins for Cross Validation

Automation: Build two models reversing the roles of the learn

and test samples (Automate FLIP)

Automation: Explore model stability by repeated random

drawing of the learn sample from the original dataset

(Automate DRAW)

Automation: For time series applications, build models based

on sliding time window using a large array of user options

(Automate DATASHIFT)

Automation: Explore mutual multivariate dependencies among

available predictors (Automate TARGET)

Automation: Explore the effects of the learn sample size on the

model performance (Automate LEARN CURVE)

Pro ProEX


UltraBasic


Pro ProEX


UltraBasicAutomation: Build a series of models by varying the random

number seed (Automate SEED)

Automation: Explore the marginal contribution of each

predictor to the existing model (Automate LOVO)

Automation: Explore model stability by repeated repartitioning

of the data into learn, test, and possibly hold-out samples

(Automate PARTITION)

Automation: Explore nonlinear univariate relationships

between the target and each available predictor (Automate

ONEOFF)

Automation: Bootstrapping process (sampling with

replacement from the learn sample) with a large array of user

options (Random Forests-style sampling of predictors, saving

in-bag and out-of-bag scores, proximity matrix, and node

dummies) (Automate BOOTSTRAP) *not available in

RandomForests

Automation: “Shifts” the “crossover point” between learn and

test samples with each cycle of the Automate (Automate

LTCROSSOVER)

Automation: Build a series of models using different backward

variable selection strategies (Automate SHAVING)

Automation: Build a series of models using the

forward-stepwise variable selection strategy (Automate

STEPWISE)

Automation: Explore nonlinear univariate relationships

between each available predictor and the target (Automate

XONY)

Automation: Build a series of models using randomly sampled

predictors (Automate KEEP)

Automation: Explore the impact of a potential replacement of a

given predictor by another one (Automate SWAP)

Automation: Parametric bootstrap process (Automate PBOOT)

Automation: Build a series of models for each strata defined in

the dataset (Automate STRATA)

Automation: Build a series of models using every available

data mining engine (Automate MODELS)

Automation: Model is built in each possible data mining engine

(Automate EVERYTHING)

Automation: Run TreeNet for Predictor selection, Auto-bin

predictors, then build a series of models using every available

data mining engine (Automate GLM)


Pro ProEX


UltraBasicModeling Pipelines: RuleLearner, ISLE

CART Classification and Regression TreesLinear Combination Splits

Optimal tree selection based on area under ROC curve

User defined splits for the root node and its children

Translating models into Topology

Manually edit and modify CART trees via FORCE command

structures

Automation: Explore alternative strategies of handling missing

values (AUTOMATE MISSING_PENALTY)

Automation: Build a series of models using all available

splitting strategies (six for classification, two for regression)

(Automate RULES)

Automation: Build a series of models varying the depth of the

tree (Automate DEPTH)

Automation: Build a series of models changing the minimum

required size on parent nodes (Automate ATOM)


required size on child nodes (Automate MINCHILD)

Automation: Explore accuracy versus speed trade-off due to

potential sampling of records at each node in a tree (Automate

SUBSAMPLE)

Automation: Generates a series of N unsupervised-learning

models. (Automate UNSUPERVISED)

Automation: Varies the RIN (Regression In the Node)

parameter through a series of values (Automate RIN)

Automation: Varying the number of “folds” used in

cross-validation (Automate CVFOLDS)

Automation: Repeat cross-validation process many times to

explore the variance of estimates (Automate CVREPEATED)

Automation: Build a series of models using a user-supplied list

of binning variables for cross-validation (Automate CVBIN)

Automation: Check the validity of model performance using

Monte Carlo shuffling of the target (Automate

TARGETSHUFFLE)

Automation: Build two linked models, where the first one

predicts the binary event while the second one predicts the

amount (AUTOMATE RELATED). For example, predicting

whether someone will buy and how much they will spend

Feature Matrix:Engine-Specific Additional Features

Automation: Indicates whether a variable importance matrix

report should be produced when possible (Automate VARIMP)

Automation: Saves the variable importance matrix to a

comma-separated file (Automate VARIMPFILE)

Hotspot detection for Automate UNSUPERVISED

Differential Lift Modeling (Netlift/Uplift)

Profile tab in CART Summary window

Hotspot detection for Automate TARGET

Multiple user defined lists for linear combinations

Constrained trees

Ability to create and save dummy variables for every node in

the tree during scoring

Report basic stats on any variable of user choice at every

node in the tree

Comparison of learn vs. test performance at every node of

every tree in the sequence

Hotspot detection to identify the richest nodes across multiple

trees

Automation: Build a series of models limiting the number of

nodes in a tree thus controlling the order of interactions

(Automate NODES)

Automation: Build a series of models trying each available

predictor as the root node splitter (Automate ROOT)

Automation: Explore the impact of favoring equal sized child

nodes by varying CART’s end cut parameter (Automate

POWER)

Automation: Vary the priors for the specified class (Automate

PRIOR)

Automation: Build a series of models by progressively

removing misclassified records thus increasing the robustness

of trees and possibly reducing model complexity (Automate

REFINE)

Automation: Bagging and ARCing using the legacy code

(COMBINE)

Automation: Explore the impact of penalty on categorical

predictors (Automate PENALTY=HLC)

Automation: Explore the impact of penalty on missing values

(Automate PENALTY=MISSING)

Build a Random Forests model utilizing the CART engine to

gain alternative handling of missing values via surrogate splits

(Automate BOOTSTRAP RSPLIT)

Pro ProEX


UltraBasic


Pro ProEX


UltraBasicTreeNet Stochastic Gradient BoostingSpline-based approximations to the TreeNet dependency plots

Exporting TreeNet dependency plots into XML file

Interactions: allow interactions penalty which inhibits TreeNet

from introducing new variables (and thus interactions) within a

branch of a tree





Automation: Repeat Cross Validation process many times to




Automation: Check the validity of model performance using

Monte Carlo shuffling of the target (Automate

TARGETSHUFFLE)

Automation: Indicates whether a variable importance matrix

report should be produced when possible (Automate VARIMP)

Automation: saves the variable importance matrix to a

comma-separated file (Automate VARIMPFILE)

Automation: Build two linked models, where the first one

predicts the binary event while the second one predicts the

amount (AUTOMATE RELATED). For example, predicting

whether someone will buy and how much they will spend

Auto creation of new spline-based approximation variables.

One step creation and saving of transformed variable to new

dataset

Flexible control over interactions in a TreeNet model (ICL)

Interaction strength reporting

Interactions: Generate reports describing pairwise interactions

of predictors

Subsample separately by target class. Specify separate

sampling rates for the target classes in binary logistic models

Control number of top ranked models for which performance

measures will be computed and saved

Advanced controls to reduce required memory (RAM)

Extended Influence Trimming Controls: ability to limit influence

trimming to focus class and/or correctly classified

Differential Lift Modeling (Netlift/Uplift)


Pro ProEX


UltraBasicQUANTILE specifies which quantile will be used during

LOSS=LAD

POISSON: Designed for the regression modeling of integer

COUNT data

GAMMA distribution loss, used strictly for positive targets

NEGBIN : Negative Binomial distribution loss, used for

counted targets (0,1,2,3,...)

COX where the target (MODEL) variable is the non-negative

survival time while the CENSOR variable indicates

Tweedie loss function

Automation: Build a series of models limiting the number of

nodes in a tree thus controlling the order of interactions

(Automate NODES)

Automation: Convert (bin) all continuous variables into

categorical (discrete) versions using a large array of user

options (equal width, weights of evidence, Naive Bayes,

supervised) (Automate BIN)

Automation: Produces a series of three TreeNet models,

making use of the TREATMENT variable specified on the

TreeNet command (Automate DIFFLIFT)

Automation: Build a series of models varying the speed of

learning (Automate LEARNRATE)

Automation: Build a series of models by progressively

imposing additivity on individual predictors (Automate

ADDITIVE)

Automation: Build a series of models utilizing different

regression loss functions (Automate TNREG)

Automation: Build a series of models by varying subsampling

fraction (Automate TNSUBSAMPLE)

Automation: Build a series of models using varying degree of

penalty on added variables (Automate ADDEDVAR)

Automation: Explore the impact of influence trimming (outlier

removal) for logistic and classification models (Automate

INFLUENCE)

Modeling Pipelines: RuleLearner, ISLE

Build a CART tree utilizing the TreeNet engine to gain speed

as well as alternative reporting

Build a RandomForests model utilizing the TreeNet engine to

gain speed as well as alternative reporting

RandomForests inspired sampling of predictors at each node

during model building


Automation: Exhaustive search and ranking for all interactions

of the specified order (Automate ICL)

Automation: Varies the number of predictors that can

participate in a TreeNet branch, using interaction controls to

constrain interactions (Automate ICL NWAY)

Random Forests Tree EnsemblesAutomation: Build a series of models changing the minimum


Automation: Varies the bootstrap sample size. (AUTOMATE

RFBOOTSTRAP)

Automation: Vary the number of randomly selected predictors

at the node level (Automate RFNPREDS)

Flexible control over interactions in a Random Forests for

Regression model (Requires TreeNet license)

Interaction strength reporting (Requires TreeNet license)

Spline-based approximations to the Random Forests for

Regression dependency plots (Requires TreeNet license)

Exporting Random Forests for Regression dependency plots

into XML file (Requires TreeNet license)

Automation: Explore the impact of influence trimming (outlier

removal) for logistic and classification models (Automate

INFLUENCE)

Automation: Exhaustive search and ranking for all interactions

of the specified order (Automate ICL)

Pro ProEX


UltraBasic


Pro ProEX


UltraBasicMARS Multivariate Adaptive Regression SplinesSave MARS basis functions in Score Setup

Automation: Build a series of models varying the maximum

number of basis functions (Automate BASIS)



Automation: Repeat cross-validation process many times to




Automation: Build a series of models varying the smoothness

parameter (Automate MINSPAN)

Automation: Build a series of models varying the order of

interactions (Automate INTERACTIONS)

Automation: Build a series of models varying the modeling

speed (Automate SPEED)

Automation: Explore the impact of penalty on categorical

predictors (Automate PENALTY=HLC)

Automation: Explore the impact of penalty on missing values

(Automate PENALTY=MISSING)

Automation: Build a series of models using varying degree of

penalty on added variables (Automate PENALTY MARS)

GPS Generalized PathSeekerRegularized Regression (LASSO/Ridge/LARS/Elastic

Net/GPS)

Automation: Build a series of models by forcing different limit

on the maximum correlation among predictors (Automate

MAXCORR)

Regression (OLS)Automation: Generate detailed univariate distributional reports

for every continuous variable on the KEEP list (Automate

OUTLIERS)

introducing - salford systemsmedia.salford-systems.com/spm/spm8brochure.pdf · modeling strategies...

Documents