spm users guide - salford systemsmedia.salford-systems.com/pdf/spm7/exploredata.pdfspm is a...

SPM Users Guide

Exploring Data

This guide describes the facilities in SPM to gain initial insights about a dataset by viewing and generating descriptive statistics.

Data Exploration 2

Introduction to Exploring Data SPM is a comprehensive set of tools to produce predictive, descriptive, and analytical models from datasets of any size, complexity, or organization. In many cases, though, you need to gain better understanding of the data first. The typical challenges an analyst faces when working with an unfamiliar dataset are.

• The quality of the data is not known. No matter how reputable a source of the data is it might still require data cleaning.

• Data dictionary not available or incomplete. The primary role of Variable Name is to identify a column of data. We would like it to convey the purpose and nature of the data too but this might be quite hard to achieve in many cases. Data Technologies (e.g. RDMS) are quite good at enforcing the identity of the column but are pretty indifferent to the descriptive power of the name of a variable.

These challenges usually occur at the beginning of the analysis. In SPM we made sure you have tools to get up and running with a new data as soon as possible. Once a dataset is loaded you can browse raw data and obtain simple and elaborate statistics in both tabular and graphical forms.

Opening and Viewing raw data

Once you open a dataset, for example, using Open button on the toolbar data exploring features become available. In this chapter we will work with sample dataset SAMPLE.CSV supplied as part of the SPM installation. It is located in the Sample Data folder. Please refer to general SPM documentation about the ways to bring your data into SPM.

To browse raw data Select View>View Data from the View menu.

You may simply click on the button in the toolbar.

As a result View Data window will appear.

Copyright, 2013 Salford Systems

Data Exploration 3

This display is tailored to handle large amounts of data. The grid works in so-called “Virtual Mode”. Only current “page” of data and some cached pages are retained in memory and the dataset is queried for more pages on demand. Sometimes querying the dataset multiple times is not what you want. A good example is browsing

content from an RDBMS (SQL Server, Oracle etc). If data access latency is too large consider extracting the data into a local file in, for example, CSV format before browsing.

Vertical scroll bar has special features to access pages of data. The buttons allow you, from top to bottom to ♦ Jump to the beginning of dataset.

♦ Jump one page up.

♦ Move one record.

♦ Move one record down.

♦ Jump one page down.

♦ Jump to the end of the dataset.

There’s thumb-bar on the vertical scroll bar of View Data window.

Descriptive Statistics To examine descriptive statistics of the currently open dataset Select View>Descriptive Stats… from the

View menu. You can also use toolbar button.

As a result Descriptive Stats Setup window will appear.


Data Exploration 4

The window is already configured to obtain detailed Descriptive Statistics of all of the variables in the dataset. You can press OK button right away. The defaults are configured so that computations finish in reasonable time for small to mid-sized datasets. For this run we will use most of the features and explain the controls along the way.

Selecting variables

Variable Selection grid allows configuring which variables are included in the computation.

Limiting number of variables to compute by excluding ones you are not interested in can speed up computation. This is especially handy when there’re variables with very high number of levels (hundreds and thousands).

To facilitate navigation through the list of variables you can sort them Alphabetically or in File Order. Search functionality is accessible through mouse right-click menu of the grid. Use Select button under Include column to set and reset multiple checkboxes at once.

Variables can be assigned special roles.

Strata variable


Data Exploration 5

STRATA <variable>

In addition to full dataset Descriptive Stats you can request stats for sub-samples identified by levels of a specific variable. In our current dataset variable T defines Learn and Test partitions for analysis. Let’s mark T as a Strata variable.

Weight variable

WEIGHT <variable>

By default each observation is accounted for only once in Descriptive Statistics computations. But you can assign any positive integer or fractional weight to each observation via a Weight variable. Let’s specify W as a weight variable.

As a result your Variable Selection should look as follows

Pre-defined Variable List Filters

There’s an alternative way to quickly select a category of variables. Filter group of controls allows you to quickly request

• Only Character variables

• Only Numeric variables.

This setting overrides the selections made in Variables Selection grid.

Configuring computation process

Computation of Descriptive Statistics can become quite resource-intensive on large and complex datasets. You can tailor the process to get the information you need in an acceptable time.


Data Exploration 6

For our Sample.csv analysis please select Detailed Stats and set both Max. distinct values to track and Max. distinct values to display to 9997.

Below each computation process configuration setting is described in more details.

Fast Stats

DATAINFO <varlist> / BRIEF = YES

Sometimes all you need is a quick lookup of some numeric statistics (Minimum, Maximum, Mean). Combined with variable selection feature you can get this information quickly.

Detailed Stats

DATAINFO <varlist> / BRIEF = NO

In this mode the full set of descriptive statistics is computed. This could be quite performance-intensive even if you select just a few variables with high number of levels and the dataset is pretty large. There’re additional controls to tailor the computation process in this mode.

Max. distinct values to track

DISCRETE MAX=<n,n>

This setting allows you to limit the number of slots to track distinct values for a variable. If a variable has more than n levels then frequency information on first n levels encountered will be available in UI. Such Frequency Tables will be labeled incomplete in the GUI.

Lowering the limit can save significant computation resources, especially when you don’t care about tabulation for continuous variables with a lot of distinct levels.

Max. distinct values to display

DATAINFO <varlist> / N=<n>

This setting limits how many levels will be displayed in the resulting frequency table. In contrast to Max. distinct values to track this parameter has no effect on the construction of the frequency table for a specific variable. If Max. distinct values to track is greater than the number of levels large enough but Max. distinct values to display is smaller you will get all the stats derived from the frequency table (e.g. number of distinct values) but frequency tables themselves will be printed incomplete. But, also in contrast to Max. distinct values to track, n most frequent levels will be displayed.


Data Exploration 7

Lower the limit if you do want all the information on continuous and high-level categorical variables, but you don’t need to see full frequency tables in the results window. Showing frequencies for all the distinct levels of continuous variables in a large dataset could easily exhaust UI resources.

Separate display for most and least common values.

DATAINFO <varlist> / EXTREMES = <n>

Some variables with many levels, both continuous and categorical in nature, might have a significant number of observations sharing the same value. While a full frequency table would be expensive to compute and in many cases useless these most frequent levels might provide some useful insight. This setting allows specifying a cap on how many most and least common values to track.

Saving Descriptive Stats

Let’s save the descriptive stats for our dataset. For this please check Save to Grove checkbox and specify a file name. Your Descriptive Stats dialog should look like this

You can press OK button now to start the computation.

The following controls configure how results of Descriptive Stats computation are saved.

Details to Classic Output.

DATAINFO <varlist> / SILENT = NO

When this setting is ON Classic Output window will contain all the Descriptive Stats in textual form. This might be useful if you need to compare descriptive stats for several datasets.

Save to Grove.

GROVE "<File name>"

Just like other SPM analysis methods Descriptive Stats can save the results into a Grove file. You can open this file in GUI at any time to access Descriptive stats without recomputing them.


Data Exploration 8

Browsing Descriptive Stats

Once the computation is done or if you open a Grove file containing previously computed results you will see Descriptive Stats window. Here’s how results window looks like for our example.

Note the Full/Brief switch in top right corner.

Descriptive Stats in Brief Mode

By default the display is in Brief mode. You can quickly get the idea of ♦ How much of the data is missing?

♦ How many distinct values each column has?

♦ What are the range and the Mean in each column?

♦ What are the boundaries of specific percentiles?

Note that some variables (e.g. X2, X4) have “Many Values” as the number of distinct levels. This means these variables have too many levels to tabulate given how the run was configured. The maximal number of levels for a frequency table is configured in Descriptive Stats Setup dialog. If full tabulation is not available some of the stats are zero or empty. Since we specified the Strata variable these stats are available both for the overall dataset and for each stratum (T=0 and T=1 sheets in this particular example).


Data Exploration 9

You can use Find combobox on the top pane to lookup variable by name. To do this please either

♦ Select a variable from the drop-down list. Variables are always sorted in File Order.

♦ Start typing variable name into the edit. As you type the grid will reposition itself to the variable that starts with the sub-string you typed.

Sort combobox allows resorting variables in the grid Alphabetically or in File Order. Navigation tools are quite helpful when the list of variables is large. Other controls on the top pane allow you to

♦ Specify Precision of the numbers in the grid.

♦ Let the grid figure out the precision by checking Auto checkbox.

♦ Align content of the cells in the grid.

Descriptive Stats in Full Mode

Now let’s switch to Full mode.


Data Exploration 10

In this mode Descriptive Stats for a variable are organized in vertical sections. Each variable occupies a column. You can use Find and Sort comboboxes to navigate to a specific variable. For example please put focus into Find combobox and start typing X1. As a result you should now see X1 in the left-most column.

Continue to type in X10. Now the grid scrolled to X10.


Data Exploration 11

Buttons and near the title of each section allow you to expand and collapse the content. Buttons

in the top left corner of the grid are helpful to collapse or expand all the sections.

This display can also be used to get a quick idea about a particular stat or group of stats across all the variables. A convenient way to do this is to expand just the section of interest and scroll through all the variables.

The following sections are available.

Descriptive


Data Exploration 12

This section contains an extended set of descriptive stats computed by the Engine. There’re quite a few of them in addition to ones displayed in Brief mode. It contains the following stats. ♦ N

♦ N Missing

♦ N = 0

♦ N <> 0

♦ N Distinct Values

♦ Mean

♦ Std. Deviation

♦ Skewness

♦ Coefficient Variation

♦ Conditional Mean

♦ Sum of Weights

♦ Sum

♦ Variance

♦ Kurtosis

♦ Std. Error Mean

Note that we don’t have Median and Std. Deviation in the list. The reason is these values are more conveniently represented by the sections below.


Data Exploration 13

Location

This section helps you understand the location of the variable on the number line. The relevant stats are grouped together ♦ Mean

♦ Median

♦ Range

Variability

This section contains stats to describe the dispersion of the variable. It contains the following stats ♦ Std. Deviation

♦ Variance

♦ Interquartile range

Quantiles

This section allows you to assess probability distribution of the variable by showing most important percentiles.

Frequency Tables


Data Exploration 14

This section contains frequency tabulation of the variables. Note that for this particular variable the limit on frequency tabulation prevented from capturing the full frequency table.

Note that the stats for T are on Overall pane, but not on T=0 or T=1 pane. The stats for the Strata variable in a particular stratum are degenerate.

Exploring Frequency Distribution

You can explore Frequency Distributions of variables in a graphical form. Select. Explore>Frequency Distribution from the Explore menu.


Data Exploration 15

As a result Generate Charts dialog will appear.

Note that Level(s) column is not populated by default. To obtain level information full Descriptive Stats have to be computed. This is a potentially lengthy operation. By skipping it you can proceed directly to requesting charts for the variables of interest.

You can bring in level information by clicking Show Info button. This will run Descriptive Stats for all the variables and populate Level(s) column. Variable Levels Threshold control will become available to filter out variables with too many levels.

Since Frequency Distribution is powered by Descriptive stats results you can always navigate from


Data Exploration 16

Descriptive stats to Generate Charts window via right-click menu.

The controls in Generate Charts window allow you to specify a group of charts you’re interested in seeing in a single display. You can specify Workspace Label for the group for better reference. If a variable has more than Max Bins levels the values get binned and a histogram is displayed. Otherwise the chart will show full frequency distribution. Variable selection grid let’s you specify variables of interest.

Let’s request Frequency Distribution plots for all the variables in the dataset. You can select the entire column and use Select checkbox above to get all the individual variables selected. Once a selection is made Generate Charts button will become enabled.

As a result you will see Frequency Distribution window.

Histograms are incomplete for variables with too many levels (e.g. Y1 on the screenshot above). This

is due to the cap on number of levels to tabulate during Descriptive Stats. Generate chars dialog uses default caps to ensure acceptable performance. You can configure your Descriptive Stats run and then request Frequency Distribution display from it.


Data Exploration 17

Generate Charts window that you used to configure the run is still available. It keeps track of all the chart groups it produced and provides an easy way to navigate to each of them via right-hand side navigation panel.

Frequency Distribution window shows up to four plots at the same time. If there’re more than four plots you can scroll up and down to get to a plot of interest. Goto Variable combobox is a handy way to navigate to a variable by name. You can make Y axis scale the same for all charts axis using Same Scale button. These are just a few of the controls on lower tool panel of the window.

Variables treated as Continuous are displayed as histograms. For Categorical variables full frequency table is displayed. By default the information whether a variable is Continuous or Categorical comes from SPM Engine. Treat Vars As group of controls let’s you override this. In addition to the default you can opt to treat all variables as either Continuous or Categorical. The dataset we are examining contains quite a few variables with large number of levels. If you request to treat them as Categorical Frequency Distribution will try to plot each individual level. So if you navigate to X1 and click Categorical you might see something like this


Data Exploration 18

It is apparent that usability of these charts is limited. To improve this you can switch X-Axis combobox from All Values to Optimal. Now you can scroll each chart horizontally and explore the levels of interest

Let’s switch Treat Vars as back to Auto and navigate to X9. Since X9 and X10 are Continuous Binning controls determine how the levels are binned. T is also treated as continuous but it only has two levels.


Data Exploration 19

Now let’s reduce Binning number down to 5.

You can also switch between the following Bins Presets ♦ For Custom preset Binning slider defines the number of bins for all the Continuous variables. This is

the default.


Data Exploration 20

♦ For Optimal preset the number of bins is determined individually for each variable. This is useful when looking side-by-side on variables with radically different distribution.

Let’s switch to Optimal number of bins. Note that you’re seeing essentially the same histogram for X9. Note that the left-most 6 bins are obscured by much more populated ones to the right. Let’s truncate 50% of levels using Truncate Right control. As a result you can now see the left side of the distribution in more details.

Sometimes it is convenient to use an alternative chart type for a histogram. For each chart type there’s a button in Chart Types group of controls. For example, here is the same screenshot with chart type changed to Dot.


Data Exploration 21


spm users guide - salford systemsmedia.salford-systems.com/pdf/spm7/exploredata.pdfspm is a...

Documents