clustering_via_kmeans_and_kohonen_som

Data Mining with IBM SPSS Modeler 14.2University of Arkansas

David Douglas

Clustering via Kmeans and Kohonen SOM

Last Updated 4/7/2023 11:45:02 PM

Clustering via k-means and Kohonen SOMIBM SPSS Modeler 14.2 and clustering

To illustrate using IBM SPSS Modeler 14.2 for clustering using k-means and Kohonen/SOM, create a stream flow as shown below.

Some of the nodes are just for viewing data so the stream flows are not as complicated as it initially appears. First, the Excel File node should be edited to connect to a Prospect.xls file. The ID and LOC variables need to be excluded—which can be done while on the Type tab of the Excel Source node. And as always, click the Read Values button on the Type tab. Again, because this in unsupervised modeling, there is no target variable and the Direction for all the variables to be used in the model should be set to Input.

Open the k-means node—the default tab is Model. As always, you have the option for providing a custom name. By default, the Use partitioned data is checked. For illustration purposes, set the Specified number of clusters to 5. For more detailed output information, you can check the Generate distance field check box.

Last Updated 4/7/2023 11:45:02 PM

Clicking the Expert tab allows setting the maximum number of iterations as well as tolerance levels. It also lets you change the encoding value for sets—which has a default value of 0.70711 instead of 1. See below for an explanation but for now, no further changes are needed so execute the node.

As with most of IBM SPSS Modeler 14.2’s model nodes, executing the k-means node results in a model nugget on the canvas as well as one placed in the GMP. Run the k-means node and right-click the model nugget on the canvas to review the results.

A model summary and a cluster quality appear in the left pane and a pie chart with the sizes of the cluster appears in the right pane. These are the default settings—note that both panes have a dropdown box to allow the user to select desired views. The Cluster Sizes pie chart provides the percent for each of the five clusters.

Click the View: dropdown box in the left pane and select Clusters. Select Cluster Comparison from the dropdown box in the right lane. Then click cluster-1 in the left pane –not the Cluster Comparison in the right pane. Because of the Auto Data Prep node, all the variables have a suffix of _transformed. Note – the legend at the top of the left pane indicates the darker the color, the more important the variable. Also, note that moving the mouse into a cell will provide the importance value and frequency.

Last Updated 4/7/2023 11:45:02 PM

From the cluster comparison, you can see that cluster-1 has an average age very close to the population average age; the cluster contains only records with a climate value of 20, married, males who do not own homes. A cluster by cluster comparison can be made in this way.

Clusters are presented in order of number of records in the cluster. The first three clusters: cluster-1, cluster-3 and cluster-4 are considerable larger than the last two clusters: cluster-2 and cluster-5.

The right pane has the additional options of variable importance and cell distributions.

Also note that double clicking a cell in the left pane will create a distribution as shown for OwnHome to the right.

Double-click the Age_transformed variable and review its distribution.

Last Updated 4/7/2023 11:45:02 PM

To provide additional information about the cluster, you can select desired clusters and generate a Select node. For our illustration, the most populous three clusters (the ones to the left) are selected in order to generate the Select node. Using windows techniques, select the first three clusters (columns). While the columns are selected, click the Generate menu option and select the Select Node from the drop down list. The generated node will be placed in the upper left hand corner of the stream canvas.

As shown in the first canvas drawing, drag the generated Select node from the upper left –hand corner of the canvas to the right of the k-means model nugget and connect from the model nugget to the generated Select node. .

Add three nodes to view the data—a Histogram, Plot, and Distribution node. Connect the nodes as was previously shown. No editing is required for the generated Select node.

Open the Plot node and try combinations of variables against the created variable $KM-K-Means. This particular illustration selects Sex for the X field, Married for the Y field and $KM-M-Means for the Overlay Color: field. If you really want to jazz up the display, also select a variable such as Climate for the animation field.

See plot below where cluster-3 contains married and single females and cluster 4 contains married and single males. The size of the dots should indicate comparatively how many records are in each cluster for each level of gender and marital status.

Last Updated 4/7/2023 11:45:02 PM

Open the Distribution Node and set the Field and Overlay entries as shown; also check the Normalize by color checkbox. Then Run the node—the graphic output displays and is saved in the Outputs tab in the upper right window.

Last Updated 4/7/2023 11:45:02 PM

The display below is shown with the Normalized by color check box checked. Proportionally, cluster-1 has considerable fewer homeowners and cluster-4 have considerable more homeowners.

The columns on the right provide the percents and counts for each cluster. Review the clusters for the other categorical variables.

For interval variables, use the Histogram Node. Open the Histogram Node and select Income for the Field value and $KM-K-Means for the Overlay variable—this window is not shown. Run the Histogram Node to get the following graph—note that the higher income values are all in cluster 4. Review the other interval variable of Age and FICO.

Last Updated 4/7/2023 11:45:02 PM

IBM SPSS Modeler’s Kohonen (SOM) Node

The Kohonen (SOM) Node, although using a different algorithm, also does clustering. Its basic assumption that clusters are formed from patterns that share similar features is consistent with the k-means clustering algorithm. See the initial discussion of Kohonen (SOM) for a conceptual understanding of how it works.

Open the Kohonen Node. As always, you have the option of providing a custom name for the Node. If you wish to replicate the run, you would need to provide a random seed. For our example, no changes are needed for the Model tab.

Click the Expert tab.

For illustrative purposes, set the Width to 2 and the Length to 2. This requires clicking the Expert option. Because this algorithm works similar to a Neural Net without the Hidden Layer(s), some of the expert settings apply a similar logic. For our example, use the default settings.

When the Model is run, a grid will appear and the colors will change as the data is passed through the Kohonen Node. Red indicates the cells winning the most instances. This may happen quickly enough you miss it.

After setting the Width and Length values, Run the node.

Double-click the model nugget to review the results. Note that with a setting of 2 by 2, only one cluster is created.

Change the setting to 1 by 5 and run the model again. Although 5 possible clusters could have been generated, only 4 clusters were created as shown on the right. Also, note that they are referred to as: X=0, Y=0; X=0, Y=1; X=0, Y=2; and X=0, Y=4. Expand and review all 4 clusters—looking for uniqueness in each cluster via the Cluster Comparison pane. Note that the browse views are identical to the K-Means Node so no further explanation is needed.

Last Updated 4/7/2023 11:45:02 PM

One could generate a Select Node as before but this is not necessary when using all the clusters--four in this case. Further analysis of the nodes can be accomplished by using the Plot, Distribution and Histogram Nodes directly attached to the model nugget. Part of the overall stream flow is show below.

The Kohonen node creates three new variables--$KX-Hohonen, $KY-Kohonen and $KXY-Kohonen. The last created variable is useful for representing a cluster. You may wish to use a Table node to review the values for these variables.

Because the Plot, Distribution and Histogram Nodes are used similarly as in the K-Means example, they will not be discussed here. Illustrative examples are:

Plot Node:X Field -- $KX-KohonenY Field -- $KY-KohonenOverlay: SexDistribution Node:Field: $KXY- Kohonen SOMOverlay: OWNHOMEHistogram Node:Field: IncomeOverlay: $KXY-Kohonen SOM

Of course, you will want to try over combinations for variable in exploring the clusters.

The TwoStep Clustering was also run but no additional output is illustrated here.

Also shown is the IBM SPSS Modeler 14.2 Auto Cluster node. It you connect the Auto Data Prep node to the Auto Cluster node run it, you will find that it determines the best cluster model is the TwoStep based on a Silhouette value. If you check the Help menu, the Silhouette value is an index that measures both cluster cohesion and separation. Search for Silhouette Ranking Measures in Help for details.

A portion of the results are shown below.

Last Updated 4/7/2023 11:45:02 PM

What you should think about:1. Clustering requires being inquisitive and having domain knowledge2. There is no quantitative method to determine which is the most useful cluster or clusters3. A small cluster may be the most useful cluster4. Even though the Silhouette Ranking Measure is a way to determine “best” clusters, it by no

means can identify what could be a useful cluster created in any one of the clusters created via the various cluster nodes. It can only average cohesiveness and separation.

IBM SPSS Modeler 14.2 Notes for missing values and standardization—IBM SPSS Modeler 14.2 uses the same standardizing and handling of missing values techniques for both k-means and Kohonen/SOMs models. Fields of Range type are transformed to a 0 to 1 range as follows

New Value = (Value – Lower bound) / Range

Flag fields are coded such that false =0 and true = 1. For Set fields, each value of the set will have a new temporary input field assigned to it, coded as either 0 or 1 (a dummy variable). Thus, a set with three values will have three new inputs. Actually, these new inputs use a value of .707 instead of 1 because 1’s tend to dominate the cluster.

The approach for missing values is to replace them with neutral values. For range and flag fields with missing values (blanks and nulls), the missing value is replaced with .5—recall that these type variables were transformed to values from 0 to 1. Thus, .5 is in theory relatively neutral. Values lower than the lower bound will be set to the lower bound—likewise for upper bound—values above the upper bound will be assigned the upper bound. For set fields, the derived fields are all set to zero.

All of this is done automatically for you.

Last Updated 4/7/2023 11:45:02 PM

clustering_via_kmeans_and_kohonen_som

Documents

generated

select node

histogram

expert tab

model nugget

stream flow

lower bound

dropdown box