promoting software reuse using self organizing maps

Neural Processing Letters 5: 219–226, 1997. 219c 1997 Kluwer Academic Publishers. Printed in the Netherlands.

Promoting Software Reuse Using Self OrganizingMaps

SUSHIL ACHARYA and R. SADANANDAComputer Science and Information Management Program, School of Advanced Technologies, AsianInstitute of Technology, PO Box 4, Klong Luang 12120, Pathumthani, Thailand

Key words: data clustering, self-organizing maps, software reuse, unix

Abstract. Reusability of software, regardless of its utilizing technique, is widely believed to be apromising means for improving software productivity and reliability. However it is not practicedadequately due to the lack of techniques that facilitate the locating of reusable components that arefunctionally close. In this paper we apply Kohonen’s Self-Organizing Maps to develop an approachfor promoting Software Reuse. We look at the details of how Self-Organization can arrange andregularize data from the original pattern space into a topology preserving map. We describe apractical implementation of the SOM methodology for Software Reuse using a database of UNIXcommands. And finally we briefly present our proposed Software Reuse Methodology.

1. Introduction

Reusability of software, regardless of its utilising technique, is widely believed tobe a promising means for improving software productivity and reliability. Thoughthe benefits of Software Reuse are obvious, the methods that promote SoftwareReuse are not encouraging. What is required are libraries of reusable softwarecomponents that are accessible for software developers and which meet theirrequirements. Software libraries should be structured according to the semanticsimilarity of the stored components so that components exhibiting similar behav-iour can be easily identifiable. Although libraries of reusable software do exist forcommercial applications, real-time work, and engineering and scientific problems,very few systematic techniques exist for retrieval, update or structuring of softwarecomponents.

There are two approaches to software reusability: the object-oriented approachand the controlled vocabulary approach which includes Free-Text Indexing andKnowledge Based Indexing [1]. Free-Text Indexing Systems automatically extractkeyword attributes from the natural language specifications provided by the userand use these attributes to localise software components. GURU [2] and ReusableSoftware Library [3] systems are commercial systems following this approach.Knowledge Based Indexing Systems perform syntactic and semantic analysis of

220 SUSHIL ACHARYA AND R. SADANANDA

the natural language specification. LASSIE (Large Software System InformationEnvironment) [4] follows this approach.

One of the problems for only moderate success of reuse is the lack of techniquesthat facilitate the locating of reusable components that are functionally close. Anefficient mechanism to help the reuser retrieve functionally close components willbe an important step in promoting Software Reuse. There have been works inInformation Retrieval (IR) area which adopt different approaches to the structuringproblem. But we have chosen the Artificial Neural Network (ANN) approach,namely Kohonen’s Self-Organising Maps (SOM) [5, 6, 7] to organise varioussoftware components into clusters based on their characteristics.

SOM involves the non-linear projection of a stream of input patterns of arbi-trary dimension to a one- or two-dimensional neuron plane (feature space). Theprojection preserves while exposing the relationships implicitly existing betweenthe patterns in the original multidimensional pattern space. The projection howeverpreserves only the most important neighborhood relationships and ignores the lessimportant ones. The inputs that appear more frequently are mapped to larger andstronger domains at the expense of less frequent inputs. Input patterns which areclose in the input space and appear very frequently are projected into a cluster ofimages forming localized images on the map. These images are then arranged intotopological neighborhoods that display the overall relationships within the entireinput data set [8].

A significant problem in Software Reuse is the lack of appropriate methods andtools for representing and retrieving reusable software. The major issues being:

• Representation of reusable software items; (semantic and syntactic conventionsfor describing the items)

• Retrieval of reusable software (user-system interaction and retrieval methods)We believe the SOM approach can properly handle these issues because:

i. ANN is capable of tolerating noisy or inexact input data.ii. ANN serves as an associative memory.

In the following sections we describe the SOM Data Clustering approach and thepractical implementation of this approach for Software Reuse using a set of UNIXcommands. We also present our proposed Software Reuse Methodology.

2. Data Clustering Using SOM

SOM can be used to classify data by mapping them into clusters in a two dimension-al plane provided that the network training vectors are properly generated, the dataitems are properly represented, and the formed clusters are properly interpreted.

2.1. DATA CLUSTERING PHASES

Our approach to Data Clustering using the unsupervised learning technique consistsof three phases [9]:

PROMOTING SOFTWARE REUSE USING SELF ORGANIZING MAPS 221

(i) Phase 1: Data Representation

In this phase the training, testing and production data sets are represented usinga proper representation scheme for input as input vectors to the Self-Organizingnetwork. For explanation purposes we will concentrate on data domains where thedata is normalized, the data consists of common attribute types either nominal orordinal and the data attributes are represented in a common unit type.

Based on these assumptions we have defined the following data representationscheme:

• For databases of ‘yes’ or ‘no’ response type – assign either ‘1’ or ‘0’.• For databases of continuous inputs type – assign a value for a range of inputs.

For a Database consisting of say R Rows and A attributes, the Connectivity Matrix[10] can be created by denoting the presence and absence of an attribute by ‘X1’and ‘X0’ respectively.

(ii) Phase 2: Cluster Formation through Network Training

In this phase the Connectivity Matrix formed in phase 1 is provided as trainingdata (input vector) to the Self-Organizing network. The SOM algorithm, firstinitializes the network size and training parameters. The adaptive weights (synapticweights) are randomly initialized if the training is being performed for the first time.Otherwise pre-stored values are loaded and the training initiated. Each input vectoris normalized to avoid larger values from dominating the training process. Theinput vector is randomly (or sequentially) drawn from the Connectivity Matrixand presented to the network. As the input vectors are provided to the system, thewinner node is located and the adaptive weights of the neighborhood are adjusted.The adaptation of the nodes is important for the forth-coming input vectors. Asthe training progresses, the neighborhoods are linearly decremented. The trainingstops after the training constant approaches zero. Once the training is complete, theinitialization weights and the final weights (simulated during training) are storedand if necessary used for future training tasks. The final weights are very close tothe corresponding input patterns and represent the knowledge of the whole trainingprocess. Two criteria are used to check whether the training is optimal or not:

a. There are no changes in positions in the output space at the end of the trainingand no changes in weights.

b. The Index of Order is sufficiently low and overlapping is minimum.To validate the formation of the clusters the following techniques are used:

a. Comparison with findings from other sources.b. Hide and Seek Methodology.

(iii) Phase 3: Cluster Analysis and Formation of Rules

The output vector produced by the training algorithm in phase 2 will be in the formof pattern maps which depict the clusters. The database can now be grouped based


on these clusters, and rules of the following format can be defined:

IF (A1 OP C1) and (A2 OP C2) and ...... and (An OP Cn)THEN CDj

Here Ai’s are the attributes of the input tuple, Ci’s are the constants and OP arethe relational operators.

3. Implementation of SOM for Software Reuse

In our case we will emphasize the possibility of using one software command inplace of the other by relating software commands. For this purpose clustering ofsimilar commands would be necessary. To apply the concept of software reusabilityusing the approach defined earlier, a set of commonly used UNIX commands havebeen taken, properly represented through its features and then mapped into a Self-Organizing Feature Map.

3.1. REPRESENTATION OF THE SOFTWARE COMMANDS

Since the software commands need to be properly represented for input to the SOMData Clustering approach, they were defined in terms of the following features [1]:

Name: software command that implements an action

Action: the verb is a sentence other than the ‘be-verb’, like: is, are

Object: the thing affected by an action

Location: where an action occurs

Manner: the mode by which an action is performed

For example, ‘GREP’ is defined as ‘Grep searches the named file for a line con-taining a given pattern’. Hence its features are:

Name: Grep

Action: search

Object: line

Location: file

Manner: pattern

From these features a list of attributes were identified. The concept being commandshaving similar characteristics have similar features and show similar attributes. Atable was then created with the attributes as headers and the command name as therecord index field. A cell was represented by 1 if the listed attribute was present


Table I. UNIX Commands and Attributes.

Name/ .... display contents file directory remove rename copy search lines page ....Attributes

.... .... . . . . . . . . . ....cp .... 0 0 1 1 0 0 1 0 0 0 ....more .... 1 1 1 0 0 0 0 0 0 0 ....mv .... 0 0 1 1 0 1 0 0 0 0 ....rm .... 0 0 1 1 1 0 0 0 0 0 ....page .... 1 1 1 0 0 0 0 0 0 0 ....grep .... 0 0 1 0 0 0 0 1 1 0 ....man .... 0 0 0 0 0 0 0 0 0 1 ........ .... . . . . . . . . . . ....

Figure 1. UNIX Commands mapped in a 10 � 10 matrix.

for the command and by 0 if the attribute was absent. Table 1 represents a portionof a sample data table (i.e. a portion of the Connectivity Matrix).

Each tuple in the table above was then provided as an input vector to the SOMclustering algorithm. After the training phase the feature map in Figure 1 wasobtained.

Such a mapping provides assistance for software developers in locating reusablesoftware commands that meet specified functionality by depicting the relationshipbetween the commands based on their properties. For example, more and page (andcat) are located nearby since their properties are quite similar. Queries can alsobe performed by providing them in the form of input vectors to the system. Thesystem will then identify the location of the query vector in the trained map. Let usconsider as a query, command Q1, with attributes: display, contents and file. Thequery name, attributes and input vector are listed below. Figure 2 represents themapping of this query in the trained map depicted in Figure 1. The query is seento be mapped into neuron 9 and is near to commands: more, page and cat , whichhave similar properties.


Figure 2. Query Q1 mapped to an existing network.

This approach shows that by using Self-Organizing Maps software commandshaving similar properties can be grouped together, subsequently aiding softwareprogrammers.

Sample QueryName: Q1

Attributes: display, contents, file

Input vector: .... 1110000000 ....

4. A Software Reuse Methodology

For using Kohonen’s Self-Organizing Maps for data clustering a proposed SoftwareReuse methodology is briefly presented in this section. Figure 3 depicts an idealexample of two technologies, Artificial Intelligence and DBMS at an united effortto shape up the future of the computing world.

In this methodology, DBMS (Data Base Management System), the core of thesystem is responsible for the data management activities. The Data Set Generatormodule creates Training Data Sets (training input vectors) from the Real WorldDatabase’s Attributes Domain. Necessary Data Conversion and Scaling is alsocarried out by this module. Statistical methods are used by the Data Analyzer toensure that the Training Data Set exhibits a real world situation. The TrainingData Set is then provided as input to the SOM algorithm by the Cluster Generatormodule and training is commenced. Once a minimum mismatch value has beenreached the network is assumed to be trained. The Cluster Visualization module isthen used for identifying the clusters and differentiating the various plateaus andvalleys. 3-D hit method is used for this purpose [9]. The Rule Generator modulethen creates the necessary rules which govern the cluster formation and store theserules in a Rules Database. Once this is completed, the DBMS system is capable ofperforming queries put forward by software developers.


Figure 3. A Software Reuse Methodology.

5. Conclusions and Recommendations

In this study we have used Kohonen’s Self-Organizing Maps to present an approachto developing a methodology to promote Software Reuse. We have described ourthree phases for Data Clustering and have used the same approach for clusteringsoftware commands. A set of UNIX commands have been used to create a UNIXCommands Attributes Database and to generate a Connectivity Matrix. Clusteringhas been carried out and verified with test data. Finally a proposed Software ReuseMethodology has been presented where statistical methods are used to monitor thecreation of the Training Data Sets, Kohonen’s SOM algorithm is used to createthe clusters, the 3-D hits method is used to visualize the clusters and the clustersare represented by rules which are placed in a database for easy access throughqueries.

It is difficult to devise a general methodology for data representation. Buta proper representation scheme that would be applicable to the majority of thecases would certainly reduce the complexity involved. Likewise it is difficult toidentify cluster boundaries when the boundaries overlap. The 3-D Hits Methodgives an approximation of the cluster boundaries but fails when the boundariesoverlap. Further research is needed in this area. In our methodology a proper RuleGeneration algorithm has not been developed yet. The possibility of using Back-Propagation is being looked into. An algorithm capable of visualizing the clusterboundaries and accordingly creating rules should be developed.


References

1. S. Pandey, “Self-Organizing Map to promote Software Reuse”, Thesis Report, Asian Institute ofTechnology, Bangkok, 1994.

2. B. Maarek and Kaiser, “An information retrieval approach for automatically constructing softwarelibrary”, IEEE Transaction on Software Engineering, Vol. 17, No. 8, 1991.

3. Burton et. al., “The reusable software library”, IEEE Software, pp. 25–33, July 1987.4. B. Devanbu and B. Selfridge, “LASSIE: A knowledge based software information system”,

CACM, Vol. 4, No. 5, pp. 34–49, May 1991.5. T. Kohonen, “The Self Organizing Map”, IEEE Proceedings, Vol. 78, No. 9, pp. 1464–1480,

1990.6. T. Kohonen and H. Ritter, “Self-Organizing Semantic Maps”, Biological Cybernetics, 1989.7. T. Kohonen, “Self-Organization and Associative Memory”, Spinger Verlag, Heidelberg, Ger-

many, 1989.8. J.M. Zurada, “Introduction to Artificial Neural Systems”, Info Access Distribution Limited,

Singapore, 1992.9. S. Acharya and R. Sadananda, “A knowledge discovery methodology using Self Organizing

Maps”, Proc. of the International Conference on Information Systems Analysis and Synthesis(ISAS’96)”, Orlando, USA, 1996.

10. R. Sadananda, A. Shrestha and N. Khosla, “The choice of neighborhood in Self-OrganizationScheme for VLSI”, IEEE Conference in Expert Systems, AIT, 1994.

promoting software reuse using self organizing maps

Documents