[studies in fuzziness and soft computing] soft computing approaches in chemistry volume 120 ||

Hugh M. Cartwright, Les M. Sztandera (Eds.)

Soft Computing Approaches in Chemistry

Springer

Berlin Heidelberg

New York Hong Kong

London Milano

Paris Tokyo

Studies in Fuzziness and Soft Computing, Volume 120 http://www.springer.de/cgi-bin/search_book.pl ?series=2941

Editor-in-chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail: [email protected]

Further volumes of this series can be found at our homepage

vol. 102. B. Liu Theory and Practice of Uncertain Programming, 2002 ISBN 3-7908-1490-3

Vol. 103. N. Barnes and Z.-Q. Liu Knowledge-Based Vision-Guided Robots, 2002 ISBN 3-7908-1494-6

Vol. 104. F. Rothlauf Representations for Genetic and Evolutionary Algorithms", 2002 ISBN 3-7908-1496-2

Vol. 105. J. Segovia, P.S. Szczepaniak and M. Niedzwiedzinski (Eds.) E-Commerce and Intelligent Methods, 2002 ISBN 3-7908-1499-7

Vol. 106. P. Matsakis and L.M. Sztandera (Eds.) Applying Soft Computing in Defining Spatial Relations", 2002 ISBN 3-7908-1504-7

Vol. 107. V. Dimitrov and B. Hodge Social Fuzziology, 2002 ISBN 3-7908-1506-3

Vol. 108. L.M. Sztandera and C. Pastore (Eds.) Soft Computing in Textile Sciences, 2003 ISBN 3-7908-1512-8

Vol. 109. R.J. Duro, J. Santos and M. Grana (Eds.) Biologically Inspired Robot Behavior Engineering, 2003 ISBN 3-7908-1513-6

Vol. 110. E. Fink I. 112. Y. Jin Advanced Fuzzy Systems Design and Applications, 2003 ISBN 3-7908-1523-3

Vol. 111. P.S. Szcepaniak, J. Segovia, J. Kacprzyk and L.A. Zadeh (Eds.) Intelligent Exploration o/the Web, 2003 ISBN 3-7908-1529-2

Vol. 112. Y. Jin Advanced Fuzzy Systems Design and Applications, 2003 ISBN 3-7908-1537-3

Vol. 113. A. Abraham, L.C. Jain and J. Kacprzyk (Eds.) Recent Advances in Intelligent Paradigms and Applications", 2003 ISBN 3-7908-1538-1

Vol. 114. M. Fitting and E. Orowska (Eds.) Beyond Two: Theory and Applications of Multiple Valued Logic, 2003 ISBN 3-7908-1541-1

Vol. 115. J.J. Buckley Fuzzy Probabilities, 2003 ISBN 3-7908-1542-X

Vol. 116. C. Zhou, D. MaravaH and D. Ruan (Eds.) Autonomous Robotic Systems, 2003 ISBN 3-7908-1546-2

Vol 117. O. Castillo, P. Melin Soft Computing and Fractal Theory for Intelligent Manufacturing, 2003 ISBN 3-7908-1547-0

Vol. 118. M. Wygralak Cardinalities o/Fuzzy Sets, 2003 ISBN 3-540-00337-1

Vol. 119 Karmeshu (Ed.) Entropy Measures, Maximum Entropy Principle and Emerging Applications, 2003 ISBN 3-540-00242-1

Hugh M. Cartwright Les M. Sztandera (Eds.)

Soft Computing Approaches in Chemistry

, Springer

ISBN 978-3-642-53507-9 ISBN 978-3-540-36213-5 (eBook) DOI 10.1007/978-3-540-36213-5

Dr. Hugh M. Cartwright Oxford University Physical and Theoretical Chemistry South Parks Road Oxford OXI 3QZ UK E-mail: [email protected]

ISSN 1434-9922

Prof. Les M. Sztandera Philadelphia University CIS Department 19144 Philadelphia PA USA

E-Mail: [email protected]

Library of Congress Cataloging-in-Publication-Data

Soft computing approaches in chemistry I Hugh M. Cartwright, Les M. Sztandera (eds.). p. cm. -- (Studies in fuzziness and soft computing; v. 120) Includes bibliographical references and index. ISBN 978-3-642-53507-9 1. Soft computing. 2. Chemistry--Data processing. I. Cartwright, Hugh M. II. Sztandera, Les M., 1961- III. Series.

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitations, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law.

Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de

© Springer-Verlag Berlin Heidelberg 2003 Softcover reprint of the hardcover 1st edition 2003

The use of general descriptive names, registered names trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

Typesetting: camera-ready-pages delivered by editor Cover design: E. Kirchner, Springer-Verlag, Heidelberg Printed on acid free paper 6213020/M - 5 4 3 2 1 0

Editors' Preface

The contributions to this book cover a wide range of applications of Soft Computing to the chemical domain. The early roots of Soft Computing can be traced back to Lotfi Zadeh's work on soft data analysis [1] published in 1981. 'Soft Computing' itself became fully established about 10 years later, when the Berkeley Initiative in Soft Computing (SISC), an industrial liaison program, was put in place at the University of California - Berkeley.

Soft Computing applications are characterized by their ability to:

• approximate many different kinds of real-world systems;

• tolerate imprecision, partial truth, and uncertainty; and

• learn from their environment.

Such characteristics commonly lead to a better ability to match reality than other approaches can provide, generating solutions of low cost, high robustness, and tractability.

Zadeh has argued that soft computing provides a solid foundation for the conception, design, and application of intelligent systems employing its methodologies symbiotically rather than in isolation. There exists an implicit commitment to take advantage of the fusion of the various methodologies, since such a fusion can lead to combinations that may provide performance well beyond that offered by any single technique.

This book brings together original work from a number of authors who have made significant contributions to the evolution and use of nonstandard computing methods in chemistry. Ali and his co-authors present a wide-ranging summary of fuzzy classification techniques and their use in the development of "electronic noses". Bianucci and her co-workers discuss the topic of Quantitative StructureActivity Relationships, an area of crucial importance to the pharmaceutical industry, and explain how neural networks can be of value in such studies.

The use of Genetic Algorithms for optimization of the structure of atom clusters is the topic of a chapter by Johnston and Roberts, while Hanai and co-authors intriguingly bring together Fuzzy Logic, Neural Networks and Genetic Algorithms in a study of how to improve the production of Japanese sake.

Sztandera and co-authors introduce another industrially significant area - that of the use of Soft Computing in the design of safe textile materials. Cartwright and co-authors consider several applications, covering the combination of neural networks with Fourier Transform Infrared Spectroscopy for the real-time monitoring of pollutants in the workplace, the use of Genetic Algorithms to help

VI

evolve production rules for the real-time control of industrial resin plants, and the use of Bioinformatics in the clustering of data within large biochemical databases.

Genetic Algorithms also form the central technique used by Lottermoser and coworkers in the analysis of Mossbauer spectra. Another topic of direct relevance to both industry and chemistry, the synthesis of methanol, is discussed by Potocnik and co-workers. Gillet discusses how evolutionary algorithms are of value in the design of combinatorial libraries, a further contribution illustrating the extent to which Soft Computing now permeates industrial processes and research.

This book illustrates the remarkable degree to which Soft Computing in chemistry has developed since Rouvray and Kirby organized a conference in 1995 entitled Are the Concepts a/Chemistry All Fuzzy?, to discuss the application of fuzzy logic to the chemical domain. With such a diverse range of applications, the book will appeal to professionals, researchers and developers of software tools for the design of Soft Computing-based systems in the chemical domain, and to many others within the computational intelligence community. It should also be of value to computer scientists who wish to apply their skills to real-world problems and it forms a sound basis for graduate level seminars on soft computing methods in chemistry.

The editors are grateful to Wojciech Slezak, an undergraduate student at Philadelphia University, for his enthusiastic assistance in preparing the camera ready manuscript.

Les M Sztandera Philadelphia University

Philadelphia, U.S.A.

References

Hugh Cartwright Oxford University

Oxford, England

1. Zadeh L.A. (1981), Possibility theory and soft data analysis, in Mathematical Frontiers of the Social and Policy Sciences, Cobb L. and Thrall R.M. (Eds.), Westview Press, Boulder, CO, U.S.A., pp. 69-129.

"j the world of human thought generally, and in physical science in particular, the most important and most fruitfol concepts are those to which it is

impossible to attach a well defined meaning. "

- Hendrik Kramers -

Contents

Preface ............................................................................................................................... v

Application of Evolutionary Algorithms to Combinatorial Library Design ................... 1 V 1. Gillet

1 Introduction ................................................................................................................ 2 2 Overview of a Genetic Algorithm .............................................................................. 3 3 De Novo Design ......................................................................................................... 4 4 Combinatorial Synthesis ............................................................................................. 6 5 Combinatorial Library Design .................................................................................... 9 6 Reactant Versus Product Based Library Design ......................................................... 9 7 Reactant-Based Combinatorial Library Design ........................................................ 12 8 Product-Based Combinatorial Library Design .......................................................... 13 9 Library-Based Designs ............................................................................................. 17

10 Designing Libraries on Multiple Properties .............................................................. 19 11 Conclusion ................................................................................................................ 26 References ..................................................................................................................... 27

Clustering of Large Data Sets in the Life Sciences ......................................................... 31 K.Patel, H. M.Cartwright

1 Introduction .............................................................................................................. 31 2 The Grouping Problem ............................................................................................. 32 3 Unsupervised Algorithms ......................................................................................... 34 4 Supervised Algorithms ............................................................................................. 41 5 Evaluation of Clustering Results .............................................................................. 44 6 Interpretation of Clustering Results .......................................................................... 47 7 Conclusion ................................................................................................................ 47 References ................................................................................................................... 48

Application of a Genetic Algorithm to the refinement of complex Mossbauer Spectra ............................................................................................................................. 51

W. Lottermoser, T. Schell and K. Steiner 1 Introduction .............................................................................................................. 51 2 Theoretical ................................................................................................................ 54 3 Experimental ............................................................................................................ 57 4 Results ...................................................................................................................... 60 5 Discussion ................................................................................................................ 62 6 Conclusions .............................................................................................................. 64 References ................................................................................................................... 65

viii

Soft Computing, Molecular Orbital, and Functional Theory in the Design of Safe Chemicals ........................................................................................................................... 67 L. Sztalldera, M. Trachtmall, C. Bock, J. Veiga, alld A. Garg

1 Introduction .............................................................................................................. 68 2 Computational Methods ........................................................................................... 71 3 Neural Network Approach ........................................................................................ 84 4 Feed-Forward Neural Network Architecture ............................................................ 89 5 Azo Dye Database .................................................................................................... 90 6 Concluding Remarks ................................................................................................ 91 Acknowledgement ....................................................................................................... 92 References ................................................................................................................... 92

Fuzzy Logic and Fuzzy Classification Techniques .......................................................... 95 S.M. Scott, w.T. O'Hare QIId Z. Ali

1 Introduction .............................................................................................................. 95 2 Fuzzy Sets ................................................................................................................ 96 3 Case Studies of Fuzzy Classification Techniques ................................................... 10 1 4 Conclusion .............................................................................................................. 133 References ................................................................................................................. 133 Further Reading ......................................................................................................... 134

Application of Artificial Neural Networks, Fuzzy Neural Networks, and Genetic Algorithms to Biochemical Engineering ....................................................................... 135 T. Hallai, H.Hollda, and T. Kobayashi

1 Introduction ............................................................................................................ 135 2 Application of Fuzzy Reasoning to the Temperature Control of the Sake

Mashing Process .................................................................................................... 137 3 Conclusion .............................................................................................................. 157 Acknowledgements ................................................................................................... 157 References ................................................................................................................. 158

Genetic Algorithms for the Geometry Optimization of Clusters and Nanoparticles ................................................................................................................... 161 R.L. Johllstoll alld C. Roberts

1 Introduction: Clusters and Cluster Modeling .......................................................... 161 2 Overview of Applications of GAs for Cluster Geometry Optimization ................. 163 3 The Birmingham Cluster Genetic Algorithm Program ........................................... 169 4 Applications of the Birmingham Cluster Genetic Algorithm Program ................... 175 5 New Techniques ..................................................................................................... 194 6 Concluding Remarks and Future Directions ........................................................... 200 Acknowledgements ................................................................................................... 200 References ................................................................................................................. 200

ix

Real-Time Monitoring of Environmental Pollutants in the Workplace Using Neural Networks and FTIR Spectroscopy ................................................................................. 205 H. M. Cartwright and A. Porter

1 Introduction ............................................................................................................ 205 2 FIlR in the Detection of Pollutants ........................................................................ 206 3 The Limitations ofFTIR Spectra ............................................................................ 207 4 Potential Advantages of Neural Network Analysis of IR Spectra .......................... 210 5 Application of the Neural Network to IR Spectral Recognition ............................. 210 6 Spectral Interpretation Using the Neural Network .................................................. 220 7 Factors Influencing Network Performance ............................................................. 221

8 Comparison of Two and Three Layer Networks for Spectral Recognition ............. 225

9 A Network for Analysis of the Spectrum of a Mixture of Two Compounds .......... 227

10 Networks for Spectral Recognition and TLV Determination ................................. 229

11 Networks for Quantitative Spectral Analysis ......................................................... 232

References ................................................................................................................... 235

Genetic Algorithm Evolution of Fuzzy Production Rules for the On-line Control of Phenol-Formaldehyde Resin Plants ............................................................. 237 H. M. Cartwright and D. [SSOft

1 Introduction ............................................................................................................ 237 2 Resin Chemistry and Modelling ............................................................................. 239 3 Simulation of Chemical Reactions ......................................................................... 245 4 Model Comparison ................................................................................................. 246 5 Automated Control in Industrial Systems ............................................................... 247 6 Program Development ............................................................................................ 252 7 Comment ................................................................................................................ 261 References ................................................................................................................. 262

A Novel Approach to QSPR/QSAR Based on Neural Networks for Structures ........ 265 A.M. Bianucci, A. Micheli, A. Sperduti, andA. Starita

1 Introduction ............................................................................................................ 265 2 Recursive Neural Networks in QSPR/QSAR ......................................................... 268 3 Representational Issues ........................................................................................... 278 4 QSPR Analysis of Alkanes ..................................................................................... 280 5 QSAR Analysis of Benzodiazepines ...................................................................... 283 6 Discussion .............................................................................................................. 291 7 Conclusions ............................................................................................................ 293

References ................................................................................................................. 294 A Appendix ............................................................................................................... 295

Hybrid Modeling of Kinetics for Methanol Synthesis .................................................. 297 P. Potocnik, I. Grabee. M. Setine. and J. Levee

1 Introduction ............................................................................................................ 297 2 Neural Networks ..................................................................................................... 298

x

3 Hybrid Modeling .................................................................................................... 301 4 Feature Selection ............................ ........................................................................ 302 5 Modeling of Methanol Synthesis Kinetics .............................................................. 306 6 Conclusions ............................................................................................................ 312 A Appendix - Analytical Model of Methanol synthesis kinetics .............................. 313 Acknowledgements .................................................................................................. . 314 References ................................................................................................................. 314

About the Editors ... ................................................................................................... ...... 317

List of Contributors ......................................................................................................... 319

Application of Evolutionary Algorithms to Combinatorial Library Design

Valerie J. Gillet

Department of Information Studies, University of Sheffield, Western Bank, Sheffield, S 1 0 2TN.

Summary: The last deacde has seen a revolutionary change in the processes used to discover novel bioactive compounds in the pharmaceutical and agrochemical industries. This change is due to the introduction of automation techniques which allow tens or hundreds of thousands compounds to be synthesised simultaneously and then to be screened for activity rapidly. These techniques of combinatorial synthesis and high throughput screening have vastly increased the throughput of the traditional structure-activity cycle. Despite the initial enthusiasm for the methods, early results have been disappointing, producing fewer hits than were expected or hits that have undesirable properties to be suitable as new drugs or agrochemicals. It is now realised that the number of compounds that could potentially be considered as new bioactive compounds is enormous compared to the numbers that can be handled in practise, even using automated techniques. Thus, efficient and effective methods are required for designing the sets of compounds to be used in combinatorial syntheses and to be screened in high throughput screening experiments. It is not possible to explore such large search spaces systematically and hence many methods have been developed for designing combinatorial libraries. Evolutionary algorithms are well suited to search for solutions to large combinatorial problems and this chapter reviews the application of genetic algorithms, a sub-branch of evolutionary algorithms, to combinatorial library design.

Keywords: Combinatorial synthesis, high throughput screening; combinatorial libraries; diversity analysis; de novo design; evolutionary algorithms; genetic algorithms; multiobjective optimisation

2 Valerie J. Gillet

1 Introduction

The discovery of novel bioactive compounds as new drugs or agrochemicals is a complex and expensive process. It has been estimated to take in the region of 12 years to bring a new drug to the market place at a cost of some $300 million [1]. The traditional approach to drug discovery involves an iterative structure-activity cycle in which a medicinal chemist synthesises a compound, tests it for activity and then uses the results to suggest a new compound for synthesis, and so on. Using manual synthesis techniques a typical medicinal chemist might synthesise approximately fifty compounds a year. An increasing number of computer-aided approaches to drug discovery have been developed over the last 2 or 3 decades in an attempt to reduce the time, and hence the cost, required to find new and useful compounds. The techniques include molecular modelling, deriving quantitative structure activity relationships (QSARs), similarity searching, pharmacophore mapping, ligand docking and attempting to design novel compounds from scratch in a process known as de novo design [2, 3].

During the last decade, the drug discovery process itself has undergone a revolutionary change as a result of the application of automation techniques [4]. Robotics are now used routinely both to screen compounds for biological activity, in a process known as high-throughput screening, and also to synthesise large numbers of compounds simultaneously, in a process known as combinatorial synthesis. In contrast to manual synthesis, combinatorial synthesis allows tens or even hundreds of thousands of compounds to be made in a single experiment, in what are known as combinatorial libraries, and correspondingly a high throughput screening experiment can be performed on hundreds of thousands of compounds. Thus, the throughput of the structure-activity cycle has increased enormously.

When the automated techniques were first introduced it was believed that simply the massive increase in throughput would in itself be sufficient to increase the probability of finding novel bioactive compounds [4]. However, it is now realised that the number of compounds that could potentially be made is vastly larger than could ever be handled practically and if the automation techniques are to be successful, there is a strong requirement for combinatorial libraries and high throughput screening experiments to be designed very carefully.

Since the early 1990s evolutionary algorithms (EAs) have been applied to many techniques in computer-aided drug design [5]. EAs attempt to model the processes of Darwinian evolution [6]. They operate on a population of individuals where each individual (or chromosome) represents a potential solution to the problem to be solved. Genetic operators are applied in an iterative manner to evolve new

Application of Evolutionary Algorithms to Combinatorial Library Design 3

potential solutions. EAs include three different classes of algorithms: evolutionary programming (EP); evolutionary strategies (ES) and genetic algorithms (GAs). The algorithms differ in the genetic operators that are applied to evolve new potential solutions.

This chapter is focused on the application of EAs to combinatorial library design and the associated problem of selecting libraries of compounds for high throughput screening. Most of the applications described here are based on GAs and so the chapter begins with a brief overview of the basic algorithm. This is followed by a review of methods developed to evolve single molecules in a process known as de novo design and fInally methods for the design of libraries of molecules are discussed.

2 Overview of a Genetic Algorithm

A typical GA begins with a population of randomly assigned chromosomes, where each chromosome is usually a linear string of bits or integers. Each individual is given a score via a fItness function that measures how well it satisfIes the solution requirements. Individuals are chosen for breeding using a strategy that mimics survival of the fIttest and breeding takes place via the genetic operators of crossover and mutation. Crossover involves the exchange of information between parents while mutation involves altering one or more bits in the chromosome at random. The new individuals are scored and inserted into the population replacing some existing members. The GA iterates through breeding and scoring cycles and, over time, better and better potential solutions evolve.

There are many parameters that can be varied in GAs, such as the population size and the rate at which crossover and mutation are applied; however the main considerations when implementing a GA are the chromosome encoding scheme, i.e. the mapping between the problem states and the chromosomes, and the fItness function that is used to determine the quality of a chromsome as a potential solution. The basic outline of a GA is shown in Figure 1.

4 Valerie J. Gillet

Initialise population

+ Select parents

+ Apply genetic operators

+ Apply fitness function to children

+ Insert children into population

+ '----- Test for convergence

+ Figure 1. Basic outline ofa genetic algorithm.

3 De Novo Design

The early nineties was a time of very active research into de novo design methods which attempt to build molecules from small building blocks to fit a set of constraints. The building blocks are typically atoms or small molecular fragments and the joining together of all possible building blocks in all possible ways very quickly results in a combinatorial explosion of possibilities and a search space that is much too large to explore systematically. Over twenty different programs for de novo design have been reported in the literature [7] based on a variety of different techniques for overcoming the combinatorial explosion. The techniques include:

• limiting the building blocks that are available and hence the types of molecules that can be built;

• using random numbers to select a fraction of the available building blocks at each bulding step;

• the use of GAs as a technique for exploring large search spaces.


The GA based programs can be divided into those that generate molecules to fit 3D constraints, for example, the design of a ligand to fit a receptor, and those that generate molecules to fit 2D constraints, for example, molecules that are similar in 2D to a known active compound and molecules with certain physicochemical properties.

The chromosome representations involve the encoding of molecules as potential solutions and a variety of different encoding schemes have been used. When designing molecules in 2D, the encoding schemes usually involve representing a molecule as a linear string of atoms or substructural fragments, for example, using SMILES notation which is a linear notation that encodes 2D chemical structure [8]. Nachbar [9] has developed a program to evolve molecules in 2D to fit a QSAR or QSPR (Quantitative Structure Property Relationship) that uses genetic programming. Genetic programming is a subclass of GAs where the chromosomes are trees rather than linear strings. A tree representation maps well to the 2D representation of an acyclic chemical structure, although special labels are required to accommodate molecules containing rings. Thus the genetic programming approach evolves solutions that are represented by trees. Globus et al. [10] have subsequently described a method for designing molecules based on 2D similarity to a target compound. Their method is an extension of genetic programming that they call genetic graphs. The method evolves graphs, that is, the chromosome is a graph in which cycles, or rings, can be encoded directly.

In the 3D approaches, molecules have also been encoded in 2D as linear strings [11], in which case a 3D conformation of the molecule must be generated before the fitness function can be applied. The representation here is relatively simple and allows the standard genetic operators to be applied; however, generating the 3D conformations that are required to apply the fitness function is a non-trivial task. In the Chemical Genesis [12], Leapfrog [13] and Pro-Ligand [14] programs the GAs operate directly on the 3D molecules themselves. In these cases the chromosome is no longer a linear string and so the normal genetic operators of crossover and mutation have been modified. For example, crossover has been implemented as the exchange of molecular fragments between two molecules via equivalent bonds and mutation has been implemented by changing an atom from one element to another, e.g., the mutation of a carbon atom to a nitrogen atom.

The fitness functions vary according to the constraints on de novo design. In the Chemical Genesis program [12] the aim is to design molecules that fit into the active site of a receptor and hence the fitness function involves fmding the best orientation of the molecule in the active site and measuring the goodness of fit. In the approach of Globus et al. [10] the fitness function compares the 2D similarity of the molecule to that of a target molecule.

6 Valerie J. Gillet

The effectiveness of a de novo design program is usually evaluated by applying it to a known system, for example, to design molecules that could potentially bind to a known receptor and several examples have been reported in the literature of molecules that have been suggested that are similar to known ligands [7]. However, despite the initial interest in these approaches a significant disadvantage of the programs is that they have a tendency to suggest molecules that are synthetically intractable. This has proved to be a difficult problem to solve and with the advent of combinatorial chemistry and high throughput screening in the second half of the nineties efforts turned to the design of libraries of molecules for synthesis and screening.

4 Combinatorial Synthesis

In traditional synthesis, one reactant is combined with another via some chemical reaction. For example, a dipeptide results from the joining together of two amino acids via a peptide bond, as shown in Figure 2. There are 20 naturally occurring amino acids which vary in the nature of the side chain substituents indicated by the variable R groups in Figure 2 and in combinatorial synthesis it is possible to generate all 400 (20x20) dipeptides in a single experiment.

- o R

H'N~OH R 0

Figure 2. A dipeptide is formed by joining two amino-acids via a peptides bond. The R groups are used to indicate substitution positions. The twenty commonly

occurring amino acids differ in the nature of their side chains (R groups).

Tripeptides can be generated by simultaneously reacting all 400 dipeptides with the 20 amino acids to give 8000 products, and so on, as the length of the peptide chain increases. The explosion in numbers is shown in Table 1.

Combinatorial synthesis was first developed for peptide chemistry but was quickly adapted to the synthesis of small organic molecules where the number of potential compounds increases much more rapidly. For example, an amide bond is formed by reacting an amine with a carboxylic acid. A search for commercially available amines and acids to be used in the synthesis of amides would result in many thousands of examples of each, resulting in millions or more potential product


molecules, for just this one reaction step. In fact, it has been estimated that there are in the region of 1040 molecules that could be considered as potential drug candidates [4].

No. Amino Acid No. Peptides

Residues (n)

NHz-Xn-COOH

20

2 400

3 8000

4 160000

8 25,600,000,000

Table 1. The number of possible peptides increases rapidly as the number of amino acids in the sequence increases.

A general scheme for combinatorial synthesis is shown in Figure 3. Figure 3a shows a traditional reaction that involves two reactants of different types, for example an amine and a carboxylic acid, and the formation of an amide bond. Figure 3b shows a two-component combinatorial reaction as a two dimensional array. Here there are several examples of each type of reactant, for example several amines and several carboxylic acids. The rows of the array represent the reactants in one pool, the amines, and the columns represent the reactants in the second pool, the carboxylic acids. The elements of the array represent the product molecules which result from the combinatorial joining of all reactants of type A to all reactants of type B.

A+B-+AB Figure 3a. A traditional single step synthesis involves reacting two reactants

together.

8 Valerie J. Gillet

Ai 8 1 8 1 8 2 8 m

~ 0 o 0 0 0 o 0 o 0

A2 82 A2 o 0 o 0 o 0 000

o 0 000 o 0 o 0

A3 83 o 0 0 0 0 o 0 o 0

-+ o 0 0 0 0 o 0 o 0

o 0 000 o 0 o 0

000 o 0 o 0 o 0

0 o 0 o 0 o 0 0 0

An 8 m An 0 000 o 0 000

nm products Figure 3b. A two component combinatorial reaction involving multiple reactants

(n examples of reactant type A and m examples of reactant type B) can be represented as a two-dimensional array (of nXm products) as shown.

Despite the increased throughput due to automation, the practical limits are such that only lOS or 106 compounds can be handled in a single experiment, thus it is clear that only a tiny fraction of the available space (-10,",> can be explored. The de novo design programs described earlier attempt to explore the entire space of theoretically possible molecules, based on rules of chemical bonding, that fit a given set of constraints. In combinatorial library design, however, the chemical space to be explored is usually limited to a small number of reaction steps with finite lists of reactants, or building blocks, that are available for each substitution position. While this represents a more restricted chemistry space than is explored in de novo design it is still much too large to allow its systematic exploration. Choosing which compounds to synthesise and screen from this vast chemistry space is therefore very important for effective drug design. It is also very computationally demanding and combinatorial library design, like de novo design, also lends itself to evolutionary optimisation and GAs in particular [15 - 17].

The set of compounds that could potentially be made in a combinatorial synthesis is often referred to as a virtual library and the computational techniques used to reduce a virtual library to a size that can actually be synthesised as real molecules are known as virtual screening [18, 19].


5 Combinatorial Library Design

There are two different criteria used for designing combinatorial libraries [4]. Diverse libraries are used in lead generation programs for screening against a variety of targets. Here the assumption is made that maximum structural diversity will result in maximum coverage of bioactivity space and thus increase the chances of finding actives across the different screens. Targeted, or focused libraries, on the other hand are usually biased towards a single therapeutic target such as HIV -Protease, a structural class such as the kinases or a series of known active compounds. Both strategies involve selecting a subset of compounds from some large virtual library using the concept of molecular similarity. In diverse library design the aim is to select compounds that are maximally dissimilar from one another, whereas in targeted or focused library design the aim is to select compounds that are similar to known actives or that are complementary to a known receptor.

Whether libraries are designed to be diverse or focused, the design itself can be done in reactant space, when optimised subsets of reactants are selected, or the design can be done directly in product space. The next section compares these two approaches from the view point of computational complexity, where it will be seen that reactant-based selection is computationally more efficient than productbased design; however product-based design can lead to better optimised libraries.

6 Reactant Versus Product Based Library Design

Consider the design of an amide library consisting of 100xl00 products selected from an available 1000 amines and 1000 carboxylic acids. In reactant-based design, there are

t nJ ;=1 k;!(n; -k;)!

possible libraries where, in the amide example, there are two reactant pools, i.e., R is 2, and the number of reactants to be selected is 100 (ki), from an available 1000 (nI) for both reactant pools.

Product-based selection is more computationally demanding than reactant-based selection. Prior to subset selection it requires the computational enumeration of the

10 Valerie 1. Gillet

full virtual combinatorial library and calculation of the descriptors for all molecules in the virtual library, see later. There are then two subset selection strategies. Cherry picking refers to the selection of a subset of products without taking into account the combinatorial constraint required in combinatorial synthesis, where every reactant chosen at one substitution position must be used in all combinations with all reactants at all other substitution positions. Cherry picking is computationally straightforward since no such restriction is placed on the products selected and hence any of the methods that have been developed for reactant-based design can be used. Cherry picking in product space is, however, much more demanding than reactant-based selection: there are

n;f

k;f(n j -k j )!

possible libraries, however, nj is now 106 (1000x 1000) and k j is 104•

0 0 0 0 0 0 0 0 0

0 • • • • 0 0 0 0

0 • • • • 0 0 0 0

0 • • • • 0 0 0 0

0 0 0 0 0 0 0 0 0

·0 • • • • 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

Figure 4. Cherry picking 4 diverse compounds from a two component product space. Synthesising these 4 compounds (black) via a combinatorial synthesis

would require synthesis of a 4x4 combinatorial library (grey). Thus cherry picking is synthetically inefficient.

The main disadvantage of cherry picking is that it is synthetically inefficient in terms of a combinatorial chemistry experiment which requires the combinatorial constraint to be satisfied. This is illustrated in Figure 4. Assume that the 4 most diverse compounds in this hypothetical library are those shown by the solid circles in Figure 4 and that these 4 compounds have been found by performing cherry


picking using diversity as the selection criterion. Synthesising these 4 compounds (A2BS, A3B3, A4B2' and Ac;B4) using combinatorial synthesis techniques would require synthesis of a 4x4 library, i.e. the 16 molecules produced by reacting A2, A3, ~ and ~ with B2, B3, B4 and Bs, shown by the partially shaded circles.

Product-based selection can also be implemented by taking account of the combinatorial constraint. This is a synthetically efficient strategy since combinatorial subsets are selected directly. The process is equivalent to intersecting the rows and columns of the array as shown in Figure 5 for a 2x2 subset. In this case, there are

possible combinatorial subsets where in the amide example R is 2, nj is 1000 and k j is 100 for each reactant pool. Experiments have shown that despite the additional computational cost associated with product-based library design it can be more effective than reactant-based design [20 - 22]. This is particularly true when library optimisation requires the calculation of whole library properties such as diversity rather than the properties of individual molecules contained within a library such as similarity to a target molecule.

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 • • 0 0 0 0 0 0

0 • • 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

Figure 5. Selecting a 2x2 combinatorial subset in product space.

The rest of the chapter focuses on the application of EAs to the design of combinatorial libraries.


7 Reactant-Based Combinatorial Library Design

Many methods have been developed for selecting subsets of compounds from existing collections, such as inhouse databases, to be used for screening and these methods can also be used to select diverse reactants to be used in combinatorial library design experiments. As already mentioned identification of diverse subsets of compounds requires ways of calculating how similar or dissimilar they are and in library design generally the methods have to be sufficiently rapid to allow large collections of compounds to be handled.

The two most important components of any similarity measure are the structural descriptors that are used to characterise the molecules, and the similarity coefficient that is used to quantify the degree of similarity between pairs of molecules.

Many different types of descriptor have been suggested for calculating structural similarity and for diversity analysis [23]. The most commonly used descriptors are whole molecule properties such as molecular weight and lipophilicity, descriptors derived from the 2D representation of molecules such as topological indices and fragment based fingerprints, and descriptors that represents molecules in 3D such as pharmacophore keys. Whole molecule properties and topological indices are usually represented as real numbered vectors. Fragment based 2D fmgerprints and pharmacophore keys record the presence or absence of fragment substructures or pharmacophoric patterns, respectively, within a molecule in a bit-vector.

Once molecules have been characterised using some descriptors, the similarity between a pair of them is calculated by means of a similarity coefficient, which quantifies the degree of resemblance between two sets of such characterisations [24]. Similarity calculations based on substructural data have generally used association coefficients such as the Tanimoto coefficient, whereas similarity calculations using property data have generally used distance coefficients, typically Euclidean distance.

The similarities or dissimilarities between molecules provide the input to various methods that are available for selecting a structurally diverse set of compounds. As has already been seen it is computationally infeasible to compare all possible subsets of a given size and hence more computationally efficient approximations have been developed. The four main techniques for subset selection include dissimilarity-based compound selection, clustering, partitioning and a variety of optimisation techniques including genetic algorithms, simulated annealing and experimental design techniques. These techniques are described in brief here.


Dissimilarity-based compound selection methods attempt to identify a diverse subset in an iterative process. One compound is selected to seed the subset and subsequent compounds are selected as those that are most dissimilar to those already selected. A number of different variations on this basic approach have been developed [25].

Clustering involves dividing a set of molecules into groups, or clusters, so that the compounds within a cluster are similar whereas compounds from different clusters are dissimilar. A diverse subset of compounds can then be obtained by choosing one compound from each cluster [26].

Partition- or cell-based selection requires the definition of a low dimensional chemistry space based on a small number of descriptors, for example, physicochemdal properties such as molecular weight and lipophilicity. The range of values for each descriptor is divided into a set of bins and the combinatorial product of all possible bins then defines a set of cells. Each molecule is assigned to the cell that matches the descriptors for that molecule. A diverse subset of molecules is then obtained by selecting one molecule from each of the resulting cells [27, 28].

Optimisation-based approaches involve the definition of a diversity index which is a quantitative measure of diversity and then adopting an optimisation technique to find a subset that maximises the index. Martin et al. [29] developed an approach to reactant-based selection based on the experimental design procedure known as Doptimal design. Simulated annealing and GAs have been applied to product-based library design as described in the next section.

8 Product-Based Combinatorial Library Design

Product-based library design programs have recently been classified as "moleculebased" and "library-based" [30]. Product-based designs require the use of optimisation techniques and typically these are either simulated annealing [31 -33] or GAs [33 - 36]. The focus here is on GA-based methods. In molecule-based library design, each chromosome represents a single molecule and the fitness functions involve analysing a single molecule, for example, for its similarity to a known target molecule. Molecule-based methods are based on the cherry picking approach described earlier; however, they attempt to overcome the synthetic inefficiency of cherry picking by analysing the products in the final population to

14 Valerie J. Gillet

identify reactants that occur frequently within them and then using the frequently occurring reactants to define a combinatorial library.

In library-based methods each chromosome represents a combinatorial library directly. These methods are much more computationally demanding since the fitness function requires the analysis of an entire combinatorial library of molecules. Despite the increased computational cost, these approaches are required when the design criteria are library based such as diversity and optimising physicochemical property profiles.

8.1 Molecule-Based Designs

Sheridan and Kearsley [30, 34] were among the first to publish a GA approach to designing focused combinatorial libraries. Their method is molecule-based, where each chromosome encodes a molecule as a linear string of integers that represents particular instances of reactants from which the molecule is constructed and which are extracted from pools of available reactants, as shown in Figure 6.

Figure 6. A molecule-based chromosome representation is illustrated for a molecule built from two reactants, A2 and B1•

Molecules can be optimised via a variety of fitness functions such as similarity to a target molecule using atom-pairs descriptors and fit to a receptor site which involves generating 3D conformations and docking them within the receptor site.

Once the GA has terminated the entire population is analysed to identify reactants that occur frequently across all the molecules in the population. The frequently occurring reactants can then be used to design a combinatorial library experiment. They tested the algorithm on the construction of tripeptoid libraries where there are 3 positions of variability with 2507 amines available for two of the substitution positions and 3312 for the thrid position. This represents a virtual library of -20 billion possible tripeptoids. The GA was able to find molecules that were very similar to given target molecules after exploring a very small fraction of the total search space.

The molecule-based method is a relatively fast procedure, especially when optimisation is based on 2D properties, since the fitness function involves a pairwise molecular comparison rather than analysis of an entire library. Although there is no guarantee that building libraries from frequently occurring reactants will result in optimised libraries they showed that for targeted libraries molecule-


based approaches can be just as effective as library based approaches. They also showed that basing the optimisation on product molecules is more effective than optimising reactants.

A similar approach has also been developed in the program Focus-2D [37, 38] where molecules are described using MolconnX topological descriptors and molecules are evolved to be similar to a known target compound or the predicted activity is maximised based on a precomputed QSAR. Both a GA and simulated annealing have been implemented as optimisation techniques.

8.2 Experimentally Determined Fitness Functions

Weber et al. [39] developed a strategy for the selection and synthesis of active compounds that is based on a GA. Interestingly, they used experimental activity data to guide the GA, so that the fitness function required the actual synthesis and biological testing of compounds. The approach was developed for the Ugi reaction which is a four component reaction. The virtual library consisted of 160 000 possible products that could be made from 10 isonitriles, 40 aldehydes, 10 amines and 40 carboxylic acids.

The approach is molecule-based with individual reactants encoded by an arbitrary bit pattern and each chromosome representing a molecule by the concatenation of four bit patterns, one for each substitution position. The GA was initialised with a population of 20 randomly assigned chromosomes. Each chromosome was then scored by synthesising and testing the compound it represented. The best product molecule in the initial population exhibited an ICso of 300 J.1M. A thrombin inhibitor with submicomolar ICso was found after just 20 generations of the GA, i.e. after synthesising and testing just 400 molecules.

In a follow up study [40] a full combinatorial library of 15 360 products was synthesised from a three-component Ugi reaction scheme using 12x8x60 substituents and the products tested for activity against serine protease thrombin. The resulting structure activity data was then used to investigate the behaviour of various GAs, including encoding schemes, mutation versus crossover rates and population size. Similar approaches with experimentally determined fitness functions based on peptide libraries were published around the same time by Singh et al. [41] and Yokobayashi et al. [42].

Gobbi et al. [43] developed a molecule-based method that uses a different chromosome representation. In their GA, each chromosome represents a molecule as a binary fingerprint. As already described, a fingerprint is a binary string in


which each bit represents a particular !?ubstructure fragment. For a given molecule a bit is set to "I" if the fragment it represents is contained in the molecule, otherwise it is set to "0".

The advantage of this approach over that used by Singh et al. and Weber et al. is that the children produced by applying crossover do not have to contain the same reactants as their parents and also they are not limited to the same reaction. A disadvantage is that crossover and mutation can generate molecules that do not exist in the collection and also chromosomes that are in fact chemical nonsense since the fragments or bits in the bitstring are not independent.

When the method is applied to select a subset of compounds for screening the fitness function involves fmding the molecule in the dataset that has a fmgerprint that is most similar to the chromosome and then testing that molecule for activity. In combinatorial library design, this strategy is not possible directly since it would involve enumerating the entire virtual library and calculating descriptors for all the compounds. This is not possible for large libraries (>1 million compounds). Instead, the fitness function involves using a TABU search method that samples the virtual library to find a similar compound to the chromosome. This sampling procedure can take in the order of 1 cpu hour for a virtual library of 1010

compounds. Thus for a population of 20 compounds one optimisation cycle can take around 1 day. This is sufficiently fast since the actual synthesis and testing cycle takes considerably longer.

They have reported the use of their GA in simulated screening experiments that involve collections of molecules where the activities are already known and they were able to find all the active compounds by screening approximately 10% of the datasets, this representing a 100 fold improvement on random selection.

8.3 Evolutionary Strategy

The TOPAS program reported recently represents an interesting development both in terms of the problem being tackled and in the algorithm employed [44-46]. TOP AS is a program for de novo design that explores a larger search space than is typically covered by the more restricted combinatorial library design programs. It also overcomes some of the limitations of the earlier approaches to de novo design that have a tendency to suggest compounds that are synthetically intractable.

TOP AS uses building blocks that have been derived by applying retrosynthesis based on a restricted number of well known reaction steps to fragment a database of druglike molecules (the World Drug Index, WDI [47]). New molecules are


evolved starting from the druglike fragments using the same well known reactions in an attempt to build synthetically accessible molecules. The molecules are evolved to be similar to a known active molecule where similarity is measured via a topological pharmacophore, which describes the 2D arrangement of pharmacophoric atoms.

An interesting feature 'of the approach is that it uses an adaptive (1, A.) evolutionary strategy. In this strategy, a set of A. variant structures are generated from a randomly selected parent structure where the variants satisfy a bell-shaped distribution centred in the chemical space coordinates of the parent structure. The variants closest to the parent will be very similar to it with similarity decreasing with increasing distance from the parent. The width of the distribution is determined by the variance or standard deviation cr. Large values of cr correspond to large jumps in the search space with small values being used to facilitate local hill climbing. The value of cr adapts automatically as the search progresses.

9 Library-Based Designs

Brown and Martin [35] describe a library-based GA for combinatorial library design in a program called GALOPED. Their method was developed for the design of diverse combinatorial libraries synthesised as mixtures. The mixtures approach to combinatorial synthesis uses a technique known as split-and-mix where several compounds are synthesised and screened in the same vessel. (Parallel synthesis, on the other hand, involves the synthesis of compounds as discretes where there is one compound per vessel.) The synthesis of mixtures allows much higher throughputs to be achieved than parallel synthesis; however, if activity is seen in a vessel the mixture must be deconvoluted with the individual compounds contained within it synthesised and then tested to identify the particular compound that is responsible for the activity. Deconvolution can be achieved using mass spectroscopy techniques where the amount of resynthesis and testing is minimised by reducing the redundancy in molecular weights. GALOPED attempts to optimise mixtures based on their diversity and ease of deconvolution simultaneously.

Each chromosome encodes a combinatorial subset as a binary string. The chromosome is partitioned with one partition for each component, or substitution position, in the library. The number of bits in the chromosome is equal to the sum of reactants available in each reactant pool so that each bit represents a different reactant, as shown in Figure 7.


, , , , , Al A2 A3 A4 As! BI B2 B3 B4 Bs

11 11 10 10 11 I 1 10 I o I 1 I 0 I

Figure 7. The library-based chromsome representation used in GALOPED. A two component library of configuration 3x2 is shown that is constructed by

combining AI> A2 and As with B\ and B4, combinatorially

Thus a virtual library of 1000xl000 potential products will require chromosomes with 2000 bits. A bit value of "I" indicates that a reactant is included in the combinatorial subset and a value of "0" indicates that the reactant has not been selected. The size of the subset selected can vary according to the number of bits set to "I" and so minimum and maximum thresholds are set by the user and libraries outside the desired size range are penalised in the fitness function.

The fitness function involves maximising the diversity of the library while minimising the molecular weight redundancy. Optimising diversity requires a diversity index that can be maximised. In GALOPED, diversity is measured by first enumerating the library represented by a chromosome and then clustering it based on 2D descriptors and counting the number of different clusters occupied by the library. Clustering is a computationally expensive process and the size of combinatorial libraries that can be handled by this method is limited.

At about the same time, Gillet et al. developed a library-based GA called SELECT [20,21,36]. SELECT was developed for designing diverse libraries using parallel synthesis where the size and configuration (the number of reactants selected from each pool) of a library are predetermined by the experimental equipment. Thus, the chromosome representation differs from that used in GALOPED.

In SELECT, as in GALOPED, the chromosome is partitioned so that there is one partition for each component of the library, however, in SELECT the size of a partition is determined by the number of reactants to be selected from each reactant pool (rather than the number of available reactants), see Figure 8. The chromosome is an integer string with each integer corresponding to a reactant that has been selected. The crossover and mutation operators have been modified to ensure that there are no duplicate integers in a partition.


Figure 8. The library-based representation used in SELECT, showing the same combinatorial subset as shown in Figure 7.

The study began by considering the relative effectiveness of product-based library design versus reactant-based design [20, 21]. Diversity was measured using distance-based diversity indices including the sum-of-pairwise dissimilarities calculated using the cosine coefficient which is a very fast diversity index and allows large combinatorial libraries to be processed efficiently. The experiments were performed for several different libraries using several different molecular descriptors and several different distance-based diversity indices and it was shown that product-based designs are more effective at generating diverse libraries than are reactant-based designs. Similar results have also been reported by Jamois et al. [22].

Lewis et al. [33] have developed both simulated annealing and GA approaches to product-based library design in a method called Rpick. The GA version is librarybased and was used to design a subset of benzodiazepine library of configuration consisting of 4x4x3x2 products from a virtual library of llx7x5x4 molecules. The GA was designed to maximise the coverage of pharmacophores in the library compared with the full coverage of the virtual library. Generation of the pharmacophore descriptors is computationally expensive since it involves a full conformational analysis of the virtual library, hence the size of libraries that can be handled is restricted.

10 Designing Libraries on Multiple Properties

Despite the initial enthusiasm for combinatorial chemistry and high throughput screening early results were disappointing with libraries either producing fewer hits than were expected [4] or producing hits that had physicochemical properties that make them undesirable as drug candidates. It is now recognised that libraries designed to be diverse should also be constrained to contain molecules that have druglike properties. Methods are now beginning to be reported in the literature that attempt to achieve this [32, 36 ,48].


Multiple properties are handled in the SELECT program described in the previous section [36] via a weighted-sum fitness function as shown below.

f(n) = wl.diversity + w2 .cost + w3 .propertyl + w4 ·property2 + ....

Typically, SELECT would be configured to design libraries that simultaneously have maximum diversity, minimum cost and druglike physicochemical properties. The physicochemical property profiles are optimised by minimising the difference between the distribution of a property in the library and some reference distribution, for example, the distribution of the property in a collection of known drugs. Each of the properties is standardised and then relative weights are defined by the user at run time.

.., -0 c;; ::l o a. E 8 f!.

25

WDI

Molecular Weight

Figure 9. The molecular weights profiles oflibraries designed on diversity alone (LIB 1) and diversity simultaneously with molecular weight profile (LIB2) are

shown superimposed on the molecular weight profile found in WDI.

The advantage of optimising multiple properties simultaneously via a weightedsum fitness function is clearly demonstrated in Figure 9 which shows the molecular weight profiles for 30x30 amide subsets selected from a 1 OOx 100 virtual amide library. The molecular weight profile found in the World Drug Index is shown in black; the profile of a library designed on diversity alone is shown in white; and the profile of a library that is designed to be both diverse and have a druglike distribution of molecular weights is shown in grey. It can be seen that libraries designed on diversity alone tend to contain molecules that are have higher


molecular weights than typical drug molecules and a better profile can be achieved by optimising both properties simultaneously.

Several other library design programs use a weighted sum fitness function for the simultaneous optimisation of multiple properties [31,32,49,50], however, there are some limitations associated with this approach. For example:

• the setting of appropriate weights is often non-intuitive - in the SELECT program it is often done by trial and error [51];

• when the objectives to be optimised are non-commensurate, for example, diversity and cost, it is not obvious how they should be combined;

• and when there are more than two objectives it is difficult to monitor the progress of the search.

Some of these limitations are illustrated in Figure 10 which shows the results of a number of runs of SELECT for the previous amide library design with the fitness function:

fen) = wl·diversity+w2·~mw Diversity is measured as the normalised sum-of-pairwise dissimilarities using the cosine coefficient and is plotted on the y axis with the normal direction of the axis reversed so that solutions nearer to the origin are more favourable (i.e. have higher diversity). The difference in the molecular weight profile of the library and the profile found in the WDI is plotted on the x axis, with the direction of improvement towards the origin. Three series of runs were performed with: equal weights (black triangles); with w]=2.0 and w2=0.5 (grey triangles); and with w]=lO and W2=1.0 (white triangles). The runs show that as the relative weight given to diversity increases then there is a tendency for SELECT to fmd more diverse libraries but that this is achieved at the expense of the molecular weight profile. So it can be seen that the two objectives are in competition and that in fact a family of solutions exists. A single run of SELECT will find one solution whose position in the objective space depends on the relative weights assigned to the properties being optimised.


0.57

•• • w1=1 .0; w2=1 .0 0.575 •• • • w1=2.0; w2=0.5

0.58 6 w1=10; w2=1 .0

~ I!? CI) >

0.585 0 .-. • 0.59 6-

6- 6-

0.595

0.58 0.6 0.62 0.64

tJ.MW

Figure to. Results are shown for a number of SELECT runs using three different relative weightings of the two objectives, diversity and molecular weight proflle.

Many multiobjective problems, including library design, are characterised by the existence of a family of solutions all of which can be seen as equivalent, in the absence of further information. Evolutionary algorithms such as GAs are well suited to multiobjective optimisation since they operate on a population of individuals and hence they can be easily adapted to search for multiple solutions in parallel. Fonseca and Fleming [52] have developed an approach to multiobjective optimisation known as MOGA (MultiObjective Genetic Algorithm). The method treats each objective independently without sununation and without the need to choose relative weights. In MOGA, a set of nondominated solutions is sought rather than a single solution. A non-dominated solution is one where an improvement in one objective results in the deterioration in one or more of the other objectives when compared with the other solutions in the population. Thus, one solution dominates another if it is either equivalent or better in all the objectives and, strictly, it is better in at least one objective.

The MOGA approach has been adopted in a new development of SELECT called MoSELECT [53, 54]. In MoSELECT, each objective is handled independently without the need to assign relative weights. Most of the components of the algorithm are the same as for SELECT; however, instead of using a weighted sum fitness function the fitness of a chromosome is calculated as the number of solutions by which it is dominated, in a procedure known as Pareto ranking. All


non-dominated individuals are given fitness 1, individuals dominated by one other individual in the population are given fitness 2, and so on. The probability of choosing an individual for reproduction is inversely proportional to its fitness, thus all non-dominated solutions have equal probability of being selected and an individual with fitness 1 is more likely to be selected than one with fitness 2. Thus a family of equivalent solutions are progressed on what is known as the Pareto frontier.

Figure 11 shows the progress of a MoSELECT run for the same amide library design problem in Figure 10 where the library is optimised on two objectives, namely diversity and molecular weight. The non-dominated solutions are shown as black circles and the dominated solutions are shown as white triangles. It can be seen that the entire Pareto frontier of non-dominated solutions moves in the direction of improvement for both objectives simultaneously as the search progresses. The percentage of non-dominated solutions in the population also increases.

The search was terminated after 5000 iterations and the fmal population is shown enlarged in Figure 12. The entire family of solutions was found in a single run which takes approximately the same amount of time to complete as does a single run of SELECT, which generates only one solution. Some of the solutions found for the individual SELECT runs reported in Figure 10 are superimposed on the MoSELECT solutions. Once a family of solutions has been found, the user can then browse through them and choose one that is acceptable based on the objectives used in the search while also taking into account other criteria, for example, availability of reactants.


0.47 0.47

0.49 ~~ 0.49

0.51 0.51 Z> .... Z> ~ .il! 0.53 e 0.53

'" ~ ~ > a 0.55 15 0.55

0 .57 I ~ Dominated ,I 0.57 • Non-<lomlnated

0 .59 0 .59

0.5 0 .7 0.9 1. 0.5 0.7 0 .9 1.1 t.'-fN t.'-fN

o iterations 100 iterations 0.47 0 .47

0.49 0.49

0.51 0.51 Z> ;:. "ii! 0.53 .ji! 0.53

~ ~ is 0.55 i5 0 .55

0.57 ~ 0.57

~~ 0.59 0 .59

0.5 0.7 0.9 I . 0.5 0.7 0 .9 1.1

t.'-fN t.'-fN SOO iterations SOOO iterations

Figure 11. The progress of a run of MoSELECT is reported after initialisation of the chromosomes and after 100, 500 and 5000 iterations.

MoSELECT is readily applied to the optimisation of libraries with increased numbers of objectives. Figure 13 shows the results of designing 15x30 subsets from a virtual library of 12 850 2-aminothiazoles (74 a-bromoketones coupled with 170 thioureas) optimised on six objectives simultaneously. The objectives are molecular weight profile (MW); hydrogen bond donor profile (RBD); hydrogen bond acceptor profile (RBA); rotatable bond profile (RB); diversity, which is measured using a cell-based diversity index; and cost.

-The Pareto frontier is illustrated by parallel graphs where each objective is plotted along the x-axis, a single solution is shown by a line. The objectives have been scaled to allow them to be plotted on the same graph, however, this scaling was not used internally during the MoSELECT run where each objective was treated independently. The competing nature of some of the objectives is clearly seen by the crossing lines with a good value of cell-based diversity corresponding with a high cost library and vice versa.


0.574 ...--.----- -------'

0.578

.z:. 0.582 .ii! Q)

> o 0.586

0.59

• • • -• -- A • -••• •• - -• a

0.594 1------.-----r------1

0.58 0.6 0 .62 0.64 6. MW

Figure 12. The fInal population is shown on an expanded scale where it can be seen that a family of non-dominated solutions is evenly spread throughout the

descriptor space and all of these solutions are equivalent.

5000 iterations

1

0.8 ~------------:~------_i

0.6 -l------~r--~':&~~----_1,..

o ~--~---=~--_r----~r_--~ MW RB HBA HBD COST Diversity

Figure 13. The non-dominated solutions in the fInal population are shown after 5000 iterations ofMoSELECT confIgured to design 15x30 2-aminothiazole

subsets optimised on six properties simultaneously.

MoSELECT represents a significant improvement over the original SELECT program for the design of combinatorial subsets that are optimised on multiobjectives simultaneously. Many of the limitations of the weighted-sum


approach have been overcome, for example: non-commensurate objectives are easily handled; there is no need to assign weights to the various objectives; and the progress of the search can be viewed using parallel graphs that allow multiple objectives to be monitored simultaneously.

MoSELECT results in a family of solutions all of which are equally equivalent in the absence of further information. The user can then make an informed choice on which solution(s) to explore rather than proceeding with the single solution generated by SELECT which may lie anywhere on the Pareto frontier. MoSELECT also allows the relationships between the different objectives to be explored with competing objectives easily identified. There are no significant overheads in terms of computing time for adopting Pareto ranking and a single run of MoSELECT takes approximately the same time as a run of SELECT but with the advantage of finding a whole family of solutions.

11 Conclusion

Combinatorial library design is a computationally demanding task due to the enormous numbers of potential druglike molecules that are theoretically possible. This type of problem has proved to be well suited to the application of EAs. To date most applications of EAs in this area have involved GAs although recently there have been some interesting developments, notably the application of an evolutionary strategy in the TOPAS program, the adoption of the MOGA approach to multiobjective optimsation in MoSELECT and the combination of a GA with a fitness function that is based on a TABU search method in the program developed by Gobbi et al. There is still much progress to be made in library design and future developments are likely to be in the areas of continued efforts in multiobjective design; the integration of combinatorial library design and structure-based drug design techniques; the design of multiple combinatorial subsets and extending the approaches so that multiple reactions can be handled simultaneously. The adoption of the MOGA approach in MoSELECT was the result of a collaboration between researchers in two different disciplines, that of automatic control and systems engineering and computer-aided drug design, where a method developed in engineering was found to be extremely well suited to combinatorial library design. Efforts in library design and other areas of computer-aided drug design can benefit enormously from such sharing of ideas across different fields.


References I. Fassina G. and Miertus S. (Eds) Combinatorial Chemistry and Technology. Principles,

Methods and Applications, Marcel Dekker Inc., New York, 1999. 2. Martin Y.C. and Willett P. (Eds) Designing Bioactive Molecules, American Chemical

Society, Washington DC, 1998. 3. Downs G.M. and Willett P. Similarity Searching in Databases of Chemical Structures,

in Lipkowitz K.B. and Boyd D.B. (Eds). Reviews in Computational Chemistry, WileyVCH, New York, 1995, Volume 7, pp 1-66.

4. Valier M.J. and Green D. Diversity Screening Versus Focussed Screening in Drug Discovery, Drug Discovery Today, 2000, 5,286-293.

5. Clark D.E. (Ed) Evolutionary Algorithms in Molecular Design, Wiley-VCH: Weinheim, 2000.

6. Parrill A.L. Introduction to Evolutionary Algorithms in Clark, D.E. (Ed) Evolutionary Algoriths in Molecular Design, Wiley-VCH: Weinheim, 2000, ppl-13.

7. Gillet V.J. and Johnson A.P. Structure Generation for De Novo Design in Martin Y.C. and Willett P. (Eds) Designing Bioactive Molecules, American Chemical Society, Washington DC, 1998, pp 149-174.

8. Venkatasubramanian V., Chen K. and Caruthers J. Evolutionary Design of Molecules with Desired Properties Using the Genetic Algorithm, J. Chern. In! Cornput. Sci., 1995, 35, 188-195.

9. Nachbar R.B. Molecular Evolution: a Hierarchical Representation for Chemical Topology and its Automated Manipulation, in Proceedings of the Third Annual Genetic Programming Conference, University of Wisconsin, Madison, Wisconsin, 22-25 July, 1998, pp 246-253.

10. Globus A., Lawton J., and Wipke, T. Automatic Molecular Design Uisng Evolutionary Techniques, Nanotechnology 1999, 10,290-299.

II. Blaney J.M., Dixon J.S. and Weininger D.J. Evolution of Molecules to Fit a Binding Site of Known Structure. Paper presented at the Molecular Graphics Society Meeting on Binding Sites: Characterising and Satisfying Steric and Chemical Restraints, York, UK, March 1993.

12. Glen R.C. and Payne A.W.R. A Genetic Algorithm for the Automated Generation of Molecules Within Constraints,J. Cornput-Aided Mol. Des., 1995,9,181-202.

13. LeapFrog is available from TRIPOS Inc., 1699 South Hanley Road, Suite 303, St. Louis, MO 63144.

14. Westhead D.R., Clark D.E., Frenkel D., Li J., Murray C.W., Robson B., Waszkowycz B. PRO_LIGAND: An Approach to De Novo Molecular Design. 3. A Genetic Algorithm for Structure Refinement, J. Cornput-Aided Mol. Des., 1995,9, 139-145.

15. Brown R.D. Clark D.E. Genetic Diversity: Applications of Evolutionary Algorithms to Combinatorial Library Design, Exp. Opin. Ther. Patents, 1998,8, 1447-1460.

16. Weber L. Evolutionary Computational Chemistry: Application of Genetic Algorithms. Drug Discovery Today, 1998,3,379-385.


17. Weber L. Molecular Diversity Analysis and Combinatorial Library Design in Clark D.E. (Ed) Evolutionary Algoriths in Molecular Design, Wiley-VCH: Weinheim, 2000, pp 137-157. .

18. Walters W.P., Stahl M.T., and Murcko, M.A. Virtual screening - An overview. Drug Discovery Today, 1998,3,160-178.

19. Bohm H.-J., and Schneider G., Eds. Virtual Screening for Bioactive Molecules, WileyVCH, Weinheinm, 2000.

20. Gillet V.J., Willett P. and Bradshaw, J. The Effectiveness of Reactant Pools for Generating Structurally Diverse Combinatorial Libraries. 1. Chern. In! Cornput. Sci. 1997,37,731-740.

21. Gillet V.J. and Nicolotti O. New algorithms for compound selection and library design Perspect. Drug Discov. Design, 2000, 20, 265.

22. Jamois E.A., Hassan M. and Waldman M., Evaluation of Reagent-Based and ProductBased Strategies in the Design of Combinatorial Library Subsets. J. Chern. In! Corn put. Sci., 2000, 40. 63.

23. Brown RD. Descriptors for Diversity Analysis. Perspect. Drug Discov. Design. 1997, 7/8,31-49.

24. Barnard J.M., Downs G.M. and Willett P. Chemical Similarity Searching 1. Chern. Inf Cornput. Sci.,1998, 38, 983-996.

25. Lajiness M.S. Dissimilarity-Based Compound Selection Techniques Perspect. Drug Discov. Design 1997,7/8,65-84.

26. Dunbar Jr. J.B. Cluster-Based Selection. Perspect. Drug Discov. Design, 1997,.7/8,51-63.

27. Mason J. S. and Pickett S.D. Partition-Based Selection. Perspect. Drug Discov. Design. 1997,7/8,85-114.

28. Lewis RA., Mason J.S. and McLay I.MSimilarity Measures for Rational Set Selection and Analysis of Combinatorial Libraries: The Diverse Property-Derived (DPD) Approach. 1. Chern. Inf Cornput. Sci., 1997,37,599-614.

29. Martin E.J., Blaney J.M., Siani M.S., Spellmeyer D.C., Wong A.K. and Moos W.H. Measuring Diversity - Experimental Design of Combinatorial Libraries for Drug Discovery. 1. Med. Chern. 1995,38, 1431-1436.

30. Sheridan R.P., SanFeliciano S.G. and Kearsley, S.K. Designing Targeted Libraries with Genetic Algorithms, 1. Mol. Graph. Model., 2000, 18, 320-334,.

31. Agrafiotis, D.K., Lobanov V.S. and Rassokhin D.N. The Measurement of Molecular Diversity in In Bohm H.-J., and Schneider G., Eds. Virtual Screening for Bioactive Molecules, Wiley-VCH, Weinheinm, 2000, pp265-300.

32. Zheng W., Hung S.T., Saunders J.T. and Seibel G.L. PICCOLO: A Tool for Combinatorial Library Design Via Multicriterion Optimization. In Pacific Symposium on Biocomputing 2000, Atlman R.B., Dunkar A.K., Hunter L., Lauderdale K. and Klein, T.E. (Eds). World Scientific,

33. Lewis R.A. and Good A.C. Quantification of Molecular Similarity and Its Application to Combinatorial Chemistry, in Computer-Assisted Lead Finding and Optimization, van de Waterbeemd H., Testa B. and Folkers G. (Eds) Wiley-VCH: Weinheim, 1997, pp 137-156.

34. Sheridan RP. and Kearsley, S.K. Using a Genetic Algorithm to Suggest Combinatorial Libraries,1. Chern. In! Cornput. Sci., 1995,35, 310-320.


35. Brown R. D. and Martin Y. C. Designing Combinatorial Library Mixtures using A Genetic Algorithm. J Med. Chern. 1997,40, 2304-2313.

36. Gillet, V.J., Willett, P. and Bradshaw, J. Selecting Combinatorial Libraries to Optimise Diversity and Physical Properties. J Chern. In! Cornput. Sci. 1999,39, 167-177.

37. Zheng W., Cho S.J. and Tropsha, A Rational Combinatorial Library Design. 1. Focus-2D. A New Approach to the Design of Targeted Combinatorial Chemical Libraries. J Chern. Inf. Cornput. Sci., 1998,38,251-258.

38. Cho S.J., Zheng W. and Tropsha, A Rational Combinatorial Library Design. 2. Rational Design of Targeted Combinatorial Peptide Libraries Using Chemical Similarity Probe and the Inverse QSAR Approaches. J. Chern. In! Cornput. Sci., 1998, 38, 259-268.

39. Weber L., Wallbaum S., Broger C. and Gubemator, K. A Genetic Algorithm for the Automated Generation of Molecules within Constraints, Angew. Chern. Int. Ed. Engl. 1995,107,2453-2454.

40. Weber, L. Molecular Diversity Analysis and Combinatorial Library Design In Evolutionary Algorithms in Molecular Design, Clark D. E. (Ed.) Wiley-VCH, Weinheim, 2000,137-157.

41. Singh J., Ator M.A, Jaeger E.P., Allen M.P., Whipple D.A., Soloweij J.E., Chowdhary S. and Treasurywala AM. Application of Genetic Algorithms to Combinatorial Synthesis: A Computational Approach to Lead Identification and Lead Optimisation, J Arn. Chern. Soc. 1996,118,1669-1676.

42. Yokobayashi Y., Ikebukuro K., McNiven S., and Karube I. Directed Evolution of Trypsin Inhibiting Peptides Using a Genetic Algorithm J. Chern. Soc. Perkin Trans. I. 1996,2435-2437.

43. Gobbi A Poppinger D. Genetic Optimization of Combinatorial Libraries. Biotechnol. Bioeng. 1998,61,47-54.

44. Schneider G., Clement-Chomiene 0., Hilfilger L., Schneider P., Kirsch S., Bohm H.-J. and Neidhart, W. Virtual Screening for Bioactive Molecules by Evolutionary De Novo Design, Angew. Che. Int. Ed. 2000,39,4130-4133.

45. Schneider G., Lee M.L., Stahl M. and Schneider P. De Novo Design of Molecular Architectures by Evolutionary Assembly of Drug-derived Building Blocks J. CornputAided Mol. Design, 2000,14,487-494.

46. Schneider G. Evolutionary Molecular Design in Virtual Fitness Landscapes In Bohm H.-J., and Schneider G., Eds. Virtual Screening for Bioactive Molecules, Wiley-VCH, Weinheinm, 2000, ppI61-186.

47. WDI: The World Drug Index is available from Derwent Information, 14 Great Queen St., London W2 5DF, UK.

48. Martin E.J., and Critchlow R.E. Beyond Mere Diversity : Tailoring Combinatorial Libraries for Drug Discovery. J Cornb. Chern. 1999, 1, 32-45.

49. Brown J.D., Hassan M., Waldman M. Combinatorial Library Design for Diversity, Cost Efficiency, and Drug-like Character. J Mol Graph. Model. 2000, 18,427-437.

50. Rassokhin D.N. and Agrafiotis D.K. Kolmogorov-Smimov Statistic and its Application in Library Design, J Mol Graph. Model. 2000, 18,427-437.

51. Bravi G., Green D.V.S, Hann M.A. and Leach, AR. PLUMS: A Program for the Rapid Optimization of Focused Libraries., J Chern. In! Cornput. Sci., 2000,40, 1441-1448.


52. Fonseca, C.M. and Fleming, P.l An Overview of Evolutionary Algorithms in Multiobjective Optimization, In De Jong, K. (editor), Evolutionary Computation, 1995, Vol. 3, No.1, pp. 1-16: The Massachusetts Institute of Technology.

53. Gillet, VJ. Khatib, W. Willett, P. Fleming, PJ. and Green, D.V.S. Multiobjective approach to combinatorial library design. Abstr. Pap. - Arn. Chern. Soc. (2001), 221 st COMP-075.

54. UK Patent Application No. 0029361.3

Clustering of Large Data Sets in the Life Sciences

Ketan Patel and Hugh M. Cartwright Physical and Theoretical Chemistry Laboratory University of Oxford South Parks Road Oxford OX13QZ England

Summary: With the growing amount of genetic data available to scientists there is a pressing need to characterise the fUnctions of genes. Such knowledge will enable us to better understand organisms at the molecular level and to elucidate the mechanisms by which diseases disrupt biological processes. With the advent of whole genome expression technologies such as DNA microarrays and proteomics, scientists can at last determine how the genes and proteins change their rates of expression under specific experimental conditions. The data sets generated from such studies are large and require sophisticated tools for proper analysis. In this chapter we review several techniques employed in clustering data sets of this type. Clustering can often reveal broad patterns which show that certain genes or proteins are performing common fUnctions. This is a useful way in which one can attribute functions to newly discovered genes. A wide variety of clustering algorithms exists; we consider several of the most promising and look at how the techniques perform when tested with different types of data from gene expression and protein expression experiments.

Keywords: clustering. grouping, visualisation. gene expression. protein expression, data analysis.

1 Introduction

With the near completion of the Human Genome Project, biologists now have access to an enormous amount of sequence data. However most of the sequences which have been identified have not been linked to a known function, and thus the next focus for researchers will be to assign functions to genes. Several key technologies have been developed to help in this task; one is the analysis ofmRNA transcripts using cDNA microarrays [1,2]. This technique is very useful in

32 K.Patel, H. M.Cartwright

studying gene expression, and has been proven to be reliable and efficient. Another key technique is Proteomics [3,4], the analysis of protein expression patterns using 2d-gel electrophoresis and subsequent characterisation by Mass Spectrometry. Through Proteomics researchers can study directly the expression of proteins under a variety of physiological and pathological conditions.

Both techniques generate large multivariate data sets which are very hard to interpret in their raw form. It would be beneficial to automatically summarise the data so that it is easier to interpret. Such a summary must not lose the essential information contained in the data, but should reduce the amount of data that the researcher has to actively interpret. Several types of clustering algorithm allow the user to organise data into meaningful grqups, and thus reduce the overall amount which has to be interpreted. Below we describe some of the latest techniques which have been used successfully in current life science research; these techniques can also be used to assess other large multivariate data sets.

2 The Grouping Problem

The grouping problem can be defined as the task of grouping n objects into k groups, such that the objects within a group are similar to each other in some way, but the groups are different to each other. In the case of multivariate data each object has several variables, which might not always be equally significant. In some cases the variables might have to be weighted to signify their relative importance. Figure 1 shows an outline of the clustering process.

2.1 Measures of Similarity

Every clustering algorithm must include some way to measure how similar two objects are. Two popular measures of similarity are the Euclidean Distance and the Pearson Correlation Co-efficient.

Raw data

Clustering of Large Data Sets in the Life Sciences 33

Use visualisation with other tools

to define clusters

Figure 1. The clustering process.

Interpret defined clusters

Biological Knowledge

The Euclidean distance between points i andj inp dimensions is given by:

The Pearson correlation coefficient between any two series of numbers X = {XI> X2, ••• , XN} and Y={YI> Y2, .. • , YN} is defined as:

r=- ' -'--1 (X. -X)(Y -iT] N Li=l,N (j x (j y


where X is the average of values in X and (Ix is the standard deviation of these values. These similarity measures are used in several of the algorithms discussed below.

3 Unsupervised Algorithms

3.1 llierarchical Clustering

Hierarchical clustering has been widely used in gene expression studies and also more recently in the study of protein expression. The algorithm uses a similarity measure (as described in 2.1), to hierarchically cluster the data.

It operates as follows: first the distance matrix is computed using the appropriate similarity measure, then the pair of items with the minimum distance (or maximum correlation) is selected. Once this pair has been identified, these two items are then merged, and removed from the list being processed. The average of the two items is used to represent a new item, and the algorithm then computes distances from all other items to this new one. The procedure is repeated until only one item remains.

This algorithm uses the centroid method to define the distance between two clusters. Another method called single linkage clustering uses the minimum of the distance between all possible pairs of objects in the two clusters. Similarly, in complete linkage clustering the distance between two clusters is defined as the maximum ofthe distances between all possible pairs of objects in the two clusters.

distance

4 3 2 objects

Figure 2. Dendrogram from a hierarchical clustering.


The normal output of a hierarchical clustering procedure is a dendrogram of which a simplified example is shown in Fig. 2. This output is an aid to determining the final cluster memberships. Sometimes the dendrogram is confusing, so other visualisation techniques may be used to aid the user in determining the cluster membership.

3.1.1 Clustering of data from an arthritis study

Expression data is usually presented in the form of a matrix, whose rows represent either genes or proteins, and columns the expression values of these in each experiment. The hierarchical clustering algorithm was tested using a variety of data drawn from both simulated and real data sets.

The algorithm performed well with simulated data sets, and was then tested on a protein expression data set taken from a study of arthritis in rats [5]. This data was derived by injecting the rats with adjuvant to induce arthritis, then taking blood serum samples at 3 day intervals for 15 days. Standard 2d gel techniques were used to derive expression data for proteins in these samples. The final data matrix consisted of 5 columns and approximately 500 rows. The hierarchical clustering algorithm performed well and identified some definite clusters in the data. However, after the most obvious clusters were identified the remaining data seemed to consist of numerous very small clusters. The data had to be normalised when using Euclidean distance measures, but not when using Pearson correlation.

Eisen et al. used hierarchical clustering to cluster gene expression data from a yeast model [6]. Many other studies have used hierarchical clustering to find commonly related genes in expression data. In these studies it was often found that genes that clustered together had a common function and were sometimes coregulated.

3.2 Self-Organising Maps

Self-Organising Maps (SOMs) [7] have a number of features that make them well suited to clustering problems. They are particularly appropriate for exploratory data analysis and also facilitate easy visualisation and interpretation. SOMs have good computational properties and are easy to implement, reasonably fast, and scalable to large data sets.

SOMs work as follows: a SOM has a set of nodes with a simple topology (e.g. a two-dimensional grid) and a distance function d(NI, N2) on the nodes. The nodes are mapped onto k-dimensional space, and then iteratively adjusted according to a function j.{N). The initial mapping fo is random. On each subsequent iteration a data point P is selected and the node Np which maps nearest to P is identified. The mapping of nodes is then adjusted by moving points toward P employing a formula such as that given below from [8]:

fi+I(N) = t;{N) + T (d(N, Np), z) (P - t;{N))


The learning rate 7 decreases with distance of node N from Np and with iteration number i. The point P used at each iteration is determined by random selection from the n data points. The function 7 is defined by an expression such as

r(x,z) = 0.02T/(T + 100 i) for x = p(z) and r(x,i) = 0 otherwise,

where radius p(i) decreases linearly with i (p(O) = 3) and eventually becomes zero and T is the maximum number of iterations. This results in the closest node Np

being moved the most, whereas other nodes are moved by smaller amounts depending on their distance from Np in the initial geometry. After many iterations neighbouring points in the initial geometry tend to map onto neighbouring points in k-dimensional space (see Fig. 3).

Figure 3. A Schematic Self-Organising Map.

3.2.1 Clustering of Gene Expression data

SOMs have been successfully used to cluster gene expression data. Tamayo et a/., who tested their SOM on gene expression data sets from yeast and human cancer cell lines [8], have shown that SOMs are sometimes better than hierarchical clustering for large gene expression data sets.

The data was first pre-processed with a variation filter to get rid of any genes with no significant change across the samples. The data was also normalised across experiments. For the yeast cell cycle data a 6 x 5 SOM was used. The SOM automatically identified the cell-cycle periodicity as among the most prominent features in the data, since there were several nodes with this feature . The genes


identified as having peak expression in the late G I · phase corresponded well with those identified by visual inspection in [9].

A second set of data was taken from a myeloid leukemia cell line (HL-60) which undergoes macrophage differentiation on treatment with the plorbol ester PMA. Nearly 100% of HL-60 cells become adherent and exit the cell cycle within 24 hours of PMA treatment. The process of hematopoietic differentiation is largely controlled at the transcriptional level, and blocks in the developmental program likely underlie the pathogeneisis of leukemia. Cells were harvested at 0, 0.5,4 and 24 hours after PMA stimulation (for full method see [8]). 567 genes passed the variation filter, and a 4 x 3 SOM was used to organise the genes into 12 clusters. The results uncovered many genes which have previously been identified as being co-regulated, but also discovered some that have not been identified. Some of these genes would not normally have been associated with each other. This generated new hypotheses about the role of certain gene families in macrophage differentiation (for full details see [8]).

!

Figure 4. SOM output screen showing the average expression profile for each SOM node.


We also tested a SOM using protein expression data from [5] and the algorithm performed well, especially with noisy data. A sample output screen is shown in Fig. 4, this screen illustrates how similar expression patterns tend to be located on adjacent nodes in the SOM structure. It is easy to visually group patterns into clusters this way, and then look at the members of those clusters in more detail. As with all exploratory data analysis tools, the use of SOMs involves inspection of the clustered data to extract insights.

3.3 Genetic Clustering Algorithms

3.3.1 Genetic Algorithms

In a Genetic Algorithm [10] the process of evolution is used to design novel solutions to problems. A population of individuals, each of which represent a possible solution to the problem, is maintained. With each 'generation' of the algorithm, the population is changed, by selecting 'fit' individuals, reproducing these and then combining them with others to create a new population. The fitness measure gives an indication of the fitness of an individual; in our case a fitness measure is an assessment of the quality of the clustering. Over time the population gets fitter as solutions of better quality emerge. Genetic Algorithms have been used successfully for the problem of clustering n objects into k clusters [11].

3.3.2 Representations used in GCA's

There are several schemes to represent a potential clustering within the GA. One such approach is the Group-number representation [12] which provides an easy way of representing a clustering using a string of numbers. This method however requires that the user define k, the number of clusters a priori. This encoding represents a clustering of n objects as a string of n integers where the ith integer signifies the group number of the ith object. For example: 11122223333 signifies objects 1 to 3 are in cluster 1, objects 4 to 7 are in cluster 2 and so on. The advantage of this is that the string is always of a fixed length, which makes processing more efficient. The disadvantage is that the user must specify how many clusters are expected, and in most cases this is not known prior to analysis.

Another approach is to use a string of numbers where the numbers represent the appropriate object, and use a separator character to denote the cluster boundries. For example if the letter Z were chosen as the separating character, then 345Z2167 would denote that objects 3,4 and 5 are in the first cluster, and objects 2, 1,6 and 7 are in the second cluster. This encoding is known as the Permutation with Separators encoding.


The Greedy permutation encoding uses a similar string of numbers but without the separator characters. The first k objects are used to seed k clusters. The remaining objects are then, in the order they appear in the permutation, added to the cluster which yields the best objective value (typically the cluster with the closest centroid).

3.3.3 Move operators

The genetic operators are used in each generation to reproduce and mutate the current population. Standard genetic algorithms use two main operators, the crossover and mutation operators. The mutation operator occasionally mutates members of the population; this maintains genetic diversity in the population. The crossover operator recombines genetic material from two population members and produces a new population member which sometimes is fitter than the 'parent' members.

In GCA's the operators depend upon the representations used. Standard crossover operators can be used with most representations, and a popular one is one-point crossover. In this operation a point on the chromosome is picked at random and material is exchanged between the two 'parents', from either side of the point, to create two new chromosomes. Another standard crossover operator is uniform crossover, in which at each point in the chromosome one parent is picked at random. The object from this parent is inserted into the first child, the object from the parent picked second is inserted into the second child. Thus two new chromosomes are created.

Standard operators are insensitive to the parent clusterings (i.e. they do not take into account the parent clusterings when creating the child). One operator which is sensitive to the parent clustering is the Edge-based crossover operator. It works as follows:

1. Initialise the child to the set of non~empty intersections of the clusters of the two parents. Let L denote the number of non-empty intersections.

2. If L = K, then stop, otherwise go to step 3.

3. Select the pair of groups with the minimum number of non-inherited edges (between group edges not present in either parent), breaking ties at random. Join this pair of groups, set L = L - 1, and go to step 2.

We can illustrate this operator with an example. Suppose we have the following parent clusterings :

{{Xl}' {X3,X4,Xs}, {X2,X6 }}

{{X3}, {X2, X4, X6 }, {Xlt Xs}}

The non-empty intersections of these clusterings are:


We initialise the child to the set of intersections and then merge clusters until the correct number of clusters is reached. In this example one possible child is:

{{X), Xs}, {X2' X6 }, {X3, X4 }}

which inherits {X3, X4 } from parent 1, {X), Xs} from parent 2, and {X2' X6 } from both parents. This operator ensures valid children, and works for all representations.

The aim of the mutation operator is to introduce random mutations into the population members; this helps the GA to explore new parts of the search space. A popular mutation operator (based on the standard GA mutation operator described in [13]) takes a random part of the string and changes it, making sure the result is still valid. This effectively moves an object from one cluster to another randomly. The idea is that good mutations are kept and bad ones are selected out by evolution.

3.3.4 Fitness Function

The fitness function must take in a GA string and return a value which represents the 'quality' of the clustering. Most GA's work to maximise the fitness value, and so any fitness function must give a higher value to optimal clusterings. Two popular fitness functions are given below.

The trace of W where W is defined as follows:

k n·

W = II(Xij -Xj)(Xij -Xi)' i=1 j=1

Here ni is the number of objects in cluster i, Xij is the jth object of the ith cluster

and X i is the centroid of cluster i. The minimisation of trace (W) is equivalent to minimising the sum of square Euclidean distances between individuals and their cluster centroids. This clustering criterion (used in [14]) favours spherical clusters, since the correlation between the attributes is not considered. Other distance measures such as diagonal distance can also be used, to account for different cluster shapes. Since the GA must maximise the fitness value, the trace is transformed using the function f' = Cmax - f, where f is the raw fitness, f' is the scaled fitness, and Cmax is the value of the poorest string in the population. This value is also linearly scaled to provide a greater range of fitness values for the GA to work with.

Another fitness function is the maximisation of the between sum of squares and within sum of squares ratio. Since the aim is to maximise this value no other transformation is necessary.


3.3.5 Results

We used a GCA to cluster protein expression data which was derived from a rat model of arthritis. Two fitness functions were implemented, minimisation of trace (W) as well as the ratio of between sum of squares to within sum of squares. The group number representation was used with standard one-point and uniform crossover operators, and a standard mutation operator. The algorithm worked well on simple simulated data, but became increasingly poorer at determining clusters, when the data set grew more complex. The results with the protein expression data were not as good as those found with other clustering methods. However clustering using GA's has been tested successfully using a wide variety of other multivariate data sets (see [11] for a full discussion).

4 Supervised Algorithms

4.1 Growing Cell Structure Networks

The Growing Cell Structure (GCS) network [15] is a neural network related to Kohonen's self-organising feature map. The difference between GCS and SOMs is that the network topology is not fixed in a GCS. It grows and changes until it accurately models the data distribution. Given a set of data, the data is first sorted into two sets, so that the data within each set is similar, but the two sets are dissimilar. Next a new node is added, and data from the first two sets is transferred to this new node, to minimise the error in each node. Thus data which is similar ends up in each node. This continues until the addition of a new node does not decrease the overall amount of error in the system. Each new node is placed adjacent to the two nodes in the system containing the most error (see Fig. 5). Once the network is constructed it can also be used to predict the classification of a new data point.

Figure 5: A Growing cell structure network shown at successive stages of growth.

42 K.Patei, H. M.Cartwright

4.1.1 Analysis of Cytology data

The GCS system was used by Walker and co-workers [16] to cluster and classify breast cancer cytology data. The data was made up of several cases, each consisting of several variables, and an outcome for that particular patient of whether the cancer was malignant or benign. The outcomes of each of the cases were already known, so a portion of the data was reserved for testing the predictive capabilities of the network. The GCS successfully identified the various input variables which were most important in determining if an outcome was malignant or benign (for full set of results see [16]).

4.2 Support Vector Machines

Support Vector Machines (SVM's) [17] are a supervised learning technique, because they exploit prior knowledge. SVM's need to be trained on a training set of data and they can then be used to classify new data into previously identified classes. In the case of gene expression data a set of genes with a common function (e.g. genes coding for ribosomal proteins) would be classified as a distinct class. Several such examples can be derived from previously classified data sets, as well as genes that could not be classified with a function. The SVM learns to classify expression data into a functional class based on this training set.

If we think of each vector of gene expression data as a point in m-dimensional expression space, then a simple way of classifying points in this space is to construct a hyperplane which separates the class members. However, in most real world problems there is no such hyperplane that can successfully separate the positive from the negative examples. One solution to the inseparability problem is to map the data into a higher dimensional 'feature space' and define a separating hyperplane there. However, . there are problems with this approach in that sometimes the system finds trivial solutions by overfirting the data set. Furthermore, mapping into feature space is computationally expensive.

SVM's avoid these problems in two ways. Firstly, they avoid overfirting by choosing the maximum margin hyperplane from among the many that can separate the positive from the negative examples. Secondly SVM's avoid explicitly representing the feature space. This is because the algorithm that finds the hyperplane can be stated entirely in terms of vectors in the input space and dot products in the feature space, by defining a function, called a Kernel Function, that plays the role of the dot product in the feature space; the feature space vectors do not have to be represented explicitly. On occasion the SVM may not be able to find a separating hyperplane in feature space. This problem can be solved by specifying a soft margin that allows some training examples to fall on the wrong side of the hyperplane. Therefore, to specify a support vector machine one needs


two parameters: the kernel function and the penalty for violating the soft margin. The settings of these parameters depend on the specific data to hand.

0 0 0 0

0 0

0 0

• • • • • • •

• • Figure 6. Difference between separating hyperplane and optimal separating hyperplane.

Given an expression vector X for each gene or protein X, the simplest kernel K(X, Y) that we can use to measure the similarity between genes X and Y is the dot --product in the input space K(X,y) = X.Y. When this dot product kernel is used, the feature space is essentially the same as our m dimensional expression space, and the SVM will classify with a separating hyperplane in this space. Raising this --kernel to higher powers (e.g. (X.y)2 ), yields polynomial separating surfaces of higher degrees in the input space. In general, the kernel function of degree d is

defined by K(X,y) = (X.Y + It In the feature space of this kernel there are d-fold interactions between expression measurements. There are also other forms of kernel function one can use besides the above, such as a radial basis kernel

function, which has a Gaussian form K(X,y) = exp( IIX - YI12 / 2a 2 ), where ex is

the width of the Gaussian.

Training the SVM consists of error minimisation using gradient descent learning techniques. For small problems standard gradient descent techniques (such as conjugate gradient methods) can be used, however for larger problems more advanced optimisation techniques need to be employed [18]. The optimisation finds the saddle point which is a global optimum in the feature space. Since this optimisation problem only has a global optimum it is not prone to getting confused by local minima in the feature space. Thus the algorithm will find the optimal separating hyperplane between the two classes of points not just any hyperplane which accurately separates the two sets of points (see figure 6). For a mathematical proof of how SVMs find the optimal hyperplane see [19]. This ability of SVM's to accurately generalise from a set of training data and not overfit


that training data is very useful, allowing SVMs to sometimes outperform other comparable methods such as neural networks.

4.2.1 SVM Treatment of Test Data

Brown and co-workers tested a variety of SVM's with different kernel functions derived on gene expression measurements [20]. The data was taken from a study of the cell cycle in Yeast. The SVM's were trained using 2,467 genes which were annotated, and were trained to recognise six functional classes of protein. The gene classes were chosen because they represent categories of genes that are expected to exhibit similar expression profiles. One of the six classes was a control group and consisted of genes which were not expected to have similar expression profiles. The results showed that the best performing method was an SVM using a high dimensional kernel function or an SVM with a radial basis kernel function. The SVM's were also able to predict the functional classes of some previously unannotated genes based on their expression profiles.

5 Evaluation of clustering results

5.1 Visualisation techniques

5.1.1 Application specific visualisation

In order to see the results of the clustering and to evaluate its quality, it is often useful to have an appropriate visual representation of the data. Using visualisation can help the user to assess the quality of the clustering and also to arbitrate when clustering is ambiguous. In the case of expression patterns a popular technique is to use a grid of rectangular cells with a colour to represent the normalised expression value. The colour ranges from bright green (for high expression), to bright red (for low expression). Colour matrices have been widely used to represent gene expression data [21], and so the same colour conventions were adopted for protein expression.

5.1.2 General visualisation techniques

General visualisation techniques exist that can be used to assess clustering output. For example by mapping our p-dimensional input data points into 2 or 3 dimensional coordinates, we could simply plot the points as a 2D or 3D graph. Using this technique we can easily see which points lie closer together in the new dimensions. A commonly used technique to reduce the dimensionality of data is


Principal Components Analysis (or PCA). This is a dimensional reduction technique, which maps the p-dimensional input vectors onto a new coordinate space according to the variation of the principal components of the original data. Gilbert and Schroder employed this technique in Space Explorer [22], a 3D interactive visualisation system, which allows the user to see the points mapped onto a 3D coordinate space; and each point is coloured to reflect which cluster it is in (Fig. 7). Using this technique it is easy to identify data points which are outliers, and also data Points which should belong in a different cluster, since one can readily spot a blue data point in a clump of red data points.

Such visualisation techniques can also be used to visually cluster the data, using user interaction. For example one could use a selection tool to cluster the data mapped onto a new coordinate space, and the user could select which data points were in which cluster. Alternatively one could use such interactive tools to fine tune and correct a clustering solution found by an automatic clustering algorithm.

Figure 7. A screen from Space Explorer.


5.2 Statistical measures to evaluate clusterings

Various statistical measures exist by which one may assess the quality of a clustering. In some cases these can also be used during clustering, to find the optimal number of clusters.

The most widely used measure is the root-mean-square standard deviation of all the variables forming the cluster. The RMSSTD is the pooled standard deviation which can be found by first calculating the pooled variance, which is given by :

Pooled sum of squares for all variables /

Pooled degrees of freedom for all the variables

The square root of the pooled variance is the RMSSTD. Since the objective of cluster analysis is to form homogenous groups, the RMSSTD of a cluster should be as small as possible. When used whilst clustering (for example during hierarchical clustering), RMSSTD should not significantly increase when forming a new cluster, as this would indicate that the new cluster is not homogenous. However it should be noted that there are no guidelines to decide what is a 'small' value for RMSSTD and what is 'large'.

R-squared (RS) is another good statistical measure of the quality of a clustering. RS is the ratio of SSb (between group sum of squares) to SSt. where SSt = SSb + SSw (the within group sum of squares). The greater the SSb the smaller the SSw and vice versa. Thus, for a given data set the greater the differences between groups the more homogenous each group is, and vice versa. Therefore the RS measures the extent to which groups or clusters are different. RS ranges from 0 to 1, where 0 indicates no differences between groups and 1 indicates maximum differences between groups.

The distance between clusters can also be used as a measure for how similar two clusters are to each other. This can be found by calculating the centroid distance (CD) between two clusters which is the Euclidean distance between two cluster centroids. When using hierarchical clustering, this measure could be used to assess whether two clusters should be merged into one; if the distance is small then the two clusters should be merged, otherwise they should remain as two separate clusters. A summary of these statistics is given in table 1.

Statistic Concept Measured Ideal Value

RMSSTD Homogeneity of cluster Value should be small

RS Heterogeneity of clusters Value should be large

CD Homogeneity of merged clusters Value should be small

Table 1. Summary of the statistics for evaluating clustering solutions.


6 Interpretation of Clustering Results

The final step of cluster analysis is to determine what the clusters represent. This requires specific knowledge of the subject area and in some cases further research. In many gene expression studies the genes which have similar expression patterns (i.e. are in the same cluster) may be related, having a common functional role. Studies by Gerstein and Jansen [23] indicate that proteins which have common structures also cluster together in expression data sets. This helps in identifying the function of those genes to which no function has yet been designated. Genes may cluster together if they are commonly regulated by a transcriptional regulator [24]. These can be identified by looking at the upstream sequence of the coclustered genes and finding a common pattern. This region is called a promoter sequence, and indicates where the regulator binds to the DNA to tum the gene 'on'. Thus clustering of gene expression data gives many clues about the commonly clustered genes and can aid in annotation of the genome as well as the elucidation of biological mechanisms.

7 Conclusion

Clustering of large multivariate data sets has been discussed and a variety of solutions presented. In testing, the hierarchical clustering algorithm worked well for well-defined data (i.e. where the clusters were well separated and there was minimal noise), but performed fairly poorly with real data sets which did contain noisy data. The self organising map performed well with noisy data, and was also informative as to the structure of the data. Similarly with the growing cell structure network algorithm, the clustering results could easily be interpreted into groups.

Some algorithms are amenable to easy interpretation whereas some require further measures and visualisation of the data to further arbitrate cluster boundaries. Once all clusters are found interpretation of the meaning of the resulting groups is a separate task and requires further data from genomic and protein databases. After analysing these data one can sometimes give a 'name' to the clusters or label them in some way. This insight can generate new hypotheses about certain genes and organises those genes which were previously unannotated. Gene expression studies have now moved on to more complex organisms such as humans [25,26], and such analysis will only increase when more data from the genome projects become available. Although several such studies will be necessary to find all the relationships between gene expression and other biological factors, analyses such as those described above will be essential in speeding up this process.


References 1. M. Schena, D. Shalon, R. Davis and P. O. Brown, Quantitative monitoring of

gene expression patterns with a eDNA microarray, Science 270:467-470, (1995).

2. P. O. Brown and D. Botstein, Exploring the New World of the genome with DNA microarrays, Nature Genetics 21:33-37, (1999).

3. M.R. Wilkins, K. L. Williams, R.D. Appel, D. F. Hochstrasser, (Eds.), Proteome Research: New Frontiers in Functional Genomics, Springer-Verlag Berlin, Heidelberg, New York, (1997).

4. Humphrey-Smith l., Cordwell SJ., Blackstock W.P.; Proteome Research: Complementarity and limitations with respect to the RNA and DNA worlds; Electrophoresis 18(8): 1217-1242 (1997).

5. D. Shipton, Autoimmune disease in rodents: control and specificity, DPhil Thesis, University of Oxford, (1999).

6. M. B. Eisen, P. T. Spellman, P. O. Brown and D. Botstein, Cluster Analysis and display of genome-wide expression patterns, Proc. Natl. A cad. Sci. USA, vol 95 ppI4863-14868, (1998).

7. T. Kohenen, Self-organized formation of topologically correct feature maps, Bioi. Cybern. 43:59-69, (1982).

8. P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander and T. R. Golub, Interpreting patterns of gene expression with selforgansing maps: Methods and application to hematopoietic differentiation, Proc. Natl. Acad. Aci. USA, 96:2907-2912, (1999).

9. R. J. Cho, J. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman, D. J. Lockhart, A genome-wide transcriptional analysis of the mitotic cell cycle, Mol. Cell, 2(1):65-73, (1998).

10. Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs (3rd edition), Springer-Verlag, Berlin, Heidelberg, New York, (1996).

11. R. Cole, Clustering with Genetic Algorithms, MSc Thesis, Department of Computer Science, University of West em Australia, (1998).

12. D. R. Jones and M. A. Beltramo, Solving partitioning problems with genetic algoritms, In R. K. Belew and L. B. Booker (editors), Proceedings on the Fourth International conference on Genetic Algorithms p442-9, Morgan Kaufmann publishers, San Mateo, California, (1991).

13. D. E. Goldberg, Genetic Algorithms in Search, Optimisation and Machine Learning, Addison-Wesley Publishing Company, Inc., (1989).

14. J. Bhuyan, A combination of genetic algorithm and simulated evolution techniques for clustering, In C. 1. Hwang and B. W. Hwang (editors), Proceedings of the 1995 ACM Computer Science conference. pI27-134, The Association for Computing Machinery, Inc., (1995).

15. B. Fritzke, Unsupervised clustering with growing cell structures, Proc. IJCNN-91, (1991).


16. A. 1. Walker, S. S. Cross and R. F. Harrison, Visualisation of biomedical datasets by use of growing cell structure networks: a novel classification technique, Lancet 354:1518-21, (1999).

17. V. Vapnik, Statistical Learning Theory. Wiley, Chichester, England, (1998). 18. 1. C. Platt, Fast training of support vector machines using sequential minimal

optimization, In Scholkopf, B., Burges, C. J. c., and Smola, A. J., editors, Advances in Kernel Methods, MIT Press, Boston, (1999).

19. C. J. C. Burges, A Tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, Kluwer Academic Publishers, Boston, (1998).

20. M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares Jr., D. Haussler, Knowledge based analysis of micro array gene expression data by using support vector machines, Proc. Natl. Acad. Aci. USA, vol. 97:262-267, (2000).

21. R 0 Meyer and 0 Cook, Visualisation of data, Current Opinion in Biotechnology 2000 11 :89-96, (2000).

22. D. Gilbert, M. Schroeder, 1. van HeIden, Space Explorer: Interactive visualisation of relationships between biological objects, Trends in Biotechnology 18(12):487-493, (2000).

23. M Gerstein and R Jansen, The current excitement in bioinformatics - analysis of whole genome expression data: how does it relate to protein structure and function?, Current Opinion in Structural Biology 10:574-584, (2000).

24. M. Q. Zhang, Large-scale gene expression data analysis: a new challenge to computational biologists, Genome Research 9:681-688, (1999).

25. V. R. Iyer, M. B. Eisen, D. T. Ross, G. Schuler, T. Moore, J. C. F. Lee, J. M. Trent, L. M. Staudt, 1. Hudson, M.S. Boguski, D. Lashkari, D Shalon, D. Botstein, P. Brown, The transcriptional program in the response of human fibroblasts to serum, Science 283:83-87, (1999).

26. U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack and A. J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and colon tissues probed by oligonucleotide arrays, Proc. Natl. A cad. Ad. USA, vol. 96:6745-6750, (1999).

Application of a Genetic Algorithm to the refinement of complex Mossbauer spectra

Werner Lottermoser, Thomas Schell & Konrad Steiner

Institute of Mineralogy, University of Salzburg, Hellbrunnerstrasse 34, A-S020 Salzburg, Austria

Summary: The present contribution includes the results of the application of a program consisting of a genetic algorithm routine and a conventional refinement part ("hybrid method'J for the evaluation of Mossbauer spectra published elsewhere. The saving of total evaluation time compared with conventional refinement routines is very high due to the rapid finding of adequate starting parameters. Contrary to previous work on a similar topic our algorithm provides solutions of the combined interaction Hamiltonian with a minimum of conventional input data. The reader is referred to a web-address where he may test the routine on his own.

Keywords: Fitting of Mossbauer spectra, hybrid method, genetic algorithm

1 Introduction

1.1 Basics of Mtissbauer spectroscopy

Mossbauer spectroscopy (in the following abbreviated as MBS) is a very widespread and successful method in nuclear and solid state physics, chemistry, biology, medicine, metallurgy, geosciences and other scientific fields due to its singular ability to use a certain nucleus (in most cases s7Fe) as a super-micro probe detecting the surroundings of a special site within a given crystal lattice. In biology, e.g., it is possible to examine the environment of Fe in the hemoglobin molecule in order to detect certain blood pathologies. We, however, use MBS to deduce relations between structural and physical properties of Fe-containing minerals. As in many other spectroscopic tools resonance is the basic principle which is sophistically applied - with one essential extension: On the nuclear scale resonance of emitted and absorbed y-rays is only possible by suppressing the nuclear recoil (normally prohibiting the two y-frequency lines to overlap due to

52 W. Lottermoser, T. Schell and K. Steiner

their incomparably low linewidth-to-intensity ratio). This is achieved by fixing the emitting and absorbing nuclei in a crystal lattice (Mossbauer effect, Nobel prize 1961). By this, the very narrow lines can be observed applying another effect - the famous Doppler relation between frequency and relative velocity: By moving the emitted y-rays (resp. their nuclear transmitter) in a definite manner we can run through the whole relevant frequency range of the fixed crystal sample's nuclei like - using a trivial analogon - "searching for a broadcasting station in the radio". We thus obtain a zero-level of non-resonant y-count rates (the background) combined with one or more resonance peaks - this dependence "absorbed intensity on the frequency" (or relative velocity, respectively) forms the so-called Mossbauer spectrum. The position of the lines on the frequency (or velocity) scale is strongly revealing for certain physical properties represented by special parameters:

If a single line or the centre of the spectrum, respectively, is shifted relative to zero-velocity this will be produced by a shift of the nuclear levels due to the valence states of the outer s-electrons surrounding the nucleus. The corresponding parameter is the isomer shift 8 [mmls].

If the single peak is split into two, this will be caused by inhomogeneous electric fields acting on the nuclear levels produced by the own electronic shell and the surrounding ligands. This "quadrupole splitting QS" (the relevant parameter) is thus influenced by a mathematical quantity which describes the local charge distribution around the nucleus, the so-called electric field gradient (efg). Geometrically, this efg-tensor (the efg is the second derivative of the potential 02V/OXjOXj!) is represented by an ellipsoid with the semiaxes Vxx, Vyy, Vzz with IV zzl~IV YYI~IV xxi. The Mossbauer parameter QS and V zz are related to each other by the equation

QS = 0.5 e Q Vzz (l+1l2/3)1/2 [mm/s]

with e = electronic charge unit, Q = nuclear quadrupole moment, Vzz z-component of the electric field gradient and - as a further parameter-

II = (Vxx - V yy)Nz

which represents the flattening of the efg-ellipsoid.

If the initial line is split into a sextet, this will be caused by magnetic fields acting on the nuclear levels producing a sort of Zeeman effect. The relevant parameter is the internal magnetic field H(O) [T] - it is even possible to derive the direction of the field with respect to the efg-axes system through two more parameters: the colateral angle e [0] between Vzz and H(O) and the azimuthal angle <l> [0] between Vzz and the projection of H(O) in the (V xx,V yy)-plane, with respect to Vxx.

Application of a Genetic Algorithm to the Refinement 53

The parameters mentioned so far are valid for a powder sample. In the case of single crystal spectra we observe an influence of the crystal orientation on the relative intensities of the Mossbauer lines, so that we can add two more parameters: an angle ~ [0] between the k-vector of the incident y-rays (this may be laid into a defined crystallographic direction) and V zz and the angle a. [0] between the projection of k in the (V xx, V yy)-plane and V xx.

In most cases, the Mossbauer parameters mentioned above work partly or in the whole together to influence the resulting spectrum in a complex manner ("combined hyperfine interaction"). In other words: it is a very difficult job to extract from a given experimentaJ Mossbauer spectrum the underlying parameters as each symmetrically non-equivalent site in the crystal produces its own Mossbauer spectrum; the resulting superimposed peak distribution may then become very complex to resolve. The commonly used approach to solve this problem is to calculate a theoretical spectrum from the relevant Schrodinger equation (Hamiltonian formalism) in order to compare it with the experimental one in order to get the relevant parameters by trial-and-error. One of the first routines proceeding in this manner was the program of Varretffeillet ([1], early versions certainly from mid-seventies), which uses the common Mossbauer parameters as the input, sets up the elements of the Hamiltonian matrix from these, diagonalizes it and modifies the parameters by a least-square algorithm always comparing experimental and calculated spectra. In subsequent iterations the parameters are refined to the goal of minimal deviation between theoretical and experimental spectra. Most of the currently available commercial programs are based on this principle. All these programs, however, have in common that -despite the minimized iteration times because of very powerful hard- and software - the total evaluation time of a complex Mossbauer spectrum may be still very high. This is due to the sensitivity of the algorithms to the choice of the starting parameters because of commonly occurring correlations between them; therefore, it depends mostly on the intuition of the Mossbauer scientist to avoid timeconsuming trials and divergences of the parameters due to inappropriate selection of the initial values. In complex cases, say, two superimposed magnetic hyperfine patterns with 8 parameters each, that method may consume several days even for a very smart and experienced scientist to find appropriate startihg parameters.

1.2 Genetic algorithms - a principally new approach

The idea is to replace the guessing phase by a genetic algorithm (in the following abbreviated with GA). The GA starts with an initial population of arbitrary input parameters. Each parameter set is called an individual. The first step is to evaluate the fitness of all individuals. Afterwards the current population is recombined to form a new population. The process of recombination involves the successive application of genetic operators which are selection, crossover, and mutation. These operators mimic the process of natural evolution i.e. the concept of the

54 W. Lottermoser, T. ScheU and K. Steiner

survival of the fittest. Even though there is no formal proof that the genetic algorithm will eventually converge, there is an excellent chance that there will be a population with very good individuals after some generations. The genetic algorithm is monitored for good individuals while it is processed. Thus, there will be a list with good parameter sets at the end of the processing.

This concept was used in the study of Ahonen et al. [2], however, in a very earlystage manner: A spectrum was fitted with a combination of single Lorentzian lines, e.g. a doublet was decomposed into two lines, a sextet into six lines and so on. As input the genetic algorithm took the peak positions, the expected number of lines, the measured spectrum and the maximum velocity. Each line was represented by three parameters, which corresponded to the intensity, full line width at half maximum and peak position. At the end of a GA-run the calculated geometric parameters were translated to physical ones by a separate program step. As a result, this genetic algorithm fits a Mossbauer spectrum geometrically.

In contrast to Ahonen et al. [2] the algorithm presented in the following combines a very modem version of a genetic algorithm with a conventional least-square routine (the combined algorithm is called the hybrid method (HM» solving the combined interaction Hamiltonian i.e. providing a physical solution with the original Mossbauer parameters with a minimum of input.

2 Theoretical

2.1 The applied genetic algorithm (GA)

The basic idea behind a genetic algorithm is the concept of natural populationbased evolution. Moreover, the basic components of natural evolution can be identified in a GA. The genotype of a natural individual corresponds to a binary string (also other alphabets than a binary one are possible) in the GA. The natural phenotype is the equivalent to a parameter set of a solution to the underlying problem which is to be solved by the GA. The parameter set itself is encoded in a binary string i.e. the genotype. The environment is represented by a fitness function, whereby the fitness function assigns a fitness-value to each phenotype/individual. The natural competition for resources is replaced by a selection procedure which favours fit phenotypes for reproduction. The reproduction cycle is carried out by a crossover-operator and a mutation operator: The crossover operator exchanges parts of the binary strings between two fit genotypes, the mutation operator changes arbitrary bits in the binary string. The initial population is replaced by a new one which was created by successive application of the selection-, crossover-, and mutation operators. Then the new population is the input for a new reproduction cycle. [3]


The most critical steps in the design of a genetic algorithm are the encoding of the parameter set and the definition of the fitness function because the performance of the GA is strongly influenced thereby.

The GA which has been applied is almost canonical. In other words, no special operators were necessary to achieve the convergence of the GA. As a selection scheme a binary tournament selection [4] has been applied because of its excellent performance i.e. its linear time complexity [5] and its simple implementation. Furthermore, binary tournament selection is known to perform better than roulette wheel selection (as applied in Ahonen et at. [2]) which was originally presented by John Holland [6). Other selection schemes like proportionate selection i.e. roulette wheel selection with sigma scaling or windowing, linear or exponential fitness ranking have been introduced, but they are more complex to implement than binary tournament selection and there is no evidence that these schemes perform better [7, 8). A two-point crossover operator (a standard GA-operator) with an optimized crossover-rate was applied as well as a mutation operator and various population sizes. In every generation the whole population was replaced by a new one. For other generational schemes see Dejong [9). In theory there are convergence criteria but from a practical point of view a limit on the number of generations was sufficient which was set to 400.

For each parameter Pj of the n parameters in the physical model of the combined hyperfine interaction an interval [CjA] and a fixed number of bits Bj is defined. The parameters Pj might take all values of the interval [Cj, diJ. Each parameter Pj corresponds to a binary string bj = bjl> bj2, ..... bjBj. The number of Bj is chosen according to the required resolution of the interval [Cj, dj). The concatenated binary strings bj are called genotype b l I~ I .... Ibn. For the initial population the genetic algorithm creates a random binary string of a length of ~j=1 Bj for each genotype. Each genotype is decoded into its phenotype i.e. parameter set by the following linear transformation:

Pj = Cj + [(dj - cj)/(2Bj - 1)] ~~j=1 bji- I

where:

p = (PI> P2,.····,Pn) is a complete parameter set ofa Mossbauer spectrum,

Pj value of the jth input parameter for the iteration method,

bj; ith bit of the jth parameter,

Bj number of bits of the jth parameter

Cj, dj lower/upper limit of the jth parameter interval, respectively.

The fitness of a single phenotype corresponds to the least squares error y! of the iteration method (see below). In fact, the negative Xl was used because the genetic algorithm maximizes the fitness. Therefore, the fitness of a phenotype was set to - Xl. In addition, the fitness of a phenotype was also influenced by the areas of the

56 W. Lottermoser, T. ScheU and K. Steiner

subspectra. Individuals with a negative area of a subspectrum were penalized by a constant value of -1 00 which was added to the - X2 because negative subspectra do not make any sense from a practical point of view.

2.2 The applied least-square routine (iteration method 1M)

The conventional part of our program has been set up according to the principles already mentioned in the Introduction. For a detailed description of the course of the necessary calculations the reader is referred to Barb [10]. The experimental Mossbauer spectrum with the intensities Yobs,j (with i = channel number or relative velocity, respectively) is approximated by a spectral function Yeal (Xj, 7t1 ..... 7tn) with Xj = channel number and 7t1 ...... 7tn = parameters to be refined. The quantity to be minimized is the function

N X2 = Li =1 (Yeal(Xj, 7t1 ..... 7tn) - Yobs,j)2 w(Xj) l/(N-n)

with N = total channel number, w(Xj) = l/Yobs,j (weight of the ith value) and n = number of the refined parameters.

For this purpose the spectral function Yeal (Xj, 7tN) (with £ = 7t), .... 7tn) was developed around the initial value 1i ° (zero approximation) up to the first order:

Flo = Lt =\ (Yeal(Xj, 1i 0) - Yabs.j) (aYeal(Xj, 1i) /87tI);,ow(Xj)

N With the condition LI = I Flmo(7tml- 7tmo) = Flo the first correction 1tml of the initial values 7tmo is obtained. This correction is used as the next starting value. This iteration is continued up to a minimum of X2 . The input parameters for the refinement procedure are already described in the Introduction.

2.3 The combination of GA and 1M: the hybrid method HM

In the approach presented here the iteration method for the combined hyperfine interaction and the genetic algorithm have been combined. The HM-program takes three input files. The first file contains the values of the measured Mossbauer spectra, the second file contains the parameters of the relevant Mossbauer experiment and the intervals [Cj, dj] for the parameters Pj which are to be evolved and the third file includes the string lengths Bj.

The genetic algorithm is initialized with a random population which can be interpreted as a set of Mossbauer parameters/individuals. Afterwards the fitness is determined for each of the individuals. The individuals with the best fitness-values


are selected for the recombination process (crossover, mutation). The new population is the input for the recombination process of the next population.

The genetic algorithm is stopped when a certain number of generations is exceeded. During a genetic algorithm run the best individuals i.e. parameter sets are recorded in a separate file. The criterion for recording a parameter-set is a fixed limit on X2.

3 Experimental

The 1M was originally programmed in FORTRAN but for this project it was converted to C by a cross-compiler and was modified to interface the program with the genetic algorithm. The GA was coded in C++. The development of the project took place on Sun workstations. The GA-runs were executed mostly on SUN-Sparc-Ultra-machines. Depending on the parameters of the hybrid method the execution of the program lasted from one to three days.

The hybrid method was extensively tested on already conventionally evaluated Mossbauer spectra which were published elsewhere, i.e. on extant problems. They are listed in the following with increasing complexity:

(1) Na-acmite powder, I symmetrical doublet (Table 1, [11])

(2) Li-acmite single crystal, 1 asymmetrical doublet (Table 1, Fig. la, Lottermoser et al. 1996)

(3) Neptunite single crystal, 3 asymmetrical doublets (Table 1, Fig. Ib, [13])

(4) Li-acmite powder, 1 symmetrical sextet (Table 1, Fig. 2a, [14, 15])

(5) Li-acmite single crystal, 1 asymmetrical sextet (Table 1, Fig. 2b, [14])

(6) Fayalite powder, 2 symmetrical multiplets, first parameter set (Table 1, Fig. 3a, [12])

(7) Fayalite single crystal, 2 asymmetrical multiplets (Table 1, Fig. 3b, [12])

In every case, conventionally calculated Mossbauer parameters are compared to the Mossbauer parameters calculated by the HM. The corresponding experimental and calculated (HM) spectra are displayed in Figs. 1-3.


Sample/ 0 r QS H(O) TJ 0 <l> ~ IX X2

method

1, cv. 0.375 0.276 0.281 - - - - - - 0.71

HM 0.3752 0.282 0.284 - - - - - - 0.70

2, cv. 0.364 0.319 0.3" - - - - 4120 - 0.40

HM 0.364 0.319 0.3" - - - - 4120 - 0.41

3, CV., I 1.072 0.312 2.135 - - - - 652 - 0.35

II 1.052 0.2602 2.606 - - - - 521 -

III 0.22 0.201 0.64" - - - - 90" -

HM I 1.072 0.334 2.145 - - - - 662 - 0.38

II 1.062 0.2581 2.596 - - - - 562 -

4, CV. 0.4743 0.3946 0.557 53.212 - 126.35 - - - 0.89

HM 0.4763 0.3936 0.556 53.212 - 53.85 - - - 0.88

5, CV. 0.4746 0.351 0.579 49.814 - 125.45 - 47.48 - 0.71

HM 0.4736 0.286 0.579 49.814 - 54.65 - 132.68 - 0.70

6, CV., I 1.2405 0.2826 3.132 21.244 0.322 72.99 90 - - 2.88

II 1.292 0.260 3.027 9.84 0.8" 3.lx 90 - -

HM I 1.2405 0.2816 3.142 21.244 0.322 72.99 90 - - 2.87

II 1.292 0.259 3.027 9.74 0.8" 3" 90 - -7, CV., I 1.251 0.318 3.125 22.11 0.67 1052 618 401 115 3.62

II 1.292 0.368 2.987 10.16 0.91 4" 162 682 90

HM I 1.241 0.3277 3.135 22.01 0.53 1062 24612 1393 65 3.17

II 1.292 0.349 2.996 10.05 0.91 4 146 68 81

Table 1 Parameters of Mossbauer spectra calculated conventionally (cv.) or with the hybrid method (HM), arranged with increasing complexity: Na-acmite powder (1), Liacmite single crystal sc (2), Neptunite sc, 400K (3), Li-acmite powder, 11K (4), Li-acmite sc, 10K (5), Fayalite powder, 50K (6), Fayalite sc, 50K (7). Roman numbers indicate the corresponding subspectra. Mossbauer parameters and correspondence parameter X' as defined in the text. Errors are given in smaller digits; where the error exceeds the parameter value, an "x" is marked instead


1

~:1 ;3 o

.~ 4

g • D < ~

7

,r

e~~~~~~~~ -7.5 -5.0 -2.5 0 2 .5 5 .0 7 .5

Velocity (mm/s)

Fig. 1.: Li-acmite ingle crystal, RT

3

· 10 -5 0 5 10 Velocity (mm/s)

Fig 28: Li-acrnite powder. 11K.

11

-7.5 -5.0 -2.5 0 2 .5 5 .0 7 .5 Velocity (mm/s)

Fig. 3a: Fayalite powder. 50K

'i '-' o. :5 C-o ;:1. <

1.

-4 -2 0 2 Velocity (mm/s)

4

Fig Ib: Neptunite single cry tal , 400K.

3

-10 -5 0 5 10 Velocity (mm/s)

Fig 2b: Li-acmite single crystal . 10K.

.~ ii. (; ~ 7 <

9

-7.5 -5.0 -2.5 0 2 .5 5 .0 7 .5 Velocity (mm/5)

Fig. 3b: Fayalite ingle cry tal .50K

... ·igs. 1-3. Mossbauer spectra of samples with different complexity corresponding to the parameters of Table 1. The calculated total intensities are represented by solid lines. the observed ones by dots; subspectra are indicated either by a solid or a broken line.


4 Results

A run of the hybrid method lasts from one to three days, depending on the hardware used, without any input from the user. The long duration of a HM-run is due to the complex evaluation process of the fitness-function; in other words it is due to the time-intensive evaluation of the 1M. The number of iterations for the 1M can be varied from 25 up to 75. Generally speaking, complex M6ssbauer spectra require more iterations than simple ones. Other, more sophisticated demonstrations of "fuzzy logic" concern the GA part of the program.

We made a comparison of different selection schemes; the results are shown in Fig. 4.

0.06

0.05

In 0.04 In Q) E 0.03 u::

0.02

0.01

o. 00U-:0~-::1'"::0-'-:::2~0-'-::3~0:-'-:4!-:::0-'-:::5!-:::0-'-:::6~O....l....:=7!=0..J... Generations

Fig. 4. The fitness of different selection schemes (random, proportional (PS), tournament (TS) and exponential ranking ERk) as a function of the number of generations

The binary tournament selection (TS) performs best in comparison with other schemes, especially from the point of view of rapid convergence. This is certainly true for the M6ssbauer application presented here but may not be valid for other concrete refinement problems.

On the whole, we found the following GA-parameter values provided a good compromise between low evaluation times, complexity of spectra and goodness of fit:


- a population size of 150 (= number of individuals i.e. parameter sets). As the processing time increases linearly with this number, it should be chosen very carefully. Generally speaking, a setting of this number to, say, 100 is possible, but this is not recommended for complex spectra.

- a cross-over rate of 0.8 (lower and upper limits are 0 and 1, respectively). At comparably high values, the individuals exchange more genetic information, which supports new combinations i.e. the chance of getting suitable Mossbauer parameter values increases. This is especially valid for difficult problems. In simpler cases the rate may be lowered.

- a mutation rate of 0.015 (lower and upper limits again are 0 and 1, respectively). It turned out that a higher value proved a failure.

Concerning the estimated resolution of the initial parameter values (which can be preselected in our program) rather low values proved to be useful. For a possible explanation of this strange feature the reader is referred to the Discussion.

From a rough inspection of Table 1 it may be stated that most of the conventionally evaluated Mossbauer parameters were well fitted by the hybrid method within error bounds. In some cases, where two or more mathematical solutions were possible, the hybrid method found both. Only in the case of a neptunite single crystal did the HM not detect the third subspectrum - this was found in the powder case and displayed to be rather low in area, so that the total spectrum could also be refined by only two subspectra as well.

The genetic algorithm is thus dependent on a certain resolution of the superimposed subspectra like the conventional routines. The not-refined central intensity in Fig. 2 is presumed to be due to a relaxation peak which could be met neither by the conventional routine nor by the HM. The application of constraints to the genetic algorithm (a very common problem in Mossbauer spectra refinement) turned out to be detrimental to the finding of solutions - nevertheless the hybrid method normally offers a multitude of 'mathematical' solutions (including the 'constrained' result) among which the user has to find the correct one from physical reasoning.

Differences in some angle parameters between the conventional and hybrid methods are due to the fact that in the Hamiltonian matrix these angles are set up with their sine or cosine functions so that they are not unequivocal - adding or subtracting multiples of 90 degrees may lead to the same result. Taking this


property into account, the angle values of Table 1, bottom, may be easily transformed into each other.

Especially in the more complex cases with 2 magnetic subspectra the hybrid method was by far the more rapid: three days at maximum without any intervention of the researcher in contrast to several weeks of subsequent evaluations with the conventional method. This was also confirmed by the tests with other fayalite single crystal spectra (from other sections), which are not cited here.

5 Discussion

The present study should be seen in connection with a recently published work about the application of a genetic algorithm to the fitting of Mossbauer spectra [2]. In this earlier publication it was stated that "only the spectral model, i.e. the peak positions of each sub spectrum have to be determined by the analyst".

In our opinion, however, the main analytical work of a scientist in this field does just consist of the assignment of the peaks to the different subspectra and the determination of the respective peak positions.

We state, that: in most cases in the literature it is nearly impossible to say at a first glance (neither at a second), in which manner a heavily superimposed spectrum could be decomposed into the different subspectra. This might only be valid for the examples of the earlier study, though we suspect that these are not really representative, because of the very good line resolution together with very similar parameters for the different subspectra.

Moreover, the magnetic subspectra only show negligible quadrupole splittings. So we may conclude that the spectra displayed in Ahonen et al. [2] do not represent combined hyperfine interactions and may be fitted by conventional routines without a genetic algorithm as well.

A second point of criticism is that according to the authors' statements, the refinement procedure is merely geometrical - the physical parameters are calculated afterwards from the detected positions. This may cause severe problems - if the model is not correct, there may be artefacts, e.g. unrealistically large isomer shifts caused by selecting the wrong pair of peaks. So in our opinion only a combination of a genetic algorithm with a calculation of the original Mossbauer parameters may be successful in practical cases.

The hybrid method here is designed according to this premise: The input parameters are the quantities cited above with common importance in Mossbauer spectroscopy. These parameters are put in with broad boundaries - e.g. distinguishing the characteristic range of the isomer shift for Fe2+ and Fe3+,


respectively, or for their hyperfine fields. This information may easily be obtained from crystallographic data and rough inspection of the experimental spectra.

As already mentioned above in the Results, the input resolution values should be comparably low in order to give good results. Normally, one expects that a high resolution would raise the probability of detecting a good individual i.e. parameter set - the contrary is true. We explain this unexpected behaviour by using a trivial analogy: Imagine two groups of balloons with strongly different sizes floating under a ceiling.

Is it easier to hit one specimen in the smaller group of the big balloons or in the bigger group of the small ones with a dart? We state: the former is true. In the first group, the probability for the dart (i.e. a well-fitting individual) to sideslip at a balloon shell (i.e. a given parameter limit) and to "bounce off' is much lower than in the second case. Hence, with comparably low resolution, a promising individual has a better chance to develop towards the good fitting than another one nearby a boundary, the latter being more probable with high resolution.

Generally, it can be said that in the cases of increasing complexity mentioned above the hybrid method has detected all conventionally evaluated solutions. In the most difficult example (fayalite single crystal, Table 1, bottom, Fig. 3b ) the time-saving was around factor 20.

Another advantage of the described method consists of the fact that in ambiguous cases a multitude of different equivalent solutions may be offered in one run, which the user can check for physical or crystallographic requirements. This is hardly possible using conventional refinement programs where a certain solution often is very "stable", even if only a side minimum of the error function has been detected. A disadvantage of the presented method lies in the fact that comparable small sub spectra with minor influence on the total spectrum may not be detected as in the case of neptunite (Table 1, Fig. 1b). But this is a common problem of most refinement programs.

Another inconvenience is the implementation of the hybrid method routine on a powerful workstation (SUN ultra sparc), but a PC version of the program is being prepared at present.


6 Conclusions

The hybrid method presented here is a distinct improvement to the algorithm published in Ahonen et al. [2], as practically occuring, very complex spectra can be evaluated with a minimum of input data and without the necessity of any user interaction during analysis. The input values consist of conventional Mossbauer parameters and do not need to be processed afterwards.

The applications of genetic algorithms are not confined, however, to the special field presented here. They may be used wherever a non-trivial spectral function should be adjusted to a complex experimental dependence.

A rather similar scientific problem to the case treated here is the evaluation of powder and single crystal diffractograrns: The multitude of Bragg-reflections for a given crystallized sample depends in a complex manner on the diffraction angle (which is significant for the metric of the relevant elementary cell) and on the measured intensity (which is characteristic for the atomic components and their relative positions within the unit mesh). Parameters to be refined are, e.g., lattice constants, fractional coordinates of the atoms or ions, temperature factors and so on.

The problem of losing the phase information in a diffraction peak is commonly met by different methods, perhaps the most widespread of which is the construction of calculated intensities from trial-and-error atomic positions. This procedure could be easily replaced by a GA-based algorithm - the processing performance, however, must then be enhanced considerably compared to the application presented here. The latter certainly forms the limitation for possible other examples of use. The advantage of getting a set of solutions rather "automatically" with a high probability of obtaining all possible ones, may be compensated by the disadvantage of lengthy evaluation periods. But as computer power doubles annually, this limit may diminish rather quickly for the problem under consideration.

For users in the Mossbauer field, a commonly accessible internet version with a convenient web interface of the hybrid method program is at present available at the web-address http://www.users.sbg.at/-moe.


Acknowledgements

The authors would like to thank G. J. Redhammer for contributing the Na-acmite input data and results. We are indebted to the Austrian "Fonds zur Forderung der wissenschaftlichen Forschung" for granting this project under the contract number Pll727-GEO.

References 1. Varret F, Teillet J (1983) Mode d'emploi du programme V ARFIT, Universite du

Mans. 2. Ahonen H, de Souza PA Junior, Garg VK (1997) A genetic algorithm for fitting

Lorentzian line shapes in Mossbauer spectra. NIM B 124: 633 - 638 3. Mayer H (1997) ptGAs - Genetic Algorithms Using Promoterrrerminator Sequences

Evolution of Number, Size, and Location of Parameters and Parts of the Representation. PhD thesis, University of Salzburg, pp 8 -- 27

4. Brindle A (1981) Genetic Algorithms for Function Optimization. PhD thesis, University of Alberta, p 93

5. Goldberg DE, Deb K (1991) A Comparative Analysis of Selection Schemes Used in Genetic Algorithms. In: Rawlins GJE (ed) Foundations of Genetic Algorithms. Morgan Kaufmann, San Mateo, CA, p 69

6. Holland JH (1975) Adaptation in Natural and Artificial Systems, MIT Press, Cambridge, Mass.

7. Hancock PJB (1994) An empirical comparison of selection methods in evolutionary algorithms. In: Fogarty TC (ed) Evolutionary Computing. Springer Verlag, Berlin, p 80

8. Blickle T, Thiele L (1995) Computer Engineering and Communication Networks Lab (TIK), TlK-report No. 11, version 2, 2. edition, ETH-Ziirich.

9. Dejong K (1975) The Analysis and Behavior of a Class of Genetic Adaptive Systems. PhD thesis, University of Michigan, p 77

1O.Barb D (1980) Grundlagen und Anwendungen der Mossbauer Spectroskopie. Akademie Verlag, Berlin, p 118

1l.Redhammer GJ (1996) Untersuchungen zur Kristallchemie und Kristallphysik von synthetischen Klinopyroxenen im System Hedenbergit - Akmit CaFe2+Sh06 -NaFe3+Si20 6 . PhD thesis, University of Salzburg

12. Lottermoser W, Forcher K, Amthauer G, Fuess H (1995) Powder- and Single Crystal Mossbauer Spectroscopy on Synthetic Fayalite. Phys Chem Minerals 22: 259 - 267

13.Lottermoser W, Forcher K, Amthauer G, Kunz M, Armbruster T (1997) Site occupation and electric field gradient in acentric neptunite: measurements and evaluations concerning powder- and single crystal-Mossbauer spectroscopy and X-ray diffractometry. Phys Chem Minerals 24: 2 - 6


14.Lottermoser W, Redhammer GJ, Forcher K, Amthauer G, Paulus W, Andre G, Treutmann W (1998) Single Crystal Mossbauer and Neutron powder diffraction measurements on the synthetic clinopyroxene Li-acmite LiFeSiz06 . z. Kristallographie 213: 101 - 107

15. Baum E, Treutmann W, Behruzi M, Lottermoser W, Amthauer G (1988) Structural and magnetic properties of the clinopyroxenes NaFeSiz06 and LiFeSi20 6 . z. Kristallographie 183: 273 - 284

Soft Computing, Molecular Orbital, and Density Functional Theory in the Design of Safe Chemicals

Les Sztanderal ., Mendel Trachtman1, Charles Bock\ Janardhan V elga3, Ashish Gargl

Summary: This research focuses on the use of soft computing to aid in the development of novel, state-of-the-art, non-toxic dyes which are of commercial importance to the u.s. textile industry. Where appropriate, modern molecular orbital (Ma) and density functional (DF) techniques are employed to establish the necessary databases of molecular properties to be used in conjunction with the neural network approach. In this research, we focused on: 1) using molecular modeling to establish databases of various molecular properties of azo dyes required as input for our neural network approach; 2) designing and implementing a neural network architecture suitable to process these databases; and 3) investigating combinations of molecular descriptors needed to predict various properties of the azo dyes.

Keywords: Fuzzy entropy, Feed-forward neural networks, Molecular modeling, Density Functional Theory

1 Computer Information Systems Department 2 Chemistry Department

3 School of Textiles and Materials Technology, Philadelphia University, Philadelphia, PA 19144, USA * To whom all correspondence should be addressed

68 Les Sztandera et al.

1 Introduction

This research involves the integration of fuzzy entropies (used in the context of measuring uncertainty and information) with computational neural networks. An algorithm for the creation and manipulation of fuzzy entropies, extracted by a neural network from a data set, is designed and implemented. The neural network is used to find patterns in terms of structural features and properties that correspond to a desired level of activity in various azo dyes. Each molecule is described by a set of structural features, a set of physical properties and the strength of some activity under consideration. After developing an appropriate set of input parameters, the neural network is trained with selected molecules, then a search is carried out for compounds that exhibit the desired level of activity. High level molecular orbital and density functional techniques are employed to establish databases of various molecular properties required by the neural network approach.

The structural and electronic properties of the positional isomers of monomethoxy-4-aminoazobenzene (n-OMe-AAB) have been investigated using density functional theory with a basis set that includes polarization functions on all the atoms. These aminoazo dyes are of interest because their carcinogenic activities depend dramatically on the position (n) of the methoxy group, e.g. 3-OMe-AAB is a potent hepatocarcinogen in the rat, whereas 2-0Me-AAB is a noncarcinogen. Although the various isomers of OMe-AAB require metabolic activation via N-hydroxylation prior to reaction with cellular macromolecules, we have shown that there are structural and electronic features present in these isomers that correlate with their carcinogenic behavior.

3-Methoxy-4-aminoazobenzene (3-0Me-AAB) is a potent hepatocarcinogen in the rat[l]. This aminoazo dye requires metabolic activation to N-hydroxy-3-methoxy-4-aminoazobenzene (N-OH-3-0Me-AAB) prior to reaction with cellular macromolecules[2]. This conclusion is in accord with the observation that 3-0MeAAB is mutagenic on the Ames' Salmonella system only after treatment with S-9, the 9,000 g supernatant fraction of liver homogenate, whereas N-OH-3-0Me-AAB is strongly mutagenic without S-9 treatment[3,4] Interestingly, changing the position of the methoxy group on the phenyl rings dramatically influences the carcinogenic behavior of the resulting compound[5]. For example, 2-0Me-AAB is noncarcinogenic in rats whereas 4'-OMe-AAB is carcinogenic, but to a lesser degree than 3-0Me-AAB. This carcinogenic potency of 2- and 4'-OMe-AAB correlates well with their mutagenic activity in the Ames' Salmonella test, where neither 2-0Me-AAB nor its N-hydroxy derivative, N-OH-2-0Me-AAB are

Soft Computing, Molecular Orbital, and Functional Theory 69

mutagenic even after treatment with S-9; 4' -OMe-AAB is very slightly mutagenic on Salmonella (TA98) and N-OH-4'-OMe-AAB is definitely mutagenic without S-9 treatment[6]. Unfortunately, the carcinogen/mutagenic activities of the remaining monomethoxy derivatives of 4-aminoazobenzene or their N-hydroxy analogs have not been reported[7t. For comparison, we note that the parent compound, 4-aminoazobenzene is only weakly carcinogenic in rats, nonmutagenic on Salmonella (T A98) with or without S-9 treatment and mutagenic on Salmonella (TAIOO) only in the presence ofS-9[6].

Although it is not entirely clear why there is such a radical difference in the carcinogenic behavior of 2- and 3-0Me-AAB, Kojima et al [I] have determined that N-OH-3-0Me-AAB has a significantly greater effect than N-OH-2-0MeAAB on DNA synthesis in vivo. This suggests that the observed differences in the carcinogenic activity of 2-0Me-AAB and 3-0Me-AAB may be linked to the differences in the inhibitory effects of their N-hydroxy derivatives on DNA replication. Hashimoto et al.8 have established that the cytochrome P-450 enzymes efficiently catalyze the mutagenic activation of 3-0Me-AAB and, in contrast to other carcinogenic aromatic amines, the activation is mediated by phenobarbital-P-450 rather than by 3-methyl-cholanthrene-P-450.

Despite significant interest in the carcinogenic behavior of the various positional isomers of OMe-AAB, relatively little is known about their structural or electronic properties. No experimental results from X-ray or electron diffraction studies have been reported for any of the OMe-AAB isomers [9]5. Furthermore, no highlevel computational results that compare the various OMe-AAB isomers using either molecular orbital or density functional theory calculations are currently available in the literature. It is important to note that substitution at the 2- and 6-positions or at the 3- and 5-positions in 4-aminoazobenzene are not equivalent, see Figure 1.

However, it is not evident that distinctions of this type were considered in the carcinogenic/mutagenic studies involving 2- and 3-0Me-AAB[I-5]. Thus, it is probably more appropriate to describe these studies as involving methoxy substitution at the meta and ortho positions respectively. The purpose of this

4 It is known that N,N-dimethyl-3'-OMe-AAB is carcinogenic 5 Only a few experimental structures of azo dyes have been reported: O-Aminoazotoluene

(X-ray) Kurosaki S., Kashino S., Haisa M. (1976), Acta Cryst. B32, pp. 3160; Disperse Red 167 (X-ray) Freeman H.S., Posey J.C. Jr., Singh P. (1992) Dyes and Pigm., 20, pp. 279; C.I. Disperse Yellow 86 (X-ray), Lye J., Hink D and Freeman H.S. (1997), Comp. Chemistry applied to synthetic dyes.


chapter is to describe the results of an extensive computational study using density functional theory (DFT) to establish the conformational preferences and relative energies of the positional isomers of OMe-AAB. Our goal is to identify any electronic and/or structural features that may be present among these positional isomers that can be correlated with their diverse carcinogenic behaviors and lead to a better understanding of the underlying molecular mechanism(s) involved.

y

1396 ~~417 t N ~ 1.270

1.409 ~ 1.417 N---{

1.390

A. Azobenzene (AB)

y

1.397 2'~ .406 t 1.416

l' N , \\ 1.274 1.412

6 1.409 \ \ 1.405 ~_-J N---{

1.392

B. 4-Aminoazobenzene (AAB)

1.390

.. x

1.384

1.388 .. x

1.390

Fig 1. The structures, coordinate system and numbering conventions for A. Azobenzene (AB) and B. 4-Aminoazobenzene (AAB). The bond lengths (A) shown were calculated at

the BP/DN**//BPIDN** computational level


2 Computational Methods

Density functional calculations were performed at the BPIDN** computational level with SPARTAN v5.0 on Silicon Graphics computers[IO]. This level uses the non-local Becke-Perdew (BP) 86 functional and employs the numerically defmed DN** basis set which includes polarization functions on all the atoms[1l,12]. Complete optimizations for a variety of conformers of each OMe-AAB derivative were carried out; no symmetry constraints were employed in order to minimize the likelihood of optimizing to a transition state. In a few cases frequency analyses were performed to ensure that the optimized structures were local minima on the potential energy surfaces (PESs). The graphics utilities of SPARTAN were used to examine the electron densities, electrostatic potentials and various Kohn-Sham orbitals for each conformer. Mulliken and electrostatic charges were also calculated.

2.1 Molecular Modeling Results and Discussion

Since no experimental structural data are available even for the parent compounds azobenzene (AB) and 4-aminoazobenzene (AAB), we fIrst optimized these molecules at the BPIDN** computational level. The initial structures of both AB and AAB were taken as nearly trans about the azo linkage[13t and, in the case of AAB, the amino group was taken as pyramidal with both hydrogen atoms on the same side of the ring[14f The optimized structure of AB is found to be planar and a frequency analysis confIrms that this is a local minimum on the PES. As can be seen in Figure I, the azo linkage in AB signifIcantly distorts the carboncarbon bond lengths in the phenyl rings compared to their values in benzene, where the carbon-carbon bond distances are 1.403A at this computational level. The length of the N=N bond in AB, 1.270A, suggests considerable electron delocalization; the N=N bond lengths in CHr N=N-CH3 and CH3-N=N-C6Hs are shorter, 1.250A and 1.258A respectively. In the optimized structure of AAB, the

6 A second conformer of AB, with the phenyl rings twisted some 800 , was found to be 2.3 kcal/mol higher in energy at the BPIDN"'* computational level.

7 We also optimized a conformer of AAB in which the hydrogen atoms bonded to the amine nitrogen atom are on opposite sides of the ring but otherwise the structure is planar. This conformer is 8.0 kcal/mol higher in energy than the form shown in Figure I. The optimized structure of a completely planar form of AAB is a transition state that is 0.15 kcallmol higher in energy than the lowest energy form shown in Figure 1.


two phenyl rings are practically planar and nearly coplanar with each other, see Figure 1. A frequency analysis confIrms that this is a local minimum on the PES. To a large extent, the calculated structural parameters of AAB show that the amine group at the 4-position reinforces the geometrical changes already induced by the azo linkage. This is a consequence of electron delocalization using the lone pair of electrons from the amine nitrogen atom to give the C4-N bond partial double bond character, which results in predictable adjustments of the bond lengths in the remainder of the molecule. For comparison, we note that the lengths of the C-N bonds in CH3-NH2 and CJfS-NH2 are 1.478A and 1.408A respectively, considerably longer than that found in AAB, 1.388A. Nevertheless, the structure at the amine nitrogen in AAB remains pyramidal - the sum of the three bond angles is 346.7° at this computational level, compared to 318.4° and 325.6° for NH3 and NHz-CH3 respectively.

It is of interest to compare a few of the Kohn-Sham molecular orbitals of AB and AAB. The highest occupied molecular orbital (HOMO) in AB is a lone-pair orbital localized primarily in the vicinity of the azo linkage, which is 14.4kcaVmol above the next highest occupied orbital (HOMO{-I}), a delocalized pi-bonding orbital. The lowest unoccupied molecular orbital (LUMO) in AB is a piantibonding orbital, some 46.7 kcaVmol above the HOMO. In AAB the HOMO also involves the azo lone-pair electrons, see Figure 2. It is nearly identical in shape to the HOMO found in AB, but it is 7.4 kcaVmol higher in energy. The HOMO{-I} in AAB involves the lone pair of electrons on the amine nitrogen atom, see Figure 2, but otherwise it is similar in shape to the HOMO{-l} (pibonding orbital) in AB. However, this HOMO{-l} is 18.6 kcaVmol above its counterpart in AB, reducing the energy gap between the two highest occupied orbitals in AAB to only 3.2 kcaVmol. The LUMO in AAB is similar in shape to the pi-antibonding LUMO in AB except that it includes a contribution from the amine nitrogen atom, see Figure 2. The energy gap between the HOMO and LUMO is 47.5 kcaVmol.

Two general types of conformers were considered for each of the positional isomers of monomethoxy AAB: one in which the O-Me bond is essentially in the nominal plane of the phenyl ring to which it is bonded and the other in which the O-Me bond is nearly perpendicular to this ring. For all nine isomers, the conformers in which the O-Me bond lies essentially in the ring plane are found to be lower in energy at the BPIDN**//BPIDN** computational level; the energies


for these conformers are listed in Table 1 along with those for AB and AAB. Selected geometrical parameters and properties of the various OMe-AAB isomers calculated at the BPIDN** level are listed in Tables 2 and 3.

LUMO

47.5

KcaVrnol

HOMO (AZO)

1 3.2

! HOMO (AMINO)

- 1

Fig.2 The HOMO {-I} , HOMO, and LUMO of 4-aminoazobenzene (AAB) calculated at the BPIDN**//BPIDN** computational level.


Since several orientations of the methoxy methyl group are possible, their positions for the lowest energy conformers at the BPIDN** level are shown in Figure 3. It should be noted that energy differences between some of the conformers of these monomethoxy isomers can be quite small. For example, rotating 1800 about the C.-O bond in 4' -OMe-AAB and reoptimizing yields a structure only 0.3 kcallmol higher in energy, whereas a conformer with the methyl group nearly perpendicular to the ring is 3.8 kcallmol higher in energy.

Me

I

n \-0-" V-N \. j) \. H

Fig 3. The orientation of the methyl group for the positional isomers ofmonomethoxy-4-aminobenzene calculated at the BPIDN**//BP/DN** computational level. (The orientation

of the methyl group in 6-0Me-AAB is analogous to that for 2-0Me-AAB, etc.)


2.1.1 3-and 5-0Me-AAB

As can be seen from Table 1, 3-0Me-AAB has the lowest total molecular energy among all the positional isomers at the BPIDN**//BPIDN** computational level. However, the other ortho derivative, 5-0Me-AAB, is less than 1 kcallmol higher in energy. As expected, the presence of a methoxy group ortho to the amine group perturbs the pattern of carbon-carbon bond lengths in the phenyl rings compared to those in AAB. The most prominent changes occur at the point of attachment, e.g. in the case of 3-0Me-AAB, the length of the shorter bond (C2-C3) decreases while the length of the longer bond (C3-C4) increases, see Table 2. As might be expected, the lengths of the bonds in the unsubstituted phenyl ring are not significantly altered by the presence of the methoxy group. In both 3- and 5-0MeAAB the length of C4-N bond is shorter than that found in AAB and the amine group is less pyramidal, see Table 2. This suggests a further delocalization of the lone pair electron density on the amine nitrogen atom; the calculated Mulliken and electrostatic charges on this nitrogen are less negative than those found in AAB. The C.-O bond lengths in the 3- and 5-isomers, 1.378A and 1.380A respectively, are longer than those calculated for the other positional isomers, see Table 2. In order to examine the effect of the amine group on the length of the C.-O bond, we replaced the amine group in 3-0Me-AAB with a hydrogen atom and reoptimized the structure. In the resulting 3-0Me-AB compound (as well as in MeO-CJIs) the C.-O bond length, 1.373A, is slightly shorter than that found in 3-0Me-AAB. Thus, the amine group at the 4-position tends to impede the delocalization of lonepair density on the methoxy oxygen atom in 3-0Me-AAB.

The presence of a methoxy group ortho to the amine group in AAB has an interesting effect on the two highest occupied Kohn-Sham molecular orbitals. The orbital localized at the azo linkage in 3(5)-OMe-AAB is similar in shape and only O.9(1.5)kcallmol higher in energy than the corresponding orbital in AAB; the inplane lone pair on the methoxy oxygen is represented in this orbital, but only to a small extent. The orbital involving the amine lone pair of electrons, which includes a significant contribution from the out-of-plane oxygen lone pair, is 5.3(5.7) kcallmol higher in energy than the corresponding orbital in AAB. These relatively large increases in energy make this orbital the HOMO in both 3- and 5-OMe-AAB.


Table 1. Total Molecular Energies (a.u.) ofn-Methoxy-4-Aminoazobenzene calculated at the BPIDN**//BPIDN** Computational Level.

Total Molecular Energies Relative Energies

n (BPIDN **//BPIDN**)

2 -742.931442 +6.7

3 -742.942450 0.0

5 -742.941042 +0.9

6 -742.936054 +4.0

2' -742.935475 +4.4

3' -742.940108 +1.5

4' -742.940449 +1.3

5' -742.941441 +0.6

6' -742.930842 +7.3

AAB -628.365269 -AB -572.974339 -

It is important to note that methoxy substitution at each of the positions on the phenyl rings increases the energy of this orbital, but the increases for the 3- and 5-isomers are more than triple the smallest increase we observed, 1.7 kcallmol for 5'-OMe-AAB. The relatively large increase in the energy of this orbital appears to be a result of electron overcrowding involving the proximate lone pairs on the methoxy oxygen and amine nitrogen atoms. The shape of the LUMO in 3(5)OMe-AAB is quite similar to the LUMO in AAB, having relatively little contribution from the out-of-plane lone-pair orbital on the methoxy oxygen atom.


Furthermore, the energy of these LUMO in both the 3- and 5-isomers is less than 2 kcallmol above the LUMO in AAB. The energies separating the HOMO and LUMO in 3- and 5-0Me-AAB, 47.2 and 46.5 kcallmol, are just slightly lower than that found in AAB.

2.1.2 2- and 6-0Me-AAB

The positional isomers 2- and 6-0Me-AAB are 6.7 and 4.0 kcallmol higher in energy than the lowest energy isomer, 3-0Me-AAB, at the BPIDN**//BPIDN** computational level, see Table 1. The length of the C4-N bond in both of these isomers is slightly longer than that found in AAB and the amine group is more pyramidal, see Table 2. The calculated Mulliken and electrostatic charges on the amine nitrogen atom in the 2- and 6-isomers are nearly the same as those in AAB. However, the C.-O bond lengths in 2(6)-OMe-AAB, 1.356A(1.362A), are some 0.02A shorter than the corresponding bond lengths in 3(5)-OMe-AAB. This indicates further delocalization involving the oxygen out-of-plane lone pair, which gives the C.-O bond additional double bond character. The Mulliken charge on the oxygen atom in 2-0Me-AAB is not as negative as that found in 3-0Me-AAB. To examine the effect of the amine group on the length of the C.-O bond in 2-OMe-AAB, we optimized the structure of the 2-0Me-AB. The length of the C.-O bond increases, but only by about 0.002A; this change, however, is in a direction opposite to what we observed in going from 3-0Me-AAB to 3-0Me-AB. The shorter C.-O bond in 2(6)-OMe-AAB results in an elongation of both carboncarbon bonds in the ring at the point of methoxy attachment when compared to that in AAB, see Table 2. Again, there are no significant changes in the carboncarbon bond lengths in the unsubstituted phenyl ring compared to those in AAB.

The presence of a methoxy group meta to the amine group in AAB alters the two highest occupied Kohn-Sham orbitals differently than when the replacement occurs at an ortho position. In particular, the azo lone-pair orbital in 2( 6)-OMeAAB is 10.0(5.3) kcallmol higher in energy than the corresponding orbital in AAB, but only 0.9(1.5) kcallmol higher for substitution at the 3(5)-position. This orbital involves contributions from the azo nitrogen lone pairs and from the inplane methoxy oxygen lone-pair; its relatively large increase in energy is clearly the result of adverse lone-pair interactions in the region. The particular geometrical arrangement of atoms in the vicinity of the trans azo linkage causes the electron overcrowding in this region to be more severe for methoxy substitution at the 2-position than at the 6-position. This leads to a greater increase in the energy of the azo lone pair orbital in 2-0Me-AAB and results in

78 Les Sztandera et aJ.

the largest energy gap between the two highest occupied orbitals we observed in this study, 9.0 kcaVmol. The orbital involved with the amine nitrogen lone pair in 2(6)-OMe-AAB is similar in shape to the corresponding orbital in AAB, although it includes a contribution from the out-of-plane oxygen lone pair. Its energy is raised to a slightly lesser extent than it is when the methoxy group is at an ortho position. Thus, for 2(6)-OMe-AAB and AAB the orbital involving the azo lone pair is higher in energy than the orbital involving the amine nitrogen lone pair, whereas for 3(5)-OMe-AAB the order of these two orbitals is reversed. It is also interesting to note that the energy gap between the two highest occupied molecular orbitals in 2(6)-OMe-AAB, 9.0(3.6) kcaVmol, is greater than that in AAB, 3.2 kcaVmol, and in 3(5)-OMe-AAB, 1.3(1.1) kcaVmol. The LUMO in both the 2-and 6-isomers is similar in shape to that in AAB, but involves a significant contribution from the out-of-plane lone-pair orbital on the methoxy oxygen atom. The LUMO energies of 2- and 6-0Me-AAB are higher than those of 3- and 5-OMe-AAB, whereas the separation in energy between the HOMO and LUMO is smaller, 44.1 and 45.6 kcaVmol respectively.

Table 2. Structural Parameters (bond lengths (A), bond angles ( 0 )) of AB, AAB and nOMe-AAB

N Ct-C1 C.-C3 C3-C4 C4-C5 C5-C6 C6-Ct C4-N Ct-N N=N

2 1.436 1.398 1.409 1.409 1.386 1.410 1.391 1.396 1.278

3 1.416 1.382 1.428 1.404 1.393 1.403 1.380 1.402 1.278

5 1.407 1.388 1.409 1.421 1.388 1.412 1.384 1.404 1.275

6 1.409 1.384 1.415 1.408 1.400 1.427 1.390 1.396 1.277

2' 1.411 1.384 1.415 1.410 1.392 1.407 1.392 1.406 1.277

3' 1.411 1.383 1.416 1.411 1.390 1.408 1.385 1.405 1.274

4' 1.412 1.385 1.415 1.410 1.390 1.407 1.390 1.404 1.276

5' 1.411 1.384 1.416 1.410 1.389 1.408 1.388 1.401 1.274

6' 1.412 1.383 1.414 1.410 1.392 1.406 1.392 1.412 1.277

AAB 1.412 1.384 1.416 1.411 1.390 1.406 1.388 1.405 1.274

AB 1.409 1.390 1.405 1.399 1.396 1.406 - 1.417 1.270


n CI'-N CI'- C1,- CJ,- C4,- CS'- C6,- C0 - O-C 1:

C1, CJ , C.' CS' C6, C" 0 angles'

2 1.414 1.406 1.395 1.399 1.404 1.392 1.408 1.356 1.430 344.5

3 1.413 1.405 1.396 1.399 1.404 1.392 1.409 1.378 1.433 348.9

5 1.417 1.406 1.396 1.398 1.403 1.392 1.410 1.380 1.431 346.8

6 1.416 1.406 1.397 1.399 1.404 1.392 1.410 1.362 1.430 346.0

2' 1.404 1.425 1.404 1.398 1.401 1.392 1.406 1.364 1.431 344.6

3' 1.416 1.408 1.398 1.406 1.398 1.395 1.405 1.374 1.433 349.2

4' 1.411 1.408 1.389 1.406 1.408 1.391 1.406 1.370 1.432 345.7

5' 1.415 1.402 1.399 1.393 1.411 1.394 1.411 1.373 1.432 347.2

6' 1.401 1.410 1.390 1.397 1.398 1.404 1.432 1.359 1.433 345.2

AAB 1.416 1.406 1.397 1.399 1.404 1.392 1.409 - - 346.7

AD 1.417 1.406 1.396 1.399 1.405 1.390 1.404 - - -• Sum of three bond angles at the amine nitrogen atom.

2.1.3 3'- and 5'-OMe-AAB

The positional isomers 3'( 5')-OMe-AAB are only 1.5 (0.6) kcallmol higher in energy than 3-0Me-AAB, see Table 1. Interestingly, methoxy substitution at the 3'- or 5'-position has very little effect on the carbon-carbon bond lengths in either of the phenyl rings when compared to those in AAB, see Table 2, The C4-N bond length in 3' -OMe-AAB is slightly shorter than that found in AAB while that of 5'OMe-AAB is nearly the same as in AAB. The small differences in the geometrical parameters and the lack of any severe lone-pair interactions in 3'(5')OMe-AAB are consistent with the observation that the two highest occupied Kohn-Sham orbitals of AAB are closer in energy to those of 3'(5')-OMe-AAB than to those of the other positional isomers, The azo and amine lone-pair orbitals are only 1.0(0.3) and 2.4(1.7) kcallmol higher in energy than the corresponding


orbitals in AAB, leading to a small energy gap of 1.8(1.7) kcallmol. The LUMOs are again similar in shape to that found in AAB, with relatively little contribution from the out-of-plane oxygen lone-pair orbital; the HOMO-LUMO energy gaps, 47.6 and 48.0 kcallmol, are slightly greater than that found in AAB.

2.1.4 2'- and 6'-OMe-AAB

The positional isomers 2'- and 6'-OMe-AAB are found to be 4.4 and 7.3 kcallmol higher in energy than 3-0Me-AAB, see Table 1. As can be seen in Table 2, methoxy substitution at the 6' -position of AAB leads to greater changes in several of the geometrical parameters than those found with the other monomethoxy derivatives. The relatively short C6,-O bond length, l.359A, in 6'-OMe-AAB indicates significant electron donation from the out-of-plane lone pair on the methoxy oxygen. This short C6,-O bond is compensated for by an elongation of both carbon-carbon bonds in the ring involving the C6, atom, a shortening of the Cl'-N bond and a slight elongation of the N=N bond. Analogous changes in the bond lengths occur for substitution at the 2-position, but the presence of the amine group buffers the magnitude of these changes somewhat, particularly at the azo linkage, see Table 2. The energies of the highest two Kohn-Sham molecular orbitals in AAB are both significantly increased by substitution at the 6' -position.

The lone-pair orbital localized at the azo linkage also includes a contribution from the in-plane oxygen lone pairs and remains higher in energy than the orbital involving the amine nitrogen lone pair; the energy separation, 5.8 kcallmol, is the second highest we observed for any of the monomethoxy derivatives. The structure of the LUMO is similar to that in AAB, but contains a contribution from the out-of-plane oxygen lone-pair, similar to that observed for the 2- and 6-isomers. The energy separations between the HOMO and LUMO are 45.1 and 44.6 kcallmol respectively.


Table 3 Selected properties of AB, AAB, and n-OMe-AAB calculated at the

BPIDN**'/BPIDN** Computational Level.

Electrostatic

Charge On BOMO{-l} HOMO LUMO Logpd Amine Dipole

n (a.u.) (a.u.) (a.u.) Nitrogen Moment

(0)

2 -0.186080" -O.171712b -0.101479 2.25 -0.72 4.85

3 -0. 186278b -0.184263" -0.109082 2.\3 -0.67 3.52

5 -O.185290b -0.183553" -0.109506 2.19 -0.67 3.54

6 -0.185048" -O.179274b -0.106652 2.26 -0.72 5.08

2' -0.183749" -O.179501 b -0.107565 2.29 -0.72 2.77

3' -0.188844" -0.186038b -0.110120 2.24 -0.73 4.54

4' -0.183084' -0.182205' -0.104728 2.34 -0.72 2.55

5' -0.189970" -0. 187243b -0.\10805 2.53 -0.71 4.50

6' -0.183052" -O.173803b -0.102735 2.30 -0.72 2.04

AAD -0.192206" -O.187648b -0.\11910 2.47 -0.72 3.61

AD -0.222337" -0. 199380b -0.1266\1 3.30 - 0.07

a. Orbital involves amine lone pair. b. Orbital involves azo lone pairs. c. Orbital is mixed, see text. d. Log P is the logarithm of the octanol-water partition coefficient calculated using

the Dixon-Hehre algorithm in Spartan 5.0[10]. This involves explicit evaluation of AM1oc:t and AMlaq solvation models. The Ghose-Crippen approach gives Log P = 3.54 for all the OMe-AAB isomers (J. Compo Chem. 9, 80 (1988).


2.1.54'-OMe-AAB

The positional isomer 4'-OMe-AAB is only 1.3 kcallmol higher in energy than 3-OMe-AAB, see Table 1. As can be seen in Table 2, the C4,-O bond length, 1.370A, is intermediate between that found for the 3-, 3'-, 5- and 5'-isomers and that found for the 2-, 2'-, 6- and 6'-isomers. The pattern of carbon-carbon bond lengths in the phenyl ring to which the methoxy group is attached is generally enhanced above that already found in AAB. The structures of the two highest occupied Kohn-Sham molecular orbitals are radically different from those observed for the other positional isomers. They are nearly degenerate (only separated by 0.5 kcallmol) and appear as a mixture of the orbitals involving the azo and the amine lone pairs that are found for the other positional isomers. For comparison, we optimized several other AAB derivatives with substitution at the 4'-position. Similar combination orbitals are obtained for the HOMO and HOMO{-l} of 4'-OH-AAB, but for 4'-F-AAB the HOMO is clearly an azo lonepair type orbital, whereas for 4'-SMe-AAB the HOMO is an amine lone-pair type orbital. The structure of the LUMO in 4'-OMe-AAB is similar to that observed for AAB, but with a contribution from the out-of-plane lone-pair orbital on the oxygen atom; the energy gap between the HOMO and LUMO is 48.6 kcallmol.

2.2 Remarks

The methoxy azo dyes 2-0Me-AAB, 4'-OMe-AAB and 3-0Me-AAB are noncarcinogenic, moderately carcinogenic and strongly carcinogenic respectively[5]. The studies that established these results, however, have not made a clear distinction between methoxy substitution at the 2- and 6-position or at the 3- and 5-position. Ames' Salmonella mutagenicity tests suggest that none of these molecules are mutagenic per se, but require activation to their N-hydroxy derivatives prior to reaction with cellular macromolecules. Nevertheless, there appear to be some differences in the structures and electronic properties of the monomethoxy-AAB compounds themselves that may provide a basis for understanding their diverse carcinogenic behavior.

Many of the structural features in the monomethoxy AAB derivatives are determined to a large extent by electron delocalization at the azo linkage, which establishes a pattern of carbon-carbon bond lengths in the phenyl rings of AB; this pattern is enhanced by electron delocalization at the amine nitrogen atom in AAB. The presence of a methoxy group, with its two lone pairs of electrons, provides yet another site where delocalization is an issue, but it also introduces the


possibility of lone-pair interactions involving the azo and amine nitrogen lone pairs.

Comparing the structures of 3(5)-OMe-AAB with those of 2(6)-OMe-AAB suggests that there is competition to delocalize lone-pair electron density at the amine nitrogen and methoxy oxygen atoms. For the 3(5)-isomers, where the methoxy oxygen is in close proximity to the amine nitrogen, it is energetically favorable to delocalize at the (less-electronegative) amine nitrogen atom by further increasing the double bond character of the C4-N bond; for these isomers the C.-O bond is relatively long. For the 2(6)-isomers, where the methoxy oxygen is now in close proximity to the azo linkage, it becomes favorable to delocalize more at the methoxy oxygen atom by further increasing the double bond character of the C.-O bond; for these isomers the C4-N bond is relatively long.

The Kohn-Sham HOMO and HOMO {-I} of AAB involve the azo and amine lone pairs respectively and these orbitals are relatively close in energy. For most of the monomethoxy AAB derivatives the two highest occupied orbitals are similar in shape to those found in AAB, but involve contributions from one of the two lone pairs on the methoxy oxygen atom. In these n-OMe-AAB compounds, the energies of the two highest occupied orbitals are sensitive to the position (n) of the methoxy group because there is the potential for its lone pairs to be forced into close proximity with those on the AAB backbone.

It is interesting to note that the HOMO of the strongest carcinogen, 3(5)-OMeAAB, involves the amine lone pair, whereas the HOMO of the noncarcinogen, 2(6)-OMe-AAB, involves the azo lone pairs. In the case of AAB itself, which is weakly carcinogenic, the HOMO involves the azo lone pairs, but the separation in energy between the two highest orbitals is smaller than that for 6-0Me-AAB and much smaller than that for 2-0Me-AAB. The carcinogenic potency of 4'-OMeAAB is in between that of2(6)-OMe-AAB and 3(5)-OMe-AAB and its HOMO is a mixed orbital that includes a contribution from the amine nitrogen lone pair.

The results of our investigation suggest that the carcinogenic activity of an OMeAAB isomer is increased as the energy of the orbital involving the amine nitrogen lone pair is raised relative to that of the orbital involving the azo nitrogen lone pairs.8 This correlation can be further tested by noting that N-methyl-AAB and

8 The energies of the LUMOs do not seem to correlate well with the carcinogenic potency of the AAB compounds, e.g. the LUMO of the noncarcinogen 2-0Me-AAB is 4.4 kcal/mol above the LUMO of the strong carcinogen 3-0Me-AAB but 6.S kcallmol above the LUMO ofthe weak carcinogen AAB.


N,N-dimethyl-AAB compounds are usually more carcinogenic than the corresponding AAB compounds[6]. The HOMOs of both AAB and N-methylAAB are localized at the azo linkage. However, the separation in energy between the HOMO and HOMO{-I} in N-methyl-AAB is only about 50% of the corresponding separation in AAB. On the other hand, the HOMOs of both 3-OMe-AAB and N-methyl-3-0Me-AAB involve the amine lone pair, but for these compounds the energy gap between the two highest occupied orbitals is three times greater in the N-methyl compound. Furthermore, the HOMO and HOMO{I} of N,N-dimethyl-AAB are of the mixed type we observed in 4'-OMe-AAB, where both orbitals involve the amine nitrogen lone pair. These results are consistent with an increase in the carcinogenic potency of a methoxy AAB derivative when the primary amine is monomethylated or dimethylated. It must be pointed out that a variety of effects can influence the carcinogenic activity of a particular compound[13-15]. For example, the HOMO of N,N-dimethyl-4'-OHAAB involves the amine nitrogen lone pair and it is 2.7 kcallmol higher in energy than the orbital involving the azo lone pairs. Based on our results for 2- and 3-OMe-AAB, this would suggest that N,N-dimethyl-4'-OH-AAB was a strong hepatocarcinogen in the rat, but this is not the case[16]. It is likely that the hydroxy group provides a site for the metabolic breakdown of this dye before it can act as a carcinogen[16]. In fact, studies have shown that N,N-dimethyl-4'-OHAAB is formed from N,N-dimethyl-AAB during its metabolism by rat homogenates[ 17], and that demethylated hydroxyazo derivatives are present in the urine of rats fed the dye[IS]. Additional calculations and further experimental carcinogenic/mutagenic studies on AAB derivatives will be required to establish the extent to which knowing the relative energies of the orbitals involving the azo and amine lone pairs in these compounds can be used as a predictive tool of the carcinogenic behavior of azo dyes. These studies are currently in progress.

3 Neural Network Approach

In the last several years there has been a large and energetic upswing in research efforts aimed at synthesizing fuzzy logic with computational neural networks in the emerging field of soft computing in AI. The enormous success of commercial applications (primarily by Japanese companies), which are dependent to a large extent on soft computing technologies, has led to a surge of interest in these techniques for possible applications throughout the US textile industry.

The marriage of fuzzy logic with computational neural networks has a sound technical basis, because these two approaches generally attack the design of "intelligent" systems from quite different angles. Neural networks are essentially


low level, computational algorithms that offer good performance in dealing with large quantities of data often required in pattern recognition and control. Fuzzy logic, introduced in 1965 by Zadeh[ 19] is a means for representing, manipulating and utilizing data and information that possess non-statistical uncertainty. Thus, fuzzy methods often deal with issues such as reasoning on a higher (i.e., on a semantic or linguistic) level than do neural networks. Consequently, the two technologies often complement each other: neural networks supply the brute force necessary to accommodate and interpret large amounts of data and fuzzy logic provides a structural framework that utilizes and exploits these low level results.

This research is concerned with the integration of fuzzy logic and computational neural networks. Therefore, an algorithm for the creation and manipulation of fuzzy membership functions, which have previously been learned by a neural network from the data set under consideration, is designed and implemented. In the opposite direction we are able to use fuzzy tree architecture to construct neural networks and take advantage of the learning capability of neural networks to manipulate those membership functions for classification and recognition processes. In this research, membership functions are used to calculate fuzzy entropies for measuring uncertainty and information. That is, the amount of uncertainty regarding some situation represents the total amount of potential information in this situation. The reduction of uncertainty by a certain amount (due to new evidence) indicates the gain of an equal amount of information.

3.1 Fuzzy Entropy Measures

In general, a fuzzy entropy measure is a function f: P(X) ~ R, where P(X) denotes the set of all fuzzy subsets ofX. That is, the function fassigns a value f(A) to each fuzzy subset A of X that characterizes the degree of fuzziness of A. Thus, f is a set-to-point map, or in other words, a fuzzy set defined on fuzzy sets [20].

DeLuca and Termini [21] frrst axiomatized non-probabilistic entropy. Their axioms are intuitive and have been widely accepted in the fuzzy literature. We adopt them here. In order to qualify as a meaningful measure of fuzziness, f must satisfy the following axiomatic requirements:

Axiom 1. f(A) = 0 if and only if A is a crisp (non-fuzzy) set.

Axiom 2. f(A) assumes the maximum if and only if A is maximally fuzzy.

Axiom 3. If A is less fuzzy than B, then f(A) stB).


Axiom 4. f(A) = f(A C ).

Only the first axiom is unique; axioms two and three depend on the meaning given to the concept of the degree of fuzziness. For example, assume that the "less fuzzy" relation is defined, after DeLuca and Termini [21], as follows:

JlA(X) $B(X) for JlB(X) < liz

JlA(X) ~B(X) for JlB(X) > liz,

and the term maximally fuzzy is defined by the membership grade 0.5 for all x EX.

Motivated by the classical Shannon entropy function DeLuca and Termini proposed the following fuzzy entropy function [21]:

f(A) = - L[(JlA (x)log2JlA (x) + (1 - (JlA (x))log2(1 -JlA (x))].

Its normalized version is given by f{A)/ lXI, where IXI denotes the cardinality of the universal set X. Similarly, taking into account the distance from set A to its complement A C another measure of fuzziness, referred to as an index of fuzziness [22], can be introduced. If the "less fuzzy" relation of Axiom 3 is defined by:

Jldx) = 0 if JlA(X) =1Iz,

Jldx) = 1 if JlA(X) > liz,

where C is the crisp set nearest to the fuzzy set A, then the measure of fuzziness is expressed by the function [22]:

f(A) = L(JlA (x) - Jldx))

when the Hamming distance is used, and by the function [22]:

f(A) = (L(JlA (x) - Jldx))2 )112

when the Euclidean distance is employed.

It is clear that other metric distances may be used as well [23]. For example, the Minkowski class of distances yields a class of fuzzy measures:

fw(A) = (L(JlA (x) - Jldx))W )l/w


where w €[1, 00).

However, both DeLuca and Termini measure, and Kaufmann measure are only special cases of measures suggested by Knopfmacher [24] and Loo [25], expressed in the form [23]:

itA) = h(Igx(~A(x»),

where gx (~A(X» are functions

gx: [0, 1] ..... R+ ,

which are all monotonically increasing in [0, 0.5], monotonically decreasing in [0.5, 1], and satisfy the requirements that gx(O) = gx(l) = 0, and that gx(O.5) is the unique maximum of gx, and h is a monotonically increasing function. It has been shown that the degree of fuzziness of a fuzzy set can be expressed in terms of the lack of distinction between the set and its complement [26-28].

It has been also established that a general class of measures of fuzziness based on this lack of distinction is exactly the same as the class of measures of fuzziness expressed in terms of a metric distance based on some form of aggregating the individual differences [23]:

fdA) = IXI- I[~A (x) - C(~A(X»].

To obtain the normalized version of fuzzy entropy the above expression is divided by the cardinality of a fuzzy set. The previous defmitions can also be extended to infmite sets [23].

Another fuzzy entropy measure was proposed and investigated by Kosko [20,29]. He established that

itA) = (Icount (AI\A C »/ (Icount (AvA c»,

where Icount is a fuzzy cardinality [30,31].

Kosko [29] claims that his entropy measure, and corresponding fuzzy entropy theorem does not hold when we substitute Zadeh's operations [19] with any other generalized fuzzy operations. If any of the generalized Dombi's operations [32] are used, the resulting measure is an entropy measure, it is maximized, however it does not equal unity at the midpoints [33].

The generalized Dombi's operations proved to do well in different applications, and were used by Sztandera [34] for detecting coronary artery disease, and were


suggested for image analysis by Sztandera [35]. However, we still have to use Zadeh's complement [19], since the Kosko's theorem does not hold for any other class of fuzzy complements.

3.2 A New Concept of Fuzzy Entropy

In our experiments we used fuzzy entropy suggested by Kosko [29] and generalized fuzzy operations introduced by Dombi [32).

Generalized Dombi's operations form one of the several classes of functions, which possess appropriate axiomatic properties of fuzzy unions and intersections. The operations are defined below. From our experience the parameter }I.= 4 gives the best results [33].

3.2.1 Dombi's Fuzzy Union

IlAvB(X) = {1 + [(lIIlA(X) - IY}., + [(I/IlB(x) - 1) -}.,] -If}.,]}-I

where }I. is a parameter by which different unions are distinguished, and }I. E (0,00).

3.2.2 Dombi's Fuzzy Intersection

where }I. is a parameter by which different intersections are distinguished, and}l.E (0,00).

It is interesting to examine the properties of these operations. By definition, generalized fuzzy union and intersection operations are commutative, associative, and monotonic. It can be shown that they neither satisfy the law of the excluded middle nor the law of contradiction. They are also not idempotent, nor distributive. However, they are continuous and satisfy the de Morgan's laws (when the standard Zadeh's complement is used) [32]. Zadeh's complement (c(a) = 1 - a) is by definition monotonic nonincreasing. It is also continuous and involutive. Other properties and the proofs can be found in Dombi's [32] and Zadeh's [19] papers.


4 Feed-Forward Neural Network Architecture

The proposed algorithm generates feed forward network architecture for a given data set, and after having generated fuzzy entropies at each node of the network, it switches to fuzzy decision making on those entropies. The nodes and hidden layers are added until a learning task is accomplished. The algorithm operates on numerical data and equates a decision tree with a hidden layer of a neural network [33]. A learning strategy used in this approach is based on achieving the optimal goodness function. This process of optimization of the goodness function translates into adding new nodes to the network until the desired values are achieved. When this is the case then all training examples are regarded as correctly recognized. The incorporation of fuzzy entropies into the algorithm seems to result in a drastic reduction of the number of nodes in the network, and in decrease of the convergence time. Connections between the nodes have a "cost" function being equal to the weights of a neural network. The directional vector of a hyperplane, which divides decision regions, is taken as the weight vector of a node.

The outline of the algorithm follows:

Step i) For a given problem with N samples, choose a random initial weight vector.

Step ii) Make use of learning rule

flwij = -p af(F)/awij

where p is a learning rate, f(F) is a fuzzy entropy function; and search for a hyperplane that minimizes the fuzzy entropy function:

min f(F) = mIN entropy(L, r)

where: L is a level of a decision tree, R is total number of nodes in a layer, r is number of nodes, f(F) is fuzzy entropy.

Step iii) If the minimized fuzzy entropy is not zero, but it is smaller than the previous value compute a new node in the current layer and repeat the previous step. Otherwise go to the next step.


Step iv) If there is more than one node in a layer compute a new layer with inputs from all previous nodes including the input data, then go to step ii). Otherwise terminate.

5 Azo Dye Database

We have conclusively demonstrated that density functional techniques can efficiently be used to investigate the structure and properties (charge distribution, band gap, 10gP, etc.) of a wide range of azo dyes. (Most prior calculations on dyes have used lower level semi-empirical methods). We employed the gradientcorrected density functional (Becke-Perdew) method incorporated into the Spartan 5.0 molecular modeling package [10] using the polarized numerical DN** basis set (BPIDN**//BPIDN** level), which provides an exceptionally good description of the bonding in most organic molecules. (This computational level can also be used with dyes that contain metals such as Cr, Co, Cu, etc.). The calculated structural and physicochemical properties of these dyes, augmented with experimental results (optical properties, toxicological activity, etc.) were incorporated into a database that was used to train the neural network.

Preliminary results from several trials suggest that, given a collection of dye molecules, each described by a set of structural features, a set of physical properties, and the strength of some activity under consideration, a neural network algorithm could be used to find patterns in terms of the structural features and properties that correspond to a desired level of activity.

To determine the effectiveness of the proposed algorithm, the performance was evaluated on a database of molecular properties involving 22 selected azo dyes (11 carcinogenic/mutagenic and 11 non-carcinogens). We used 80% of the database (18 molecules) for training purposes, and 20% (4 molecules) for testing. We repeated the process five times (20% lacknife procedure). After several trialand-error approaches with different input sets, we opted for three input parameters (logP, surface, and volume). Using those parameters, in conjunction with experimental toxicological data, the network was able to learn and differentiate between mutagenic/carcinogenic and non-mutagenic/non-carcinogenic dyes. We expect the neural network to predict the mutagenic/carcinogenic nature of other chemical structures.


We are currently looking into using so called topological indices (modified Wiener's index, modified Balaban's index, modified Schultz's index, etc.) [36,37] that have been used successfully in the pharmaceutical industry in QSPR and QSAR studies. We plan to use one of these topological indices, or develop one ourselves if none of these are adequate, in conjunction with 10gP and selected electronic properties from our density functional calculations as descriptors in our soft computing approach.

6 Concluding Remarks

From the soft computing point of view, the proposed approach shows a way in which neural network technology can be used as a "tool" within the framework of a fuzzy set theory. Generating membership functions with the aid of a neural network has been shown to be an extremely powerful and promising technology. In this research, membership functions are used to calculate fuzzy entropies for measuring uncertainty and information. The proposed neural network is a building block towards combining the two soft computing paradigms. It allows for a selfgeneration of a neural network architecture suited for a particular problem.

The main features and advantages of the proposed approach are: 1) it is a general method of how to use numerical information, via neural networks, to provide good approximations to the membership functions; 2) it is a simple and straightforward quick-pass build-up procedure, where no time-consuming iterative training is required, resulting in much shorter design time than most neural networks; 3) there is a lot of freedom in choosing the membership functions and corresponding fuzzy entropies; this provides flexibility for designing systems satisfying different requirements; and 4) it performs successfully on data where neither a pure neural network nor a fuzzy system would work perfectly.

Molecular modeling has allowed us to investigate the properties of a large number of azobenzene derivatives in a short period of time. It is clear that there are correlations between our calculated properties and their toxicological behavior. Weare also certain that such correlations exist between molecular properties and various textile parameters such as light fastness, for example. Often these correlations are not evident until calculations on a sufficiently large number of

92 Les Sztandera et a!.

related structures have been performed and the data carefully analyzed. Then appropriate molecular descriptors can more readily be identified and used as input into a neural network.

Acknowledgement

The authors would like to acknowledge the US Department of Commerce, National Textile Center (Grant #I98-POl) for financial support of this research.

References 1. Kojima M., Degawa M., Hashimoto Y. and Tada M. (1991), Biochem. Biophy. Res.

Commun., 179, p. 817. 2. Hashimoto Y., Degawa M., Watanabe H. K. and TadaM. (1981), Gann, 72, p. 937. 3. Degawa M., Miyairi S. and Hashimoto Y., (1978), Gann, 69, pp. 367. 4. Degawa M., Shoji Y., Masuko K. and Hashimoto Y. (1979), Cancer Lett., 8, p. 71. 5. Miller J. A. and Miller E. C. (1961), Cancer Res., 21, p. 1068. 6. Hashimoto Y., Watanabe H.K. and Degawa M., (1981), Gann, 72, p. 921. 7. Freeman H.S., Posey JT. J.C. and Singh P. (1992), Dyes and Pigm., 20, p. 279. 8. Degawa M., Kojima M. and Hashimoto Y. (1985), Mutation Res., 152, p. 125. 9. Lye J., Hink D and Freeman H.S., Computational Chemistry applied to synthetic dyes.

In: Cisneros G., Cogordan J.A., Castro M., Wang C. and editors (1997), Computational chemistry and chemical engineering, Singapore World Scientific Pub!.

10. Spartan v.5.0, Wavefunction Inc., 18401 Von Karmen Avenue, Suite 370, Irvine, CA 92612.

11. Perdew J.P. (1986), Phys. Rev., B33, p. 8822. 12. Perdew J.P. (1987), Phys. Rev., B34, p. 7046. 13. Chung K.T. and Cerniglia C.E. (1992), Mutation Res., 277, p. 201. 14. Ashby J., Paton D., Lefevre P.A., Styles J.A. and Rose F.L., Carcinogenesis, 3, 1277

(1982). 15. Cunningham A.R., Klopman G. and Rosenkranz H.S. (1998), Mutation Res., 405, p. 9. 16. Miller J.A., Sapp R.W. and Miller E.C. (1949), Cancer Res., 9, p. 652. 17. Mueller G.C. and Miller J.A. (1948),J Bioi. Chem., 176,pp. 535. 18. Miller J.A. and Miller E.C. (1947), Cancer Res., 7, p. 39. 19. Zadeh L. (1965), Fuzzy Sets, Information and Control, 8, pp. 338-353. 20. Kosko B. (1986), Fuzzy Entropy and Conditioning, Information Sciences, 40, pp. 165-

174. 21. DeLuca A. and Termini S. (1972), A Definition of a Nonprobabilistic Entropy in the

Setting of Fuzzy Sets Theory, Information and Control, 20, pp. 301-312.


22. Kaufmann A. (1975), Introduction to the Theory of Fuzzy Subsets, Academic Press, New York.

23. Klir G.J. and Folger T.A. (1988), Fuzzy Sets, Uncertainty and Information, Prentice Hall, Englewood Cliffs.

24. Knopfmacher J. (1975), On Measures of Fuzziness, J Math. Anal. and Appl., 49, pp. 529-534.

25. Loo S.G. (1977), Measures of Fuzziness, Cybernetica, 20, pp. 201-210. 26. Yager R.R. (1979), On the Measure of Fuzziness and Negation. Part I: Membership in

the Unit Interval, International Journal of General Systems,S, pp. 221-229. 27. Yager R.R. (1980), On the Measure of Fuzziness and Negation. Part II: Lattices,

Information and Control, 44, pp. 236-260. 28. Higashi M. and Klir GJ. (1982), On Measures of Fuzziness and Fuzzy Complements,

International Journal of General Systems, 8, pp. 169-180. 29. Kosko B. (1992), Neural Networks and Fuzzy Systems, Prentice Hall, Englewood Cliffs. 30. Zadeh L. (1983), A Computational Approach to Fuzzy Quantifiers in Natural

Languages, Comput. Math. Appl., 9, pp. 149-184. 31. Zadeh L. (1983), The Role of Fuzzy Logic in the Management of Uncertainty in Expert

Systems, Fuzzy Sets and Systems, 11, pp. 199-227. 32. Dombi J. (1982), A General Class of Fuzzy Operators, the De Morgan Class of Fuzzy

Operators and Fuzziness Measures, Fuzzy Sets and Systems, 8, pp. 149-163. 33. Cios K.J. and Sztandera L.M. (1992), Continuous 1D3 Algorithm with Fuzzy Entropy

Measures, In: Proceedings of the 1" International Conference on Fuzzy Systems and Neural Networks, IEEE Press, San Diego, pp. 469- 476.

34. Cios K.J., Goodenday L.S. and Sztandera L.M. (1994), Hybrid Intelligence Systems for Diagnosing Coronary Stenosis, IEEE Engineering in Medicine and Biology, 13, pp. 723-729.

35. Sztandera L.M. (1990), Relative Position Among Fuzzy Subsets of an Image, MS. Thesis, Computer Science and Engineering Department, University of MissouriColumbia, Columbia, MO.

36. Vedrina M., Markovic S., Medic-Saric M. and Trinajstic N. (1997), Computers Chem, 21, pp. 355-361.

37. Balaban A.T. (1982), Chem Phys Letters, 89, pp. 399-404.

Fuzzy logic and fuzzy classification techniques

S.M.Scott, W.T.O'Hare and Z.Ali School of Science and Technology University of Teesside Middlesborough TS13BA England

Summary: This chapter presents some basic fuzzy theory and then demonstrates how this may be used for the classification of data. A variety of fuzzy pattern recognition systems, fuzzy c-means, fuzzy ARTMAP, SF AM and radial basis junction neural networks are described and compared using standard circle-insquare and iris datasets. A fuzzy classifier for analysis of volatiles using data from an electronic nose is described as an example of constructing a specialised fuzzy based system.

Keywords: Classification, Neural network, c-means, Fuzzy, Radial basis junctrion, Iris, Electronic nose.

1. Introduction

Fuzzy sets and systems provide an alternative to the traditional forms of logic and set membership that have predominated since the time of the ancient Greeks. In 1965, Lofti A. Zadeh published the work "Fuzzy Sets" [1][2] which describes the mathematics of fuzzy set theory and fuzzy logic. This theory extends that of the classical notion of true and false to include a range of real numbers [0.0, 1.0]. New operations for the calculus of logic were proposed which were generalisations of classic logic. In classical set theory an object either belongs to a set or it does not. Fuzzy logic explains situations in which there is imprecision due to vagueness rather than randomness, probability explains how events occur in random space [3][4].

A requirement of probability is additivity i.e. the mutually independent probabilities of a particular system must add to one. Fuzzy membership functions do not possess this property. Fuzzy membership functions can be developed using a wide range of techniques including probability density functions. Probability deals with the likelihood of an outcome and fuzzy logic deals with the degree of ambiguity. A probability of 1 indicates that the event is certain to occur. In fuzzy

96 S.M. Scott, W.T. O'Hare and Z. Ali

logic a membership of I means a complete lack of ambiguity. The statement "there is a 50% chance of a cloudy day" states the chance (0.5) of an ambiguous (cloudy) outcome. In many situations approximate reasoning is more practical than exact reasoning; e.g. it is more appropriate to say "apply the clutch just before the car is due to stop" rather than "apply the clutch 0.638 seconds before the car is due to stop". Fuzzy set theory may also be used for pattern recognition when the categories are imprecisely defined.

A B - • • • • • • • • • • - • - • • • • .C • • - - • • - • • • • • • • - • • • • - • • • • • - • • • • • • • - • -Figure 1.1 Assignment of point C to clusters A and B

In figure 1.1, conventionally the point C would be assigned either to cluster A or B. Fuzzy clustering techniques are able to assign C in a more reasonable fashion by giving C a separate grade of membership to each cluster.

2. Fuzzy sets

If U is a classical set of objects, called the universe, whose generic elements are denoted by x . Membership in a classical subset A in U is viewed as a function :

f..lA : f..l ~ {O,l}

Fuzzy Logic and Fuzzy Classification Techniques 97

such that:

,XE A

,x~ A

(2.1)

{0,1} is called a valuation set. If the valuation set is allowed to be the real interval [0.0, 1.0], then A is called a fuzzy set. #lA(x) is a number in the closed interval [0.0, 1.0] indicating the degree or grade of membership of x in A. The closer the value of #lA(x) is to 1, the more x certainly belongs to A. When U is a finite and countable set {xJ, ... ,xn }, the fuzzy setA is expressed as:

A = '22;=1 JL:(X) I

When a universal set is infinite, a fuzzy set A is often written in the form.

A= jJLA(X) x x

-M -::\... .9- 0 9 .r:; .

~ 0.8 E 0.7 E 0.6 '0 0.5 ~ 0.4 ~ C) 0.3

0.2

0.1

d

O~~~~L-~--~--~---L--~ __ ~ __ ~ __ _ a c X

Figure 2.1 Triangular membership set

(2.2)

(2.3)


A triangular-shaped membership function characterised by the parameters a, band c, shown in figure 2.1, may be represented by equation (2.4).

{x-a}d

b-a

A{x)= {c-x}d c-b b'.5x'.5c

o otherwise

(2.4)

A trapezoidal membership function, shown in figure 2.2, characterised by the parameters a, b, c, d and e can be represented by equation (2.5).

~ 0.9 ~ 0.8 E 0.7 E 0.6 o 0.5 ~ 0.4 ~ ~ 0.3

0.2 0.1

e

O~~~~ __ ~ __ -L __ -L __ J-__ ~ __ ~ __ ~ ___

abe x

Figure 2.2 Trapezoidal membership set

A{x) =

{x - a}e b-a

e

{d -x}e d-c

0


(2.5) b~x~c

c~x~d

otherwise

A Gaussian membership function, characterised by parameters a (mean) and (J

shown in figure 2.3 can be represented by equation (2.6) .

. ~ 0 9 ..c .

~ 0.8

E 0.7 E 0.6

'0 0.5 ~ 0.4 ~

C) 0.3

0.2

0.1 O~~~-=~~--~--~--~--~--~~~---+

-3 -2 -1 a 2 3 x

Figure 2.3 Guassian membership function

(2.6)

The information contained by the linguistic terms is expressed by membership functions. A wide variety of membership functions can be used including triangular, trapezoidal and Gaussian.


2.1 Basic operations of fuzzy sets

Membership functions measure the degree to which objects satisfy imprecisely defined properties. Standard operations are used to manipulate fuzzy sets.

If U is a set and x is a member or element of U, then the complementation of a

fuzzy set A, denoted A, has a membership function described by equation (2.7). The total shaded area in figure 2.4 represents the complement of a Gaussian fuzzy setA .

-.......

. ~ 0 9 ..c: .

~ O.B E 0.7 E 0.6 '0 0.5 ~ 0.4 ~

(!) 0.3 0.2 0.1

o

(2.7)

x

Figure 2.4 Union of fuzzy set A and its complement

The UNION of fuzzy sets A OR B, written AU B , has membership function.

J.l AvB (x) = max[,u A (x), J.lB (x)] (2.8)


Where 'max' represents the maximum of the two grades of membership. The union of A and its complement is illustrated in figure 2.4 as the bold line over both set areas. In classical set theory the union of any set A with its complement A

yields the universal set U. All of the elements of U must belong to either A or A . This law of the excluded middle does not hold for fuzzy sets since an element x is not a member of U with full membership.

The INTERSECTION of fuzzy sets A AND B, written An B, has the membership function

J.l AnB (x) = minLu A (x), J.lB (x)] (2.9)

Where 'min' denotes the minimum of the two membership grades, shown as the lightly shaded area in figure 2.4.

A set is EMPTY if for all elements x within the setA: J.lA (x) = 0.0

Two sets A and B are EQUAL, if for all x: J.l A (x) = J.lB (x)

A fuzzy set A is CONTAINED in a fuzzy set B, written as A c B, if and only if, J.l A ~ J.lB

3. Case studies of fuzzy classification techniques

3.1 Pattern recognition systems

(2.10)

(2.11 )

(2.12)

A variety of fuzzy pattern recognition systems, fuzzy c-means, fuzzy ARTMAP, SF AM and RBF neural networks will be described using standard circle-in-square and iris datasets. A fuzzy classifier for analysis of volatiles using an electronic nose (chemical sensor array) will also be discussed.


3.2 Standard data sets

Standard data sets allow algorithms to be tested independently from specific problems, one algorithm may than be compared for efficiency and accuracy to any other. For this reason we are using two standard datasets to present the standard pattern recognition techniques.

3.2.1 Circle in square problem

The circle in the square problem consists of a square of unit length side, inscribed within this square is a circle that has the same centre as the square and an area of one half that of the square as shown in figure 3.1. The test is to correctly classify whether an (x, y) point lies within the circle of not. The problem looks easy but is not for a machine. The circular region makes this an exclusive-or problem that requires a relatively large number of hidden neurones for most back-propagation networks to solve.

08

06

04

01

o

'()2 o 01 0 4 06 08 1 1

Figure 3.1 Circle in square problem

To test the classification techniques on this standard problem, one thousand points were generated in an even grid through the problem space of the square. These points were then split randomly into training and test sets at a ratio of 2:1, (667 training, 333 test).


3.2.2 Iris data

The Iris data of Anderson [5] consists of 4 measurements (in cm) from each of 150 iris plants, the sepal length, sepal width, petal length and petal width. The first 50 sets belong to iris setosa; the second 50 sets to iris versicolor and the last 50 sets are iris virginica. Iris versicolor is a hybrid of iris setosa and iris virginica, but is more similar to virginica. Consequently setosa is easily identified but the other two are more difficult to separate. For classification this data was split into 100 training sets and 50 test sets. Table 3.1 shows an extract from the iris data.

Sepal Lenmb S~I Wldth_~

5.1

7

5.8

3.5

3.2

2.7

Table 3.1 Extract from iris data

3.2.3 Principal Component Analysis

1.4

4.7

5.1

0.2

1.4

1.9

Principal Component Analysis (PCA) is used here for visualisation of the data. It is a commonly used multivariate technique [6][7], which acts unsupervised. PCA finds an alternative set of axes about which a data set may be represented. It indicates along which axis there is the most variation; axes are orthogonal to one another. PCA is designed to provide the best possible view of variability in the independent variables of a multivariate data set. If the principal component scores are plotted they may reveal natural clustering in the data and outlier samples. Using this technique provides an insight into how effective the pattern recognition system will be at classifying the data. PCA is a simple and fast method for dimensionality reduction but remains a linear approach.


• • -0.03

-0.04

-0.05 PC1

Figure 3.2 peA Scores plot for iris data

.Setosa eVirginica • Versicolor

• •

0.3

Figure 3.2 shows the first two principal components in a PCA scores plot for the 150 iris samples. The categories are clearly visible as clusters. The setosa forms a tight cluster to the right of the plot with a centre of (0.18, 0), the versicolor forms a less tight cluster to its left, centre (-0.025, 0). The virginica data forms a loose cluster to the left of the plot, centre (-0.15, 0). There is a small degree of overlap between the versicolor and virginica categories, and there are several outliers to the virginica some of which are closer to the versicolor cluster centre than the virginica cluster centre. Overall this dataset forms a good test of classification techniques due to the overlapping clusters and the loose cluster with outliers.

Determination of clusters may be performed by a number of methods including Kohonen's self-organising feature map (SOM). Kohonen's SOM does, however, have a number of limitations including no well-defined cost function, no guarantee of convergence, the procedure for shrinking the neighbourhood is arbitrary and the parameters of the learning process need to be changed to achieve best results. Generative Topographic mapping has been suggested as an improvement on the SOM [8]. An assumption often made about data is that its distribution is Gaussian or nearly Gaussian; this is often not true.

3.3 Fuzzy c-means

Fuzzy c-means is a clustering method of data analysis based on the fuzzy membership of each data point to each of the clusters of data formed. Conceived in 1973 by Dunn [9] and generalised by Bezdek [10], the family of algorithms is based on an iterative optimisation of a fuzzy objective function. Due to the


efficacy, simplicity and computational efficiency these have become very popular techniques

.9-0.8

.J::. ~ Q)

~ 0.6 Q)

E '0 0.4 Q)

-cs ... (50.2

o

20 Gaussian Fuzzy set membership

.... :. . . . . ~ .

.. , . ... ;' . i .. · · ···~·· . ,.-. . . . . ~.

.... : .. ..

.... : .. ....

..... : .. .

,.- .

.... ~ ... .

. . .. ~ .. , .. ... ! ....

SId deY. Y

. .. ~ .... . .

-3

-. . ' .

"': : ....

'.' :' "

Figure 3.3 2D Fuzzy set

: ....

' . . " . '.

-.. ; :'"

' . . ' . . ' .

SId dey. x

' .. ' ., ~" .

"';"

" :. : "

The classification of a set of entities by a learning system is a powerful tool for acquiring knowledge from data. Given a set of feature vectors a process may cluster them into similar feature values. A ball of uniformly distributed vectors has no cluster structure. But if a set of vectors is partitioned into multiple subsets that are compactly distributed about their centres and the centres are relatively far apart, then there is a strong cluster structure.

3.3.1 Fuzzy c-means explanation

The Fuzzy c-means algorithm uses fuzzy weighting with positive weights to determine the centres of the c cluster prototypes; c must be given. The weights are set to minimise a constrained functional. As a point approaches a prototype centre its weight increases to unity, but as the distance increases the weight decreases and tends to become more uniform as shown in figure 3.3. The fuzzy c-means


algorithm allows each feature vector to belong to multiple clusters with varying fuzzy membership values. It should be noted that convergence to a fuzzy weight set that minimises the functional is not assured for the fuzzy clustering algorithm due to local minima and saddle points. To overcome this the initial weights of the feature vectors are randomly chosen and the process repeated several times to obtain a mean solution.

The aim of cluster analysis is to group data vectors according to the similarities amongst them. A cluster is a group of objects that have more similarities with objects within the group than with members of other clusters. Typically this similarity is defined as the distance between vectors based on the length from a data vector to some prototypical object of the cluster. The prototypes are not usually known beforehand, and are calculated by the clustering algorithm simultaneously with the partitioning of the data. Accordingly clustering techniques are among the unsupervised learning methods, as they do not use prior knowledge of class identification. The prototypes may be vectors of the same dimension as the data objects, but may also be defined as higher-level geometrical shapes.

A cluster is a subset of the full data set; classification may be either the classical hard clustering or fuzzy clustering. Hard clustering methods are based on set theory and require that an object either does or does not belong to a specific cluster. Fuzzy clustering allows objects to belong to clusters with a degree of membership. The dataset Z is partitioned into c fuzzy subsets. Objects on the boundaries between classes are not forced to fully belong to anyone ofthe classes. They are however assigned a membership of 0 to J indicating the degree to which the data vector belongs to that cluster.

If each data vector consists of n measured variables grouped into an n-dimensional

column vector Z = [Zlk' ........ .znkY' Zk E Ren . A set of N observations is

denoted by Z = {Zk I k = 1,2, .... N} and may be represented as an n row by N

column matrix:

Z= (3.1)


In typical pattern recognition terminology, the columns of Z are the patterns or objects; the rows are called the features or attributes and Z called the pattern matrix.

Clustering divides the dataset Z into c clusters. A c by n matrix U = Luik]

represents fuzzy partitions if the elements satisfy the following conditions:

The fuzzy membership for each object to each cluster lies in the range [0,1]

Pik E [0,1] 1 ~ i ~ e, 1 ~ k ~ N (3.2)

The sum of fuzzy memberships to all cl usters for each object is 1.

(3.3)

The sum of fuzzy memberships for all objects to each cluster must be greater than o and less than N.

1 ~ i ~ e (3.4)

where: c is the number of fuzzy clusters, Pik denotes the degree of membership.

The Z = [Zlk , ........ .znk r -th observation belongs to the 1 ~ i ~ e'k cluster.

The objective of the fuzzy c-means algorithm is to minimise the sum of the

weighted squared distances between the data points, Z k and the cluster centres,

Vi' The distances Di; are weighted with the membership values Pik' The

objective function is then:

\08 S.M. Scott, W.T. O'Hare and Z. Ali

c N m

J(Z,U,V)= II(,uik) Dii (3.5) i;\ k;\

where:

U = [,uik] is the fuzzy partition matrix Z

V = [v\, v2 , ••• , vc1 is a vector of cluster prototypes (centres).

m E (1,00) is a weighting exponent that determines the fuzziness of the resulting

clusters, it is commonly chosen to be m = 2.

Dik may be determined by any appropriate norm, for example Euclidean norm

distance.

(3.6)

The minimisation of the c-means functional represents a non-linear optimisation problem that may be solved using the alternating optimisation algorithm also known as the fuzzy c-means algorithm.

The Euclidean distance results in point prototypes and develops spherical clusters. The Gustavson and Kessel algorithm [11] replaces the Euclidean distance by a metric that is induced by a positive definite matrix. It therefore can detect ellipsoidal clouds of data vectors. The clusters are stiII assumed to be approximately the same size.

3.3.2 The Fuzzy c-means algorithm

Initialisation

Given the dataset Z; choose the number of clusters c, the weighting exponent m, the termination tolerance & > 0 and initialise the partition matrix randomly.


LoOp (1 1,2 ... ) (Calculate/or a maximum number o/iterations)

compute the cluster centres

"N { .. (i-l),," (I) _ L.Jk=1 \Pik J Z k

Vi - "N { .. ~/-l))m L.Jk=1 \P,k

l::;;i::;;c

compute the distances (Euclidean)

1 ::;; i ::;; c, 1::;; k ::;; N

Update the partition matrix

(I) _ 1 Pik - "C (D. / D. \21(m-l)

L.J j=l Ik jk J

Else

(Distance is zero, so membership is 1)

Until IIU I _U(/-l) < &11

less than a tolerance)

(partition matrix {~uclidean norm} alters by

The calculation will continue until the partition matrix alters by less than a tolerance (Euclidean Norm) value or a maximum number of iterations has been reached. A variation of this is to use the change in cluster centres V.


3.3.3 Results using fuzzy c-means

The results of applying the fuzzy c-means algorithm to the full iris data set (with m =2) are 133 correctly classified, details are shown in table 3.2 below.

Category Setosa Versicolor Virginica Total Correct (%)

Setosa 50 0 0 50 100

Versicolor 0 47 3 50 94.0

Virginica 0 14 36 50 72 .0

Total Correct (%) 88 .6

Table 3.2 Results of Fuzzy c-means analysis on irises

The results shown in table 3.2 above show that the iris virginica has a poor classification rate. This is not surprising as all of the clusters are assumed to be spherical and of the same size, which is not true for this data set.

Figure 3.4 shows the fuzzy c-means partitions for the irises mapped onto the first two principal component surface. The outlying points for the virginica have been partitioned to have a higher membership value for the versicolor iris. Similarly the outliers for the iris versicolor that are closer to the virginica cluster centre are prescribed a higher membership value for virginica.

c.. :.c f! co .., E '" E

PC2


Fuzzy c.means membership (or Iris Categories

...... ]" ... ··T·····r ..... ·1 .. ···· ... ! ........ 1:.··· ····1.····· "j .. ' ... "'1 ...... "'~::""" ':\ 1 ~ ~ . ~ : . : Jerslco:lor :brl : ...... ~ ... se:.osa··; ··.···U: .. ..... . .....• . .. ·.· Vlrgln lca' ·· . .(Hy d)- . . . : -.. ..

j : i I I i mj~

.Q.25 .Q.2 .Q.15 .Q.l .Q.05 0 0 .05 0.1 PCl

Figure 3.4 Fuzzy c-means memberships for irises

Fuzzy c-means may not be the ideal classification technique for this data, it may, however be used as a pre-processing device to reduce dimensionality, whilst simultaneously normalising the data. For the three classes shown here the output from the c-means algorithm will be three values for each data set, table 3.3 shows the output for the first 5 data sets for the iris data.

0.972446

0.976282

0.963284

0.992346

0.019955

0.016956

0.026492

0.005496

0.007599

0.006763

0.010224

0.002158

Table 3.3 Example Fuzzy c-means output for irises


3.4 Fuzzy Adaptive Resonance Theory Mapping. (FuzzyARTMAP)

3.4.1 Fuzzy ARTMAP

The fuzzy adaptive resonance theory (Fuzzy ART) neural network is part of a family of self-organising neural architectures that cluster the pattern space and produce weight vector templates. One of the problems of simple competitive nets is that under certain circumstances the assignment of best matching nodes to input patterns may become unstable [12][13]. Carpenter and Grossberg refer to this phenomenon as the plasticity-stability dilemma [14], how maya network retain learned patterns (stable) while remaining able to learn new ones (plastic). Kohonen's self-organising network uses a gradually reducing learning rate; this however simply limits the plastic period of the net. Another problem in neural network computing is to fix the number of nodes required to describe the pattern space. If a large number of nodes are used then a finely graded solution will be obtained but computation times will increase, too few nodes and the granularity will be too coarse resulting in imprecise calculation. It is far better to allow the network to organise itself in this respect so that the number of nodes produced results in the appropriate accuracy required according to a single 'vigilance' parameter. The ART family of neural networks address these issues in a biologically plausible way [15] underpinned by a rigorous mathematical description.

Carpenter and Grossberg [14][16][17] developed the Adaptive Resonance Theory (ART) family of neural networks to solve the stability-plasticity dilemma that other neural networks suffer from. The aim was to have a stable memory structure even with fast on-line learning that was capable of adapting to new data input, even forming totally new category distinctions. Fuzzy ARTMAP is a specialisation of the general ART case, developed for supervised slow learning. Unlike parametric probability estimators Fuzzy ARTMAP does not depend on a priori assumptions about the underlying data. Online computation is able to achieve probability estimates and compression by partitioning the input space into categories. Recognition categories large or small are produced to output best predictions. A variable number of recognition categories may predict each output. The network has a small number of parameters and does not require guesswork to determine the initial configuration since the network is self-organising. In a standard back-propagation network used for pattern classification an output node is assigned to every class of object that the network is expected to learn. In Fuzzy ARTMAP the assignment of output nodes to categories is left up to the network. Input into the network must be normalised to a value from 0 to 1, hence a suitable

Fuzzy Logic and Fuzzy Classification Techniques I \3

normalisation value must be chosen so that no input will fall outside the valid range.

3.4.2 Mapping

Fuzzy ARTMAP consists of two Fuzzy ART modules (ARTa and ARTb) the F2 layers of these modules are linked by an inter-ART associative memory referred to as a 'match tracking' system. The Fuzzy ARTMAP architecture is shown in figure 3.5. During supervised learning ARTa receives a stream of input patterns (aM), ARTb also receives a stream of patterns (bM), where bM is the correct prediction for aM. When ARTb does not confirm a prediction by ARTa, inhibition of the inter-ART associative memory activates a match tracking process. This increases the ARTa vigilance by the minimum amount needed for the system to activate an ARTa category that matches the ARTb category or to learn a new ARTa category.

ART.

Map lieldpb

Match tracking

Figure 3.5 Fuzzy ARTMAP architecture

l


3.4.3 ART modules

Input into the ART module consists of a vector of normalised data. The Fo layer is a complement coder that transforms the 1m vector into the 21m vector. The F) layer is passed the compliment-coded vector and compares it to each node in the F2 layer according to a fuzzy match criterion. If there are no nodes in the F2 layer a new node is created with its weights set to the complement coded input vector (Fast learning). If nodes already exist then the node with the highest match is the winning node. If this winning node matches better than a vigilance criterion the module is said to be in resonance. The nodes weights are updated by an amount dictated by the learning rate. If the node does not match by at least the vigilance criteria then a new node is created and added to the F2 layer with weights set to the complement-coded input vector. If the winning node from the ARTbmodule does not confirm the prediction of the ARTa module then the inter-ART map field induces the match tracking process. Match tracking raises the ARTa vigilance to just above the F)a to Foa match ratio. This triggers an ARTa search that leads to activation of either an ARTa category that correctly predicts b or to the creation of a new ARTa category node in the F2a layer.

3.4.4 Complement coding

Complement coding ensures that the presence or lack of presence of a particular feature in the input is visible. For a given input vector a of d features the complement vector Zi represents the absence of each feature.

(3.7)

The internal complement coded input vector I is then of dimension 2d.

I (a, a) (a I , .... , ad' a i ,.... ad) (3.8)

The normalisation of a fuzzy vector is the sum of all of its points, if a fuzzy vector x contains n points, its norm Ixl is given by equation (3.9).


n

Ixi = LXi (3.9) i=\

3.4.S Fz Output node activation

If a new category is detected then a new F2 output node is created with weights set to:

w~ew = I J

(3.10)

When an F\ layer receives a complement-coded input pattern I, all of the output nodes in the F1layer are activated to some extent. If the activation level ofa node is T, then the activation of the jth output node with weights Wj is 1j.

(3.11)

where the epsilon CL is a small number, typically 0.0000001, this avoids unity activation for a node.

The winning node is then the node that has the highest activation value.

(3.12)

If two or more output nodes share the winning value then the node with the lowest indexj is arbitrarily chosen to win. The category associated to this node becomes the network's classification for that input pattern.

A match function compares the complement coded input features and the weights in the winning, selected output node to determine ifleaming should occur.


(3.13)

This equation may be simplified due to the fact that the norm of any complementcoded vector is equal to the dimension d of the original input vector [18].

(3.14)

3.4.6 Resonance and mismatch

If M is greater or equal to the vigilance parameter p then the selected /h output node is capable of encoding the input I (if node j represents the same category as the input vector 1) the network is said to be in a state of resonance. The output node may then update its weights. Only one output node is allowed to alter its weights for any given training input vector.

(3.15)

If the output encodes a different category from the input vector there is a 'category mismatch' condition. The node activation is suppressed and the weights for that node are not updated. If the match function value is less than the vigilance a 'mismatch reset' condition applies, the current output node does not meet the granularity represented by the vigilance, its activation is suppressed and its weights are not updated. This prevents the category from becoming increasingly non-specific (low vigilance). The vigilance value is set to match the value of the winning node plus a small value a, equation (3.16). A new output node must be formed with its initial weights set to match the input vector, equation (3.17).

Pnew =M +a (3.16)

The selected output node has its weight vector Wj updated according to the rule


O::;P::;l (3.17)

The learning rate {3 may be set to 1 for ' fast learning' . If this is the case then equation (3.17) reduces to a simple fuzzy AND of the input vector and the topdown weights of the selected output node.

(3.18)

Once trained a classification may be made by presenting the network with an input vector. The ART. module encodes this input and the nodes in the F2 layer are activated. The winning node is selected and the inter-ART module looks up the mapping for the category from the ARTb module. For testing the category returned is compared to the correct category.

The network was trained on the circle in square data using a vigilance of 0.5 and learning rate 0.5 with number of training epochs set at 1. There were 667 training sets and 333 test sets, the data is split into training and test sets as Fuzzy ARTMAP is a supervised learning method i.e. it must be told the correct classifications to a representative sample of the data to train it. This resulted in 72 ART. categories and 2 ARTb categories being formed. It is fairly obvious that two ART b categories will be formed, as there are two categories, inside the circle and outside the circle. The 72 ART. categories formed show that to adequately map the input vectors to the output categories there needed to be 72 sub regions formed in the input space.

Category Square Circle Total Correct (%)

Square 173 9 182 95.05

Circle 22 129 15 1 85.43

Total Correct (%) 90.69

Table 3.4. Confusion matrix of circle in square for Fuzzy ARTMAP

These may not necessarily be of the same size; it is most likely that most of these nodes are at the boundary of the circle and square. Larger sparser nodes would cover the extremes of the square. When the network was tested using the 333 test


cases a total of 302 cases were correctly classified with 31 incorrectly classified, table 3.4 gives a breakdown of these results. Better results could be obtained by varying the values for the vigilance, learning rate and number of training epochs. If however the vigilance value is set too high then there is a tendency for the network to form too many ARTa nodes, therefore not general ising sufficiently. A balance needs to be found for any problem so that good generalisation is obtained without compromising the ability to discriminate at the boundaries of categories.

Table 3.5 shows the results of using a Fuzzy ARTMAP network on the iris data. Trained on 100 sets and tested using the remaining 50 sets. Of the 50 test sets 48 test cases were correctly classified.

Class Setosa Versicolor Virginica Total Correct (%)

Setosa 16 0 0 16 100

Versicolor 0 16 1 17 94.12

Virginica 0 1 16 17 94.12

Total Correct (%) 96

Table 3.5. Confusion matrix of irises for Fuzzy ARTMAP

3.5 Simplified Fuzzy Adaptive Resonance Theory Mapping (SFAM)

3.5.1 Overview

Simplified Fuzzy Adaptive Resonance Theory Mapping is a simplified version of Fuzzy ARTMAP [18]. A complement coder normalises the input and also provides the fuzzy complement for each value. This expanded input (J) is then passed to the input layer. Weights (w) from each output node sample the input layer, making the weighting top-down. The category layer replaces the ARTb module and merely holds the names of the (m) categories that the network is expected to classify. There is no need for an inter-ART module as the output nodes hold the individual mappings for the categories

Category Layer

Output Category Layer

Input Layer


Raw input pattern of size d

Figure 3.6 Block Diagram of SF AM network

The vigilance parameter (P) is used in the learning phase of the network; its range is 0 to 1 and is used to control the granularity of the output nodes. In general higher vigilance values result in a greater number of output category nodes to form. The network is able to self adjust its vigilance during learning from some base value (user defined) in response to errors found in classification. It is through this "match tracking" that the network is able to adjust its own learning parameters to enable the production of a new output node or to reshape the decision regions. A block diagram of the SFAM network showing the main architecture is shown in figure 3.6.

3.5.2 Classification

Once SFAM has been trained a feed-forward pass of a data set through the compliment-coder into the input layer triggers a classification. The output node activation function is evaluated for each output node in the network. The category of the input vector is found by assigning it the category of the most highly activated node Twin .


Class Setosa Versicolor Virginica Total Correct (%) Setosa

16 0 0 16 100 Versicolor

0 16 I 17 94.12 Virginica

0 I 16 17 94.12


Table 3.6 SF AM Results with iris data

Table 3.6 shows one virginica sample was misclassified as versicolor and one versicolor sample was misclassified as virginica, exactly the same as for the Fuzzy ARTMAP. Both of the fuzzy networks produce the same results for the classification of the irises. This was expected, as both techniques work the same way. It may be seen that the misclassifications appear to occur around the boundary of the virginica and versicolor data sets. With more data obtained from this area, it is reasonable to assume that the classifiers could perform higher classification rates. The SF AM network also produced the same results as the Fuzzy ARTMAP network for the circle in square data for the same network parameters.

3.5.3 Summary of ARTMAP and SFAM

Fuzzy ARTMAP and SFAM carry out supervised learning much like a backpropagation network, but are more sensitive to noisy data. If the vigilance parameter is initially set too high the network can over train and map an output node to each input vector, becoming a look-up table. The networks are however self-organising, self-stabilising and suitable for real time learning [19].

For a classification regime to be effective the training data must fully satisfy two criteria: • Every class must be represented. • For each class, statistical variation must be adequately represented.

In general a large number of training sets will allow for noise effects if these are truly random. If the noise is not random then the regime will learn the noise pattern, possibly masking the true data patterns. If the data classes are well separated then few training sets may be needed to adequately describe the pattern,


however if there are classes that fall near a decision boundary then it is important to use a larger number of data sets from near that boundary.

3.6 Radial basis neural network

3.6.1 Overview

Radial basis neural networks were popularised by Broomhead and Lowe in the late 1980's [20], they are quick to train and conceptually elegant. The feature space is normalised [0,1] n and is filled with M overlapping radial based functions. The functions are continuous and reach a maximum value at the centre of the specific region covered, but assume a near zero value outside it. There are several types of radial functions, the most popular being the Gaussian. One way of describing a RBF network is that each radial function is a fuzzy set membership function in the feature space. Any feature vector x, belongs to one or more of the response regions, it is fuzzified by each radial basis function, then these outputs are summed to determine the match level for each class. This is very similar to the fuzzy based classifier that was constructed to map the sensor responses. The analogy we feel is a good one; the major difference between the simple fuzzy based classifier and the RBF network is the method of determining the Gaussian centres and widths, and for the network the optimisation of the weights.

3.6.2 Architecture

The centre of each RBF is placed on a small cluster that represents a subclass; therefore M functions cover the feature space. The spread parameter «(12) may be adjusted so that it covers a larger area; adjacent RBF's usually overlap to some degree. The neurones represented by the M centres make the single hidden layer of an N-M-C feed forward artificial neural network as shown in figure 3.7. The output layer C contains summing neurones with weighted connections to the hidden layer M that must be trained in a similar way to a multi layer perceptron network.


Inputs x

• •

Outputs

•. _._._._. __ ._. __ . __ .-..

c· J

Figure 3.7 Architecture ofRBF Classification Network

3.6.3 Operation

The operation of a trained network consists of presenting an input vector x, the input layer normalises the vector to [0,1]. The hidden layer to produce a scaled response then processes the normalised vector. Any input vector close to one of the M neurone centres will produce an output y that is greater than any other. The vector y = (Yh ... ,YM) that is output from the hidden layer is processed by each neurone of the output layer. It is usual to use a summing function (equation 3.19) or an averaging squashing function (equation 3.20) rather than the multi layer perceptron sigmoid function.

M

c j = LUmjYm m=l

(3.19)


,S = Yl + ... + Ym (3.20)

The output vector c is then tested against each of the target vectors that identify the classes. The greatest output represents the highest activation and thus the input vector x is recognised.

3.6.4 Training

The fulI training algorithm for radial basis function networks of Looney [21] alIows adjustment of the hidden neurone centres v, the spread parameter u2 and the output weights u.

TypicalIy the steepest descent algorithm is used to train the output weights u, the total sum-squared error, E over alI the Q input vectors is minimised. t is th.e target output vector that identify the classes. Urn is initialised to 0.05, Urnj are set randomly to between (-0.5 to 0.5).

(3.21)

If 71 is the network-learning rate, the steepest descent formula to optimise the output weights u is.

Q u. = u. + (211 / M)" (t~k(q» - z(q) )y(q)

mJnew Tn} old "' L..J } } m q=l

Function centres

q=l j=l

(3.22)

(3.23)


Spread parameter

(3.24)

3.6.5 Results using RBF networks


Setosa 16 0 0 16 100 Versicolor 0 16 1 17 94.12 Virginica 0 0 17 17 100


Table 3.7 RBF 4-4-3-network iris results

Table 3.7 shows the results of using an RBF network on the iris data. All setosa and virginica samples were correctly classified; one versicolor was misclassified as virginica.

As mentioned previously fuzzy c-means is often used to pre-process data, here we have used it to reduce the four-dimensional data sets down to three-dimensional sets. A RBF network using a 3-3-3 configuration was trained on 12Q data sets. We show the results of testing the network on all 150 data sets for direct comparison with the fuzzy c-means output (used as input to this network). Table 3.8 shows that 1 virginica sample was misclassified as versicolor a significant improvement onthe 14 misclassified samples from thec-means, this is due to the optimised supervised training that the neural network uses.


Setosa 50 0 0 50 100 Versicolor 0 41 9 50 82 Virginica 0 1 49 50 98


Table 3.8 RBF 3-4-3-network (Fuzzy c-means input) results. for iris data


For comparison a Radial Basis function Neural Network was trained on 667 sets of the circle in square data, then tested on the same 333 sets as the Fuzzy ARTMAP network. The architecture used was 2-20-1; training was carried out for 5000 epochs. The output node was set to 1 (Circle) if the output was 0.5 or above and 0 if the output was below 0.5 (Square). From the 333 test cases 320 were correctly classified, 13 were incorrect. Table 3.9 gives details of the results.

Class Circle Square Total Correct (%)

Circle 139 12 151 92.05

Square 1 181 182 99.45 Total Correct (%) 96.09

Table 3~9 2-20-1 RBF network for circle in square problem

3.7 Fuzzy classification of chemical sensor array data

In this section we will discuss a fuzzy logic based pattern recognition system for analysis of volatile compounds by chemical sensor array.

3.7.1 Introduction

Analysis of volatile compounds is important in a number of sectors including manufacturing, medical and environmental. Volatile analysis is also important in sensory evaluation. Conventional instrumental techniques for the analysis of odours or volatile compounds are expensive, laboratory based and require technical skill to ope~te. Sensory evaluation by trained panellists is also expensive and may be susceptible to imprecision due to fatigue and physiological differences between the judges. There is therefore a great deal of interest in development of inexpensive and portable instrumental methods for the analysis of odours.

Persaud and Dodd [22] proposed the concept of an electronic nose system. An electronic nose is a system comprising an array of electronic chemical sensors with partial specificity and an appropriate pattern-recognition system. Since the sensor array has only partial specificity to the odour or volatile compounds, it has good reversibility. Selectivity for the analyte is achieved from the pattern of the sensor array responses, which acts as a fingerprint for the analyte.

The electronic nose and mammalian nose perform the same function but clearly have many differences in operating principle, type and number of sensors,


sensitivity and selectivity. Electronic nose systems can employ a variety of gas sensors or in some cases a combination of sensor types. The gas sensors used may be divided into those that operate at high temperatures e.g. the metal oxide semiconductors (MOS) and metal oxide field effect transistor (MOSFET) and those that operate at around room temperature such as conducting polymers, piezoelectric quartz crystals (also known as bulk acoustic wave or BA W) and surface acoustic wave sensors (SAW) [23]. Optical sensor arrays are also being investigated; these devices are often termed artificial noses [24].

Piezoelectric quartz crystal (PZQ) and SAW sensors are two of the most common mass sensors. They differ in that, in the former, an acoustic wave travels through the bulk of the material, while in the latter case the acoustic wave travels on the surface. A mathematical relationship between the mass of material on the piezoelectric quartz crystal and frequency shift was first derived by Sauerbrey [25] and is given in equation (3.25).

(3.25)

Where f:J.! is the change in frequency of the quartz crystal (Hz),!o is the resonant frequency of the quartz crystal (MHz), f:J.Ms is the mass of the coating or substance sorbed (g) and A is the area coated (cm\ They are converted into chemical or biosensors by incorporation of a chemical or biochemical layer on the device surface, which will abstract the analyte from the sample stream. Since a wide range of coatings can be applied onto the device surface these sensors have very broad selectivity. The responses from the sensor array can be analysed by using pattern recognition methods [26]. Unsupervised methods such as principal component analysis, and cluster analysis are used in exploratory data analysis since they will attempt to identify a gas mixture without prior information. These techniques are most appropriate when no example of different sample groups is available or when a hidden relationship between samples or variables is suspected. Supervised learning techniques such as artificial neural networks and fuzzy logic can be used to classify a sample by developing a mathematical model relating training data to a set of descriptors. The test samples are then evaluated against a knowledge base and predicted class membership determined. Neural network and fuzzy logic methods are attractive since they are able to deal with non-linear problems. Electronic noses have been used with some success, Gardener and coworkers [19][26][27] claim a classification rate of 97% on coffee and 79% on cow's breath whilst 100% on cyanobacteria samples in water using a neural network to classify the data.

Air


Reference flow

ample flow

Sample at constant

temperature

Figure 3.8 Schematic of volatile sensing rig

Sensor Array

A schematic of the flow rig for a piezoelectric quartz crystal based array for headspace analysis is shown in figure 3.8. This system was used to obtain the data for three types of vegetable oils, extra-virgin olive oil, non-virgin olive oil and sunflower oil. This data is use to demonstrate a fuzzy based classification system. The sensor array consists of six crystal sensors each having a fundamental frequency of lOMHz; each PZQ was coated with a gas chromatography stationary phase containing a different functional group providing limited selectivity to components in the analysis stream. A reference PZQ allows for base reading compensation. The sensors were conditioned prior to use by passing nitrogen over their surface for six hours . A valve switches between the reference and sample gas streams. Sampling was performed over a 3-minute cycle, 1 minute base line reading (reference) and 2 minutes response (sample). After each reading the sample chamber was purged with reference nitrogen for 5 minutes prior to the introduction of the next sample. A total of 346 samples were taken, consisting of 112 Extra Virgin Olive oil, 126 non-virgin Olive oil and 108 Sunflower oil samples.


-+-Set 1

-Set2 --'-Set 3

Set 4

-+-Set 5 ~Set6

-+-Set 7

eo eo 100 120 1<0 -SetS Time In second.

Figure 3.9 Typical dynamic sensor responses

Figure 3.9 shows a typical sensor response to an analyte, eight response sets are shown for an OV-210 coated sensor reacting to sunflower oil. The response curves approximate to an open loop first order response. To enable a reasonably stable reading to be made for classification, the response at 120 seconds was chosen. At this time the sensor has almost reached equilibrium with the analyte so small experimental errors will have little effect on the frequency change observed. The actual equation for the sensor response is (as described by Freeman [28]) the sum of two exponentials and given by equation (3.26).

(3.26)

Where a} to a4 are constants and t is time (seconds)

The attributes of the gas (frequency change for each sensor, giving 6 features) were calculated and stored in a pattern matrix.


Histogram of sensor (OV-210) Frequency Change Wr---_r----~--~----_r----~--~----_r--__.

18 Sun Flower Oil Response at 1W seconds

Mean: 38.3056 16

iii 8 .D E :J 6 Z

4

2

SId Oev. 6.2905

Sample size: 108

25 30 35 40 45

Frequency Change 50 55 60

Figure 3.10 Histogram of OV -210 sensor with sunflower oil

3.7.2 Analysis of bistogram of sensor frequency sbifts

The histogram of figure 3.10 shows that for the OV-210 coated sensor the response to sunflower oil has a mean frequency change of 38 Hz. and the standard deviation is 6.2 Hz. The overall shape of the histogram may be likened to a Gaussian function as defined in equation (3.27); we may therefore model the sensor response using a Gaussian radial basis function (RBF). Similar analysis of the other sensors with an of the analytes gives similar results.

[(x . _X.)2]

.u(Xj) = exp - I 2 I where i = J, 2 .. . ,n 2aj

(3.27)

where X is the mean of x, the analyte data set for that sensor and (J is the corresponding standard deviation.

3.7.3 Classifier design

The classifier is modelled on the sensor response sets; it therefore uses Gaussian functions to describe the data in fuzzy terms. Each fuzzy set has a mean


frequency and a standard deviation for that frequency. All sensors used have such a fuzzy set for each analyte under investigation. The grade of membership for a given frequency change from the base value is given by the Gaussian equation (3.27).

3.7.4 Fuzzy processing

Given an unknown analyte an n by c partition matrix is constructed, where n is the number of sensors and c the number of categories. For example for the OV -210 coated sensor with sunflower oil as the analyte, the mean frequency is 38Hz. and the standard deviation is 6.2 Hz. If an unknown analyte gives a sensor response of 35Hz. for this sensor, then from equation (3.27) above the fuzzy membership is 0.871. All memberships (each sensor for each known analyte) are calculated.

U=

I-il e

1-i2e (3.28)

where /lGO is the column vector fuzzy group set for non-virgin Olive oil, JlGo =

Lull , ... l-inl] JlGE for Extra Virgin Olive oil and JlGs for sunflower oil.

then CG = {I-iGO' I-iGE' I-iGs} (3.29)

To calculate the group sets we must defuzzify the partition matrix. There are several defuzzification methods available; for example the minimum of the vector, or the maximum of the vector, the one used here was that of the mean of height [29] [30].

1 x=n

I-iGx = - I I-iGxi n x=1

l::;i::;c (3.30)


---r.

I \ 0.8 { \ .r

I I

0.6 ( \ I

\ 0.4 \

\ \ \

0 .2 \ "\

0 -.... ,--

0 50 100 150 200

Figure 3.11 Mean of height defuzzification

Equation (3.30) and figure 3.11 show the mean of height defuzzification method for an array of n sensors. The mean is taken of the grade of membership for each sensor for the analyte under investigation. The classification category = CG a, where a= maximum of the fuzzy set CG.

(3.31)

This method also gives a match level for each analyte, a perfect match being 1. The analyte with the best match is the winner; if this match equals or exceeds 0.5 then a classification has been made. If the match is below 0.5 then no classification has been made.


3.7.5 Analysis of edible oil data

Figure 3.12 PCA plot for edible oil data

The first two principal components in a PCA scores plot for 346 oil samples are shown in figure 3.12. The data set consists of 112 Extra Virgin Olive (EVO) oil, 126 Olive oil (01) and 108 sunflower oil (SFO) samples. The data classes are clearly visible as clusters. The non-virgin olive oil forms a tight cluster to the right of the plot with a centre of (0.12,0), the sunflower forms a less tight cluster to its left, centre (0.03, 0), and there is a small degree of overlap between these two classes. The extra virgin olive oil data forms a loose cluster to the left of the plot, centre (-0.18, 0); there are several outliers to this cluster some of which are closer to the sunflower cluster centre than the extra virgin olive oil cluster centre. Here the data has been split into 233 training sets and 113 sets to test the classifier. This ratio is usual for supervised classification methods as it avoids the model memorising the test set and becoming a look-up table while retaining enough data sets to adequately test the model. The test set is chosen randomly from the entire data set and is therefore independent.

3.7.6 Fuzzy classifier results with edible oil data

Table 3.lO shows the details of results of using the fuzzy classifier. The fuzzy classifier performed a high overall classification rate of 99% of the test set; one sunflower oil sample was incorrectly classified as non-virgin olive oil. The mapping of one sensor to one fuzzy set for each analyte enables accurate modelling the system.


Class Extra- Non Sunflower Total Correct Virgin Virgin (%) Olive Oil Olive Oil

Extra Virgin 37 0 0 37 100 Olive

Non virgin 0 42 0 42 100 Olive

Sunflower 0 1 33 34 97


Table 3.10. Confusion matrix of oil data results using fuzzy classifier

4. Conclusion This chapter has presented some basic fuzzy theory and demonstrated how this may be used for the classification of data. Other methods of classification and applications of fuzzy logic in analytical science may be found in the references listed as further reading.

References 1. L. A. Zadeh. Fuzzy Sets. Info. and Control, Vol. 8, 1965,338-353 2. L. A. Zadeh. Fuzzy Algorithms. Info. and Control, Vol. 12, 1968, 94-102 3. G. J. Klir, Bo Yuan. Fuzzy sets andfuzzy logic, Theory and Applications.

Prentice Hall. 1995 4. E. Cox. The Fuzzy Systems Handbook, Second Edition. A.P. Professional,

Academic Press. 1999 5. E. Anderson. The Irises in the Gaspe Peninsula. Bull. Amer. Iris Soc. 59

(1934) 2-5 6. H.G. Byun, K.C. Persaud, S.M. Khaffaf, P.J. Hobbs, T.H. Misselbrook.

Computers and Electronics in Agriculture, (1997), 17,233-247 7. R. Callan. The Essence of Neural Networks. Prentice Hall . 1999 8. C.M. Bishop, M. Svensen, C.K.1. Williams, Neural Computation, 1998,

Vo1.10, No.l,215 9. J.C. Dunn. A fuzzy relative of the ISODATA process and its use in detecting

compact well-separated clusters.J. Cybernetics. Vol. 3, no 3, 32-57. 1973 1O.J.C,. Bezdek. Pattern Recognition withfuzzy Objective function Algorithms.

Plenum. N.Y. 1981 11. D. E. Gustavson, W. C. Kessel. Proc. IEEE CDC, 761-766, 1979


12. S. Grossberg. Adaptive pattern classification and universal recoding: I. parallel development and coding of neural feature detectors. Biological Cybernetics 23, 1976, 121-134

13. S. Grossberg. Adaptive pattern classification and universal recoding: II. Feedback, expectation, olfaction, illusions. Biological Cybernetics 23, 1976, 187-202

14. G. A. Carpenter, S. Grossberg. A massively parallel architecture for a selforganizing neural pattern-recognition machine, Computer Vision, Graphics and Image Processing 37, 116-165 1995

15. K. Gurney. A n Introduction to Neural Networks. UCL press 1997 16.G. A. Carpenter, S. Grossberg. Reynolds J.H. A Fuzzy ARTMAP

Nonparametric Probability Estimator for Nonstationary Pattern Recognition Problems. IEEE Trans. On Neural Networks, Vol. 6, no. 6 Nov. 1995

17. G. A. Carpenter, S.Grossberg. et al. Fuzzy ART: An Adaptive resonance algorithm for rapid, stable classification of analog patterns. International Joint Conference on Neural Networks, 411-416. 1991

18. T. Kasuba. Simplified Fuzzy ARTMAP. AI Expert November (1993) 18-25 19. E. Llobet, et al. Fuzzy ARTMAP based electronic nose data analysis. Sensors

and Actuators B 61 (1999) 183-190 20. D. S.Broomhead, D. Lowe. Multivariate functional interpolation and adaptive

networks, Complex Systems. Vol. 2, 321-355. 1988 21. C. G. Looney. Pattern Recognition using Neural Networks. Oxford University

Press. 1997 22.K. Persuad, and G.H. Dodd, Nature, 1982,299,352 23. Z. Ali. J. Thermal Analysis Calorimetry, 55(2), 397, 1999 24.T.A. Dickinson, J. White, J.S. Kauer and D.R. Waltz. Nature, 1996,382,697 25.G.Z. Saubrey, Z. Phys., (1959),155,206 26. J. W. Gardner and E.L. Hines, Handbook of Biosensors and Electronic Noses:

Medicine, Food and the Environment, Ed. E. Kress-Rogers, CRC Press, Ch.27, 633, 1996

27.J.W. Gardener. H.W. Shin. E.L. Hines. C.S. Dow. An Electronic nose system for monitoring the quality of potable water. Sensors and Actuators B 69 (2000) 336-341

28.N. J. Freeman. et al. J. Chem. Soc. Faraday Trans. 199490(5),751-754 29. B. Kosko. Neural Networks and Fuzzy Systems, Prentice-Hall, Englewood

Cliffs, NJ, 1992 30. D. Drainkov, et al. An Introduction to Fuzzy Control 2nd Ed. Springer 1996

Further Reading 1. D.H. Rouvary. Fuzzy Logic in Chemistry. Academic Press. 1997 2. GJ. Klir, Bo Yuan. Fuzzy sets, fuzzy logic and fuzzy systems. World

Scientific. 1996

Application of Artificial Neural Networks, Fuzzy Neural Networks, and Genetic Algorithms to Biochemical Engineering

Taizo Hanai, Hiroyuki Honda and Takeshi Kobayashi

Department of Biotechnology, Graduate School of Engineering, Nagoya University

Furo-cho, Chikusa-ku, Nagoya 464-8603, Japan

Summary: In bioengineering processes, many complex and nonlinear biochemical reactions occur simultaneously, since a variety of microorganisms and enzymes are present in the system. Thus, it may be difficult to describe the process with conventional mathematical models and use such models for process control. Recently soft computing methods such as artificial neural networks, fuzzy reasoning, fUzzy neural networks, and the genetic algorithm, have been applied to the modeling and control of bioengineering processes. In this chapter, three applications to the Japanese sake making process are reviewed, and the manner in which soft computing methods can help in the interpretation and control of this process are discussed. Knowledge extraction from a sake brewing expert, called TOJ/, was carried out with the aim of optimizing the temperature control of the mashing process using fUzzy reasoning and fuzzy neural networks. We also discuss the determination of optimum process temperature and humidity using artificial neural networks and genetic algorithms.

Keywords: biochemical engineering, sake mashing process, artificial neural network, jiazy neural network, genetic algorithm

1 Introduction

In the field of chemical engineering, simulation models are widely used for the control of chemical plants. Structured mathematical models based on the appropriate physical and chemical theory are usually used. However, in some cases, it is not possible to express these models in closed mathematical form, due to the complexity of conditions in the reactor, the nonlinearity of the chemical reactions taking place within it, and the disturbance of liquid flow. One means of tackling these problems is to use soft computing algorithms [1-4]. In this section,

136 T. Hanai, H.Honda, and T. Kobayashi

we discuss the application of soft computing methods to biochemical processes, with particular reference to the sake mashing process, which presents a more complex system than those often encountered in chemical plants.

Alcoholic beverages such as beer and wine are manufactured through microbial fermentation. In Japan, sake is a particularly important product. The industrial sake mashing process has in the past often been under the control of an expert person known as rOJ!. However, most rOJ! are getting older, and the shortage of suitable successors has been recognized as a serious problem for the sake industry. To address these problems, a wide variety of approaches to the automation and mechanization of the mashing processes have been reported [5-19]. Generally, it is difficult to describe such a biochemical process using mathematical models, since the relationships among state variables are nonlinear. The control of these processes based on conventional control theory is therefore hard. Moreover, in the sake mashing process, there are some microorganisms, such as Saccharomyces cerevisiae and Aspergillus oryzae, together with enzymes, which interact. The saccharization of rice starch and the conversion of the resulting glucose to alcohol take place concurrently. To model this system mathematically, a large number of equations and parameters are required. Consequently, in practice the process has been controlled by expert operators, rOJ!, drawing on their practical experience.

For the control and the simulation of this type of process, the soft computing methods that have been developed over the last three decades are potentially very valuable. These methods can - in favorable circumstances - substitute for the process operator and sometimes the expert operator also, if the algorithm can extract and reproduce the expert's knowledge. The most widely used soft computing methods for this purpose are Expert Systems, Fuzzy Reasoning [1], Artificial Neural Networks (ANN) [3], Fuzzy Neural Networks (FNN) [2] and the Genetic Algorithm (GA) [4].

Many researchers, including those in our group, have studied the application of these methods to the temperature control of the sake mashing process [5-19]. One of the critical parameters in sake mashing is the temperature in the fermentation tank. In these studies, an interview with the expert operator, rOJ!, was carried out. Using information from the interview and historical data for the mashing process controlled by rOJ!, temperature control strategies were extracted. Fuzzy reasoning and FNN were then used to reproduce the strategies. The resulting strategy was applied to the sake mashing and the sake produced was compared with that produced by rOJ!.

KOJ! is the rice grain on which Aspergillus oryzae is cultured and provides enzymes for the mashing. The quantitative balance of these enzymes is very important for sake mashing [20]. Humidity and temperature both affect the KOJ! making process, and our group has constructed an ANN model which can estimate the enzyme activity of KOJ!, given the humidity and temperature of the room. To

Application of Artificial Neural Networks, Fuzzy Neural Networks 137

calculate the temperature and humidity required for use with known enzyme activities, the combination of this ANN model and a GA was investigated [21].

This chapter is divided into the foIl owing sections: 1. the extraction and reproduction of control knowledge for the mashing process using fuzzy reasoning, 2. the extraction and reproduction of control knowledge for the mashing process using FNN, 3. the construction of a simulation for the KOJI making process using ANN, and 4. the determination of the appropriate temperature and humidity using ANN and GA. Some possibilities for future work are also discussed.

2 Application of fuzzy reasoning to the temperature control of the sake mashing process

2.1 Fuzzy reasoning

Fuzzy reasoning was proposed by Zadeh in 1965 [1] and has been widely used by researchers both in biochemical engineering and in other fields. This method can transform human reasoning into rules suitable for process control.

Various studies on the temperature control of the mashing process based on fuzzy reasoning have been published [6, 9-11, 16]. Production rules and membership functions are necessary for calculation within fuzzy reasoning. A production rule represents expert knowledge in a form such as "IF A is B, THEN C should be D"; these are known as "IF-THEN" rules, and can readily be summarized in tabular form. A membership function is used to relate a numerical value of a variable (such as temperature or specific gravity) to its "grade", which shows the degree to

DB

NB NM NS ZE PS PM PB NB NS ZE PS PS PM PB PB NM NS ZE ZE PS PM PB PB NS NS ZE ZE ZE PM PB PB

DD NE NS ZE ZE ZE PS PM PB PS NS NS ZE ZE PS PM PB PM NS NS ZE ZE ZE PS PM PB NS NS ZE ZE ZE ZE PS

Fig. 1 Production rules of fuzzy control

Key: PB; positive big, PM; positive medium, PS; positive small, ZE; zero, NS; negative small, NM; negative medium, NB; negative big

~

" '" f! (!)

Low Mediwn High

AX 0 i I 14 16 18 20

Alcohol concentration [%1

Fig. 2 Membership function of alcohol concentration


which the variable belongs to a certain class (such as "high", or "medium").

In the sake mashing process, the specific gravity (Baume) is determined largely by the glucose and alcohol concentrations; the mashing temperature is the only controllable variable. Since the fermentation progresses slowly, the mashing temperature, specific gravity and alcohol concentration are typically measured just once a day. When the temperature of the process is to be changed, a new set point is determined by fuzzy reasoning from these measured values. Oishi et al. reported production rules for temperature control [6], using fuzzy reasoning to relate the alcohol concentration and the specific gravity to a reference value calculated from "historical" data, that is, all previous data available from the progress of a particular mashing run. Figure 1 shows the production rules when the glucose concentration is low. DB shows the difference between the reference specific gravity and that calculated assuming that mashing continues at the present temperature. DD is the difference between the rate of change of the reference and estimated specific gravity. From the production rules in Fig.l, it was found that the following IF-THEN rule is included;

" IF the estimated specific gravity is much higher than the reference (DD is PB) and the rate of change of the specific gravity is much less than the reference (DB is NB), THEN the set point mashing temperature for the fol1owing day should be much higher than that for today (set point should be PB)".

Figure 2 shows an example of a membership function, which was reported by Tsuchiya et al [16]. Time course data for the mashing process from many sake companies in the Hiroshima prefecture were col1ected and the membership functions for every mashing day prepared. Figure 2 shows the membership functions for the alcohol concentration on the 13th day. If the alcohol concentration is 15.9 %, the grade for "Low concentration" is 60 % and that for "Medium concentration" is 40 %.

To use fuzzy reasoning for the control of practical processes, the membership functions and production rules generally need to be tuned to achieve optimum results. Tuning is important, but is typically rather time-consuming. A simulation taking advantage of historical data is very useful for this step.

The performance of fuzzy reasoning systems, including ours, was checked against real data for the mashing process. During the entire period, the specific gravity, the alcohol concentration and the temperature in these studies were almost identical to those in a control mashing operated by TOJI, and the sake produced from the plant using fuzzy reasoning had almost the same taste and flavor as that produced by TOJI.

The tuning step relies upon trial-and-error. Whether a good result for this step can be obtained from fuzzy reasoning depends on the quality of the data provided by the engineers. To overcome this difficulty, a fuzzy neural network (FNN) [2] or a


fuzzy reasoning combined with a GA [22], which can generate suitable membership functions automatically, is used.

2.2 Fuzzy neural network (FNN)

Fuzzy reasoning has been used in many cases in which an expert's knowledge can be extracted and reconstructed in a computer. However, it takes a relatively long time to tune membership functions by trial-and-error and the expert's experience is still necessary to accompany any rules with fuzzy reasoning and assess the quality of any rules which are produced during this refinement. An ANN can be constructed automatically if a large quantity of historical data is available. Nevertheless, it is difficult to interpret the link between input and output variables in the final model, because the structure of an ANN is very complex; in essence, it behaves like a black box.

Recently, fuzzy neural networks (FNNs) have been proposed as a tool for fuzzy modeling. FNNs have neural network structures in which the connection weights have a direct interpretation in terms of fuzzy production rules and membership functions. FNNs have been applied to many processes in chemical and biochemical engineering.

In this section, "Type 1" FNNs, proposed by Horikawa et at. [2], which have been used in our study, are explained. The FNN realizes a simplified fuzzy inference of which the consequences are described with singletons. The inputs are non-fuzzy numbers.

Fig. 3 Structure of a fuzzy neural network


Figure 3 shows the structure of the FNN, in which the FNN has two inputs XI and X2, one output y* and three membership functions in each premise. The circle and square in Fig.3 indicate units of the neural network, while We, wg, We, 1 and -I are the connection weights. The connection weights We and Wg determine the positions and gradients of the sigmoid functions "f' in the units in (C)-layer, where sigmoid functions ''/' are defined as follows:

fix) = I/[l+exp{-wg(x+xe)}]

Each membership function consists of one or two sigmoid functions.

In the FNN, the membership functions in the premises are tuned and the fuzzy rules are identified by adjusting the connection weights We, Wg and We through the back propagation learning algorithm [3]. The connection weights We and Wg are initialized so that the membership functions in the premises are equal. The value of We is initialized to zero. The FNN thus has no rules at the start of learning.

In our study, it was recommended that the data should be divided in half. If all data are used for training, neural networks tend to overfit the data and lose generalization. Those data sets that were not used for training were retained for evaluation and used to help monitor for overfitting. If insufficient data were available, the fraction of the data set used for training was increased at the expense of data in the evaluation set.

The mean difference between the estimated and actual values given by the following equation was used as a performance index J for assessing the model.

in which J I and J2 are the mean values of the squares of the error for training and evaluation data, respectively.

To encourage convergence, training was carried out in the following fashion. At first, training was done for only the consequence. Learning rates for the connection weights We, Wg and We were initially set to 0.0, 0.0 and 0.01, respectively. The maximum learning time was 10,000 cycles. Learning was stopped when the value of J reached a minimum within the maximum learning time. Next, learning was done for the premise and consequence. The learning rates for We, Wg and We were initially set to be 0.01, 0.01 and 0.01, respectively. Other learning conditions were the same as those for the case of only the consequence.

When many candidates for the input variables are available, it is necessary to select input variables that most strongly influence the output variable. The number of membership functions for each input variable must be also decided. A parameter increasing method (PIM) has been proposed by Horikawa [2]. This method identifies a fuzzy model by increasing such parameters as the number of

Input variable

x, x, x, x, x, x, x, x, x, Xto

x"


Table 1 Minimum and maximum values of input and output variables in each control region

Control region 1 Control regi0n2 Control regionJ Control region4 (day 1-9) (day 10- BOUZU) (BOUZU - Ba_ """ 2.0: (Ba_ WIder 2.0)

Min. Max. Min. Max. Min. Max. Min. Max Mashing time [d] I 9 10 15 11 21 16 35 Ba_ [-I 3.95 7.34 4.00 6.12 200 5.16 -0.82 1.99

Alcohol concentration [%] 0.00 9.71 4.14 10.10 8.53 13.00 11.60 17.10

Temperature [0C] 5.5 10.0 8.5 to.5 5.5 10.5 3.5 9.0 liBauni [-I -0.46 1.29 -0.53 0.46 -0.53 -0.23 -0.50 0.00 Max. Baume [-I 3.95 7.49 5.16 7.49 5.)6 7.49 5.16 7.49

Init. Baume [-J 3.95 7.34 3.95 7.34 3.95 7.34 3.95 7.34

Max. Bam -lnil Ba~ [-J 0.00 2.50 0.00 2.50 0.00 2.50 0.00 2.50

Ave. Baume [-I -071 5.55 -0.94 1.06 -0.67 1.11 -0.62 1.17

Ave. alcohol cone. - alcohol COOl 1%] -6.94 4.11 -3.14 2.81 -2.78 2.04 -1.77 1.49

Reference BMD - BMD [dJ -18.12 23.80 Output Set temperaltue [0C] 6.2 10.4 8.5 10.5 5.0 10.5 4.5 9.0 variable BOUZU: the disappearance of the fermentation foam from fennentation tank

We defmed the BOUZU day when akobol concentration increases over 10.5% and Baume decreases under 3.5

membership functions and/or the number of input variables step by step_ PIM will be explained using an example of our study.

Twenty-five day time course data for the sake mashing process were used_ The entire mashing process takes 33 to 36 days. Since data were collected once a day, the number of data sets per mashing is accordingly 33 to 36_ The data available for any given day (XI) consisted of the temperature (X4), specific gravity (X2) and alcohol concentration (X3); these values are listed in Table L

From the value of these parameters, the following values were calculated, since rOJ! uses them to set the temperature for the following day: rate of change of specific gravity (~ specific gravity; xs), maximum specific gravity (X6), initial specific gravity (X7), difference between the maximum specific gravity and its initial value (xs), difference between the average specific gravity in the past mashing period and the present value (X9), difference between the average alcohol concentration in the past mashing period and the present alcohol concentration (XIO) and the difference between the reference BMD (Baume Multiple Day) and the actual BMD (XI 1)- These 11 variables were selected as candidates for the input variables to the FNN_ The temperature to be set for the next day (y*) was selected as the output variable_ The reference BMD curve was calculated by least squares fitting of the 25 day time course data_

The period of the mashing process was separated into four control regions as listed in Table 1, since different control strategies are used by rOJ! in each control region_ The last day of the mashing is decided by rOJ! as the day when the specific gravity equals that of water, and is normally between 29 and 33 days. The minimum and maximum values of the input and output variables in each control region are also listed in Table 1_ BOUZU means the disappearance of the fermentation foam from the fermentation tank. We defined the BOUZU day as that when the alcohol concentration exceeds 10.5% and the specific gravity decreases by less than 3.5 from past data_


Table 2 Identification of structure of FNN model by PIM in control region 2

step inEut variable JI 12 J 1 x,[2J 0.027 0.039 0.033

x,[2J 0.034 0.028 0.031

xJ[2J 0.033 0.034 0.034 x,[2J 0.014 0.016 0015 x,[2J 0.027 0.041 0.034 x,[2J 0.031 0.032 0.032 x,[2J 0.030 0.035 0.033 x,[2J 0.028 0.040 0.034

x,[2J 0.027 0.047 0.037 x10[2J 0.028 0.037 0.033 x,[3J 0.013 0.016 0.015

x,[2Jx,[2J 0.013 0.019 0.016 x,[2J x,[2J 0.011 0.017 0.014

x,[2JxJ[2J 0.013 0.016 0015 x,[2Jx,[2J 0.014 0.015 0.015

x ,[2J x,[2J 0.014 0.016 0.015 x,[2Jx,[2J 0.014 0.016 0015 x,[2Jx,[2J 0.014 0.016 0.015

x,[2Jx,[2J 0.014 0.019 0.017 x,[2JxlO[2J 0.014 0.016 0.015

x,[3Jx,[2J 0.009 0.018 0.014 x,[2Jx,[3J 0.011 0.017 0.014

x,[2Jx,[2Jx,[2J 0.010 0.019 0.015

x,[2J x,[2J xJ[2J 0.013 0.017 0015 x ,[2J x ,[2J x ,[2J 0.010 0018 0.014

x ,[2J x ,[2J x ,[2J 0.010 0018 0.014 x,[2J x,[2J x,[2J 0.010 0.018 0.014

x ,[2J x ,[2J x ,[2J 0.010 0.018 0.014

x ,[2J x,[2J x,[2J 0.010 0.020 0.015 x ,[2J x ,[2J x 1O[2J 0.013 0.016 0.015

x,[4Jx,[2J 0.008 0.022 0.015

x,[3Jx,[3J 0.009 0.016 0.013 x,[3Jx,[2Jx,[2J 0.008 0018 0.013

x,[3Jx,[2JxJ[2J 0.012 0.015 0.014

x,[3Jx,[2Jx,[2J 0.008 0.019 0.014 x,[3Jx,[2Jx,[2J 0.008 0.018 0.013 x,[3J x,[2J x,[2J 0.009 0.017 0.013

x ,[3J x,[2J x,[2J 0.011 0.016 0.014 x,[3Jx,[2Jx,[2J 0.013 0.022 0.018

x,[3J x,[2J x 1O[2J 0.013 0.017 0015 x,[4Jx,[3J 0.009 0.019 0.014 x,[3Jx,[3J 0.010 0.023 0.017 . x,[3Jx,[3Jx,[2J 0.008 0.017 0.013

x,[3Jx,[3JxJ[2J 0.010 0.017 0.014

x,[3Jx,[3Jx,[2J 0.009 0.020 0.015 x,[3Jx,[3Jx,[2J 0.013 0.018 0.016

x,[3Jx,[3Jx,[2J 0.009 0.022 0.016

x,[3Jx,[3Jx,[2J 0.008 0.018 0.013

x,[3Jx,[3Jx,[2J 0.013 0.023 0018

x ,[3J x,[3J x ,o[2J 0.014 0.017 0.016 xj[kJ: k membership fWlCtions in the premise for the input vari .: the better model in each setp •• : the best model with the smal1estJ

Two thirds of the input and output data in each control region, were selected at random for training the FNN and one third was used for evaluation. Membership functions were constructed using sigmoid functions and input and output data were normalized in the range 0.05 to 0.95.


Table 2 shows the process of identification by the FNN model by PIM in control region 2. xJk] in the column of input variables means that there are k membership functions in the premise for the input variable Xj' j corresponds to the number described in Table 1. The model with the lowest estimated performance index J was selected in each step. In the best model, specific gravity and temperature with three membership functions were finally chosen as input variables.

Results of the FNN modeling for the mashing process are given in Fig.4 and Table 3. Figure 4 shows the relationship between the actual normalized temperatures and the estimated temperatures in each control region. Open circles show the evaluation data and filled circles the training data. If the models of the FNNs were completely precise, all data would be on the solid line. In control region 2, some scatter is apparent, but almost all data were close to the solid line in the other control regions.

Table 3 shows the number of membership functions, input variables selected by PIM and performance index J in each control region. In control region 2, J was

Table 3 Results of the FNN modeling in all regions Control region Selected variable J I J2 J

1 x.[2], x ,0[3] 0.0020 0.0016 0.0018 2 x,[3],x.[3] 0.0095 0.0162 0.0129 3 x ,[2], x .[2], x ,[2], x .[2] 0.0016 0.0023 0.0020 4 x,[2],x.[2],x,[3],x.[2],xlI[2] 0.0015 0.0024 0.0020

:!: 1.0,--------...,.

; Region 1 i 08

D 'l! 0.6

. " I 0.4 8 'l! j 02 • •

o "-~---:,..,...-...,...,.---:,~..J o 0 2 0 .4 0.6 0 .8 I 0

Aclul ~iu:d 'etr\fIU1Il-c I·)

Region 2

I' Ii .( . : . r ..

0204060.810

:!:IO~------~

~ Region 4 i 0.8

i 0.6

104

102 •

~ 0 !<--=-=-----::-:-----:-:,......-::-=-...J o 02 04 06 08 10 A<1 .. 1_1~~"cl·)

Fig.4 Relationships between actual normalized temperature and estimated temperature. Filled and open circles show the data for learning and evaluation, respectively.


bigger than in other regions. However, the mean error was smaller, being 0.19, 0.23, 0.24 and 0.27 °C in the control regions 1, 2, 3 and 4, respectively. This shows the model has developed a good ability to determine the appropriate temperature for the mashing process.

The model in region 2 was analyzed as an example. As shown in Table 3, the input variables selected in this region were the specific gravity and temperature. It is reasonable to assume that the temperature to be used on day n is correlated with that used on day en-I). When the mashing is under way, TOJI normally determines the temperature by consideration of the specific gravity. Therefore, it was concluded that the input variables of the model were in agreement with the experience of TOJI.

Figure 5 shows the membership functions for specific gravity and temperature after training. The membership functions were tuned automatically to express the grade of each input variable. Since the estimated value is normalized to unity, the value was converted to a real scale. As shown in Fig. 5, the medium region of temperature was broader than that of specific gravity, indicating that control was more strongly correlated with specific gravity than with temperature. This again was in accord with the experience of TOJ/.

Table 4 shows the acquired fuzzy rules. In FNN, the acquired knowledge can be described in the form ofiF-THEN rules by analysis of the connection weight wrin the FNN after training. Table 4 means, for example,

"If the current temperature is low and the specific gravity is medium, set the temperature for the next day to 9.0 °C".

Table 4 Production rule for the set temperature identified after learning in control region 2

Bawne Small Mediwn Big

Small 8.7 9 9 Temperature Mediwn 9.7 9.8 10.1

Big 10 10.3 11.1

"Frr 'fIT i"~ i "l1J..L 0.0 0.0

&0 ~1 U U U ~s

Baum. [-J Temperature [-J Figure 5. Membership functions identified after learning in control region 2.


The following rules were also found after the analysis.

"If the specific gravity is high, set temperature for the next day to high."

"If today's temperature is high, set the temperature for next day to high."

These rules also agree well with the experience of TOJI.

Using the FNN model constructed above, a simulation based on historical data was performed. Simulated results were compared with the average values of 25 sake mashings. It was found that specific gravity and alcohol concentration were in good agreement with the average values throughout mashing period. The temperature predicted by the simulation coincided with the average.

II

U L 9 ~ ::J

G c.. 7 E u .....

~ 0 .p o~ .p

0 -0 o~o -Q~ ~o - - Do 0

S

~o 6 0 . Clo ..,

§ 4 ~

2

0

Q Q

Q

0110

~~

~ "0 .<:: 0 \0 u <: /~

10 20 30 40

Mashing rime [day]

Figure 6. Time courses of temperature, specific gravity and alcohol concentration. Filled and open circles show the data for conventional and FNN control.


A test sake mashing using IOOkg rice was carried out, and the sake produced was compared with that obtained from the mashing under the control of rOJ!. It was found that the quality of both sake products was comparable. A commercial sake mashing process using 1500kg rice was then performed under the control of an FNN model. For this mashing, data were collected and the FNN model was reconstructed.

Mashing results are described in Fig. 6. The time courses of specific gravity and alcohol concentration agreed well throughout the mashing period with those of the conventional method using rOJ!. Temperature in the mashing using the FNN model reached a maximum 2 days later and was a little lower during the 20th to 28th days. As a consequence, the overall process lasted one day longer. This difference may be due to slight differences in the initial conditions, since in sake mashing many factors such as the activity of enzymes, fermentation activity of yeast, the water content of polished rice, and the solubility of steamed rice influence the course of the fermentation.

The specific gravity, alcohol concentration, acidity and BMD values were checked on completion of the mashing. These were first compared with those of seven other mashings performed in the same month conventionally by the rOJ! using the same type of rice with the same polishing ratio. As shown in Table 5, the mashing period and all other values were in the same range as the other seven mashings, confirming that the mashing was performed well using the FNN model.

It is apparent then, that FNN can adjust membership functions and decide production rules automatically from historical data. However, a FNN must be used as carefully as an ANN. The reliability of predictions depends on the Euclidean distance between those data and the data sets used for learning. If some data are far from the data used for training, estimated values from FNN may contain a relatively large error. To overcome this, PENN (policy and experiencedriven neural network: [23]) and CF (confidence function: [24]) have been proposed. PENN learns from not only the training data but also from additional data such as the values which experts regard as the maximum and minimum values for parameters. The CF value reflects the distance between data sets for estimation and learning and it is available to assess the accuracy of FNN for any

Table 5. Comparison of sake produced using FNN and conventional control.

FNN control Conventional control References* Average Minimum Maximum

Mashing period [d] 36 35 34 33 36 Baume[-] 0.2 0.25 0.16 0.17 0.3

Alcohol [%] 17.2 17.3 17.1 17 17.3 BMD [d] 7 8.5 6.1 0 9.9

Acidity [mI] 1.5 1.5 1.5 1.4 1.6 *These data were obtained from other seven mashes operated under similar conditions


estimation data.

2.3 Artificial neural network (ANN)

ANNs, the multi-layered network models normally driven by a back propagation algorithm, were developed from the perceptron, which was proposed as a model for information circulation in the brain [3]. There have been numerous applications of ANNs in engineering, since they are powerful methods for pattern recognition and learning.

Figure 7 shows the structure of a three-layered ANN with input, hidden and output layers. Outputs from the units in the output and hidden layer are calculated using a sigmoid function operating on the sum of the inputs to these units. ANN can be applied to many types of data sets, but are particularly effective when a non-linear relationship between input and output exists.

To use an ANN model for practical process control, the ANN model estimating the enzyme activity of KOJI from the temperature and humidity in the KOJI making room is reviewed [21]. KOJI consists of rice grains and fungi which grow on and within the grain [25]. KOJI has three main functions in the sake mashing process [26]. First, it produces enzymes that can solubilize and saccharify the steamed rice. Second, it contains nutrients for the growth of yeast and thereby accelerates fermentation. Third,it metabolizes some chemical constituents of sake. If KOJI cannot produce enzymes, glucose is not supplied to the yeast and no alcohol is produced. Therefore, the enzyme activity of KOJI is one of the most important factors in evaluating KOJI quality. The KOJI fungi produce various enzymes, of which a-amylase (AAase), glucoamylase (GAase), acid proteinase (APase), and acid carboxypeptidase (ACPase) are typical. If a-amylase and glucoamylase activity is high, alcohol formation may be suppressed by the high sugar content in the tank. It is therefore important that KOJI is prepared with well-balanced enzyme activities suitable for the mashing process [20].

y.

Figure 7. Structure of a three layered artificial neural network.


The KO]I making process is a solid-state fermentation using rice grain. The grains are spread in a layer of uniform thickness. Both temperature and humidity are controlled. The rice is periodically mixed to prevent clumping and the mixing interval constitutes a further variable. Mathematical models describing mycelial growth on the rice grain have been proposed after analysis of the characteristics of KO]I fungi [27] . However, many parameters influence the production of enzymes, and a considerable quantity of KO]I making data is needed to construct such a model. From this reason, no adequate mathematical model for enzyme production is yet available.

Forty-three time course data sets for KO]I production were used, obtained in 1995 from the Sekiya Brewing Co. using a fully automatic KO]I preparation machine, NOS (Maruko Co., Niigata, Japan). Figure 8 illustrates the time course data for temperature and humidity. The rice bed is normally mixed 4 times during fermentation in steps known as YUKAMOMI, KIRIKAESHI, MORl, NAKASHIGOTO, and SHIMAISHIGOTO. In Sekiya Brewing Co., the SHIMAISHIGOTO step has been omitted.

DEKO]I is the end of the process, when the KO]I is taken out of the NOS machine. Open circles in Fig.8 indicate the temperature of the rice and the atmospheric humidity in each mixing step. After MORl, the humidity is decreased and the temperature is increased in a typical KO]I process. After mixing, an expert TO]I decides the temperature to be used until the next mixing, the humidity during fermentation, and the time intervals between the mixing steps. These factors were therefore taken as process variables. Initial conditions of the process, that is the polishing ratio, seed mycelial weight, sucking ratio of the steamed rice, and total rice weight, were also regarded as variables that affect the quality of KO]/. As a result, the total number of input variables was 21, and these are listed

'0

30 ~--~--~--~----~--~

o 10 20 30 40 so Time [h)

Figure 8. Time courses of temperature and humidity in the kojimaking


in Table 6. The activities of four enzymes, AAase, GAase, APase, and ACPase, were selected as representative of the enzyme activities in KOJI. The amount of KOJI fungi is also an important variable in the KOJI making process, but this was not used as a variable since the value was difficult to measure quantitatively. In Table 6, the minimum and maximum values of the variables are also shown. They were used for the analysis after normalization to the range 0.05 - 0.95.

A three-layered ANN with input, hidden, and output layers was employed for the analysis. The process variables and initial conditions were used as the input variables and the enzyme activities as the output variables. ANN models were constructed with 10, 11, 15, and 21 input variables (ANN-IO, -11, -15, and -21) as shown in Table 6. A back-propagation algorithm was used for learning. The initial connection weights of the ANN models were decided at random. Since the final connection weights of ANN after learning are influenced by the initial values of the weights, learning from various initial values should be done several times. In this study, learning was carried out five times and the result with the lowest J value was selected. In order to determine the optimum structure of the ANN models, the number of hidden units was changed from (n - 10) to (n + 10), where n was the number of input units, and the structure with the highest accuracy was selected.

Table 6. Input and output variables for ANN-21, -15, -II and -10.

InEut variable Minimum Maximum ANN-21 ANN-15 ANN-II ANN-IO Type of rice Gohyakumangoku 0 1

Yamadanishiki 0 Mi;!:anishiki 0 •

Initial conditions Polishing ratio [%] 40 60 Initial mycelial weight (g] 30 70

Sucking ratio of steamed rice [%] 40.5· 45.5 Totalrice~ 320 500

Temperature To [0C] 28.4 31.0

T, 29.6 30.9

T2 30.5 32.3

T, 29.8 32.5

T, 33.0 38.4

T, 33.0 37.4

T. 41.5 43.6

Time 10 [h] 10.75 15.67

I, 10.00 18.00

12 6.00 11.30

I, 3.75 12.50

I, 2.50 14.25

10+/1 +/ 2 29.50 40.00

13+ 14 12.00 20.50

Humidity Ho[%] 40.0 61.5

H, 43.4 66.7

OU!l!ut variables AAase [unitslg-KaJ/] 393 1305

GAase 102 159 APase 1370 2678

ACPase 1556 3858


The 43 time course data sets of the KO}! process were divided into 3 groups: the first 30 sets were used to train the ANN model, the next 10 sets for model evaluation, and the last 3 sets for evaluation of searching by the GA, which is discussed later.

Estimation of the activities of the 4 enzymes - AAase, GAase, APase, and ACPase - was investigated using ANN models. Four models corresponding to each enzyme were independently constructed, with only one output unit in each model. In the case of ANN-IS, for instance, the model for AAase estimation had 15 input units, 23 hidden units and 1 output unit. For GAase, APase, and ACPase, the numbers of hidden units were 19,22, and 13 respectively.

During training, the connection weights of the ANN models were changed according to the estimated error of the 30 learning data sets. The maximum training time was 2000 iterations. Training was stopped when } reached a minimum within the maximum learning time. The learning rate for the connection weight was 0.1.

ANN models using 21 variables (ANN-21) were constructed to estimate the enzyme activities. Figure 9 shows the relationships between the actual and the estimated values for each enzyme activity. For the estimation of the AAase, GAase, APase, and ACPase activities, almost all data plots for learning and evaluation are close to the diagonal line, indicating that the ANN models estimated the enzyme activities with a high degree of accuracy. } .. }2, and} for

10 AAase

01

:::: ~ 06 -: ] 04

02

0 0 02 04 06

....... 1 .. 1 .. (·(

10

0,1

:::: ~ 06 -: ] 04

J 02

0 0 0,2 0 .4 0.6

Ac· .. • ..... H

0.8 10

0 8 .0

I O,.-------~

Ac.uJ .. h.: (.J

10,.-------""'"

08

¥ 06 ~ 'lI

ACPase

o 0 ~ 0.4 , . 0

.n 0.2 .0. 00 02 04 0.6 08 1.0

....... I ..... I·)

Figure 9. Estimation of enzyme activities by ANN-21. Filled and open circles show the data for learning and evaluation, respectively.


Table 7. Jvalues for ANN-21, -15. -11 and

Enzyme JI 12 J AAase 0.0002 0.0018 0.0010 GAase 0.0087 0.0205 0.0146

ANN-21 APase 0.0005 0.0050 0.0028 ACPase 0.0026 0.0093 0.0060 Average 0.0030 0.0092 0.0061 AAase 0.0009 0.0072 0.0041 GAase 0.0088 0.0086 0.0087

ANN-15 APase 0.0029 0.0044 0.0037 ACPase 0.0022 0.0095 0.0059 Average 0.0037 0.0074 0.0056 AAase 0.0044 0.0098 0.0071 GAase 0.0244 0.0068 0.0156

ANN-II APase 0.0056 0.0046 0.0051 ACPase 0.0024 0.0084 0.0054 Averalle 0.0092 0.0074 0.0083 AAase 0.0065 0.0097 0.0081 GAase 0.0290 0.0073 0.0182

ANN-IO APase 0.0065 0.0070 0.0068 ACPase 0.0047 0.0103 0.0075 Average 0.0117 0.0086 0.0101

each model are given in Table 7. The respective J values for AAase, APase, and ACPase were 0.001, 0.0027, and 0.0059, giving estimated errors of about 3%, 5%, and 8% for the activities of these enzymes. In the case of GAase, J was 0.0096 and the estimated error was about 10%. In Fig. 9, some of the data points for GAase are a little distance away from the line. However, the mean error was only 8 units/g-KOJI, since the range of enzyme activity of GAase was much smaller than those of the other enzymes as shown in Table 6. It was comparable to the experimental error of enzyme activity. Therefore, it was concluded that these trained ANN models had good estimating ability for all the enzyme activities.

2.4 Genetic algorithm (GA)

A Genetic Algorithm imitates some aspects of natural evolution [28], and is based upon the presumption that well-adapted life will survive. In the GA, the gene in chromosomes constitutes the genetic information and adaptation is brought about by crossover and mutation. This algorithm can search rapidly for the maximum or minimum value of a function and can be used for global searching and multi variable optimization in a variety of fields.

As an example of the application of a GA, our results on the KOJI making process will be explained, in which the optimal temperature and humidity were determined using a GA combined with the ANN models described above to make KOJI with the desired balance of enzyme activities.


In the construction of an ANN, the process variables and initial conditions are input variables. We wish to determine values of these which should be selected to make KOJI with the desired enzyme activities. For this purpose, input values of the ANN must be decided in the reverse direction from the desired values of the output variables. This results in a multi-variable optimization problem, to which we applied a simple GA [16]. The input variables selected for ANN-IO, -II, -15, and -21 were used for the algorithms named GA-IO, -II, -15, and -21, respectively.

Gene coding of the normalized variables was done using the actual number coding method [16]. Numbers with eight figures after the decimal point were coded to a line of bits. A line of an input variable is called a gene, and a chromosome consisted of 21 of these genes. The number of chromosomes (population size) was 1000.

Initial chromosomes were generated at random. The usual procedure in a GA is that genes in a chromosome are changed by mutation and crossover to create a new chromosome. However, genes relating to the initial conditions in all the chromosomes were fixed at the actual values and not changed. To evaluate the fitness of each chromosome, the genes were decoded from the line of bits. When the decoded values were input to the ANN models, the output values, i.e. the four enzyme activities, were obtained as the estimated values. The fitness was evaluated by the following equation from the average error between the desired and the calculated activities. A chromosome with a small error has a high fitness:

Fitj = 1- L!Yij-ytI/N

'-----.--'

_ ..__. Random generation of individuals

Calculation of fitness .. using ANN

- .... -....• Selection by fitness

-. Crossover and mutation

Figure 10. Flowchart of a genetic algorithm.


where Yij is the calculated value of the i-th enzyme activity for the j-th chromosome, Yi· is the desired enzyme activity, and Fitj is the fitness of the j-th chromosome. N is the number of estimation models (4 in this study, corresponding to the activities of the enzymes AAase, GAase, APase, and ACPase).

After calculating the fitness of each chromosome, those with higher fitness were selected by roulette selection based on the fitness proportional strategy [16]. Onepoint crossover was performed with a crossover rate of 0.9. The bits on the chromosomes were changed at random with a mutation rate of 0.01. The number of generations was 1000 and the chromosome with the highest fitness over all generations was selected as the solution. These procedures are shown as Fig.l O.

To determine temperature and humidity of the KOJI process using the ANN-21 model, the GA was used to search for 21 variables which predicted the desired activities. As a preliminary test, one time course data set was chosen, which was not used for the ANN learning and evaluation. The enzyme activities of this data set were 493, 145, 2160, and 2260 units/g-KOJI for AAase, GAase, APase, and ACPase, respectively, which were used as target values for these enzymes. The initial conditions for polishing ratio, initial mycelial weight, the sucking ratio of steamed rice, total rice, and type of rice were fixed as 40%, 40 g, 41. 8%, 420 kg, and YAMADANISHIKI. The GA was used to search for the other 14 process variables pertaining to temperature, humidity, and time interval.

GA searching was carried out using GA-21 . The fitness of the best chromosome was high (0.999) and the enzyme activities (units/g-KOJI) for AAase, GAase, APase, and ACPase were calculated to be 506, 145, 2143, and 2233, respectively, which were almost identical to the actual values. Figure 11 shows the temperature and humidity profiles determined by the GA. Although the enzyme activities

45r-----------------------~

~ 40 ~ 2)5

i 1l30~--.,-I-

25 .............. .... .. .

~ 55

~ 50+----------"'

E £ 45

4~~--1-0--~2-0---3~0---~-----50--~ro

Tim( [hI

Figure 11. Temperature and humidity orbits acquired by GA-21. Filled and open circles show the data actual and acquired orbit, respectively.


agreed well with the target values, the temperature and humidity in Fig. 11 were very different from the actual ones; in particular, the initial and final humidities were too high and the total time was too short. Thus, it was evident that this GA solution predicted conditions unsuited to the actual process.

When a GA is used to search in the reverse direction, a large number of candidate solutions exists. If too many variables must be determined, there is a high probability that an inappropriate solution will be selected. In our research, the number of data sets available for ANN learning was apparently too small compared to the number of variables. The dimensions of the space to be searched by the GA increase with the number of ANN input variables. In such cases, the data sets used for learning and evaluation may be distributed sparsely in the search space. Since an ANN is a learning method, its accuracy becomes higher as the space includes more data. From a space with a high ANN accuracy, the GA can estimate a solution that is closer to the actual solution.

Table 8. Relative errors of process variables and enzyme activities acquired by

GA-21, -15, -11 and -10.

Relative error [%1 GA-21 GA-15 GA-II GA-IO

To 2.16 1.40 1.40 1.40

TI 1.31 86.00 0.86 0.86

T2 2.19 1.87 1.87 1.87

T3 1.94 1.84 1.20 1.20

T4 1.96 4.24 3.35 4.46 T5 4.15 1.43 1.90 1.90 T6 1.27 0.63 1.57 0.87

Average error for temperature 2.14 1.75 1.74 1.80

10 10.60

1 I 15.70

12 25.70

13 64.40

14 67.00

10+ / 1 +/ 2 3.87 1.89 7.53

13 + 14 9.31 5.34 5.34

A veral\e error for time 36.70 6.59 3.62 6.44

Ho 16.30 6.35 1.51 1.51

HI 9.13 4.55 2.33 4.44

Avera!le error for humidity 12.70 5.45 1.92 2.97 Average error for process variables 16.00 3.30 2.11 2.85

AAase 1.84 2.34 15.40 17.20 GAase 0.48 0.48 8.25 9.02 APase 1.62 0.78 10.10 10.40

ACPase 3.22 1.51 5.07 5.18 Average error for enzyme activities 1.79 1.28 9.73 10.40


To overcome this problem, two methods can be used. One is to decrease the number of input variables to reduce the dimensionality of the search space. Alternatively, we may search only within regions of the space which are wellstocked with learning data. In the latter case, a space located a short Euclidean distance from the learning data should be used. Mahalanobis's distance [29], a relative distance that is the square of the Euclidean distance divided by the dispersion of the space, could be utilized instead of the Euclidean distance. In this study, the former method was used; we decreased the number of input variables of ANN models and used these models for searching.

We tried to search for temperature and humidity in the KOJI process using GA-I5, -II, and -10. The input variables were reduced by trial-and-error. From an interview with a TOJI, the temperature before NAKASHIGOTO (T4) and the time of NAKASHIGOTO (to + t( + t2) were found to be very important process variables. These were therefore selected for ANN-I5, -II and -10. The input variables used in the three models are listed in Table 6.

The J .. J2, and J values of these models are given in Table 7. The smaller the number of input variables used, the larger the J values for all enzyme activities obtained, except for the GAase activity with ANN-I5 and the ACPase activities with ANN-I5 and -II. The ANN-II and -10 estimations of the GAase activity showed more than 13% error but other models could estimate the activities with satisfactory accuracy. Some of the variables not used as input variables in the ANN-IO, -11 or -15 models are known to be important for the production of the enzymes, but the estimation errors of the ANN models lacking these variables were not large compared with those of the models using them. In this study, the actual process data were utilized and the distribution of several process variables,

45 .-----------------------,

t 40 ~ I 35

~ 309'""'---<,.--- ~

25 ~----------~----------~

_ 55 ~ ~ 50 'e ~ 45

40~--------------~----~ 10 20 30 40 50 60

Time[h]

Figure 12. Temperature and humidity orbits acquired by GA-15. Filled and open circles show the data actual and acquired orbit, respectively.


that is, the difference between their maximum and the minimum values, was not very large. Therefore, variables with a narrow distribution did not significantly influence the enzyme activities. For example, T6 is a very important process variable for GAase and AAase activities [30], but the estimation error for GAase in ANN-I0 was comparable with that in ANN-II.

To investigate GA performance using these models, the process orbits from three time course data sets that were not used for ANN learning and evaluation were used. Using the GA method described above, the temperature and humidity were calculated. In GA-15, -11 and -lO, the numbers of process variables to be determined were 8, 4 and 3, respectively. Search results are listed in Table 8, which shows the average relative errors between the actual values and those from the GA using the ANN models. For the variables not used for modeling, such as To, T(, and T2 in GA-15, the relative errors between the actual and average values of the past data are calculated and used as a matter of convenience. Among the four algorithms, GA-21 gave the largest average errors; those for time and humidity were particularly large. Using GA-15, -11 and -10, the maximum errors among the process variables were 9.31, 5.34, and 7.53%, respectively. GA-ll showed the smallest average error for all the process variables, only 2.11 %. However, the average error of GA-l1 for enzyme activities was about 10%, which is relatively high in comparison with the results obtained by GA-21 and -15. Considering the average errors for enzyme activities and all process variables together, GA-15 was judged the most suitable for determining temperature and humidity.

Figure 12 shows one example of the temperature and humidity determined by GA-15 against the same target values as shown in Fig. 11. The estimated values (units/g-KOJI) for the activities of AAase, GAase, APase, and ACPase were 471, 146, 2191, and 2272, respectively, and the fitness was 0.999. This is the same fitness as that achieved by GA-21, and the calculated enzyme activities were almost identical to the actual ones. Furthermore, the temperature and humidity were similar to the actual ones, whereas those determined by GA-21 were quite different (Fig. I I).

For processes in which the relationship between input and output is not known, the combination of a GA with FNN or ANN is potentially very effective. When there are few learning data near to the answer calculated by GA with ANN or FNN, it cannot be guaranteed that the answer is valid. In such a case, the method proposed here or CFGA [24] proposed by us, which combines GA with CF value as the evaluation function, will be useful.


3 Conclusion

Fuzzy reasoning, ANN, FNN and GA are powerful tools to tackle chemical and biochemical problems [31-33]. However, every method has both a merit and a demerit, and the following factors need to be considered when applying these methods to the actual process.

For fuzzy reasoning to be effective, the interview with an expert and the tuning of membership functions and production rules are both important. This method requires the knowledge and experience of an expert, so self-evidently cannot be realized without an expert. In many cases, tuning is done by trial-and-error, which is time-consuming. However, if this step were done perfectly, fuzzy reasoning would lead to a satisfactory result.

For ANN and FNN modeling, the collecting of data, its pretreatment, the size of the model structure and the degree of generality which it offers are important. These methods may be used with relatively little effort, if appropriate data are available. The problem of overfitting can be avoided or mitigated by the collection of more data, but this is not always an option. If only limited data are available, a model should be constructed taking into account the possible dangers of overfitting. For this purpose, the use of an evaluation data set is helpful. The estimation accuracy for the evaluation data could be thought of as the generality, so the selection of the input variables and decisions regarding the size of the model can be done with this accuracy. However, the accuracy of data sets for estimation depends on the distance between them and the data sets used for training the FNN or ANN. If estimation data are far from training data, estimated values may contain substantial error.

In GA searching, the function used for the fitness calculation is important, so this function must be carefully constructed. In our example, the accuracy of the ANN model influenced the quality of the solution found.

These methods have potential not only in biochemical engineering but also in chemical and other fields of engineering. In our study for the application of FNN and GA to the polymerization process, we get better results than using mathematical models [34]. If appropriate data are available, and it is hard to construct a mathematical model, these methods are often eminently suitable.

Acknowledgements

We would like to thank Dr. Iwao Fukaya and Dr. Yoshio Nishida for collaboration in the sake mashing process, the deceased Professor Y oshio Uchikawa and Professor Takeshi Furuhashi for offering FNN, and many collaborators in and around our laboratory.


References 1. Zadeh, L. A.: Fuzzy sets. : Information and Control, 8, 338-358 (1965). 2. Horikawa, S., Furuhashi, T., and Uchikawa, Y.: A study on fuzzy modeling using fuzzy

neural networks. Proc. of IFES '9 J , 562-573 (1991). 3. Rumelhart, D. E., Hinton G. E., and Williams, R. J.: Learning representations by back

propagating errors. Nature, 323, 533-536 (1986). 4. Holland, J. H.: Adaptation in natural and artificial systems. University of Michigan

Press, Michigan, USA (1975). 5. Sugimoto, Y. and Fujita, E.: Process control of sake fermentation. Hakkokogaku, 65,

199-215 (1987). (In Japanese) 6. Oishi, K, Torninaga, M., Kawato, A., Abe, Y., Imayasu, S., and Nanba, A.: Application

of fuzzy control theory to the sake brewing process. J. Ferment. Bioeng., 72, 115-121 (1991).

7. Oishi, K, Tominaga, M., Kawato, A., and Imayasu, S.: Analysis of the state characteristics of sake brewing with a neural network. J. Ferment. Bioeng., 73, 153-158 (1992).

8. Matsuura, K., Hirotsune, M., Nagata, F., and Hamaguchi, M.: A kinetic study on continuous sake fermentation. Hakkokogaku, 69, 345-354 (1991). (In Japanese)

9. Nishida, Y., Fukaya, I., Takahashi, N., Hanai, T., Honda, H., and Kobayashi, T.: Construction of fuzzy rules based on statistically analyzed data for control of the sake (Ginjo) making process. Seibutsu-kogaku, 72, 267-274 (1994). (In Japanese)

10. Hanai, T., Honda, H., Takahashi, N., Nishida, Y., Fukaya, I., and Kobayashi, T.: Framework rules for control of the sake (Ginjo) making process and their application in fuzzy control. Seibutsu-kogaku, 72, 275-281 (1994). (In Japanese)

11. Hanai, T., Nishida, Y., Ohkusu, E., Honda, N., Fukaya, I., and Kobayashi, T.: Experimental Fermentations of Ginjo sake with two fuzzy controls. Seibutsu-kogaku, 73,283-289 (1995). (In Japanese)

12. Hanai, T., Katayama, A., Honda, H., and Kobayashi, T.: Automatic fuzzy modeling for Ginjo sake brewing process using fuzzy neural networks. J. Chem. Eng. JPN., 30, 94-100 (1997).

13. Matsuura, K, Hirotsune, M. and Hamachi, M.: Cell density identification in alcohol fermentation by neural network. Seibutsu-kogaku, 69, 463-469 (1991). (In Japanese)

14. Nishida, Y., Hanai, T., Katayama, A., Honda, H., Fukaya, I., and Kobayashi, T.: Experimental Ginjo-sake brewing by using fuzzy neural network. J. Brew. Soc. Japan, 92,447 - 451 (1997). (In Japanese)

15. Sugimoto, Y., Tanaka, N., Furukawa, A., Watanabe, K., Yoshida, T., and Taguchi, H.: Adaptive control strategies for the mashing process in sake brewing. J. Ferment. Technol., 64, 187-197 (1986).

16. Tsuchiya, Y., Koizumi, J., Suenari, K., Teshima, Y., and Nagai, S.: Construction of fuzzy rules and a fuzzy simulator based on the control techniques of Hiroshima toji (experts). Hakkokogaku, 68,123-129 (1990). (In Japanese)

17. Matsuura, K, Shiba, H., Hirotsune, M., and Hamachi, M.: OptimiZing control of sensory evaluation in the sake mashing process by decentralized learning of fuzzy inferences using a genetic algorithm. J. Ferment. Bioeng., 80, 251-258 (1995).

18. Honda, H., Hanai, T., Katayama, A., Tohyama, H., and Kobayashi, T.: Temperature control of ginjo sake mashing process by automatic fuzzy modeling using fuzzy neural networks. J. Ferment. Bioeng., 85,107-112 (1998).

19. Hanai, T., Ueda, N., Honda, H., Tohyama, H. and Kobayashi, T.: Decision of temperature course in mashing process using GA-FNN for quality control of Ginjo sake. Seibutsu-kogaku, 76, 331-337 (1998). (In Japanese)


20. Nihon Jyouzou Kyoukai. : Saishin shuzou kouhon, p.60-82. Nihon Jyouzou Kyoukai, Tokyo (1991). (In Japanese)

21. Hanai, T., Honda, H., Ohkusu, E., Ohki, T., Tohyama, H., Murakami, T. and Kobayashi, T.: Application of an artificial neural network and genetic algorithm for determination of process orbit in the Iroji making process. J. Biosci. Bioeng., 87, 507-512 (1999).

22. Shiraishi, H. and Nakahara, S: Application of fuzzy reasoning using genetic algorithm for control of an activated sludge process, Kagaku Kogaku Ronbunshu, 22, 1-6 (1996). (In Japanese)

23. Ishida, M. and Ohba, T.: Control by new policy and experience-driven neural network to follow a desired trajectory. J. Chem. Eng. Japan, 27,137-138 (1994).

24. Hanai, T., Iwata, N., Furuhashi, T., Yoshikawa, Y., Honda, H., and Kobayashi, T.: Proposal of Confidence Function for reverse problem of fuzzy neural network. Proc. 63th Annual Meeting of Soc. Chem. Eng. Japan. 1, 61, Osaka, Japan (1998). (In Japanese)

25. Ito, K.: Studies on the mycelial growth of Aspergillus oryzae on rice grain. Seibutsukogaku. 71, 115-127 (1993). (In Japanese)

26. Inoue, T., Tanaka, J., and Mitsui, S.: Recent advances in Japanese brewing technology, p.13-27. Gordon and Breach, New York (1993).

27. Okazaki, N. and Sugiyama, S.: Growth estimation of Aspergillus oryzae cultured on solid media. J. Ferment. Technol., 57, 408-412 (1979).

28. Kitano, H.: Identeki alugolizumu, Sangyou Tosho, Tokyo (1993). (In Japanese) 29. Suga, T.: Tahenryou Kaiseki no Jissen, p.97-103. Gendai Sugaku Sha, Kyoto, Japan

(1993) (In Japanese) 30. Nihon Jyouzou Kyoukai. : Seishu seizou giiyutsu, p.91-124. Nihon Jyouzou Kyoukai,

Tokyo (1992). (In Japanese) 31. Shi, Z. and Shimizu, K.: Pattern recognition of the change in dissolved oxygen

concentration by neural network in E. coli cultivation. Kagaku Kogaku Ronbunshu, 19, 692-694 (1993). (In Japanese)

32. Ye, K., Jin, S. and Shimizu, K.: Fuzzy neural network for the control of high cell density cultivation of recombinant E. coli. J. Ferment. Bioeng., 77, 663-673 (1994).

33. Yoshida, T. and Shioya, S. ed.: Application of kriowledge engineering to bioprocess systems. Seibutsu-kogaku, 74,185-213 (1996). (In Japanese)

34. Horiuchi, J., Kamasawa, M., Miyakawa, H, and Kishimoto, M.: Phase control of fedbatch culture for a-amylase production based on culture phase identification using fuzzy inference. J. Ferment. Bioeng., 76,207-212 (1993).

35. Hanai, T., Ohki, T., Honda, H., and Kobayashi, T.: Estimation model of the butadiene quality from polymerization conditions using fuzzy neural network. Proc. 63th Annual Meeting of Soc. Chem. Eng. Japan, 1,61, Osaka, Japan (1998). (In Japanese)

Genetic Algorithms for the Geometry Optimization of Clusters and N anoparticles

Roy L. Johnston and Christopher Roberts

School of Chemical Sciences, University of Birmingham, Birmingham B15 2TT, UK

Abstract. A review is presented of the design and application of Genetic Algorithms for the geometry optimization (energy minimization) of clusters and nanoparticles, where the interactions between atoms, ions or molecules are described by a variety of potential energy functions (force fields). A detailed description is presented of the Birmingham Cluster Genetic Algorithm Program, developed in our group, and a number of specific applications are highlighted. Finally, a number of recent innovations and possible future developments are discussed.

1 Introduction: Clusters and Cluster Modelling

Clusters are aggregates of between 2 and 106 (or more) atoms or molecules. They may consist of identical atoms, or molecules, or two or more different species. Clusters are formed by most of the elements in the periodic table and can be found in a number of environments for example: coinage metals in stained glass windows; silver clusters on photographic film; molecular clusters in the atmosphere and carbon clusters in soot. A selection of different types of clusters can be seen in Fig. 1.

There is considerable experimental and theoretical interest in the study of elemental clusters in the gas phase and in the solid state [1,2]. Clusters are of fundamental interest both due to their intrinsic properties and because of the central position that they occupy between molecular and condensed matter science. One of the most compelling reasons for studying clusters is that they span a wide range of particle sizes, from the molecular (with well separated, quantized states) to the micro-crystalline (where states are quasi-continuous). Clusters constitute a new type of material (nanoparticles) which may have properties which are distinct from those of discrete molecules or bulk matter. The study of the evolution of the geometric and electronic structures of clusters and their chemical and physical properties is also of great fundamental interest, as it opens up the possibility of fine tuning the properties of electronic and optical devices, constructed from clusters, by varying the sizes of the component clusters.

162 Roy L. Johnston and Christopher Roberts

Fig. 1. Examples of types of clusters. (Clockwise from top left: fullerenes; metal clusters; molecular clusters; ionic clusters.)

Since, for large clusters (of hundreds or thousands of atoms) ab initio calculations are still, at present, unfeasible, there has been much interest in developing empirical atomistic potentials for the simulation of such species. A recent review of empirical potentials, some of which have been used to study clusters, has been presented by Erkoc;; [3].

Whether one is using empirical atomistic potentials or ab initio molecular orbital or density functional theory to describe the bonding in clusters, one of the objectives is to find, for a given cluster size, the arrangement of atoms (or ions or molecules) corresponding to the lowest potential energy - i. e. the global minimum on the potential energy hypersurface. As it is possible that a number of different states -(isomers or local minima) will be thermally populated at a given finite temperature, one may pose the question "Why is it important to locate the global minimum?" Of course, clusters corresponding to global minima (or low-lying local minima) are the most likely candidates for the most probable structure formed in a cluster experiment [4], though, depending on experimental conditions, there may be some competition between thermodynamic and kinetic structure preferences. It is also important to locate the global minimum from the point of view of testing the accuracy and how physically reasonable a particular potential energy function (or other theoretical model) is - for example, it is no use trying to model carbon clusters with a potential that predicts close packed structures!

Global minimization, however, is a non-trivial problem. As the number of minima rises exponentially with increasing cluster size, finding

Genetic Algorithms for Geometry Optimization 163

the global minimum is actually a member of the class of NP-hard problems [5,6]. It has been found that "traditional" Monte Carlo and Molecular Dynamics Simulated Annealing approaches often encounter difficulties finding global minima for particular types of interatomic interactions (such as, for example the short ranged Morse potential discussed below) [7]. It is for this reason that Genetic Algorithms have found increasing use in the area of finding global minima for clusters - i. e. for cluster geometry optimization.

The Genetic Algorithm (GA) [8-11] is a search technique based on the principles of natural evolution. It uses operators that are analogues of the evolutionary processes of genetic crossover, mutation and natural selection to explore multi-dimensional parameter spaces. A G A can be applied to any problem where the variables to be optimised (genes) can be encoded to form a string (chromosome). Each string represents a trial solution of the problem. The GA operators exchange information between the strings to evolve new and better solutions. A crucial feature of the GA approach is that it operates effectively in a parallel manner, such that many different regions of parameter space are investigated simultaneously. Furthermore, information concerning different regions of parameter space is passed actively between the individual strings by the crossover procedure, thereby disseminating genetic information throughout the population. The GA is an intelligent search mechanism that is able to learn which regions of the search space represent good solutions, via the concept of schemata [8].

In this Chapter, we will briefly discuss the various GAs which have been applied to the problem of cluster geometry optimization, before moving on to discuss, in detail, our own cluster optimization program and its application to a number of cluster types. Finally, we will discuss new and possible future directions in the use of GAs to study clusters.

2 Overview of Applications of GAs for Cluster Geometry Optimization

The use of GAs for optimizing cluster geometries was pioneered in the early 1990s by Hartke (for small silicon clusters) [12] and Xiao and Williams (for molecular clusters) [13]. In both cases the cluster geometries were binary encoded, with the genetic operators acting in a bitwise fashion on the binary strings. Hartke has subsequently published the results of G A geometry optimizations for a number of different types of cluster: large silicon clusters bound by an empirical potential [14]; small silicon clusters bound by a model potential fitted to ab initio data [15]; rare gas clusters [16]; water clusters [17,18]; large Lennard-Jones clus-


ters [4]; solvated ion clusters [19,20] and mercury clusters [21]. In his study of large Lennard-Jones clusters [4], Hartke used "directed mutation", whereby the "worst" atom (that with lowest calculated binding energy) is moved into the "best" vacancy (i .e. where it will have the highest binding energy). He also used "niching" to prevent a single geometry type dominating the GA search. For Lennard-Jones clusters of up to 150 atoms, Hartke was able to show approximately cubic scaling of cpu time taken (to find the global minimum) with cluster size (t ~ 0.05 x N 2.S), as shown in Fig. 2 [4]. This compares well with the results of other GA, Monte Carlo and MD studies. Hartke also found all of the minima previously collected together in the Cambridge Cluster Database by Wales and co-workers [?].

1e+06

100000

VI - 10000 CD

.~ ::J

1000 D-U .'

100

50 100 150 200 N

Fig. 2. Scaling of cpu time taken with cluster size (N) for Hartke's cluster GA. [Taken from: "Global cluster geometry optimization by a phenotype algorithm with niches: location of elusive minima and low-order scaling with cluster size" , B. Hartke, J . Comput. Chern., 20 (1999) 1752. Copyright © (1999 B. Hartke). Reprinted by permission of John Wiley & Sons, Inc.)

An important stage in the evolution of GAs for cluster optimization occurred when Zeiri [23-25] introduced a GA that operated on the real-valued cartesian coordinates of the clusters. This approach allowed for a representation of the cluster in terms of continuous variables and removed the requirement for encoding and decoding binary genes. Zeiri used the GA to optimize ArNH2 (N = 4-7 and 12) clusters [23], LennardJones clusters (with 4-10 atoms) [24] and XeNY clusters (N = 1-16; Y = CI, Br, CI-, Br-) [25]. The mating operators used in Zeiri's GA include multi-point crossover routines, which simply cut and paste the strings in


their array form, and operators which average the cluster coordinates to produce a single new offspring. The Zeiri GA is self-optimizing, as the parameters controlling the GA operation are varied by the GA during its run to produce a more efficient search of the potential energy surface.

The next significant step in the development of GAs for cluster optimization was due to Deaven and Ho [26,27] who performed a gradient driven local minimization of the cluster energy after each new cluster was generated. As Wales has pointed out [28], the introduction of local minimization effectively transforms the cluster potential energy hypersurface into a stepped surface, where each step corresponds to a basin of attraction of a local minimum on the potential energy surface, as shown in Fig. 3. This simplification of the surface greatly facilitates the search for the global minimum by reducing the space that the GA has to search. This principle also underpins the Basin Hopping Monte Carlo method developed by Wales [28] and the "Monte Carlo plus energy minimization" approach of Scheraga [29]. These related methods have proved very efficient for the structural optimization of clusters, crystals and biomolecules [30]. In the GA context, such local minimization corresponds to Lamarckian, rather than Darwinian evolution, as individuals pass on a proportion of the characteristics that they have acquired to their offspring. In the case of clusters, these acquired characteristics are the geometries after local minimization, rather than the characteristics they themselves inherited. Such hybrid or "Lamarckian" GAs, which couple local minimization with GA searching, have been found to improve GA efficiency for a number of different applications of GAs in global optimization [31].

v

A,

------. {Xl

Fig. 3. Simplification of the potential energy surface in the Lamarckian GA. Initially generated clusters which are in the same basin of attraction (e.g. A and AI are minimized to the same structure (Ao), while cluster B minimizes to Bo.


Another significant development in cluster optimization GAs, also due to Deaven and Ho, was the introduction of the 3-dimensional "cut and splice" crossover operator [26,27]. This operator, which has been employed in most subsequent cluster GA work, gives a more physical meaning to the crossover process. In this crossover mechanism, which is shown in Fig. 4, good schemata correspond to regions df the parent clusters which have low energy local structure. Deaven and Ho applied their GA to the optimization of carbon clusters bound by a tight binding potential [26] and Lennard-Jones clusters with 2-100 atoms [27]. Their work on Lennard-Jones clusters yielded many more low energy minima than had previously been found by alternative geometry optimization methods. Ho has subsequently employed a GA to study silicon clusters, using tight binding potentials [32] .

..

.. Fig. 4. The Deaven-Ho cut and splice crossover operation

Hartke has stated that, because the genetic operators in the Deaven and Ho GA act on the clusters themselves, in configuration space, rather than on a string representation of the problem, this algorithm may better be described as ''phenotypic'', rather than "genetic" [4]. While this is strictly true, in this Review, we shall continue to describe the DeavenHo algorithm, and subsequent variants (including our own), as "genetic algorithms", due to the pseudo-genetic nature of the operators, rather than the way in which the the problem is coded.

Mestres and Scuseria [33] studied the geometries of Lennard-Jones clusters with less than 13 atoms and Cs clusters modelled with a tight binding potential. They studied the benefits of having an initial population containing structures seeded with the geometries of low energy (N - I)-atom clusters over completely random initial geometries. The seeded populations always converged on the global minimum in fewer generations - though one has to question whether such a biased search may encounter difficulties when there is an abrupt change in atom pack-


ing type on going from the (N -I)-atom to the N-atom global minimum. The genes in the Mestres and Scuseria G A were expressed using graph theory, i. e. through distance and adjacency matrices. The local optimization step used by Mestres and Scuseria involved another G A that operated on a binary representation of a cluster. Bokal has also used a GA based on graph theory to generate polyhedral clusters with maximized sphericity [34].

Further work on cluster geometry optimization using GAs, for a variety of potentials and types of cluster, has been reported by Niesse and Mayne: Ar clusters and water clusters [35]; Lennard-Jones clusters with up to 55 atoms [36]; small silicon clusters [37] and hydrocarbon clusters [38]. The GAs used by Niesse and Mayne have used both binary and real-valued encoding of the atomic coordinates in conjunction with a number of operators adapted from the work of Zeiri [23] and Deaven and Ho [26]. They found that a G A using real-coded parameters is more efficient (requires fewer function evaluations to locate the global minimum structures) than an equivalent GA using binary parameter encoding. Tutein [39] used the GA of Niesse and Mayne [36], with a semi-empirical potential, to optimize the geometries of Na(32P)ArN clusters, where N was in the range 2-17. Iwamatsu has used a modified version of the Niesse-Mayne GA to optimize the geometries of silicon clusters, bound by Stillinger-Webber and Gong potentials [40]. Iwamatsu uses a simplex algorithm for local cluster energy minimization.

Pullan [41-43] has applied a GA to the geometry optimization of mixed Ar-Xe clusters and benzene clusters. He also performed a comparative study of a number of mating and mutation operators, which were presented in previous work on cluster geometry optimization with GAs, in which he found the cut and splice operator of Deaven and Ho produces the most efficient results [41].

Hobday and Smith [44] optimized carbon clusters with 6-60 atoms bound by the Brenner potential and with 10-22 atoms bound by the Murrell-Mottram potential (see Section 4.3). The Hobday and Smith GA differs from the most recent GAs in that it uses a binary encoding of the parameters. They also use a variation of the Deaven and Ho crossover operator in which the clusters are rotated and translated, so that the low energy region of a cluster lies in its lower half. Then the cluster halves from each parent are exchanged in such a way that the high energy region of one parent is replaced with the low energy region of the other parent. The mutation operator introduced by Hobday and Smith also uses biasing, in that it replaces a high energy atom of a cluster with a low energy atom from one of its parents. The use of biased crossover and mutation operations has also been investigated by Hartke [4], who


prefers to use an approximate 50/50 mixture of random and biased (or "directed") crossover.

The basic Deaven and Ho GA was applied by Curotto et al. [45] to the geometry optimization of Ni and NiH clusters (2-13 atoms) using a potential derived from the Hiickel model. The potential was parameterized using data fitted to ab initio and experimental data. These authors used simulated annealing to relax the geometries of the clusters after mating and mutation instead of the more usual, gradient driven methods. The geometries of small (AIP)N (N < 6) clusters were predicted using a GA by Tomasulo [46] to generate input for higher level DFT-LDA calculations. The GA used the geometries of SiN clusters as templates.

In their study of Lennard-Jones clusters with up to 100 atoms, Wolf and Landman [47] aimed to improve the Deaven and Ho GA by introducing several variations to the basic cluster GA scheme. Their first change was to the cut and splice crossover operator; they removed the step that rotates the cluster by a random plane prior to cutting the parent clusters. Instead, the same cluster halves from each parent were used during every mating of a particular parent. A novel and intuitive feature of their GA was to vary the mutation rate according to the energy distribution amongst the population - a small variation between the highest and lowest energy clusters in the population over a number of generations (i. e. low population diversity) would be used as a measure to indicate that the mutation rate should be increased. Wolf and Landman also pioneered two new mutations: a "twinning" operation that rotates half the cluster and then pastes the rotated and un-rotated parts back together and an "etching" mutation. The etching mutation adds M atoms to an N -atom cluster, performs a rough energy minimization of the cluster and then removes the highest energy atom. The energy minimization followed by highest energy atom removal processes are repeated until the cluster has N atoms again. Wolf and Landman reported new global minima for 68 and 75-78 atom Lennard-Jones clusters as well as finding that their GA required fewer generation on average to find the global minima than the standard Deaven and Ho GA.

GA studies by Chaudhury and Bhattacharyya [48,49] on LennardJones clusters have concentrated not just on finding the global minimum geometries of the clusters but other critical points on the potential energy surface as well. In their most recent work [49], they searched for the first order saddle points on the Lennard-Jones potential energy surface.

Luo and Zhao [50] studied (C 60 )N (N = 3-25) clusters of clusters with another variant of the Deaven and Ho GA. They used a simple pair potential and a 2+3-body potential to describe the interactions between


the Cao clusters. The global minimum of (Cao)N clusters was typically found within 1000-4000 mating operations.

The GA has been extended to optimize large Lennard-Jones clusters with 148-309 atoms by Romero [51], who combined the GA with a stochastic search procedure on icosahedral lattices (i. e. using prior knowledge of likely cluster packing). A population of "spherical slices" of icosahedral lattices are used together with crossover, mutation and local minimization operators to explore the geometries of the Lennard-Jones clusters. Hartke, has used his G A program to confirm the structures up to N = 250, though he has actually found lower energy structures (than Romero) in the range 185-187 atoms [52,53].

Michaelian, Garz6n and co-workers [54-58] have developed a "symbiotic algorithm", based on GA principles, and have applied it to metal clusters bound by many-body Gupta potentials. The symbiotic algorithm partitions the cluster into cells, each centred on a cluster atom. The atoms within a cell are then optimized using a GA based method with crossover and mutation operators. Once the cell is considered optimized, i.e. there is no further improvement in its energy over a certain number of generations, then the coordinates of the optimized cell are used to update the coordinates of the cluster. The symbiotic algorithm then moves onto the next cell and the G A is again used to optimize the atoms within the new cell. This process is repeated many times for all cells until the energy of the cluster as a whole remains constant over a number of iterations of the symbiotic algorithm.

A database of the application of Genetic Algorithms (and other Evolutionary Algorithms) to clusters (and other optimization problems) is maintained by Clark [59].

The remainder of this Chapter consists of a description of our own G A program for cluster geometry optimization and an overview of some of the applications (for different types of cluster) of this program.

3 The Birmingham Cluster Genetic Algorithm Program

A flow chart representing the operation of our cluster geometry optimization GA program [60] is shown in Fig. 5 and the basic features of the GA are described below.

3.1 Generation of the initial population

For a given cluster size (nuclearity N), a number of clusters, NciuB (typically ranging from 10 to 30) are generated at random to form the initial


No Converged?

Fig. 5. Flow chart for the Birmingham Cluster Genetic Algorithm Program

population (the "zeroth generation"). We have followed the approach of Zeiri [23-25] in using the real-valued cartesian coordinates of the cluster atoms as the genes. The x, y and z coordinates are chosen randomly in the range [O,N1/3]. This ensures that the cluster volume scales correctly with cluster size (i. e. linearly with N). All of the clusters in the initial population are then relaxed into the nearest local minima, by minimizing the cluster potential energy as a function of the cluster coordinates, using the quasi-Newton L-BFGS routine [61]. This routine utilizes analytical first derivatives of the potential.

The GA operators of mating (crossover), mutation and selection (on the basis of fitness) are performed to evolve one generation into the next. In this Chapter, we will use the term "mating" to refer to the process by which two parent clusters are combined to generate offspring. The mechanism, at the chromosome level, by which genetic material is combined, will be termed "crossover".

3.2 Fitness

The fitness of a string gives a measure its quality, in this case, the lowest energy (most negative Vc/us) clusters have the highest fitness and the


highest energy (least negative Vclu8 ) clusters have the lowest fitness. The cluster GA uses what is known as dynamic fitness scaling; the fitnesses of strings in each generation are scaled to lie within a defined range. The fitness of the lowest energy cluster in the population is always 1 (or close to 1) with the highest energy cluster having a fitness of zero (or close to zero). The dynamic scaling is achieved by using a normalised value of the energy, p, in the fitness calculations:

Vi - Vmin Pi =

V maz - Vmin (1)

where V min and Vmaz are the lowest and highest energy clusters in the current population, respectively.

The different fitness functions used are:

Exponential The fitness is depends exponentially on Pi, a is typically set to 3:

Linear A linear relationship exists between the fitness and Pi:

Power A quadratic (or other power) dependence on Pi:

Ii = 1- p~

Hyperbolic Tangent A tanh dependence on Pi:

1 Ii = -[1 - tanh(2pi - 1)] 2

(2)

(3)

(4)

(5)

The choice of fitness function controls which clusters are preferentially selected as parents. A function, such as the exponential, which slopes steeply away from its maximum value will give a large range of fitness values for clusters with P < 0.4. The selection of parents will be biased towards only the very low energy clusters in the population, i.e. those with values of P close to O. If the power function is used, then for the same range of P < 0.4 the range of fitness values is significantly smaller and there will be far less bias towards the lowest energy clusters. At the other end of the scale, the power function has a value of 0 when P = 1.0 so the highest energy cluster in the population will never be selected.


3.3 Selection of parents for mating

The selection of parents is accomplished in one of two ways: using either the roulette wheel or tournament selection method [9]. In the roulette wheel method, a cluster is chosen at random and selected for mating if its fitness value (Fi) is greater than a randomly generated number between 0 and 1 (i.e. if Fi > R[O, 1]), otherwise another cluster is chosen at random and tested against another random number. This process continues until a pair of clusters have been selected. The tournament selection method picks a number of clusters at random from the population to form a "tournament" pool. The two lowest energy clusters are then selected as parents from this tournament pool.

In both of these selection schemes, low energy clusters (with high fitness values) are more likely to be selected for mating and therefore to pass their structural characteristics on to the next generation. Although each cluster may be chosen more than once for mating, the same cluster cannot be chosen as both parents for a single mating event. Once a pair of parents have been selected, they are subjected to the crossover operation.

3.4 Crossover

Crossover (the exchange of "genetic" information) is carried out using a variant of the cut and splice crossover operator of Deaven and Ho [26,27]. In the original work of Deaven and Ho (see Fig. 4), a random plane was chosen which passes through the centre of mass of each cluster. The clusters were then cut about this plane and complementary halves were spliced together in order to generate the offspring or child clusters. In our implementation of the cut and splice operation, random rotations (about two perpendicular axes) are performed on both parent clusters and then both clusters are cut horizontally about one or two positions, parallel to the xy plane, and complementary fragments are spliced together. Several different crossover routines have been developed that create an offspring by making one or two cuts and putting together complementary slices. For the single cut method, the cutting plane can be chosen at random, it can be defined to pass through the middle of the cluster, or weighted according to the relative fitnesses of the two parents. For the double cut method, the cutting planes are chosen at random.

In practice, the cut and splice operation (after rotation of the parent clusters) is accomplished by ranking the coordinates of the component atoms of each rotated cluster in order of decreasing z coordinate and then selecting the first (highest z) N - m pOB coordinates from the first parent and the last (lowest z) m pOB coordinates from the second parent


and combining them to generate a child cluster with N atoms, as shown in Fig. 6. The choice of a random crossover point, which reduces to the selection of a random integer MpOB in the range [1,N - 1], leads to a greater number of possible offspring from a given pair of parents, thereby helping to maintain population diversity. Though we have chosen only to generate one child from each crossover operation, the creation of two children may be desireable in cases where mating leads to too few children of comparable fitness to their parents.

""",",

+7:

m ... --

Fig. 6. A diagrammatic representation of the crossover operation adopted in our GA

Mating continues until a predetermined number of offspring (Noll) have been generated. The number of offspring is generally set to approximately 80% of the population size (i.e. Noll = 0.8 x NcluB))' Unless selected for mutation (see below), the offspring clusters are subsequently relaxed into the nearest local minima, as described above. The local minimization step, obviously changes the structure of the child cluster, and this structural rearrangement will be greatest in the region of the join between the two fragments donated by its parents. As the clusters get larger, however, the perturbation due to the local minimization should become relatively smaller and confined to the join region. In this way, the principle of schemata [8], where parents with high fitness are more likely to have fit children (by passing on fragments with low energy arrangements of atoms) should apply.

3.5 Mutation

While the mating/crossover operation leads to a mixing of genetic ma-:terial in the offspring, with the exception of the small perturbation in the join region, no new genetic material is introduced. For small popu1ations, this can lead to population stagnation and premature convergence on a non-optimal structure. In an attempt to avoid stagnation and to maintain population diversity, a mutation operator is introduced. Each string has a probability (Pmut ) of undergoing mutation. A random number between 0 and 1 is generated, and if the random number is less than


Pmut then the cluster undergoes mutation. The mutation perturbs some or all of the atomic positions within a cluster.

A number of mutation chemes have been adopted:

Atom Replacement Mutation This mutation involves replacing the atomic coordinates of a certain number of the atoms with randomly generated values. The number of atomic coordinates replaced is set to be approximately one third of the total number of atoms, N.

Twisting Mutation In this mutation scheme, which is analogous to the twinning mutation of Wolf and Landman [47], the cluster is mutated by rotating the upper half of the cluster about the z axis by a randomly generated angle, relative to the bottom half.

Cluster Replacement Mutation This mutation involves the replacement of an entire cluster with a new, randomly generated cluster. The cluster is generated in an identical way to that used for the generation of the initial population.

Atom Permutation Mutation This mutation operator swaps the atom types of a pair of atoms without perturbing the structure of the cluster. Approximately Jt atom label swaps are performed per cluster mutation. This type of mutation is used for hetero-elemental clusters, such as ionic clusters and bimetallic clusters.

After mutation, each "mutant" cluster is subsequently relaxed into the nearest local minimum, using the L-BFGS minimization routine, as described above.

3.6 Diversity Checking

The program contains an option of removing clusters from the population that have a difference in energy less than a value <5E, which is typically set to 1 x 10-6 eV. If two or more clusters have energies less than <5E apart, then the lowest energy cluster is retained and the other clusters are discarded. Use of this operator ensures that population diversity is maintained.


3.1 Selection of the New Population

This corresponds to "natural selection" or "survival of the fittest" in biological evolution. The new population is selected from the Nc1uB lowest energy, and therefore highest fitness, clusters selected from the set containing the old population the new offspring dusters and the mutated clusters. The inclusion of clusters from the previous generation makes the GA "elitist". If the previous generation contains clusters of lower energy than the offspring or mutant clusters then these will be copied into the new population ensuring that the lowest energy clusters are always present in the population - i. e. the best member of the population cannot get worse from one generation to the next.

3.8 Subsequent Generations

Once the new generation has been formed, the potential energies of the best (Vmin) and worst (Vmaz ) members of the population are recorded and the fitness values calculated for the entire population. The whole process of mating, mutation and selection is then repeated for a specified number (Ngen ) of generations or until the population is deemed to have converged. The population is considered to be converged if the range of cluster energies in the population does not changed for a prescribed number of generations.

A considerable amount of effort has been expended in optimizing the GA operations and parameters [62]. In the next Section, we shall detail applications of our GA to a number of types of cluster. Values of the GA parameters, and any other options adopted are listed where appropriate. It should be noted that, because of the stochastic nature of the GA, the GA program is run several times for each cluster nuclearity and for each set of GA operations/parameters.

4 Applications of the Birmingham Cluster Genetic Algorithm Program

Many of the structures of the global minima found by our GA, for a number of cluster types, can be found on the Birmingham Cluster Web site [63].

4.1 Morse Clusters

The first type of potential that we studied with our cluster GA was the Morse potential [62,64], because, although Morse clusters had not


previously been studied with GAs, Doye and Wales had described the structural consequences of varying the range of the Morse potential [7J. Furthermore, using the Basin Hopping Monte Carlo approach [28], they had found global minima for Morse clusters with different range parameters and noted that the short range Morse potential (which has many local minima and a very "noisy" potential energy surface) presents a particular challenge for global optimization techniques [7J. We, therefore, decided to apply our GA to find global minima for Morse clusters and to compare our results with Doye and Wales' tabulated coordinates and energies, which are available on the Cambridge Cluster Database website [22J.

The Morse Potential The Morse potential is a pair-wise additive potential [65J which depends only on the separations rij between pairs of atoms:

(6)

where De is the bond dissociation energy (assumed constant for all interactions in a homonuclear cluster), re is the equilbrium bond length and 0: is the range exponent of the potential. Short range Morse potentials correspond to high values of 0:.

As in the work of Doye and Wales [7J, we have adopted a simplified, scaled version of the Morse potential with De and r e both set to one:

(7)

This provides a non-atom specific potential which depends on a single parameter: the range exponent 0:. We have compared short range (0: = 14) and medium range (0: = 6) Morse potentials.

The total potential energy of a cluster of N atoms is obtained by summing over all atom pairs:

N-l N

Vc/us = L L Vir (8) i j>i

It should be noted that, as all Vc1us values are negative, the expression "low energy" actually refers to high values of - Vclus .


Comparison of GA and Random Search Methods for Global Optimization of Morse Clusters The efficiency of the GA was compared with that of a simple random search algorithm (RSA). The RSA generates a number of structures using an identical method to that used to generate the initial population of the GA and then minimizes the energy of these clusters using the L-BFGS local minimization routine. Table 1 presents the success or failure of the RSA in finding the global minimum energy structures of Morse clusters bound with a Morse potential with a = 6. It can be seen that the RSA is only successful in locating the GM for clusters of 20 atoms, even when 30,000 random geometries are generated, as for the 50 atom clusters.

Table 1. The success of the random search algorithm in locating the global minima of Morse (0 = 6) clusters with N = 20, 30, 40 and 50 atoms

N Nsearch GM found?

20 5000 yes

30 10000 no

40 20000 no

50 30000 no

The GA was used to search for the GM of Morse clusters ofthe same nuclearities, with the GA being run from 40 different initial populations for each nuclearity. The GM was found during each run of the GA for all 4 cluster nuclearities, this is a considerable improvement over the RSA where the GM was found only for the smallest cluster. The average number of energy minimizations (Nmin) required to locate the GM for these clusters, measured over the 40 runs for each nuclearity, can be seen in Table 2. The GA requires an average of 472 energy minimizations to locate the GM for the 50 atom Morse cluster, this is significantly less than the 30,000 energy minimizations performed by the unsuccessful RSA algorithm. The table also reveals that it is particularly challenging to find the GM for N = 30, which takes nearly as many minimizations (on average) as for N = 40.

The success of the GA compared with the random search algorithm is undeniable. We are currently working on a detailed comparison of the efficiency and reliability of our GA relative to other global optimization techniques, such as Monte Carlo, Molecular Dynamics Simulated Annealing and Monte Carlo Basin Hopping.


Table 2. The average number of minimizations (Nmin) required by the GA to locate the global minima of Morse (a = 6) clusters with N = 20, 30, 40 and 50 atoms

N 20 30 40 50

(Nmin) 31 301 323 472

GA Optimization of 19-50 Atom Morse Clusters The cluster geometry optimization GA program was used to study Morse clusters (0:= 6 and 14) with N = 19-50. Preliminary calculations were performed on a number of trial clusters and the following optimal values were obtained for the GA program parameters: N ciuB = 10; N mat = 0.8 X Nclu,B (i.e. Naif = 8); Pmu,t = 0.1; and N gen = 10-300 (increasing with cluster size). The exponential fitness function was adopted, with roulette wheel selection, random single cut crossover and atom replacement mutation.

The GA located all of the previously published global minima [7] for Morse clusters with 19 to 50 atoms, both for medium (0: = 6) and short range (0: = 14) Morse potentials. The GA found a lower energy structure for the N = 30 (0: = 14) Morse cluster than was initially reported by Doye and Wales [7], though this structure has now been found by their Basin Hopping algorithm [22].

The structures of some of the global minima for 0: = 6 and 0: = 14 are shown in Fig. 7. These structures have been discussed in detail by Doye and Wales [7], the most obvious difference between the two potentials being that the longer range (0: = 6) potential tends to favour poly-tetrahedral, icosahedral geometries, while the short range (0: = 14) potential favours decahedral and fcc-like packing (such as the truncated octahedral cluster which is the global minimum for N = 38).

Cluster Evolution One way of monitoring the progress of the cluster GA is to construct an Evolutionary Progress Plot (EPP) [66], of the lowest energy (V min), highest energy (V maz) and the average energy (Vave ), as a function of generation number. Such EPPs are shown in Fig. 8 for N = 38, with 0: = 6 and 0: = 14. The EPPs show that there is a rapid improvement in the population (a sharp drop in Vmin, Vmaz and Vave ) in the early generations, relative to the initial, randomly generated population ("generation 0"). This early improvement is due entirely to the mating process. Subsequent, less dramatic improvement occurs in a stepwise fashion and may be due to mating or mutation.

Fig. 8 shows that, in both cases, the population converges on a single structure - as evidenced by V min, V maz and Vave becoming equal. The


Fig. 7. Geometries of selected Morse clusters with (from left to right) 19, 38 and 50 atoms for Q = 6 (top row)and Q = 14 (bottom row).

38 atoms, a - 6 -145.0 r------------,

-150.0 \

b] ......... V ... • ....... • V ..... ....... v_

\~ '\ •..• " ....

-155.0 " . . "

~- ... -"'It " .. ",

.... ~'·,1l11t ... " ... ~ . ........ .

-160.0 O'------I~O --2~0 --3~0 ----'40

Generation

38 atom., a - 14 -110.0.--_-_-_-_--,

-120.0

> ~~. . '. - -130.0 • .} ..........

-140.0

-150.0 '---~-~-~-~---' o W 40 60 W ~

Generation

Fig. 8. Evolutionary Progress Plots for 38 atom Morse clusters with Q = 6 anda=M .

converged structure was found to be the global minimum in each case. The fcc-like truncated octahedral geometry of the 38 atom (0: = 14) Morse cluster is difficult to find with most global optimization techniques [7], but is here found before the 100th generation - even for a small population size of 10. Comparison with EPPs for larger clusters confirms that the GA must, in general, be run for a greater number of generations for larger cluster sizes. Similarly, comparison of the EPPs for 0: = 14 with those for 0: = 6 confirms that the short range Morse potential is more difficult to search [7], taking 2-3 times as many generations to find the global minima.


4.2 Ionic Clusters

Ionic clusters are derived from ionic solids - compounds formed between electropositive metals and electronegative non-metals. The bonding in ionic clusters is primarily due to electrostatic interactions, and the simplicity of modelling ionic clusters makes them ideal systems in which to study size-dependent properties. Ionic clusters have a number of practical applications, such as: silver halides in the photographic process; ZnS clusters as gas sensors; CdS and CdSe clusters as photodetectors; and polar semiconductors made from GaAs and InP clusters. The study of NaCI and NaBr clusters is also important in determining the mechanisms of ozone depletion and pollution in marine environments.

In this Section, we will consider the simplest ionic oxide species, magnesium oxide (MgO) which is known to crystallize in the rock-salt (NaCI) structure. The suitability of the GA for investigating the structure of mixed element (MgO) and ionic clusters will be studied using the rigid ion model to describe the electrostatic bonding.

The Rigid Ion Potential The rigid ion model is a simple model of the bonding in ionic solids and their clusters, in which the ions are of fixed size and carry fixed charges. The interaction between a pair of ions is given by the sum of the long range electrostatic Coulomb energy and a repulsive Born-Mayer potential (which reflects the short range repulsive energy due to the overlap of the electron density of the ions):

Vij = qiqje2 + Bij exp (-Ti j ) (9)

47rcOTij Pij

where, qi and qj are the charges on ions i and j, Tij is the inter-ion distance and Bij and Pij are Born-Meyer energy and distance scaling parameters, respectively. Bij and Pij are zero when ions i and j are both Mg2+.

The potential parameters, listed in Table 3, were derived by Lewis and Catlow [67], for formal charges of Mg2+ and 0 2-.

Comparison of Mg+O- and Mg2+02- Cluster Structures The ions in MgO have formal charges of Mg2+ and 0 2-, but, for reasons which are described below, we have used the cluster GA to optimize the geometries of stoichiometric (MgO)N clusters (N = 10-35), with formal ionic charges of Mg+O-, as well as Mg2+02- [68]. The following GA parameters were used: roulette wheel selection, with the tanh fitness function; weighted single-cut crossover; rotate mutation, mutation probability Pmut = 0.05; population size NcluB = 20; N mat = 0.8 X NcluB (i.e.


Table 3. Rigid Ion Potential Parameters for MgO

Parameter Value

Bi; (Mg-O) 821.6 eV

Bi; (0-0) 22764eV

Pi; (Mg-O) 0.3242 A Pi; (0-0) 0.1490 A

16 offspring produced per generation); maximum number of generations, Ngen = 200.

Using the GA and the Rigid Ion Potential, the lowest energy Mg+Oclusters were found to be compact cubic structures based on the NaCI structure (the structure of bulk MgO), whereas the Mg2+02- clusters are more spherical, cage like structures with square, hexagonal and octagonal faces. The shorter equilibrium bond length of a pair of doubly charged ions forces all ions closer together in the cluster, which increases the repulsion between like ions. The repulsion is much greater between ions of like charges in the Mgq+Oq- clusters with q = 2, compared with those with q = 1. This causes the structures to open out from the highly coordinated lattice structures seen in the clusters when q = 1 to the cage like structures seen when q = 2.

The rigid ion potential does not contain any terms to describe the polarization of ions in a system, and, as the 0 2- ions are highly polarizable and the Mg2+ ions are strongly polarizing, the rigid ion model provides a poor description of Mg2+02- clusters. The effect of polarization in the clusters will be to reduce the effective charges of both ions, thereby reducing the repulsion felt between like ions. This will lead to the more highly coordinated (bulk-like) cuboidal structures being preferred over the cage like structures, due to the higher degree of bonding present. The rigid ion model calculations used singly charged ions in order to mimic, partially the effect of polarizability. Calculations using potentials that include terms for the polarizability of the cluster ions [69] and ab initio [70-72] calculations predict cubic structures to be the ground state configurations of MgO clusters.

Ziemann and Castleman [73] have studied (MgO):t and (MgO)NMg+ clusters experimentally, using laser-ionization time of flight mass spectrometry. The resulting spectra showed magic number clusters (corresponding to relatively intense peaks) for (MgO):t at N = 2, 4, 6, 9, 12, 15, 18, 24, 27, and 30 and (MgO)NMg+ at N = 8, 11, 13, 16, 19, 22, 25, and 27. These results indicate that MgO clusters have compact cuboidal


(rock salt) structures, as predicted by the rigid io~ model with ionic charges of + 1 and -1. Similar · structures are known to be preferred for alkali-halide clusters [74]. Ziemann and Castleman also performed calculations based on the rigid ion model [73]. They found similar trends in the structures of clusters formed from singly and doubly charged ions, to those produced by our GA optimization: namely cuboidal structures for q = 1 and cage like structures for q = 2.

Our GA found clusters with complete lattice structures for Mg+O-)N clusters with N = 12 (4 x 3 x 2), 15 (5 x 3 x 2),16 (4 x 4 x 2), 18 (4x3x 3),20 (5 x 4 x 2), 24 (4 x 4 x 3), 30 (5 x 4 x 3) and 32 (4 x 4 x 4)),with most of these matching the magic numbers observed experimentally by Ziemann and Castleman [73].

Variation of Cluster Structure with Formal Charge As noted above, MgO clusters with formal ionic charges of ±l and ±2 have different structures. To investigate the dependence of structure on charge further, we have studied the effect, on the lowest energy structures of small (Mgq+Oq-)N clusters, of varying the magnitude of the charge (q) in increments of 0.1 from 1.0 to 2.0. Here, we report the results for N = 8 and 9. In the following discussion, our results are compared with work by Puente and Malliavian [71], who studied neutral MgO clusters of between 6 and 13 using HF (Hartree-Fock) and CHF (Correlated Hartree-Fock) ab initio techniques. The CHF method includes intra-ion electron correlation corrections.

Fig. 9. The structures of (Mgq+Oq-)s clusters with q in the range: 1.0-1.3 (left); 1.4'-1.6 (centre); and 1.7-2.0 (right)

As shown in Fig. 9, the (MgO)s cluster has three possible structures, a 4 x 2 x 2 lattice for q = 1.0-1.3, a cage structure for q = 1.4-1.6 and a structure consisting of two stacked octagonal rings for q = 1.7-2.0. The two more open structures can both be generated by lengthening different Mg-O contacts in the 4 x 2 x 2 cuboid. The stacked ring structure is the ground state structure predicted by the HF calculations of Puente and Malliavian, while the CHF calculations predict a structure of two


hexagonal MgO rings with an Mg20 2 square capping one of the square faces [71].

In contrast to the N = 8 cluster, the same geometry was found for (MgO)g for q = 1.0-2.0: corresponding to three stacked hexagonal Mg30 3

rings. This geometry, which is shown in Fig. 10, is the ground state structure found, using both the HF and CHF methods, by Puente and Malliavian [71].

Fig. 10. The structure of (Mgq+ Oq-)9 clusters with q in the range 1.0 to 2.0.

Evolutionary Progress Plots (EPPs) showing the convergence of the GA on the lowest energy (MgOho clusters, with formal charges of ±1 and ±2 are shown in Fig. 11.

-230.0 ,----------------,

-23S.0 0~--:--~10~--:-::IS-"7:20:----:2~S ---:30

Geaerado.

(Mg'+O')lO

-11601.0,-----------,

-1168.0

!

> ~ ") -1172.0 c~

> ' •. :,--,

-1176.0 \\

~-•.... --. v • .. · · .. ··V_

fl\\ .~ ~iIIE!!III..~=----4

-1110.0 '--~-~-~~~~--' o u ~ H " ~ ill

GeaendoD

Fig. 11. EPPs for (Mg+O-)ao and (Mg2+02-)aO clusters

The GA required 74 generations to find the lowest energy structure of (Mg2+02-ho but only 7 generations to find the lowest energy structure of (Mg+O-ho. In both cases the minimum, maximumum and average energies converge, indicating that the GA has converged on a single solution. The GA converges in 27 generations for q = ±1 and in 100 generations for q = ±2. These results are typical of the other cluster


nuclearities studied. The rigid-ion potential with formal charges of ±2 is shorter ranged and the potential energy surface is therefore likely to have more local minima for the GA to search, leading to greater difficulty in finding the global minimum.

Non-stoichiometric Clusters The GA was subsequently used to search for the global minima of non-stoichiometric (Mgq+Oq-)NMgq+ clusters with N in the range 5 to 29, for q = 1 and 2.

The trends in the geometries of the clusters are similar to those seen in the geometries of the stoichiometric clusters; the clusters with ions of charges ± 1 are cubic structures and those with ions of charges ±2 are cage-like. This is to be expected as the same factors influence the geometries of the stoichiometric and non-stoichiometric clusters modelled with the rigid ion potential. The same cluster geometries are found for N = 12 and N = 13 (MgO)NMgq+ clusters with both singly and doubly charged ions. The N = 12 structure has a Mg4 0 5 3 x 3 square array of ions with two Mg4 0 4 rings stacked above it. The N = 13 structure is a 3 x 3 x 3 cube. Ab initio calculations on non-stoichiometric (MgO)NMgH [75,76], confirm that these clusters adopt structures based on NaCI, though there is some uncertainty as to the actual geometries [75,77].

4.3 Carbon Clusters

There has been considerable interest in carbon clusters, and in particular the fullerenes, since the experimental discovery of the icosahedral ("buckyball") fullerene structure of C60 by Kroto, Smalley and co-workers [78]. Fullerenes are composed of an even number of 3-coordinate Sp2 carbon atoms that arrange themselves into hexagonal and pentagonal faces. Much of the interest in fullerenes has arisen because of their unique electronic properties which give rise to numerous possible applications: superconductors made from (K3C60); new semi-conducting materials; molecular containers with possible medical applications; and nanometre thickness carbon fibres.

The Murrell-Mottram Potential In this study, the cohesion of the carbon clusters is described by the Murrell-Mottram (MM) 2+3-body potential [79,80]. The MM potential is based on a many-body expansion of the potential energy:

v = V(l) + V(2) + V(3) + .... V(n) (10)

in which the atomic term V(l) is set to zero and the series is truncated at the 3-body level, V(3).


The 2-body (pair) potential, between atoms i and j, is expressed as:

(11)

where D is the dissociation energy of the pair potential, Pij is the reduced interatomic distance:

Pij = (rij - re)/re (12)

and r e is the equilibrium distance of the pair potential. The 3-body term, for triangle (i,j, k):

(13)

is restricted by the requirement that it be unchanged upon interchanging identical atoms. This is achieved by defining the 3-body potential in terms of the symmetry coordinates Q i:

(~~) = (Vlf ~ -~) (;;~) Q3 /273 -y't16 -Jf76 Pki

(14)

A totally symmetric polynomial can be written in terms of sums and products of the functions Ql, Q~ + Q~ and Q~ - 3Q3Q~, which are invariant with respect to the interchange of identical atoms.

V(3) is defined by an exponent a3 and a set of polynomial coefficients Ci. For carbon, a quartic polynomial was adopted:

P(Ql, Q2, Q3) = Co + ClQl + C2Q~ + C3(Q~ + Qn + C4Q~ +C5Ql(Q~ + QD + C6(Q~ - 3Q3Q~) +C7Qt + C8Q~(Q~ + Q~) + C9(Q~ + Q~)2 (15)

+ClOQl(Q~ - 3Q3Q~)

F(a3, Qd is a damping function which makes V(3) go to zero exponentiallyas Q1 goes to infinity. Several forms for the damping function have been investigated [SO]. The carbon potential used in this work has the damping function:

(16)

The parameters for the MM carbon potential were derived by Eggen et al. [SI] and are listed in Table 4. These parameters were obtained by a least squares fitting to experimental data from the diamond allotrope of bulk carbon. The interlayer spacing of graphite was also included in the fitting of the potential parameters.


Table 4. Parameters defining the MM potential for carbon [81]

Parameter Value

R2 8.200

aa 8.200

DleV 6.298

re/A 1.570

Co 8.087

CI -13.334

C2 26.882

C3 -51.646

C4 12.164

C5 51.629

C6 25.697

C7 -5.964

Cs -7.306

C9 2.208

CIO 13.707

Small Carbon Clusters For the MM potential, the cluster optimization GA [62] finds small CN clusters (N < 20) to have cage structures, with 2- and 3-coordinate atoms. Experiments, however, have shown that these clusters have linear chain or mono-/bicyclic ring structures [82-84].

For N = 20 and for even-N clusters with N > 24, the GA finds fullerene structures [85]. Fullerenes are hollow, pseudo-spherical, 3-connected cages with 12 pentagonal faces and any number (except one) of hexagonal faces. The 20 atom carbon cluster is the lowest nuclearity fullerene possible and the generation of the dodecahedral C20 fullerene has recently been reported [86]. It is known, from experiments, that clusters with an even number of atoms, from N = 24 onwards, do tend to adopt fullerene structures [85].

Clusters with an odd number of atoms cannot form fullerene cages, as it is impossible for all the atoms to be 3-connected. In the clusters predicted by the MM potential, the 'extra' atom lies on the surface of the cage and bridges an edge of a fullerene cage.


Ceo: The Archetypal Fullerene The search for the G M for C60 , using the MM potential, was performed using the sub-population parallel version of the GA program (see Section 5.2). SUb-populations of 40 clusters per processor were ad9pted, on 8 processors, giving a total population size of 320. Calculations were run for a maximum of 250 generations.

Previous theoretical studies indicate that the icosahedral "Buckyball" structure is the lowest energy fullerene isomer for C60 , as it is the only isomer which has no (unfavourable) adjacent pentagonal rings [87-89]. Unfortunateley, the lowest energy configuration of C60 found by the GA in conjunction with the MM carbon potential is not the icosahedral structure, it is a less spherical fullerene structure with Cs symmetry (see Fig. 12). In fact, the GA does find the Ih fullerene structure as the second lowest energy isomer of C60 for the MM potential.

Fig. 12. The icosahedral Buckyball structure of 0 60 (left) and the Os structure of 060 found as the GM for the MM potential (right)

The energies of the two structures were calculated by the RHF ab initio method using the vdz(p) basis set in the MOLPRO [90] quantum chemistry package. The energy of the Ih fullerene structure is -2233.8494 hartrees whereas the energy of the Cs structure is -2231.9643 hartrees. These high level calculations confirm the expectation that the Ih fullerene structure is the lowest energy configuration for C60 [87-89]. The hexagonal rings in the Cs structure are puckered, rather than planar, perhaps reflecting the fact that the parameters of the MM carbon potential were fitted to experimental data from the diamond structure of bulk carbon, where the carbon atoms are Sp3 hybridized. The neglect of electronic effects in the MM potential may explain the incorrect energy ordering of the Ih and Cs isomers of C60 , since the driving force for avoiding adjacent pentagons has electronic, as well as steric, origins.

This finding highlights an important point: namely that a GA (or any other search method) is only able to find the lowest energy structure consistent with the potential function which has been adopted. IT


the potential gives an incorrect description of the lowest energy cluster, the GA will find dusters with different geometries than those predicted by more accurate potentials. In this way, the efficiency of the GA, in searching a potential energy surface, enables it to be used to test the quality of a particular potential.

Hobday and Smith [44] have performed geometry optimizations of carbon clusters modelled with the Brenner and MM potentials using their own GA program. They optimized clusters with 6--60 atoms using the Brenner potential and clusters with 10-22 atoms using the MM potential. The lowest energy clusters found using the MM potential agree with those found in this work, with the exception of the IS-atom cluster where the lowest energy cluster they find is higher in energy than our global minimum. The clusters modelled with the Brenner potential have significantly different geometries than those modelled with the MM potential for many nuclearities, especially for small clusters (N < 20), for which the Brenner potential predicts ring structures for clusters with 9-17. For the Brenner potential, the Hobday and Smith G A correctly finds the Ih Buckyball fullerene to be the lowest energy structure for C60 •

The MM potential does not provide an accurate prediction of the structures of carbon clusters because it is only able to model the geometric interactions and not the important electronic interactions present in carbon clusters. The deficiencies of the MM potential were highlighted by the efficiency of the GAin searching the potential energy surface and locating the lowest energy cluster isomers. A less thorough search of the potential energy surface could have failed to find the anomalous global minimum energy structures predicted by the potential.

4.4 Metal Clusters

There is continuing interest in metal clusters because of potential applications in fields such as catalysis and nano-electronics (e.g. in single electron tunnelling devices). It is known that alkali metal clusters, with sizes of up to thousands of atoms, conform to the jellium model, in that certain nuclearities are relatively stable (the so-called magic numbers) due to their having filled electronic shells [91]. The same model also explains the stabilities of small clusters of the noble metals (Cu, Ag and Au). By contrast, clusters of transition metals and some alkaline earth elements (e.g. Ca and Sr) exhibit magic numbers which correspond to clusters consisting of concentric polyhedral shells' (geometric shells) of atoms, where the relative stability of a given cluster is determined by the competition between packing and surface energy effects [92].

We have applied our GA to the study of monometallic clusters (composed of a single metallic element), such as AI, Ni, Cu, Au and Ir, de-


scribed by Murrell-Mottram and/or Gupta potentials. Recently, we have also considered mixed-metal bimetallic Cu-Au and Ni-AI clusters. Here, however, we will briefly discuss one example of a monometallic cluster (AI) and one of a bimetallic cluster (Cu-Au).

GA Optimization of Aluminium Clusters A121-A155 Aluminium, occupies a central position, where the crossover from the regime where electronic factors determine cluster stability to where packing and surface energy effects dominate, occurs at relatively low nuclearities [93-95]. The mass spectroscopic studies of Martin and co-workers indicate that aluminium clusters with upwards of a few hundred atoms have octahedral shell structures, based on fcc packing [92]. These experimental interpretations have been backed up by theoretical calculations using empirical potentials [96] and Density FUnctional Theory (DFT) [97]. Ahlrichs and Elliott performed detailed DFT calculations on clusters up to Ah5, as well as studying selected geometries for higher nuclearities [97]. They found structures which indicate competion between icosahedral, decahedral and fcc-like cluster structures. A more restricted DFT study, by Rao and Jena, found similar lowest energy geometries for clusters with up to 15 atoms [98].

We have previously reported the use of Random Search and Monte Carlo Simulated Annealing to find the global minima for Al clusters with between 2 and 20 atoms [99], using an MM potential. Here, we extend the study to search for global minima for Ahl-AI55 , using the same many-body potential, but applying the GA method [100].

The MM potential for Al (see Section 4.3 for details of the MM potential) was derived by Cox, by fitting experimental data (phonon frequencies, elastic constants, vacancy energy etc.) for solid (fcc) aluminium, and has previously been used in a study of the bulk and surface melting of aluminium [101]. The parameters defining the potential are listed in Table 5.

The potential has the 3-body damping function:

(17)

After confirming that our GA was suitable for finding the GM for aluminium clusters, by performing a detailed study of AI19 and Alas, the GA program was used to find the global minima for AI21-AI55, using the GA parameters: N c1us = 10-30, N mat = 0.8 X Nclus , P mut = 0.1, Ngen = 20-60 [102].

With the MM potential, the structures predicted for the GM of Al clusters are strongly size-dependent. Thus, a number of clusters (e.g. A124 , A126 , Al27 and A133 ) have structures derived from hexagonal close


Table 5. Parameters defining the MM potential for Al [101].

Parameter Value

a2 7.000

a3 8.000

DleV 0.907

re/A 2.757

Co 0.253

Cl -0.467

C2 4.490

C3 -1.172

C4 1.650

C5 -5.3580

C6 1.633

packing (hcp). Other structure-types encountered include: face centred cubic (fcc, e.g. Ah7, A138 , and A14t}j icosahedral (e.g. AI51-AI55 )j and decahedral e.g. Ahg). Some of the other clusters have structures which are intermediate between these regular packing types and others are amorphous.

In order to demonstrate how the G A leads to successive improvement of the "best" member of the population, Fig. 13 shows an EPP for A138 ,

in which each new lowest-energy structure (with VclUB = Vmin) is drawn and labelled according to the generation when it was first found. The Figure shows that there is a sharp drop in V maa: after the first cycle of the GA, which is also reflected in a significant decrease in Vave . The structure labelled 34 (i. e. the lowest energy member of the population at generation 34) is the fcc-like truncated octahedron, which is the GM for A138 , as found by a number of previous semi-empirical and DFT calculations [96,97]. The GA initially finds low energy structures based on icosahedral or decahedral packing. From the 4th generation until the 34th, the best member of the population is a structure (labelled 4 in Fig. 13) which can be regarded as a distorted version of the truncated octahedral GM.

GA Optimization of Cu-Au Nano-alloy Clusters Bimetallic "nanoalloy" clusters are of applied interest as regards catalysis and materials applications. They are also of fundamental interest, because their chem-


-fl.0

lAw-at EM'rv • HJc-h<f1)'

• • Avenlt! £.C1"I)' -11 .0

> t -1Z.0 • 1

~ _ -'J.O •

~ ..... ~~J •• •• :::::::::::::::::::::::::::::: : ::::::::::: .. _ ... ..... - ._ ................. ,

-1!.O L-~_~_~~_~_~_~~_~--J • 10 15 28 15 JO J5 .. 45 ~

Cnt.rttio.,

Fig. 13. EPP for Ahs, showing intermediate "best" members of the population and the GM, first found in generation 34

ical and physical properties may be tuned by varying the composition and atomic ordering, as well as the size of the clusters. There have been a number of MD studies of alloy clusters (e.g. Ni-AI and Cu-Au) using many-body potentials [103-107]. Here, we will discuss the application of the cluster GA to study Cu-Au clusters modelled by the Gupta potential [108].

The Gupta potential [109] was derived from Gupta's expression for the cohesive energy of a bulk material [110] and is the sum of repulsive (vr) pair and attractive many-body (vm) terms:

N

VcluB = L {Vr(i) - vm(i)} (18)

where

(19)

and

(20)


In the Gupta potential, rij is the distance between atoms i and j in the cluster and the parameters A, ro, (, p and q are fitted to experimental values of the cohesive energy, lattice parameters and independent elastic constants for the crystal structure at 0 K. For AuzCuy alloy clusters, the parameters take different values for each of the different types (Cu-Cu, Au-Cu and Au-Au) of interaction. The Gupta parameters used in this study are listed in Table 6 [109].

Table 6. Parameters defining the Gupta potential for Au-Cu clusters [109].

Parameter Cu-Cu Au-Cu Au-Au

A / eV 0.0855 0.1539 0.2061

p 10.960 11.050 10.229

To / A 2.556 2.556 2.884

,/ eV 1.2240 1.5605 1.7900

q 2.2780 3.0475 4.0360

We have made a study of stoichiometric clusters with the compositions of the common bulk Au-Cu alloy phases: (AuCu3)N; (AuCu)N; and (Au3Cu)N and have compared them with pure copper and gold clusters, also modelled by Gupta potentials [108]. Standard GA parameters and operators were used, except for the introduction of atom permutation mutation for the alloy clusters - which was found to greatly improve the reproducibility of the results and the likelihood of finding the GM. Jellinek has introduced the term "homotops" to describe AaBb alloy cluster isomers, for fixed number of atoms (N) and composition (alb ratio), which have the same geometrical arrangement of atoms, but differ in the way in which the A and B-type atoms are arranged [103]. As the number of homotops rises combinatorially with cluster size, global optimization (in terms of both geometrical isomers and homotops) is an extremely difficult task. (Hartke has pointed out that full global optimization of molecular clusters (e.g. (H20)N) is likewise complicated by the requirement to find the correct orientations, as well as positions, of the molecules in the cluster [21].)

Using our GA, pure copper clusters were found to adopt regular, symmetric structures based on icosahedral packing, while gold clusters have a greater tendency towards amorphous structures, as found previously by Garz6n et al. [55]. In many cases (e.g. for 14, 16 and 55 atoms), the replacement of a single Au atom by Cu was found to change the GM


structure to · that of the pure Cu cluster, which has also been observed by Lopez et al. [107] .

The "global minima" found by the GA for 40-atom clusters of varying composition - CU40j (AUCU3ho ( = CU30AulO)j (AuCuho ( = CU20Au20) j (Au3Cuho ( = CUlOAu30)j and AU40 -are shown in Fig. 14. While we are confident that the GM are correct for the pure Au and Cu clusters, it is quite possible that the true GM for the alloy clusters is an alternative homotop. We are currently working on increasing the reliabilty of the GA for the permutational optimization of homotops .

• • •

• Fig. 14. Structures of 40 atom Cu, Au and Cu-Au clusters (dark atoms = Cu, light atoms = Au). Top row (left to right): CU40; CU30AulO; Cu2oAu2o. Bottom row: CUlOAu30; AU4o .

It is apparent from Fig. 14 that the clusters Cll40, CU30AulO and CU20Au20 have the same decahedral geometry. In the decahedral alloy clusters, the Au atoms generally lie on the surface, while the Cu atoms are encapsulated. Although the AU40 cluster has a low symmetry, amorphous structure, the gold-rich CUlOAu30 cluster has a structure which is more symmetrical than the AU40 GM, but it is not decahedral. The CUlOAu30 GM has a flattened (oblate) topology, in which all of the Au atoms, ~xcept one, lie on the surface of the cluster and 7 of the Cu atoms occupy interior sites. The observed atomic segregation in the alloy clusters may be driven by lowering the surface energy and/or relieving the


internal strain of the cluster and further studies are underway to rationalise these observations. Similar tendencies have been found for larger Au-eu clusters [111] and for Ni-AI clusters [112].

5 New Techniques

5.1 The Predator Operator

G As have been very successful in determining global minima, but in a number of physical applications, structures corresponding to higher local minima may be more important than the GM. For example, carbon cluster ions formed in laser-ablation experiments [113] are observed in several different geometries, distinguished by their relative mobilities. In some experiments, kinetically favoured, higher energy isomers may be formed, and the distribution of and interconversion between isomers is also of great interest [7]. Finally, it is worth noting that the biologically active forms of proteins do not always correspond to global minima [114].

As in the original idea of Genetic Algorithms [8], many of the subsequent developments have been inspired by natural evolution. For example, self-optimizing [9] (or self-adapting) GAs are motivated by the fact that there are genes in nature which can alter mutation rates, for example [115].

In recent work by Manby et al. [116], the analogy with natural evolution has been taken one step further by considering the use of "predators" to remove unwanted (although otherwise potentially optimal) individuals or traits from the population. Sometimes unwanted members of a population can be removed by imposing a constraint on the fitness function, however, in seeking minima other than the GM, a suitable modification of the fitness function is not always possible.

In principle, predation can be carried out using any property of the cluster, for example a shape selective predator could be used to remove individuals from the population (with a certain probability) if they show (or fail to show) certain topological features, such as sphericity, ring size or adjacent/non-adjacent pentagons (in the case of fullerene clusters).

The simplest predator is the energy predator, in which clusters with energies at or below a certain value are removed from the population with a probability of 1. The energy predator can thus be used to search for low energy minima other than the global minimum, or to enhance the efficiency of the GA by removing specific low energy non-global minima that the GA may be incorrectly converging towards.

The energy at the local minimum is purely a function of the interatomic distances in a cluster and is invariant to the exchange of atom


labels (for single element clusters). Relaxation of clusters to the nearest local minimum (as in the Lamarckian cluster optimization GA) enables one to use the total energies of the minimized clusters to distinguish them (except for enantiomers). (If the clusters were not relaxed to the nearest local minimum on the potential energy surface then it would be possible to have clusters with almost identical energies but different geometries.)

In the energy predator for finding low energy minima (other than the GM), a cluster is "killed" by the predator if its potential energy is less than vtarg + 6V, where vtarg is the target energy and 6V = 1 X 10-6 e V, for example. The GA is first run without the predator to find the GM cluster and its energy. Then the predator is invoked to remove the GM, so that the G A finds the next lowest energy isomer (the first metastable structure). The energy of the first excited state is then used as the target energy which ensures that both the GM and first metastable isomer are predated. This cycle is continued until the required number of low-lying isomers have been found.

Use of the Energy Predator to Find the Lowest Energy Isomers of C40 The energy predator operator has been used to find the six lowest energy isomers of C40 . The results listed in Table 7 are the lowest energy clusters found when clusters with an energy equal to or below that of the previous lowest energy isomer are removed from the population. The energies given are the lowest energies found over five runs of the GA from different initial populations. Isomer Clo is the lowest energy cluster (the GM) found for C 40 • The geometries of the six most stable isomers of C40

are shown in Fig. 15.

Table 7. Potential energies (Vclus) of the six lowest energy isomers of C40

obtained using the G A Predator

Isomer Point Group Vclus/eV

Clo C2 -287.92

C~o C8 -287.90

C~o Td -287.88

C!o C3 -287.85

C~o D2 -287.84

C~o C8 -287.78


Albertazzi et al. [117] have carried out detailed studies of the isomers of C40 using molecular mechanics, tight binding and ab initio methods. The aim of their work was to confirm that the minimization of pentagon adjacency is a major factor in determining the relative stability of fullerenes. The cluster isomer of C40 with the smallest pentagon adjacency count is the D2 cage, corresponding to isomer C~o found by the GA (for the MM potential). Albertazzi found that the D2 cage was predicted by 11 out of the 12 methods used to be the lowest energy configuration. The only method that disagreed with the consensus on the lowest energy isomer was that using the Tersoff semi-empirical potential. This potential has only 2- and 3-body terms. Albertazzi et al. argued that the Tersoff potential is only able to model steric effects and as the energetic penalty for pentagon fusion has both electronic and steric effects, the Tersoff model is not well adapted to modelling this system. The two lowest energy structures predicted by the Tersoff model were isomers C~o and Clo.

Fig. 15. Geometries of the six lowest energy isomers of C40 obtained using the GA Predator. Top row (left to right): cio; C~o; C~o. Bottom row: C~o; C~o; C~o·

The Murrell-Mottram potential, that we have used, suffers from the same limitations as the Tersoff potential in that it does not describe the electronic interactions witpin the clusters, as evidenced by the prediction of the incorrect global minimum fullerene structure for C60 (see earlier discussion). However, the predator GA does find the true global mini-


mum among the top 6 isomers, for the MM potential. This emphasizes the utility of the predator in searching for low energy local minima for a given empirical potential, which may then be checked (recalculated) at a higher level of theory.

5.2 Parallel Genetic Algorithms

The genetic algorithm has an intrinsically parallel nature: the production of each new cluster during a generation is independent of the production of any other new cluster. In this way the G A can study different regions of the potential energy surface in parallel. This parallel nature makes it easy to create a parallel version of the GA code that will utilize many processors simultaneously, thereby spreading the computational effort and reducing the time taken for the run of the algorithm [31]. The availability of bigger, faster, (and cheaper !) multi-processor computers, Beowulf clusters etc. means that parallel programming is an important and growing area in modern computational science.

We have compared three parallel GA paradigms [62], which are briefly discussed below.

Master-Slave Algorithm The Master-Slave algorithm is essentially identical in its function to the sequential GA, the only difference being that the work is spread over several processors. The run of the GA is controlled by a single processor (the master) which controls how the work is divided up between the remaining processors (the slaves). The master processor holds the only copy of the population. It selects a pair of parent clusters and sends them to a slave processor, which receives the parents and then performs the rotation, mating, mutation and minimization operations to produce a single offspring cluster. The slave then sends the new offspring back to the master processor. This scheme is repeated across all the slave processors until the required number of offspring have been produced.

Distributed Algorithm This is another parallel model that attempts to mimic the operation of the serial GA but on a number of processors. A single population is used, with each processor holding a copy. A portion of the generation and energy minimization of the initial population is performed simultaneously on all the processors.


Sub-population Algorithm The sub-population algorithm differs from the other two parallel algorithms in that it does not have a single population. Each processor runs what is effectively the serial GA on a population of reduced size. Because of the small size of the sub-populations, sUb-population stagnation can a problem, so it is essential for there to exist a mechansim to exchange individuals between the sub-populations, thereby helping to maintain genetic diversity within each sub-population. The nature or topology of this exchange can be treated in a number of ways. In the "island" model, for example, exchange is allowed between any pair of sub-populations, while in the "ring" or "stepping-stone" model, exchange may only take place between neighbouring populations. In our work, we have considered only the island model.

Comparison of Parallel Algorithms The three different parallel algorithms were compared for the test example of C60 , bound by the MM potential (see previous discussion). The test job consisted of a single run of each algorithm, with a total population size of 40, producing 32 offspring per generation, running for a total of 20 generations. The algorithms were run on an IBM SP2 machine with 8 nodes. Each node contains 4 Power3/II processors running at 375MHz. The nodes are connected using the SP high performance switch, which is capable of delivering data at 150Mb/s to a node. It should be noted that this data bandwidth is divided amongst all 4 processors within the node. The communication between processors in the same node is also carried over the switch.

The speed up (relative to the equivalent serial run) for these calculations is shown in Fig. 16. (Master-Slave 1 and 2 refer to versions of the Master-Slave algorithm which differ in the number of parents sent out to the slave processors during mating.) The sub-population and distributed parallel algorithms are found to afford essentially linear speed-up (with number of processors) and to be faster than either of the master-slave algorithms.

Subsequent testing has shown that, as well as being faster for a fixed total population size and number of generations, the sub-population algorithm tends to find the global minimum more quickly and more reliably than the other algorithms, provided that the sub-populations are not too small [62]. This work is being continued in order to optimize the parallel cluster GA fully.

5.3 Hybrid Genetic Algorithms

The inclusion of local minimization in the cluster GA (i.e. the Lamarckian approach) can be said to generate a hybrid algorithm, combining


8 ir=======~--------~ M •• 'cr-Slavt I

n----r. M"" n-SlIin 1 ,... ., .It-,.,.lttk. 6 ~ DIotrtM0.4

-------

°2~----~~----~6------~

.lIIbtr of procestors

Fig. 16. The speed-up obtained by each of the parallel GAs when run on 2-8 processors for the Cso test case

the heuristic genetic GA with deterministic minimization. Many other types of hybrid GAs have been studied (for a variety of applications), combining "pure" GAs with other heuristic search algorithms (such as Tabu Search or Simulated Annealing (SA)), with neural networks, and with fuzzy computing methods [31] . By adopting such hybrid methods, it is hoped that that the best features of the GA and its partner may be combined, to produce an even better hybrid.

Zacharias et al. have developed a combined SAjGA approach for optimizing the structures of silicon clusters [118] . They report that the SAjGA method outperforms individual SA or GA by an order of magnitude (in terms of the CPU time required for convergence).

5.4 Combining Ab Initio and Empirical Potentials

Hartke has introduced the concept of using an empirical potential to guide an ab initio calculation towards the GM on the ab initio, rather than the empirical hypersurface, without having to perform a large number of costly ab inito calculations [15,119]. His method involves on-the-fty reparameterization of the empirical potential to a limited number of ab initio calculations, within a GA framework, and has proved successful in the geometry optimization of small silicon [119] and water clusters [17]. With the advent of faster computers, this technique promises to open up the possibility of global optimization of moderately sized clusters at high levels of computational sophistication.


6 Concluding Remarks and Future Directions

In this Review, we have shown how Genetic Algorithms have been used to study a wide variety of cluster types. The field of cluster geometry optimization has benefitted significantly from the application of GAs, which have generally been shown to be as efficient (and often more efficient) than "traditional" Simulated Annealing techniques.

The future is likely to bring important developments in a number of areas, such as parallel GAs, hybrid GAs, and combining ab initio and empirical calculations, which will enable the study of ever-larger clusters, at increasing levels of sophistication. Another important area may be the application of self-optimizing GAs [31] and Genetic Programming [120] (where the functional form, as well as the parameter values, defining a cluster potential energy function could be refined in an evolutionary fashion, in order to fit experimental or ab initio calculated data).

Acknowledgements The authors wish to thank the following people for their collaboration in the work described here: Dr Fred Manby, Dr Lesley Lloyd, Dr Nicholas Wilson, Tom Mortimer-Jones and Sarah Darby; and to the following who have also contributed to our research in this area: Dr Jadson Be1chior, Mark Bailey, Freddy Fernandes-Guimaraes, Francesco Sebastianelli and Christopher Thorley. RLJ also thanks Dr Bernd Hartke for helpful discussions and for permission to reproduce Fig. 2 and Professor Julius Jellinek for helpful discussions on alloy clusters. Finally, the authors are grateful to Tom Mortimer-Jones for helping in the preparation of the manuscript.

Our research has been supported by funding from: EPSRC (Ph.D. studentship for CR); HEFCE (JREI Grant for computing equipment); The Royal Society; and The University of Birmingham.

References

1. H. Haberland (ed.), Clusters of Atoms and Molecules (Springer-Verlag, Berlin, 1994).

2. R. L. Johnston, Atomic and Molecular Clusters (Taylor and Francis, Lon-don, 2002).

3. S. Erkoc;, Phys. Rep. 218 (1997) 80. 4. B. Hartke J. Comput. Chem. 20 (1999) 1752. 5. L. T. Ville J. Phys. A 18 (1985) L419. 6. R. S. Berry and R. E. Kunz in Large Clusters of Atoms and Molecules (Ed.

T. P. Martin, Kluwer, Dordrecht, 1996), p. 299. 7. J. P. K. Doye and D. J. Wales J. Chem. Soc., Faraday Trans. 93 (1997)

4233.


8. J. Holland Adaptation in Natural and Artificial Systems. (University of Michigan Press, Ann Arbor, MI, 1975).

9. D. E. Goldberg Genetic Algorithms in Search, Optimization and Machine Learning. (Addison-Wesley, Reading, MA, 1989).

10. H. M. Cartwright Applications of Artificial Intelligence in Chemistry. (Oxford University Press, Oxford, 1993).

11. D. E. Clark (ed.) Evolutionary Algorithms in Molecular Design (Wiley-VCH, Weinheim, 2000).

12. B. Hartke J. Phys. Chern. 97 (1993) 9973. 13. Y. Xiao and D. E. Williams Chern. Phys. Lett. 215 (1993) 17. 14. B. Hartke Chern. Phys. Lett. 240 (1995) 560. 15. B. Hartke Chern. Phys. Lett. 258 (1996) 144. 16. S. K Gregurick, M. H. Alexander, and B. Hartke J. Chern. Phys. 104

(1996) 2684. 17. B. Hartke, M. Schutz, and H.-J. Werner. Chern. Phys. 239 (1998) 561. 18. B. Hartke Z. Phys. Chern. 214 (2000) 1251. 19. B. Hartke, A. Charvat, M. Reich and B. Abel J. Chern. Phys. 116 (2002)

3588. 20. F. Schulz and B. Hartke Chern.Phys.Chem. 3 (2002) 98. 21. B. Hartke, H.-J. Flad and M. Dolg Phys.Chern.Chern.Phys. 3 (2001) 5121. 22. D. J. Wales, J. P. K Doye, A. Dullweber, M. P. Hodges, F. Y. Naurnkin

and F. Calvo The Cambridge Cluster Database, http://www-wales.ch.cam.ac. uk/CCD .html.

23. Y. Zeiri Phys. Rev. E 51 (1995) 2769. 24. Y. Zeiri Cornp. Phys. Comm. 103 (1997) 28. 25. Y. Zeiri J. Phys. Chern. A 102 (1998) 2785. 26. D. M. Deaven and K M. Ho Phys. Rev. Lett. 75 (1995) 288. 27. D. M. Deaven, N. Tit, J. R. Morris, and K M. Ho Chern. Phys. Lett. 256

(1996) 195. 28. J. P. K Doye and D. J. Wales J. Phys. Chern. A. 101 (1997) 5111. 29. Z. Li and H. A. Scheraga J. Mol. Struct. (THEOCHEM) 179 (1988) 333. 30. D. J. Wales and H. A. Scheraga Science 285 (1999) 1368. 31. A. Tuson and D. E. Clark in Evolutionary Algorithms in Molecular Design

(Ed. D. E. Clark, Wiley-VCH, Weinheirn, 2000), p. 241. 32. K-M. Ho, A. A. Shvartsburg, B. C. Pan, Z. Y. Lu, C. Z. Wang, J. Wacker,

J. L. Fye and M. F. Jarrold Nature 392 (1997) 3919. 33. J. Mestres and G. E. Scuseria J. Comput. Chern. 16 (1995) 729. 34. D. Bokal MATCH (Cornrn. Math. Compo Chern.) 38 (1998) 99. 35. J. A. Niesse and H. R. Mayne Chern. Phys. Lett. 261 (1996) 576. 36. J. A. Niesse and H. R. Mayne J. Chern. Phys. 105 (1996) 4700. 37. J. A. Niesse and H. R. Mayne J. Comput. Chern. 18 (1997) 1233. 38. R. P. White, J. A. Niesse, and H. R. Mayne J. Chern. Phys. 108 (1998)

2208. 39. A. B. Tutein and H. R. Mayne J. Chern. Phys. 108 (1998) 308. 40. M. Iwamatsu J. Chern. Phys. 112 (2000) 10976. 41. W. J. Pullan Comput. Phys. Cornrnun. 107 (1997) 137. 42. W. J. Pullan J. Chem. Inf. Cornput. Sci. 37 (1997) 1189.


43. W. J. Pullan J. Comput. Chem. 18 (1997) 1096. 44. S. Hobday and R. Smith. J. Chem. Soc., Faraday Trans. 93 (1997) 3919. 45. E. Curotto, A. Matro, D. L. Freeman, and J. D. Doll J. Chem. Phys. 108

(1998) 729. 46. A. Tomasulo and M. V. Ramakrishna Z. Phys. D. 40 (1997) 483. 47. M. D. Wolf and U. Landman J. Phys. Chem. A 102 (1998) 6129. 48. P. Chaudhury and S. P. Bhattacharyya Chem. Phys. 241 (1999) 313. 49. P. Chaudhury, S. P. Bhattacharyya, and W. Quapp Chem. Phys. 253

(2000) 295. 50. Y. Luo and J. Zhao Phys. Rev. B 59 (1999) 14903. 51. D. Romero, C. Barr6n, and S. G6mez Compo Phys. Comm. 123 (1999)

87. 52. B. Hartke, in Proceedings of the Genetic and Evolutionary Computa

tion Conference, GECCO-2001 (Eds. L. Spector, E. Goodman, A. Wu, W. B. Langdon, H.-M. Voigt, M. Gen, S. Sen, M. Dorigo, S. Pezeshk, M. Garzon and E. Burke. Morgan Kaufmann, San Francisco, 2001), p. 1284.

53. B. Hartke, in Nonconvex Optimization and its Applications, (Ed. J. D. Pinter. Kluwer, Dordrecht, in press).

54. K. Michaelian Chem. Phys. Lett. 293 (1998) 202. 55. I. L. Garz6n, K. Michaelian, M. R. Beltran, A. Posada-Amarillas, P. Or

dej6n, E. Artacho, D. Sanchez-Portal, and J. M. Soler Phys. Rev. Lett. 81 (1998) 1600.

56. K. Michaelian Am. J. Phys, 66 (1998) 231. 57. I. L. Garz6n, K. Michaelian, M. R. Beltran, A. Posada-Amarillas, P. Or

dej6n, E. Artacho, D. Sanchez-Portal, and J. M. Soler Eur. Phys. J. D 9 (1999) 211.

58. K. Michaelian, N. Rendon, and I. L. Garz6n Phys. Rev. B 60 (1999) 2000. 59. D. E. Clark Evolutionary algorithms in computer-aided molecular design,

URL http://panizzi.shef.ac.uk/cisrg/links/ea_bib.html. 60. R. L. Johnston and C. Roberts Cluster Geometry Optimization Genetic

Algorithm Program, University of Birmingham (1999). 61. R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu SIAM J. Scientific Computing

16 (1995) 1190. 62. C. Roberts Genetic Algorithms for Cluster Optimization. Ph.D. Thesis

(University of Birmingham, 2001). 63. R. L. Johnston and N. T. Wilson Birmingham Cluster Web, URL

http://www.tc.bham.ac.uk/bcweb (1999). 64. C. Roberts, R. L. Johnston, and N. T. Wilson Theor. Chem. Ace. 104

(2000) 123. 65. P. M. Morse Phys. Rev. 34 (1929) 57. 66. K. D. M. Harris, R. L. Johnston, and B. M. Kariuki Acta Cryst. A 54

(1998) 632. 67. G. V. Lewis and C. R. A. Catlow J. Phys. C: Solid State Phys. 18 (1985)

1149. 68. C. Roberts and R. L. Johnston Phys.Chem.Chem.Phys. 3 (2001) 5024. 69. M. Wilson J. Phys. Chem BIOI (1997) 4917.


70. E. de la Puente, A. Aguado, A. Ayuela, and J. M. L6pez Phys. Rev. B 56 (1997) 7607.

71. M.-J. Malliavin and C. Coudray J. Chem. Phys. 106 (1997) 2323. 72. J. M. Recio, R. Pander, A. Ayuela, and A. B. Kunz J. Chem. Phys. 98

(1993) 4783. 73. P. J. Ziemann and A. W. Castleman Jr. J. Chem. Phys 94 (1991) 718. 74. R. L. Whetten Acc. Chem. Res. 26 (1993) 49. 75. A. Aguado, F. L6pez-Gejo, and J. M. L6pez J. Chem. Phys. 110 (1999)

4788. 76. C. Coudray, G. Blaise, and M. J. Malliavin Eur. Phys. J. D 11 (2000)

127. 77. A. Aguado and J. M. L6pez J. Phys. Chem. B 104 (2000) 8398. 78. H. W. Kroto, J. R. Heath, S. C. O'Brien, R. F. Curl, and R. E. Smalley

Nature 316 (1985) 162. 79. J. N. Murrell and H. E. Mottram Mol. Phys. 69 (1990) 571. 80. H. Cox, R. L. Johnston and J. N. Murrell J. Sol. State Chem. 145 (1999)

517. 81. B. R. Eggen, R. L. Johnston and J. N. Murrell J. Chem. Soc., Faraday

Trans. 90 (1994) 3029. 82. V. Parasuk and J. Almlof Theor. Chim. Acta 83 (1992) 227. 83. M. Feyereisen, M. Gutowski, J. Simons, and J. Almlof J. Chem. Phys. 96

(1992) 2926. 84. J. M. L. Martin and P. R. Taylor J. Phys. Chem. 100 (1996) 6047. 85. P. W. Fowler and D. E. Manolopoulos. An Atlas of Jilullerenes (Oxford

University Press, Oxford, 1995). 86. H. Prinzbach, A. Weiler, P. Landenberger, J. Worth F. Wahl, L. T. Scott,

M. Gelmont, D. Olevano, and B. von Issendorff Nature 407 (2000) 60. 87. H. W. Kroto Nature 329 (1987) 529. 88. T. G. Schamlz, W. A. Seitz, D. J. Klein, and G. E. Hite J. Am. Chem.

Soc. 110 (1988) 1113. 89. P. W. Fowler Contemp. Phys. 37 (1996) 235. 90. MOLPRO is a package of ab initio programs written by H.-J. Werner

and P. J. Knowles, with contributions from R. D. Amos, A. Berning, D. L. Cooper, M. J. O. Deegan, A. J. Dobbyn, F. Eckert, C. Hampel, T. Leininger, R. Lindh, A. W. Lloyd, S. J. McNicholas, W. Meyer, M. E. Mura, A. Nicklass, P. Palmieri, K. Peterson, R. Pitzer, P. Pulay, G. Rauhut, M. Schutz, H. Stoll, A. J. Stone, and T. Thorsteinsson.

91. W. D. Knight, K. Clemenger, W. A. de Heer, W. A. Saunders, M. Y. Chou, and M. L. Cohen Phys. Rev. Lett. 52 (1984) 2141.

92. T. P. Martin Phys. Rep. 273 (1996) 199. 93. J. Lerme, M. Pellarin, B. Baguenard, C. Bordas, E. Cottancin, J. L. Vialle,

and M. Broyer in Large Clusters of Atoms and Molecules (Ed. T. P. Martin, Kluwer, Dordrecht, 1996), p. 71.

94. K. E. Schriver, J. L. Persson, E. C. Honea, and R. L. Whetten Phys. Rev. Lett. 64 (1990) 2539.

95. P. Milani, W. A. de Heer, and A. Chatelain, Z. Phys. D 19 (1991) 133.


96. G. T. Turner, R. L. Johnston and N. T. Wilson J. Chern. Phys. 112 (2000) 4773.

97. R. Alhrichs and S. D. Elliot Phys.Chem.Chem.Phys. 1 (1999) 13. 98. B. K. Rao and P. Jena J. Chern. Phys. 111 (1999) 1890. 99. L. D. Lloyd and R. L. Johnston Chern. Phys. 236 (1998) 107. 100. L. D. Lloyd, R. L. Johnston, C. Roberts and T. V. Mortimer-Jones

Chem.Phys.Chem. 3 (2002) 408. 101. H. Cox, R. L. Johnston, and J. N. Murrell Surf. Sci. 373 (1997) 67. 102. L. D. Lloyd, R. L. Johnston, C. Roberts and N. T. Wilson (Manuscript

in preparation,2001). 103. J. Jellinek and E. B. Krissinel in Theory of Atomic and Molecular Clus-

ters (Ed. J. Jellinek, Springer, Berlin, 1999). 104. J. Jellinek and E. B. Krissinel Chern. Phys. Lett. 258 (1996) 283. 105. E. B. Krissinel and J. Jellinek Chern. Phys. Lett. 272 (1997) 301. 106. E. B. Krissinel and J. Jellinek Int. J. Quant. Chern. 62 (1997) 185. 107. M. J. L6pez, P. A. Marcos and J. A. Alonso J. Chern. Phys. 104 (1996)

1056. 108. S .. Darby, T. V. Mortimer-Jones, R. L. Johnston and C. Roberts J. Chern.

Phys. 116 (2002) 1536. 109. F. CIeri and V. Rosato Phys. Rev. B 48 (1993) 22. 110. R. P. Gupta Phys. Rev. B 23 (1981) 6265. 111. N. T. Wilson and R. L. Johnston J. Mater. Chern. (in press). 112. M. S. Bailey, N. T. Wilson, C. Roberts and R. L. Johnston Eur. Phys. J.

D (submitted). 113. G. von HeIden, M.-T. Hsu, N. Gotts and M. T. Bowers J. Phys. Chern.

97 (1993) 8182. 114. A. R. Leach Molecular Modelling: Principles and Applications. (Adison

Wesley-Longman, Harlow, 1996). 115. A. Aafif and J. Lin Phys. Rev. E 57 (1998) 2471. 116. F. R. Manby, R. L. Johnston, and C. Roberts MATCH (Comm. Math.

Compo Chern.) 38 (1998) 111. 117. E. Albertazzi, C. Domene, P. W. Fowler, T. Heine, G. Seifert,

C. van Alsenoy, and F. Zerbetto Phys. Chern. Chern. Phys. 1 (1999) 2914. 118. C. R. Zaharias, M. R. Lemes and A. Dal Pino J. Mol. Struct.

(THEOCHEM) 430 (1998) 29. 119. B. Hartke Theor. Chern. Acc. 99 (1998) 241. 120. J. R. Koza Genetic Programming: On the Programming of Computers by

Means of Natural Selection. (MIT Press, Cambridge, MA, 1992).

Real-time Monitoring of Environmental Pollutants in the Workplace Using Neural Networks and FTIR Spectroscopy

Hugh M. Cartwrightl • and Andrew Porterl

IPhysical and Theoretical Chemistry Laboratory, Oxford University, South Parks Road, England OXl 3QZ

Summary: The monitoring of pollution in the environment is of increasing importance. Atmospheric pollutants can generally be detected using Fourier Transform Infrared Spectroscopy (FTIR), but automatic identification of these pollutants from their spectrum is not trivial. It is often desirable to not only identify pollutants but also to estimate their concentration, so that it is possible to determine whether the level in the environment exceeds safe working limits. This chapter describes the use of Neural Networks, combined with Genetic Algorithm optimization, to develop software tools that are capable of identifying atmospheric pollutants, determining whether their concentration exceeds a defined threshold, and, ifrequired, determining that concentration.

Key words: Infrared spectrum, pollutant, Neural Network, Artificial Intelligence, Genetic Algorithm, health, Fourier Transform Infrared.

1 Introduction

Regulatory pressure to ensure that the workplace is safe from toxic chemicals is considerable, and continues to grow. A safe working environment requires not only strong and effective legislation, but also suitable methods of monitoring for harmful chemicals. It is evident that this monitoring must be both continuous and capable of delivering results rapidly, in order that any rise in the level ofpoliutants can be detected before it constitutes a risk to health.

Author to whom correspondence should be addressed. [email protected]

206 H. M. Cartwright and A. Porter

The load of toxic chemicals in the environment is already very great, and it is inevitable that it will increase further. Reliable detection of gas-phase pollutants is an essential requirement for safe working; it is the manner in which Artificial Intelligence methods may be used to help in this task which is the subject of this chapter.

2 FTIR in the Detection of Pollutants

Differences between the laboratory and the workplace environment are numerous and marked. In the laboratory, experimental conditions can be made artificially simple. Indeed, the reduction or elimination of outside interference in laboratory experiments as an aid to the interpretation of data is standard scientific practice. Beyond the laboratory experimental data may be less simple to gather, more complex, and therefore less straightforward to interpret. In a factory or workshop, the environment can be only loosely controlled, and a wide variety of chemicals may be present in the atmosphere, where, in the worst case, they may represent a significant hazard.

The detection of such chemicals is made possible through a number of analytical techniques, including Fourier Transform Infrared Spectroscopy (FTIR). The IR spectra of molecules contain information of sufficient detail that many important atmospheric pollutants may be identified from their IR spectrum alone. FTIR is fast, even modest instruments being capable of delivering a complete spectrum within seconds, and for the detection of gas-phase pollutants no sample preparation is needed. FTIR spectrometers are sensitive to low concentrations of pollutants, and can detect many environmentally-significant compounds at or below their Threshold Limit Values (TLVs) [1].

The data required for identification and - in favourable cases - quantitation of pollutants can thus be generated by an FTIR instrument in near real time. This speed of data collection represents a clear advantage for FTIR over competitive techniques in the monitoring of the environment in locations such as mines, foundries or garages, where a variety of pollutants may present a hazard to health, and the concentration of them may change abruptly over short time periods.

FTIR is superior not just to other forms of spectroscopy in this type of application, but also to dispersive IR spectroscopy. In a dispersive instrument a prism or diffraction grating disperses the light before it is passed through a slit to select a narrow range of wavelengths. The slit unavoidably reduces the intensity of light falling upon the sample and hence the sensitivity of the system. In a Michelson Interferometer (and in other types of interferometer) all wavelengths are monitored at the same time and there is no reduction in intensity, creating what is known as the Throughput (or Jacquinot) advantage [2].

Real-Time Monitoring of Environmental Pollutants 207

The signal to noise ratio in an FTIR instrument is proportional to the square root of the length of time for which data are collected. By monitoring all wavelengths simultaneously, the time during which data are collected at anyone wavelength is significantly greater than in a dispersive spectrometer, with a consequent improvement in the sin ratio. Multiple scans may be taken rapidly and summed, leading to a further sin gain.

3 The Limitations of FTIR Spectra

We see, therefore, that FTIR is a promising method for the detection, identification and quantitation of pollutants. However, its practical use is complicated by several factors.

3.1 The Need for a Spectral Search Algorithm

In order to identify and quantify a compound from its IR spectrum, access to a spectral database is required, so that the sample spectrum can be compared to the spectra of possible matches. A suitable search algorithm must be available which can be used for this comparison [3,4]. Several algorithms are available [3-5], and these typically normalize the spectra to allow for differences in concentration between the unknown spectrum and the database spectrum.

In Euclidean search the total area under the curve is normalized; each data point for both the reference spectrum and the unknown spectrum is divided by the square root of the dot product of its spectrum. A Hit Quality Index (HQI) [3] is calculated as the sum of the squares of the difference between each point in a data pair.

In the Least Squares algorithm the absorption at each point is compared to the absorption of the strongest peak. A constant is subtracted from each point, to make the minimum zero, and the spectrum is normalized to unit maximum absorbance. The Hit Quality Index is calculated in the same way as for a Euclidean search.

In effect, both Euclidean and Least Squares searches attempt to match band areas, and function without regard to the spectroscopic significance of a specific band. This has serious implications for the reliability with which different types of pollutants may be detected. A broad band such as an O-H stretch is heavily weighted by band area normalization, whilst a sharp band such as a C-N stretch may have minimal impact. Searching may also be compromised if the baseline slopes or is offset. Since all data points are taken into account in the algorithm, a sloping baseline may be interpreted as a difference between spectra, and the Hit Quality Index will suffer as a result.


These problems are addressed by the First Derivative algorithm, which compares the difference between a pair of points in the sample spectrum and between the same pair of points in the reference spectrum, eliminating the effect of a sloping baseline. However the algorithm is much less tolerant of small shifts in the position of peaks (greater than 2 cm- I ) than the previous two algorithms. None of these algorithms is entirely suited to the dual task of identification and the quantitative analysis of pollutants.

3.2 Complications due to Overlapping Bands

The IR spectrum of a sample is generally compared electronically with a background spectrum. Water and carbon dioxide absorb strongly in the IR region and are inevitable contributors to atmospheric spectra, so in the measurement of the spectra of pollutants in the atmosphere it is necessary to correct for absorption by these compounds.

4000 3000 2000 1000 The spectrum of an 'unpolluted' atmosphere

Fig 1. Atmospheric absorptions due to water and carbon dioxide.

This is not as simple as it may sound. A 'standard' atmospheric background spectrum could be used for this correction, but what is the composition of a 'standard' atmosphere? The atmosphere is not unchanging, and any variation in composition must affect its spectrum. This makes reliable identification of atmospheric pollutants challenging. One could try to interpret the spectrum of a sample by simply disregarding the background, but background peaks are often much larger than those of any pollutant and dominate the spectrum. Furthermore, in the workplace the background may include bands due to a variety of materials which would not be present in a standard atmosphere and which, unless they are explicitly allowed for, may fatally undermine spectral interpretation.


3.3 Beer's Law Limitations

Quantitative analysis of the spectrum presents a further challenge. Search algorithms generally invoke spectral normalisation, which may hamper efforts to derive reliable quantitative data. Beer's law is valid over a wide concentration range in the UV/visible region of the spectrum, but is applicable over only a small range of absorbance in the IR region. This complicates interpretation of spectra and attempts to derive quantitative information.

The use of spectroscopy for any type of quantitative analysis requires some form of calibration. In the IR, spectra are recorded at sample concentrations covering the range of interest and the absorbance, as measured by variation in the area (or height) of a given peak, determined. If the spectra cover a concentration range in which Beer's Law is obeyed, a plot of absorbance versus concentration will be linear and can be used to determine the concentration of the analyte. If the calibration curve is non-linear, which is often the case in IR spectra, then Beer's Law is not obeyed and quantitative analysis is more difficult [6].

3.4 Problems Caused by Low Absorption

The absolute concentration of pollutants such as carbon monoxide, chlorinated hydrocarbons or nitrogen oxides is typically very low. Consequently, even when these concentrations are at a hazardous level, the IR signature of pollutants may be weak compared to the absorption which can be attributed to other atmospheric compounds. Using a conventional search algorithm, this low pollutant absorption creates a high potential for a false recognition - or no recognition at all.

In summary, the problems which need to be addressed by any algorithm hoping to quantitatively analyze spectra are:

• Loss of quantitative information consequent upon data manipulation prior to identification.

• Difficulties in identification due to shifts in band wavelengths between spectra, low total absorption by compounds of interest and the obscuring of information by overlapping bands.

• Complications caused by the effects of a sloping baseline and deviations from Beer's law.


4 Potential Advantages of Neural Network Analysis of IR Spectra

Options for the simultaneous identification and quantitation of atmospheric pollutants are limited. Monitoring systems are available for recognition of a single compound or class of compound, but methods by which we might determine several compounds simultaneously are restricted by algorithmic rather than by instrumental difficulties.

Neural networks have been used in a number of spectroscopic applications, such as UV spectral recognition [7], infrared spectral interpretation [8,9] and classification [10,11] and the analysis of NMR spectra [12]. This range of applications suggests that a neural network may be effective in identifying gasphase pollutants and perhaps also in providing quantitative data.

We can identify several possible advantages to the use ofa neural network:

• With access to a training database of suitable breadth, it should be possible for a network to assess the spectra of arbitrary mixtures of compounds (from within the database) over a range of concentrations.

• By focussing on key wavelengths for each pollutant, Neural Networks may overcome two fundamental difficulties in the interpretation of IR spectra: ambiguity caused by small changes in peak position due to varying background signals or miscalibration of instruments, and the swamping of pollutant bands by stronger overlapping bands.

• Neural Networks can cope effectively with non-linear data. They should thus be applicable in regions in which Beer's Law is not obeyed.

• A Neural Network has the potential to complete the process of recognition and quantitation at high speed.

5 Application of the Neural Network to IR Spectral Recognition

Neural networks are strong in pattern recognition, and, in a sense, spectral interpretation is exactly that. However, treating the interpretation of an IR spectrum as an exercise in image analysis may not be the most efficient approach. A human expert, when analysing a spectrum, would attempt to identify key bands, rather than trying to match every feature in an entire spectrum. It is logical to design a neural network along similar lines.


To prepare an effective network, two central points need to be addressed: choice of target compounds and choice of input data to be passed to the network.

5.1 Target Compounds

No algorithm in the near future is likely to be able to interpret every IR spectrum, so in assessing pollutant spectra it is unreasonable to set this as a goal. There is an extensive literature on chemicals that pose a danger to health [13,14]. Since the purpose of the network discussed here is to recognise chemicals that are hazardous in the workplace, it will be helpful to concentrate on compounds that are most likely to be of environmental concern. We have used three criteria to select the most significant compounds as targets for recognition:

• Danger to human health.

• Likelihood of their presence in the laboratory or workplace.

• Feasibility of identification via infrared spectrum analysis.

Bearing these criteria in mind, we shall illustrate the development of the method using twelve compounds and two groups of compound. The concentrations of these compounds that are hazardous to human health are given in Table 1.

ComEound or IDLHb TLYC LODd

Ammonia 0.043 300 30 N/A

Aniline 0.5 100 20 0.220

Benzene" 4.68 500 10 0.280

Carbon monoxide odourless 1200 200 0.010

Carbon tetrachloride" >50 200 5 0.005

Chlorobenzene 1.3 1000 75 0.060

Chloroform" 206 500 10 0.006

Cyc10hexane 780 1300 600 0.010

Methylamine 0.021 100 15 N/A

Phenol" 0.06 250 10 0.540

Pyridine 0.021 1000 5 0.830

Toluene 0.017 500 100 0.067

Methane 2000 N/A N/A 0.015


Ethane 900 N/A N/A O.oI5

Propane 5000-10000 2100 1000 O.oI5

Butane 1200-5000 800 N/A 0.015

Methanol 4.2 - 6000 6000 200 0.020

Ethanol 50to 115 3300 1000 0.020

Propanol 0.03 to 40 800 200 0.020

Butanol O.l2-5000f 1400 - 2000f 50 - 100f 0.020

• Odour Threshold (ppm); b Immediately Dangerous to Life and Health Value (ppm); C Threshold Limit Value (ppm, ceiling values); d Limit of Detection' (ppm); e Suspected human carcinogen; r Dependent upon isomer.

Table 1. Target compounds and concentration thresholds.

5.2 Network Inputs

A complete IR spectrum is rich in detail. It might seem desirable to use all this information in trying to elucidate the composition of a sample, but as we suggested earlier, this is not necessarily the case. To develop an efficient recognition system it is wise to sample the spectrum judiciously. If we take input only at those wavelengths that are likely to contain useful information, rather than at every wavelength, areas of the spectrum that contain irrelevant data can be avoided.

It makes sense to be selective in the choice of wavelengths for analysis, but what criteria should be used? For maximum discrimination, the network must focus on key bands in the spectra of the target compounds. This criterion can lead to conflicts. Some atmospheric constituents such as carbon dioxide and water vapour are infrared active, so the strongest features in the IR spectrum of pollutants may be obscured by the background. For example, the most significant spectral feature of benzene is found near 670 cm-1, but this is almost completely obscured by a peak due to atmospheric carbon dioxide [1 ].

The molar extinction coefficient of the 670 cm-1 peak is around ten times greater than that of the next most significant feature, which lies close to 1038 cm-1, but the latter band is well away from any likely atmospheric interference, and hence is more suitable for use in recognition. Though this band is less intense, it still allows detection of benzene to a limiting value of 0.28 ppm [1], below its TLV.


Carbon dioxide is an unavoidable atmospheric constituent, but fortunately presents a relatively minor problem in terms of overlap with important pollutant bands. The strongest feature in the spectrum of carbon dioxide lies around 2350cm- l , and few other species display significant absorptions in this region. However, while the presence of carbon dioxide in a sample should not unduly complicate analysis, the spectrum of water displays strong absorption in important regions of the infrared (Fig. 1). The concentration of water in the atmosphere (and hence the strength of its spectrum) is dependent upon the humidity, but even at low humidity the broad O-H stretching bands of water (VI symmetric and V3 anti symmetric ) mask all but the strongest spectral bands in the range 3400 - 3900cm- l .

The spectral wavelengths chosen to provide input to the network were therefore selected according to the following criteria:

• Significance in the spectrum of a particular compound.

• Absence of overlap with the spectra of carbon dioxide and water.

Several environmentally-significant compounds exhibit only one strong absorbance in the infrared. These features must therefore be used as network inputs. The most notable example in the target group is tetrachloromethane, which shows a single very strong spectral feature around 790 - 800 cm-I, with an extinction coefficient of 5.1 x 10-3 ppm-I m-I.

Similarly, hydrocarbons such as butane, pentane, and cyclohexane exhibit (in addition to other weaker bands), a strong C-H stretch around 2930 - 2950 cm-I. This region of the infrared is therefore very important in the identification of any short chain saturated compound [10].

Since the neural network is required to identify individual compounds, each network input was set to target a specific band in the spectrum of one compound. Sometimes a compound will exhibit an absorption close to that of another compound, making detection relying upon a single input potentially ambiguous. Where possible therefore, two inputs were specified for each target compound, to reduce the possibility of false identifications.

IR bands are not, even in the gas phase, infinitely sharp. Further, as a result of limited spectrometer resolution and variations in calibration, there may be some uncertainty in the position of the absorption maximum in experimental bands. Accordingly, input ranges were specified to be between 5 cm-I and 10 cm-I in width and an average absorbance calculated from points within this range. Broad absorptions may span one hundred wavenumbers or more, leading to the possibility that several inputs to the network might be 'triggered' by a single peak. However an input targeting a given absorption was defined so that the crest of the peak should fall within that range.


Extra inputs are necessary for recognition of alcohols and alkanes due to the unique nature of the spectra of methanol and methane when compared to other short chain species with more than one carbon atom (Fig 2).

CD (.) t: cu .0 .... a (IJ

.0 <t

......................

i J~ $1\

U ::t

't." ........... _ ... _ ....................................... .-~~"" 3000 2000

Wavenumber/cm- 1

. ................................ .

1000

Fig. 2. The IR spectra of methane (solid line) and ethane (dotted line).

Methane exhibits an absorption around 1305 cm-\ whereas the equivalent peak for ethane is shifted to around 1470 cm- I .

790 800 Tetrachloromethane, Chloroform

2 960 970 Ammonia

3 1027 1032 Pyridine

4 1035 1040 Benzene

5 1053 1059 Toluene

6 1067 1072 Methylamine

7 1083 1089 Chlorobenzene

8 1182 1187 Phenol

9 1210 1220 Chloroform

10 1270 1275 Aniline

II 1304 1309 Methane

12 1430 1443 Pyridine


13 144R 14'i'i Methanol. Ethanol. PronanoL RlItanol

14 1455 1460 CycIohexane

15 1468 1478 Butane, Ethane, Propane

16 1620 1629 Ammonia, Aniline, Methylamine

17 2113 2118 Carbon monoxide

18 2172 2177 Carbon monoxide

19 2925 2930 Methylamine

20 2932 2937 CycIohexane, Toluene

21 2942 2947 Methanol

22 2952 2957 Ethane

23 2965 2970 Butane, Butanol

24 2971 2976 Propane, Ethanol, Propanol

25 3018 3023 Methane

26 3040 3048 Aniline, Toluene

27 3055 3063 Benzene, Phenol

28 3075 3080 Pyridine

29 3082 3087 Chlorobenzene

Table 2 (continued from previous page). Network Input Lower and Upper Limits (em-I) and their Targets.

The input ranges and the compounds they target are shown in Table 2. This series of inputs focuses on important bands in the spectra of all target compounds, whilst avoiding regions where the background atmospheric spectrum is particularly strong, that is:

• Below 800 cm-) (except for tetrachloromethane).

• Near 1500 - 1750 cm-) where the strong, broad vibrational progression of water is located (the input range around 1625 cm-) occurs in the gap between the two bands).

• Around 2300 - 2400 cm-) to avoid a sharp absorption by carbon dioxide.


Above 3100 cm- I , and especialJy in the 3600 - 3900 cm- I region, again to avoid broad water bands.

5.3 Data Pre-processing

The data for training and testing the network were compiled from several sources, including Infrared Spectral Handbooks [15-18], the Internet [18,19] and spectra gathered in the authors' laboratory. Further data were generated where necessary from known extinction coefficients. These additional synthetic spectra made it possible to train and test the networks more thoroughly.

Each spectrum presented to the network was pre-processed as foHows:

(i) The maximum absorbance in each specified range was determined.

(ii) All values were scaled relative to the baseline absorbance.

(iii) Absorbances below a specified threshold were set to zero.

The effect of pre-processing upon the spectrum of methylamine is shown in Fig. 3.

Each spectrum used for training or testing was assigned a characteristic output signature. This is the set of target output values used to train the network. 14 output neurons were used, to indicate the presence or absence of each of the fourteen pollutants in the database. A value of one delivered through neuron 6, for example, indicated recognition of chlorobenzene by the network and should idealJy be accompanied by an output of zero from all other nodes.

(I) ()

c

1.8

~1 . 0 ... o III .0 «

1.4

3000 2000 1000 25 17 9 1 Wavenumebr/cm o

, Input

Fig. 3. The IR Spectrum of Methylamine (left); after Transformation (right).


Output neuron Target compound

Ammonia

2 Aniline

3 Benzene

4 Carbon monoxide

5 Tetrachloromethane

6 Chlorobenzene

7 Chloroform

8 Cyclohexane

9 Methylamine

10 Phenol

11 Pyridine

12 Toluene

13 Alkane

14 Alcohol

Table 3. Allocation of Output Neurons.

5.4 Optimisation

Parameter optimisation is a crucial step in the development of a neural network. Various procedures for optimisation have been proposed, ranging from trial and error [9] to Simplex optimisation [12,20] and the Genetic Algorithm (GA) [21].

Optimisation is not trivial. The initial recognition system described here employed a three layer network, for which eleven variables influence performance, the values of all of these need to be optimised; even for a two layer network the problem is still five dimensional. To further complicate solution, the network parameters are not independent, and during optimisation they must therefore be treated as an interacting set.


Variable

Number of hidden neurons (H).

2 Stiffness of threshold function for hidden layer (aJ).

3 Threshold for hidden layer (PJ).

4 Step size for training hidden layer (1]J).

5 Lower limit for weights to hidden layer (/J).

6 Upper limit for weights to hidden layer (uJ).

7 Stiffness of threshold function for output layer (a2).

8 Threshold for output layer (P2).

9 Step size for training output layer (1]2).

10 Lower limit for weights to output layer (/2)'

11 Upper limit for weights to output layer (U2).

Table 4. Neural Network Variables to be Optimised.

Genetic Algorithms have been found to be valuable in the parametric optimisation of neural networks for several reasons:

• They have a lower tendency than other algorithms to become trapped at false maxima.

• Their method of operation allows them to reach parts of the search space which are inaccessible to, or unlikely to be reached by, conventional algorithms.

• They do not use the variables themselves for optimisation, but instead a representation of those variables, usually in the form of a binary string. This allows the maximum amount of information to be contained within that string [22].

The genetic algorithm has proven to be a powerful optimisation tool for many problems [7,21,23,24], and the methods and benefits intrinsic in its operation make it of particular value for use with a neural network. We have therefore adopted this method in the determination of suitable network parameters.


5.5 Training and Testing

During training, pre-processed spectra were presented to the network in a random order, and the weights adjusted using Back Propagation. Following presentation of each member of the training set, test sets of data were shown, without adjustment of weights, to determine the performance of the network. This process constitutes one training epoch, and was repeated for a pre-determined number of iterations.

The success of a training session is measured in terms of its fitness. This function measures the accuracy of the network's predictions for the test data. The function used varied between 0 (an unsuccessful, or ambiguous prediction) and 1 (a successful prediction) for each spectrum:

q

C1 = L (1- outputl,n)2 (1) n=1

p

Co = L(outputo,n)2 (2) m=1

g

T=LI (if g = 0, T=l) (3) n=1

(if F; < 0 then F; = 0) (4)

In these equations C1 is the contribution to the fitness of neurons with a target output of 1 and g the number of neurons with a target of 1. Co is the contribution of those neurons with target 0, and F; the fitness of the ith test data set. The fitness for the complete test set is then the sum of the fitness of the individual members, scaled to between 0 and 100 (s is the number of members of the test set):

(5)

As with all neural networks the length of the training period is important. Too short a period of training, and the maximum amount of useful information will not be extracted from the training set. Too long, and the network will begin to recognise individual spectra in the training set, and start to learn specifics rather than generalities.

Responsibility for this "overtraining" can be assigned to the network operator, and not the network itself. Overtraining cannot be blamed upon the network, as it is merely doing its job too well [25-28]. It is related to two factors:


• A lack of diversity in the training database, which is thus an inadequate representation of the experimental data with which the network will eventually be presented. This difference can be addressed by expanding the database.

• Too many hidden neurons, which allows the neural network to learn data at too fine-grained a level (and therefore to learn the characteristics of individual spectra, rather than more general rules). This problem can be tackled by reducing the number of hidden neurons, and in this way restricting the power of the network to learn detail from the training set.

6 Spectral Interpretation Using the Neural Network

The quality of spectral recognition by the network is measured by the deviation of the overall network output from the target signature and may be quantified in terms of the Root Mean Square (RMS) error:

n

L(1~ -OJ)2

RMSError= j~1 (6)

n

In equation (6) n is the number of output neurons, 1i and OJ the target and output values. The RMS error is a useful quantitative assessment of the performance ofa network.

For a network with 14 output neurons, an RMS error of 0.2 or above suggests an unsuccessful prediction (the identity of the target compound is predicted incorrectly), whilst a value below 0.05 denotes an accurate identification. A distinction can be made between output neurons with a target value of 1 (t l

neurons) and those with a target value of 0 (to neurons). Division of the total RMS error into the error at the tl neuron(s) and that at the to neurons gives a more useful representation of where a particular test spectrum succeeds or fails.

Network performance may also be measured qualitatively. The output of the network (with respect to all output neurons) is termed correct if the result is a conclusive correct output. This corresponds to tl neurons outputting 0.8 or greater, and to outputs negligibly different from O. An inconclusive output occurs when only correct t, output(s) respond, but yield values below conclusive levels (typically around 0.5). A result specified as ambiguous implies that both to and tl neurons are triggered (i.e. yield output values greater than 0.8), or that they return significant values below conclusive levels. In an incorrect result only wrong output(s) are triggered.


7 Factors Influencing Network Performance

7.1 Parameter Optimization

Network perfonnance is, as has been suggested earlier, strongly dependent upon the values of network parameters. The importance of optimisation can be demonstrated by considering the effect on the perfonnance of the network of small variations in the values of these parameters. The following three sets of variables were applied during the training of otherwise identical three layer networks:

Set 1 Set 2 Set 3

H 23 23 23

al 2.2421 2.2421 2.2421

bl 4.20149 4.20149 4.20149

hI 0.27512 0.27512 2.27512

II -0.46313 -0.00929 -0.00929

UI 0.17588 0.873828 0.873828

a2 4.47573 4.47573 4.47573

b2 4.11231 4.11231 3.11231

h2 0.285539 0.285539 0.285539

12 -1.74357 -0.10746 -0.10746

U2 0.314137 0.714126 0.714126

Fitness 41.2065 63.8194 85.7669

Table 5. Variation of fitness with network parameters for a 3-layer network.

These parameter sets differ only slightly, but the fitness of the network defined by the first set of parameters is less than half that of the third. The first network struggles to interpret everyone of the 14 test spectra, and succeeds in identifying only one. Parameter sets two and three differ only in the values of b2 and hI but the fitness of networks constructed according to the latter is still significantly higher. The second network produces conclusive correct outputs for 3 test spectra and inconclusive predictions for 6 more, whilst the third correctly predicts outputs for 11 members of the test set.


In order to determine suitable network parameters we invoke the Genetic Algorithm. Between 30 and 100 different chromosomes were constructed at the start of a typical calculation, using random initial values (within specified limits) for each parameter. Full training and testing of the network was carried out using each chromosome in turn, and the fitness of each calculated. Standard reproduction, crossover and mutation operators were applied.

Network optimisation generates numerous different chromosomes. The number of cycles before the maximum fitness occurs is dependent on the complexity of the problem, and the number of chromosomes per generation. Some networks are more 'robust' than others, and their performance is less dependent upon variations in parameters. While an increase in the number of chromosomes generates more promising solutions in the initial 'random' stage of the genetic algorithm, this must be balanced against the extra computing time taken to process each generation and the benefits brought by the genetic operators.

High fitness is usually achieved early in the optimisation, but small improvements continue for many GA generations. In a typical run, a fitness of 96.8 % was attained after 97 generations, but a fitness of 98 % was achieved after a further 600 cycles of gradual improvement. This may seem a small gain, but is very consequential. Using a test set of 14 spectra it corresponds to a decrease in mean deviation from the target value of all outputs for all spectra from 0.0478 to 0.0378. During this particular optimisation, the improvement was largely confined to the network output for the test spectrum of one target compound, and equated to the difference between an incorrect prediction and a correct one.

The number of hidden neurons (H) is least dependent upon the values of other parameters; in effect each neuron in the hidden layer has a distinct environment, which is little influenced by other parameters. By keeping all variables except H fixed, the effect of any variation on the performance of the network can be analysed.

Fig. 4 shows how the fitness of the network depends upon H, and illustrates several points:

• In the region of the optimum H value, small variations in the number of hidden neurons have little effect on the overall fitness.

• A decrease in H has a much more pronounced effect on the quality of prediction than an increase. If the number of hidden neurons is reduced below a certain level, the ability of the network to learn is substantially reduced, as one would expect.

(/') (/') Q) C -u..


100

Optimum number (30)

20 40 60 Number of hidden neurons

Fig. 4. Variation of Fitness with the Number of Hidden Neurons.

• Increasing H reduces the performance of the network, but within the range of values studied here never leads to a network that cannot learn enough during training to achieve a reasonable level of operation.

One can ascribe the detrimental effect of increasing H to both the potential for overtraining and to the greater complexity of the network, leading to a higher chance of reaching a deficient solution.

The slightly irregular appearance of the line in Fig. 4 illustrates that each training session is unique, and to an extent stochastic, so the optimum set of network parameters is not always located. To increase the chance that the optimum will be found, it is helpful to train the network using the same set of variables several times. The probability of reaching a sub-optimal solution is also reduced by making appropriate adjustment to the momentum term in the back propagation algorithm. In addition, the choice of a suitable value for the momentum makes training more rapid, by reinforcing positive changes in weights, whilst negating small anomalous changes. The value of the momentum, p, is variable, but a small positive value (0.2 to 0.3) appears to give the best results.

7.2 Targeted Inputs

The benefit of using the 'targeted' set of 29 network inputs can be clearly demonstrated by comparison of the performance of a network using these weights with that ofa network which uses a 'universal' series of inputs, which embrace the whole infrared region apart from those regions affected by water and carbon dioxide absorptions. This series of inputs can be achieved by dividing the spectral ranges 700 -1500 cm-I , 1700 - 2300 cm- I and 2400 - 3100 cm- I into forty-two 50 wavenumber wide sections.


A resolution of 50 cm- I was chosen so that the number of inputs in the targeted and universal sets would be comparable. Although the latter network has the greater number of inputs, its performance falls well below that of the targeted set. Using a test set of fourteen spectra, the identities of only six compounds were correctly predicted by a three layer network trained with data in the universal format. In contrast, both two and three layer networks using data in the targeted format correctly identified all fourteen spectra.

This reduction in the accuracy of prediction can be attributed to two limitations:

• Many of the inputs in the universal series focus on regions of the spectrum in which none of the target compounds exhibit absorptions. Removal of these null inputs (mainly lying in the 2000 - 3000 cm-I region) has no detrimental effect on the performance of the network, but reduces the number of input neurons to 27.

Resolution {cm- I } Number ofInQuts

100 21

50 42

40 53

30 71

20 106

10 212

7.5 263

5 424

Table 6. Variation of Resolution with Number of Network Inputs.

• Unlike the targeted set, the universal set is unable to distinguish between small differences (5 - 10 cm-I ) in the frequencies of absorbances which are close together, because the input bands are too wide.

The resolution of the universal set can be increased and this improves accuracy. However, it is not until a resolution of 10 cm- I or less is used that the performance is comparable to that of a three layer network with the targeted input set.

This level of resolution necessitates over 200 inputs (Table 6), resulting in a much more complex network and consequently a large increase in processing time.


8 Comparison of Two and Three Layer Networks for Spectral Recognition

To examine the hypothesis that a three layer network should be more suited to spectral recognition than one of two layers, networks of each type were optimised and tested. The networks were treated in exactly the same way:

• 29 input neurons.

• Sigmoidal thresholds, to both hidden (in the three layer network) and output layers.

• 14 output neurons.

• Optimisation using the genetic algorithm.

The networks were trained using a set of 70 spectra (5 per output neuron) and a test set of 14 spectra (1 per output neuron). This division of the data creates the potential for different combinations of spectra forming the two sets. Each of these combinations has unique consequences during training and testing, with the possibility that one combination might form a particularly suitable or unsuitable training set. Six separate networks were optimised and trained, each using a different sixth of the total database as the testing set. Assessment of performance was based upon an average taken over all networks.

After optimisation, alI six three layer networks identified sample spectra for all of the target compounds. The most capable two-layer network also achieved this, but the other five failed to recognise one or more of the target compounds.

The outputs of the two layer networks exhibit a higher overalI degree of error than those of the three layer networks. Across both networks, the highest degree of error is in the network output for the spectra of alcohols and alkanes. This is predictable, since both of these categories target more than one compound, and hence focus less precisely on features in a particular spectrum. For both two and three layer networks, the lowest RMS error values are seen in the predictions made for compounds whose spectra contain unique bands, notably carbon monoxide and tetrachloromethane. Using the three layer network, outputs for pyridine spectra display unusualIy high RMS error compared to other single compound categories, but error values for alI test spectra still lie within acceptable limits.

Using a two layer network, one training set is seen by the network as more representative of the data than others, whereas the performance of the three layer network is virtually independent of the training set used. The most successful two and three layer networks are analysed in greater detail in Table 7.

The principal failure of the two-layer network is a large 11 error produced by the network for the methylamine test spectrum, which indicates a larger than desirable


deviation from the target value. The II output for this spectrum is 0.72, slightly below a conclusive level, but still high enough to constitute a reasonably reliable prediction. Error values in the outputs of the three layer network are on the whole slightly greater than those generated by the two layer network. However, using this network, correct predictions are made for all test spectra, with II output values within 0.08 of their targets, and very few notable peaks at 10 outputs.

2 layer network 3 layer network

Epochs 150 100

H nla 30

al nla 1.93958

PI nla 3.24237

VI n/a 2.84821

II n/a -0.30585

'II nla 35.8517

a2 1.37622 0.836524

132 7.1744 3.73226

'12 2.54116 0.974481

Iz -1.74879 -0.75296

U2 40.1336 0.020615

Fitness 97.89339 98.40411

Table 7. Optimisation Values for Two and Three Layer Networks.

It is barely possible to differentiate between the performance of the most successful two layer and three layer neural networks. If a network were to be used only for spectral recognition, then the level of attainment could potentially be equalised by normalisation of the data, in contrast to the use of non-normalised data here. Alternatively a step threshold added to the output layer of each network (so that outputs take only boolean values) would remove the difference in performance.


The performance of networks trained using spectra in the form of transmittance values was inferior to the best two and three layer networks trained with absorbance data. There were no false predictions, but the I) outputs for the spectra of both toluene and methanol were below conclusive levels (around 0.7). This reduction in network performance when compared to a system using absorbance data could be a consequence of several factors:

• Unsuccessful optimisation.

• An inappropriate value for the bias. The bias used in this system was 1, which is significantly less than the magnitude of the inputs to the network (in the range 0-100).

• The non-linear nature of transmittance data.

9 A Network for Analysis of the Spectrum of a Mixture of Two Compounds

We tum now to the analysis of the spectra of samples that contain more than a single pollutant species.

The networks described in earlier sections were trained using only the spectra of individual compounds, and so unsurprisingly were not able to consistently give the correct I) outputs for both chemicals in a binary mixture. A commonly observed outcome was that one of the two target compounds was correctly identified, but identification was inconclusive for the other. A general increase in the magnitude of outputs at 10 neurons was also detected.

To combat this problem, a network was trained using both individual spectra, and combinations of those spectra. The standard training set contains 70 spectra, 5 for each target compound, and everyone of these formed part of the new training set. In addition, the spectrum of each binary mixture was required. All possible combinations of two spectra were generated, resulting in 2275 new spectra, a total training set of 2345 spectra. Similarly, the test set consisted of 14 individual spectra, and 91 binary combinations, a total of 105 spectra.

After optimisation and training, network performance was found to be well below that of the networks trained for identification of single compounds. All of the individual spectra gave outputs of 0.7 or greater at the 1\ neuron. Using a I)

threshold of 0.8, 44 of the combined spectra (almost 50%) gave inconclusive outputs for one or both target compounds (41 spectra) or gave an incorrect output (4 spectra). Lowering the I) threshold to 0.65 reduces the number of inconclusive results to 19, but this is still unacceptably high. This modest performance arises principally because of problems caused by overlapping bands, as one would


anticipate. This difficulty is most pronounced when the spectra of the two compounds share many peaks, for instance phenol and aniline (Fig. 5).

i.... I ! ~ + ~ ~ ., .. :t. .............. /\ " , ...................................................... : ............... .

3000 2000 Wavenumber/cm-1

1000

Fig. 5. IR Spectra of Phenol (solid line) and Auiline (dotted line).

The peaks targeted by the network are marked in Fig. 5 by solid arrows; all four highlight features in both spectra. The only discriminating feature of the four is the peak around 1185 em-I, strong in the spectrum of phenol, but weak in that of aniline.

The result is that conflicting information is presented to the network. An enhancement in performance is seen if a further input is created, targeting the peak in the phenol spectrum close to 1335 em-I. (The peak near 3400 em-I is severely overlapped by an absorbance due to water.) For the individual spectra, network outputs reach conclusive levels (previously inconclusive) although the II outputs for the combined spectrum are still below conclusive levels (Fig 6.).

1.0 1.0

...... ...... ~ ~ 0.. 0.. ...... ...... ~ ~

0 0

0.0 0.0 1 5 9 13 1 5 9 13

Output neuron Output neuron

Fig 6. Network Output for a Combined Phenol/Aniline Spectrum (i) before and (ii) after the Addition of an Extra Input Neuron to Target the 1335 cm- I Band in the Spectrum of

Phenol.


As one would anticipate, the best performance tends to occur where either one, or both compounds exhibit strong and unique spectral features, particularly carbon monoxide and tetrachloromethane. Both pollutants in almost all spectra featuring either compound are correctly identified.

These results demonstrate that a neural network has the potential to analyse the information provided by the spectrum of a mixture, and suggest that a suitably diverse network should also be able to handle the spectra of more complex mixtures.

10 Networks for Spectral Recognition and TL V Determination

10.1 Combined Recognition-Quantitation Network

A neural system which could be usefully deployed in the workplace will not only need to be able to recognise compounds from an experimental spectrum, but also determine whether the concentrations of compounds within the sample are above a certain level. The following sections cover the development of such networks.

The data subset (including several theoretical data sets) for each target compound was arranged in order according to concentration. Each subset was then split, with spectra at a concentration below the TL V for that compound being specified a target of 0 at all output neurons (to spectra), and those at a concentration above the TL V a target of 1 at the output neuron corresponding to that compound (tcomp) with all other targets being set to 0 (t\ spectra).

Two spectra were selected at random for each compound, one below and one above the TL V. These 28 spectra formed the test set, with the remaining spectra forming the training set. Three layer networks were then optimised and trained, and the best performing network was able to correctly predict outputs for both to and t\ test spectra of nine of the target compounds (ammonia, aniline, benzene, carbon monoxide, tetrachloromethane, chlorobenzene, methylamine, phenol and pyridine). (Fig. 7)

Of the remaining five compounds, the network output was correct for the t\

spectrum of chloroform, but incorrect for the to spectrum, whilst outputs were correct for the to spectra of cyclohexane and toluene, but inconclusive for the t\ spectra. The output for ethane (t\ spectrum) was correct, but for methane (to spectrum) inconclusive at the tcomp neuron (both compounds target the alkane output). The network performance was worst for the alcohol category, giving inconclusive outputs for both test spectra.


• terror comp 0.6 DAMS to error

0.4

~ w

0.2

0 .0 ., ., ., ., ., (5

~ c c c c c N u ., '" E .,

c c U Po< )( .J:; c ., .. < ., ., .J:; >. Cl. CD ~ 0 .J:;

0 (j co 0 >. :;: U ~

u

0.6 • tcomp error

DAMS to error

0.4 e at

0.2

0 .0

'" ., ., 0 ... ., E ., ., (5

r; @ c U - c 0 c C c 'N U ., .. 'e 0 Po< "0 )(

., E c c () c ., ., .J:;

E < ., ., 0 .J:; >. Cl. CD ~ 0 < :;: .J:;

0 (j co 0 U >. :;: () ~

u

Fig. 7. Error in 10 spectra (top) and t) spectra (below).

Building upon the partial success of this network, a second network was trained using the same data. This network had twenty-eight output neurons, two per target compound, one to determine the identity of a compound (teomp), and the second focusing solely on the concentration of that compound (teone).

There were two primary reasons for increasing the number of output neurons:

• It was anticipated that the increased network complexity, combined with a simplification of the role of each output neuron, would enhance performance of the network. .


• A practical system needs to be able to recognise the spectrum of a chemical even if its concentration is below the TLV. For example, tetrachloromethane has a TL V of 5 ppm, but is a carcinogen, and so its presence even at lower concentrations is potentially dangerous.

Optimisation showed that an increased number of hidden neurons was necessary to cope with the greater number of outputs. This network employed 32 hidden neurons, in contrast to only 23 in the previous network which had 14 outputs. This also led to an increase in the number of epochs required during training from an average of 150 to over 200.

The performance of this network was disappointing, with network outputs being correct for both to and t\ spectra for only five target compounds (aniline, benzene, tetrachloromethane, phenol and alcohol).

10.2 Separation of the Recognition and TLV-determination Tasks

Splitting the problem into two stages is one means by which network performance may be improved. The first stage is to classify the spectrum as showing evidence of one of the fourteen target species, and the second is to determine whether the concentration ofthat species is above the TLV.

The first step was achieved using one of the 'recognition' networks described previously. The second step involved training a separate three layer network, with a larger number of input neurons. The input set for this 'concentration' network combined the output of the recognition network (14 values) and the initial input set (29 values) to form a total of 43 inputs. Again 14 output neurons were used, with one neuron corresponding to the concentration of each target compound. An output close to 1 at tconc means the concentration of the compound is above the TLV, whilst an output approaching 0 means the concentration is below the TLV.

Successful predictions were made by this network for all compounds, using both to and t\ test spectra. The accuracy of the network output is superior with to spectra, but the errors in outputs with t\ spectra are still low enough for reliable predictions to be made; all tconc outputs are within 0.15 of their target values, and RMS to errors are very low.

This combination of two neural networks proves to be very effective in completing the dual task of spectral identification and concentration determination. Compounds exceeding the TL V give significant outputs at the relevant recognition and concentration neurons, whilst those below the TL V still output at the recognition neuron, thus acknowledging the presence of that compound in the sample.


However, the accuracy of network predictions with spectra recorded at concentrations close to the TLV is strongly dependent upon the compound. Consider, for example, the spectrum oftetrachloromethane. The TLV is very low (5 ppm), but the network should easily be able to distinguish between to and tl

spectra because the extinction coefficient of the 790 cm-I peak is high (51.0 x 10-4 ppm-I m-I). Even small changes in concentration have a large effect on the strength of the absorbance.

In contrast, the 1030 cm-I peak in the spectrum of pyridine has an extinction coefficient of only 0.52 x 10-4 ppm-I m-I. The TLV is again 5 ppm, but distinguishing between the to and tl spectra based upon this peak is much more difficult. It is therefore to be expected that a neural network would be able to distinguish more effectively between tetrachloromethane spectra at 4 or 6 ppm than pyridine spectra at similar concentrations.

11 A Network for Quantitative Spectral Analysis

The system would be of further value if, in addition to identifying a pollutant, it could also give some quantitative indication of the sample concentration. Using data within the confines of Beer's Law, it should be possible to train a network to give an output proportional to concentration. Accordingly, a three layer network was optimised and trained, using data for the spectral feature of benzene close to 1038 cm-I which has an extinction coefficient of 0.76 x 10-4 ppm·1 m-I. Five network inputs were defined, spanning the spectral region around this peak from 1023 - 1052 cm-I. Several inputs were used since this is, in effect, equivalent to using peak areas instead of heights for spectroscopic calibration.

Q) o C 111 .c o 1/1 .c <C

1060 Wavenumber/em-1

Fig. 8. 1000 - 1080cm·' Region in the Spectrum of Benzene.


Input number Lower limit (em-I) Upper limit (em-I)

1 1031 lO33

2 lO34 1036

3 lO37 lO39

4 1040 lO42

5 lO43 lO45

Table 8. Network Inputs for Benzene Concentration Determination.

One output neuron was used, at which the target output for the compound at highest concentration was set to 1. A reference data set was created, with all input values set to 0, and this sample was given a target output of O. The target outputs for all other training sets were scaled to a value proportional to these limits.

The spectra in the database which obeyed Beer's Law were used as the basis for the training and testing sets. In addition 25 theoretical data sets where generated, and a small random error (of ±. 5%) was added to each point in the theoretical set. This created a total of 30 spectra, 20 of which formed the training set, and the remaining 10 the test set.

A three layer network was optimised and trained, using sigmoidal thresholds to both hidden and output layers. The results of this were promising, but the effect of the sigmoidal threshold is prominent (at low and high concentrations in Fig. 9).

Actual concentration (target output)

Fig. 9. Predicted and Actual Concentration Using a Sigmoidal Threshold.


Test spectra with concentrations close to the threshold (fJ) of the sigmoidal function achieved values close to their targets, as did those with outputs around 0.15 and 0.85.

In the lower and upper regions performance deteriorates somewhat, with output values deviating from their targets as the asymptotes of the sigmoid are approached. By scaling target outputs to values in the region ofthe sigmoid with an output between 0.15 and 0.85, performance would improve noticeably, but the sigmoidal function still deviates somewhat from linearity in this region and it is this departure from linearity which is responsible for the deterioration.

To counteract this effect, a linear threshold was used in the output layer, retaining the sigmoidal output in the hidden layer. This purpose of this was to remove the non-linear character of a sigmoidal function, whilst maintaining the contribution of the sigmoid in training the hidden layer. The linear threshold was far more successful in predicting concentrations over the complete range presented by the test set (Fig. 10).

Actual concentration (target output)

Fig. 10. Predicted and Actual Concentration Using a Linear Threshold.

The TL V of benzene is 10 ppm, and this is within the range of concentrations encompassed by Beer's Law. A linear threshold of this type could therefore be applied to a 'critical concentration region' bracketing the TLV. In this critical region, the network output would be proportional to the concentration, whilst below it the output would be 0 and above it the output would be 1.

The success of this network suggests it should be possible to train a network to give a quantitative output for the majority of the target compounds. However for several of these compounds, the TL V lies above the concentration range to which Beer's Law can be applied. A neural network could be used for quantitative prediction beyond the range of Beer's Law, provided that two criteria were met:

• Sufficient spectra recorded at concentrations above the limit of Beer's Law were available.


• A threshold function for the output layer, which mimicked the spectral data, was used.

The linear threshold used above by the network is successful because it correlates closely with the linear variation of absorbance with concentration. However, above the limit of Beer's Law, the linear relationship fails, and so the linear threshold becomes ineffective. The calibration curve for a given compound may be well predicted by a suitable mathematical function. If this is so, then use of a threshold which maps this function would enable the network to make predictions beyond the range of Beer's Law. Such predictions would be fast and useful for estimation of the concentration of atmospheric pollutants on a short timescale.

References 1. Li-Shi Y., Levine S.P., Strang C.R. and Herget W.R. Fourier Transform

Infrared (FTIR) Spectroscopy for Monitoring Airborne Gases and Vapours of Industrial Hygiene Concern. J. American Industrial Hygiene Association, 198950(7) , 354-359.

2. Jacquinot P. How the search for a throughput advantage led to Fourier Transform Spectroscopy. Infrared Physics, 1984, 24(2/3), 99-101.

3. Bio-Rad Laboratories, Sadder Division. http://www.bio-rad.com, 1999. 4. Smith B.C. Fundamentals of Fourier Transform Infrared Spectroscopy, 1996.

CRC Press, Boca Raton, Florida, USA. 5. George W.O. (Ed) and Willis H.A. (Ed). Computer Methods in Uv, Visible

and IR Spectroscopy, 1990. Royal Society of Chemistry, Cambridge. 6. Springsteen A.W. (Ed) and Workman J. Jr. (Ed). Applied Spectroscopy; A

Compact Reference for Practitioners, 1998. Academic Press, London. 7. Clark C. and Canas A. Spectroscopic Identification by Artificial Neural

Networks and Genetic Algorithms. International Journal of Remote Sensing, 199516(12),2255-2275.

8. Madison M.S. and Munk M.E. A Neural Network Approach to Infrared Spectrum Interpretation. Mikrochimica Acta, 1990 I , 131-155.

9. Madison M.S., Munk M.E. and Robb E.W. Neural Network Models for Infrared Spectrum Interpretation. Mikrochimica Acta, 1991 11,505-514.

10. Mackenzie M.D. Counterpropagation Network applied to the classification of Alkanes through Infrared Spectra. Neural Computation and Applications, 1994 2(2), 111-116.

11. Tanabe K., Tamura T. and Uesaka H. Neural Network System for the Identification of Infrared Spectra. Applied Spectroscopy, 1992, 46(5), 807-810.

12. Madison M.S., Munk M.E. and Robb E.W. The Neural Network as a Tool for Multispectral Interpretation. J. Chemical Information and Computer Science, 199636,231-238.


13. Greene S.A. and Pohanish R.P. Hazardous Materials Handbook, 1996. Van Nostrand Reinhold, London.

14. Luxon S.G. (Ed). Hazards in the Chemical Laboratory (5th Edition), 1992. Royal Society of Chemistry, Cambridge.

15. The Interpretation of Vapour Phase Infrared Spectra (Vol. 2), 1984. Sadder Research Laboratories, Philadelphia, Pennsylvania, USA.

16. The Sadtler Handbook of Infrared Spectra, 1978. Sadder Research Laboratories, Philadelphia, Pennsylvania, USA.

17. Schrader B. RamanlInfrared Atlas of Organic Compounds (2nd Edition), 1989. VCH-Weinhein, Cambridge.

18. Linstrom P.J. (Ed) and Mallard W.G. (Ed). NIST Chemistry Webbook (http;llwebbook.nist.gov), NIST Standard Reference Database 69, 1998. National Institute of Standards and Technology, Gaithersburg, Maryland, USA.

19. Galactic Industries Corporation. http://www.galactic.comlgalacticl~alaldata.htm. 1999.

20. Walters F.H., Parker L.R. Jr., Morgan S.L. and Deming S.N. Sequential Simplex Optimisation, 1991. CRC Press, Boca Raton, Florida, USA.

2l. Jones A.J. Genetic Algorithms and their Application to the design of Neural Networks. Neural Computation and Applications, 1993, 1(1),32-45.

22. Goldberg D.E. Genetic Algorithms in Search, Optimisation and Machine Learning, 1989. Addison-Wesley, Reading, Massachusetts, USA.

23. Cartwright H.M. Applications of Artificial Intelligence in Chemistry, 1993. Oxford University Press, Oxford.

24. Reeves C.R Modern Heuristic Techniques, 1993. Blackwell Scientific, Oxford. 25. Masters T. Practical Neural Network recipes in C++, 1993. Academic Press,

London. 26. Carling A. Introducing Neural Networks, 1992. Sigma Press, Wilmslow. 27. Zupan 1. and Gasteiger J. Neural Networks for Chemists, 1993. Weinheim,

New York. 28. Jain A.K. and Mao 1. Artificial Neural Networks: A Tutorial. Computer, 1996

29(3), 31-44.

Genetic Algorithm Evolution of Fuzzy Production Rules for the On-line Control of Phenol-Formaldehyde Resin Plants

Hugh M. Cartwrightl • and David Issottl

IPhysical and Theoretical Chemistry Laboratory, Oxford University, South Parks Road, Oxford, England OXI 3QZ

Summary: The composition and properties of the products of batch chemical processes, such as phenol-formaldehyde resins, may be subject to significant variation. Variations may arise from an inability to control the production process to the desired degree of precision, or from a lack of detailed knowledge about how physical parameters such as temperature, pressure and concentration affect the course of the reaction. The operating parameters for batch processes are generally derived by scaling up laboratory-scale experiments, and then gradually modified, taking advantage of experience with industrial scale production. Modelling of batch production is often used, but its value is inextricably linked to the quality of models for both the reaction and the production process. Production rules for batch processes are frequently determined by the plant operators, but potentially effective ways of running the plants may then never be explored, if they lie outside the experience of the operators. In this chapter we investigate the possibility that fuzzy production rules for plants producing phenol-formaldehyde resins (and, more generally, plants involved in a variety of industrial chemical syntheses) may be evolved using a genetic algorithm.

Keywords: Fuzzy rules, Genetic Algorithm, Phenol-formaldehyde resin, Artificial Intelligence, batch process.

1 Introduction

The widespread production and use of synthetic polymers is a comparatively recent development. It was not until the late 19th century that the first truly synthetic polymers were prepared [1). Their range of desirable properties and low production cost has since lead to a rapid increase in the number of commercially-

Author to whom correspondence should be addressed. [email protected]

238 H. M.Cartwright and D. Issott

available polymers and these are now ubiquitous in consumer and commercial products.

Although the number of polymers in production is substantial, many wellestablished polymers and their precursors retain a position of commercial importance. Amongst the most notable of these is the polymer fonned by phenol and fonnaldehyde. The reaction between these chemicals was first investigated in the 19th century by Lederer and Manasse. Their studies were extended in pioneering work by von Baeyer [1] who produced the first phenol-formaldehyde resins, which are condensate polymerisation products (Fig. 1).

OH

X4X X

Fig. I. The generalized structure of a phenolic resin (resol).

These early syntheses yielded a reddish-brown intractable mass of little commercial value, and it was not until 1907 that Leo Baekeland developed a method by which the resins could be converted into moldable forms. Baekeland's resins could be transfonned by heat and pressure into hard, chemically-resistant products, and because of their clear superiority to competing materials they were adopted for a variety of consumer products. Phenolic resins appeared at a time when the electrical industry was entering a period of rapid growth. The ease of fabrication which these resins offered, together with their desirable insulation and cost characteristics compared to alternatives such as metal or wood, lead to their rapid adoption. The first commercial phenolic resin plant, Bakelite GmbH, was founded in 1910, and the name Bakelite became synonymous with the early polymers.

Phenolic resins have more recently been replaced in some applications by cheaper plastics, but the market for them is still substantial. Resins are used in industries from wood working to rubber and adhesive manufacture and in decorative laminates. This variety of markets requires products in a range of forms, such as

Genetic Algorithm Evolution of Fuzzy Production Rules 239

powdered or flaked solid resins, and also as solvent-based or aqueous solution resins; this in tum demands flexibility from manufacturers.

Resins are generally prepared in batch, rather than continuous, process since during the latter cured resin may collect on hot reactor surfaces, leading to a loss in production efficiency. In a batch process the appropriate amounts of phenol, formaldehyde and water are combined under strict pH and temperature control in a reactor which is discharged once reaction is complete.

Resin production does not require substantial capital investment. The methodology of production is well understood and publicly available. Since relatively little is done to the starting materials to form the resin, the economic value added during production is small. Furthermore, competition in resin production has traditionally been high, because of the low cost of entry. For these reasons, profit margins are slim, and small increases in production efficiency have a disproportionately large effect on profitability. It is the manner in which such efficiency gains may be achieved which is the subject of this chapter.

2 Resin Chemistry and Modelling

Phenolic resins are obtained by the step-growth polymerisation of a difunctional monomer, such as formaldehyde, with monomers of functionality greater than two, such as phenol [2]. The temperature and pH of the reaction have a significant effect on the characteristics of the product.

There are three main reaction sequences in resin manufacture: the addition of formaldehyde to phenol, condensate prepolymer formation or chain growth, and the cross-linking or curing reaction. The rate of initial phenol-formaldehyde reaction at pH<4 is found to be proportional to the hydrogen ion concentration [1], whilst at pH>5 it is proportional to the hydroxide ion concentration; this indicates a change in reaction mechanism. Prepolymer molecules formed at low pH are known as novolaks, whilst those produced at high pH are termed resols.

This chapter deals with resols, whose generalised structure is shown in Fig. 1. The chemistry leading to the formation of a resol can be divided into the behaviour of formaldehyde in aqueous solution, the addition of formaldehyde to phenol and the formation of pre polymer species.


2.1 Formaldehyde Equilibria

In aqueous media formaldehyde undergoes an extremely rapid hydration reaction which results in the reversible production of methylene glycol [3].

CH20 + H20 0 HOCH20H

HO(CH20);_IH + HOCH20H 0 HO(CH20);H + H20

(1)

(2)

Methylene glycol can also undergo further polymerisation by reaction with itself to form higher products.

The polymerisation reaction is pH dependent, base-catalysed and the equilibrium between methylene glycol and polymers is attained slowly [4] at low temperature (35°C) and low pH (3-5). As would be expected, the average degree of polymerisation increases as the concentration of formaldehyde in solution increases [5].

Formaldehyde is prepared industrially by the oxidation of methanol, so most industrial sources of formaldehyde solution contain methanol as an impurity. This becomes iiwolved in the polymerisation reactions in side reactions that must be taken into account. Hahnenstein et al [6] have investigated the polymerisation kinetics of aqueous formaldehyde and accounted for side reactions as well as the temperature and pH variations of the equilibrium and rate constants.

They used the following simplifying assumptions

1. The NMR data showed no signal at equilibrium conditions attributable to free formaldehyde; the amount of free monomeric formaldehyde was therefore assumed to be zero.

2. The rates of the higher polymerisation reactions above i=2 were assumed to be equal.

On the basis of the equilibria shown above and these assumptions, the following rate equations can be derived:


dx" =k"xFAxSOL -k*" X" -2(k"2X"X" -k*"2 k"2 X SOL) ~ ~

- I:/k"iXi_I X" -k*"i X"iXSOL)

dx"o dt I = k"iX"i_I X " - k * "i-! X"i-IXSOL - k"i+IX"i X "

+ k * "i+1 X"i+1 X SOL

where Xa denotes the mole fraction of species a

ka denotes the rate constant of the forward reaction forming species a

k*a denotes the rate constant of the back reaction removing species a

(6)

in aqueous solution, SOL = water, A = methylene glycol, Ai ith poly( oxymethylene) glycol

in methanol solution, SOL = methanol, A = methylene hemiformal, Ai = ith poly( oxymethylene) hemiformal.

These equations can be combined with experimental data to give expressions for the equilibrium and rate constants. The equilibrium constants obey the generalised form of the Van't Hoff Isochore whilst the rate constants for the formation of the methylene glycol or hemiformal obey:

kls- I -(l+c lO-pH +c lO+pH) e(a-bIT) - I 2 (7)

and the rate constants for the formation of the polymers of methylene glycol or hemiformal obey:

(8)

in which a, band c are constants.

To understand the dynamics of phenolic resin formation, the behaviour of formaldehyde in solution must be known, and Hahnenstein's scheme can be used to calculate the distribution of formaldehyde, methylene glycol and higher polymers at a given pH and temperature.


2.2 Phenol-formaldehyde Addition

The kinetics of the phenol-formaldehyde addition reaction have been extensively investigated [7-11] since Freeman and Lewis [8] published their ground-breaking paper in 1954. Virtually all subsequent work on this topic stems from that paper.

Freeman and Lewis observed that the kinetics of the base-catalyzed phenolformaldehyde reaction were second order.

The general expression for the overall reaction rate is given in equation (9) where X is a hydroxymethylphenol.

dX = k[Ph-][Methyleneglycol] dt

(9)

However the structure of the hydroxymethylating species in the alkaline catalyzed reaction is not fully understood, nor is the mechanism by which methylene glycol reacts with the phenoxide ion, as the concentration of non-hydrated formaldehyde is too low to substantiate a mechanism [12, 13] as shown in Fig. 2.

OH 6 + OH-

I 0- 0 0 ]

LO -6--6

o

6 ~ H~CH O·

2

Fig. 2. Possible mechanisms of addition of formaldehyde to phenol.

The hydroxide group of the phenol directs addition to the ortho and para positions, since the products are generated via the more stabilised intermediates. The paraaddition product is the thermodynamic product due to lower steric hinderance,


whilst the ortho-compound is the kinetic product. Formaldehyde can undergo reaction twice more, adding to the mono-hydroxymethylate product to form the diand tri-hydroxymethylphenol. The mechanism in Fig. 2 shows the hydroxymethylated products that result from the addition of formaldehyde to phenol.

Values of the rate constants for the steps in Fig. 3 were determined by Freeman and Lewis, and by using acid dissociation constants for the substituted phenols; rate equations can be constructed for the scheme in Fig. 3 using equations (9) and (10).

Fig. 3. Hydroxymethylphenols.

d[X] = k[Ph-][Methyleneglycol] dt

~ d[X] = k Ka [PhH][Methyleneglycol] dt [H30+]

(9)

(l0)

The temperature dependence of the rate constant can be calculated if it is assumed that the rate constant follows the Arrhenius equation. deJong and deJonge [14] gave the activation energy for the hydroxymethylation of phenol as 22 kcal mor!. The reaction can also be modelled thermodynamically using values from Kakiuchi [15] and Imoto and Kimura [16] who agree on the value of 23 kcal mor! for the average enthalpy of formation of a hydroxymethylphenol. The nature of the reaction is such that individual enthalpies of formation are difficult to obtain for all the hydroxymethylate species.


2.3 Prepolymer Species

We define a prepolymer as a molecule containing from two to five joined phenol rings. Analysis shows that, in industrial plants, the resin typicaIly contains negligible amounts of pre polymer with more than five rings.

Prepolymer formation is observed during normal reaction conditions when the temperature is in the range 60-100°C. Below 60°C and at high pH the condensation reaction is negligible. Megson [17] and Martin [18] have proposed two main reaction pathways, Fig. 4.

OH OH

OH OCH20CH2--O + H20 OCH20H-{ OH I" OH

OCH2--O + H20 + CH20

Fig. 4. Prepolymer Reaction Pathways.

Grenier-Loustalot et at [19] have determined, using HPLC and NMR, that the reaction pathway marked a) only occurs in acid media, and that shown as b) is the major pathway in basic media. Further analysis has revealed that, under the conditions used in production of this resin, only prepolymers bonded th~ough the para position are formed in significant quantity.

Jones [20] has investigated the self-condensation of trihydroxymethylphenol and formulated a rate expression that accounts for changes of reaction rate with pH and temperature. The rate expression for the reaction shown in Fig. 5 is first order, but the rate constant is complex.

Other side reactions exist but are little documented. Analysis of a typical industrial production process yields no species that cannot be explained using the chemistry shown above, although values for certain constants contain significant uncertainty.


OH

HOH2Cu:7CH20H I~

,lj

CH20H

OH

Fig 5. Self-condensation of Trihydroxymethylphenol.

3 Simulation of Chemical Reactions

To determine optimum conditions for resin production, it is necessary to simulate all aspects of the reaction as accurately as possible. To do this at a level that acceptably mimics reality, a sound knowledge of the reaction's kinetics is essential, together with values of relevant kinetic and thermodynamic parameters. A compromise between simulation speed and accuracy is almost inevitable.

In recent years advances in computing power have made simulation of large systems of chemicals more feasible and such simulations have been particularly valuable in areas such as atmospheric chemistry [21] and diffusion modelling [22].

Most simulation of chemical reactions relies upon numerical integration of the rate equations. The simplest procedure for this is Euler's method, which approximates a function by a polygon whose first side is tangent to the curve at an initial value. Using Euler's method (and indeed any numerical method for integration) the error between calculated and true value of the function grows with time. Greater accuracy can be obtained using a smaller timestep, but the computational time required grows rapidly as the timestep is decreased.

The Runge-Kutta method is generally more accurate than Euler's method, and has been used to numerically integrate the rate equations employed in the model discussed in this chapter. It was found that at a pH of 14, and a temperature of 373K (the upper bounds of reasonable reaction conditions), a timestep ofO.ls was required for realistic simulations. A larger timestep resulted in oscillations in temperature and concentration which quickly became unstable.

A model of the reactions involved in the production of the phenolic resin precursor was built using the Runge-Kutta method and known values of the rate constants, or values calculated from the Arrhenius equation. A comparative model was built


using Runge-Kutta and empirical data from a commercial process. In this model the isomers of mono- and di-substituted hydroxymethylphenol are ignored and the rate constants are calculated from experimental data. Both models are approximate, since it is assumed that side reactions are negligible and that no further self-condensation beyond the formation of dimers can occur.

4 Model Comparison

An analysis of samples of commercial resin by HPLC and GC revealed a weight average molecular weight of 320 and a number average molecular weight of 188. The maximum number of linked phenolic units was between 12 and 14 but there were very few molecules of this size: molecules with 5 or more linked phenolic units accounted for only 1.5% of the total.

Both theoretical and empirical models were run using the parameters corresponding to commercial production. Table 1 contrasts the composition of the commercial resin with predictions from the models.

Composition (%)

Chemical Species Actual Product Theoretical Model Empirical Model

Phenol 33.1 71.0 44.0

Hydroxymethylphenols 32.2 0.0009 2 % 10-45

di- and tri- HMPs 23.7 29.0 56.0

Dimers 11.0 7 % 10-16 3 % 10-15

Table I. Resin Composition Model Predictions.

The percentage composition of the real resin has been rebased assuming that only phenol, mono- di- and tri-hydroxymethylphenols and dimers are found in the resin.

As Table 1 shows, the performance of neither model is satisfactory. Both fail to predict any significant amount of hydroxymethylphenol although this constitutes 32% of the commercial product. Likewise, virtually no dimerized species are produced. The models differ in their production of further addition products of


hydroxymethylphenols and the phenol concentration. The theoretical model produces a more accurate prediction of the di- and tri-hydroxymethylphenols but more phenol, while the opposite is true for the empirical model. This can be attributed to the relative differences in the rate constants of formation of monoand di-hydroxymethylphenol.

Better agreement with the composition of the commercial resin might be reached if more ofthe kinetic constants were available. Given the economic significance of the reaction this suggests that further investigations in this area would be valuable.

5 Automated Control in Industrial Systems

A wide variety of methods of control are available for industrial processes. Of these, the methods most applicable to the control of resin production are fuzzy, adaptive and neural net (FAN) approaches, the Model Predicative Control (MPC) and Advanced Proportional-Integral-Differential (Adv. PID) methods. Recent research [23-27] suggests that FAN methods offer the highest degree of design transparency, easy of use, operator satisfaction and low maintenance, and it is therefore these methods on which work has been undertaken.

We discuss in the remainder of this chapter how the manufacture of a phenolic resin precursor may be optimised using an adaptive controller. Such controllers are self-referential [28] and can change to suit varying circumstances, so would be able to respond to changing feedstock or product requirements, and offer the prospect of enhanced efficiency and lower manufacturing cost. The adaptive controller described in this chapter combines the use of Fuzzy Logic with the Genetic Algorithm.

5.1 Applications of Hybrid Genetic Algorithm - Fuzzy Logic Controllers

In the first application of a Genetic Algorithm - Fuzzy Logic controller (GA-FLC) Karr and Gentry [29] showed that the pH of a test solution could be controlled in a laboratory based experiment, and quantified the adaptability of a controller to respond to changes in pH and buffer level. Karr and Gentry concluded that the Genetic Algorithm attached to the fuzzy controller was able to learn to drive the pH to the desired set point in a reasonable length of time.

In 1995 Homaifar and McCormick [30] used a GA-FLC to balance a pole mounted on a moveable cart, which simulates an inverted pendulum, and commented that a combined Genetic Algorithm - Fuzzy Logic controller could be


created without any human expert. The success of the GA-FLC in this quite modest problem encouraged work with GA-FLCs in other areas.

Ishibuchi et al [31] applied a GA-FLC to the classification of complex patterns and pointed out that valuable information about systems can be gleaned on inspection of the IF-THEN rules bred by the GA, leading to previously undiscovered knowledge about the system.

Chan et al [32] discussed the use of a GA-FLC hybrid in breeding the fuzzy rules used to control a target tracking radar. This is of particular relevance in the control of systems subject to rapid fluctuations such as chemical reactions. Target trackers using Fuzzy Logic ('fuzzy trackers') were shown to be very effective and combined GA - fuzzy trackers were shown to be clearly superior to all other methods studied. This suggests that GA-FLCs might be capable of controlling complex chemical systems in which large and rapid shifts in concentrations and physical conditions are possible.

5.2 Fuzzy Logic Control

Fuzzy control was developed by Zadeh [33, 34] to describe physical phenomena that were less amenable to description by the discrete mathematical models dictated by Boolean logic. Fuzzy sets represent the ambiguity inherent in events [34] and measure the degree to which an event occurs rather than whether it occurs. In a Boolean system, an item can have a set membership of only 1 or 0, whilst fuzzy systems allow the membership of the item to lie anywhere between these two extremes. In essence, Fuzzy Logic control approximates a human's "rule-of-thumb" or heuristic approach to problem solving.

A recent publication [35] identifies areas in chemistry that are themselves fuzzy, for instance: investigation into chiral substances. The spatial distribution in a vessel containing a statistically significant number of molecules is almost certainly asymmetric at any instant in time. As there are too many degrees of freedom for a mirror image configuration to occur a moment later, the system is known as stochasticallyachiral. In the absolute, geometrical sense, 'chiral' and 'achiral' are non-fuzzy, yet when used in the context of chemical usage are two extremes of a spectrum, the intermediates being described as 'weakly chiral'.

In Fuzzy Logic two variables are used to represent a non-fuzzy value. These comprise a linguistic term that defines in which set the non-fuzzy value is found and a number, the membership function, that defines what membership the nonfuzzy value has within the set. In this way, a discrete input, such as temperature, is 'fuzzified' into a member of a set such as 'hot' and 'cold' with a value 'representing the degree of 'hot' -ness or 'cold' -ness. A temperature of 299K might be 'fuzzified' as 'WARM' with a membership function of 0.8, and 'HOT' with a membership function of 0.2.


A fuzzy logic controller possesses a set of rules which determine what actions are performed if a particular set of inputs is present. The form of these rules is:

If < condition a, condition b > THEN < action x, action y >

The membership functions of each of the input conditions are used to determine how far the output conditions should be followed. This is termed 'defuzzification', and the centroid defuzzification method is usually used, according to which the result of defuzzification is a weighted mean of the elements belonging to the fuzzy subset.

Figure 6 shows an overview of the actions of a fuzzy logic controller, which can be summarized as follows:

• sensor reads the state of the physical system

• input from each sensor is fuzzified

• the rules are consulted and the appropriate output sets selected

• from the output sets and the membership functions, an output signal is produced

• the actuator acts on the output signal to affect the physical system

• the process loops indefinitely

Actuator

Physical System

Sensor

Fig. 6. The Operation of a Fuzzy Logic Controller.


5.3 Genetic Algorithms

Genetic Algorithms (GAs) were developed as a method of examining the processes of natural selection in artificial, computer - based systems. GAs mimic the genetic operators of reproduction and mutation on a population tempered by natural selection. Just as natural evolution optimises species to face the challenges of their environment, so GAs can be used to optimise some forms of information problems if encoded appropriately.

GAs operate upon a population of 'strings', each of which represents a possible solution to the problem under investigation. Each string is analogous to a chromosome and has a set of points along it which code to information about parts of the overall solution. Each coding point takes a value, with each point being analogous to a gene, and the value of the point being analogous to an allele.

Each string in the population is assigned a 'fitness', which is a measure of the quality of the solution encoded by the string. The GA selects parents from the population using a probabilistic fitness-weighted method, which, while allowing every solution a chance of being selected to be a parent, incorporates a bias so that the fitter the string, the more likely is its selection.

A new population is created by combining parts of the selected parents at a randomly chosen point. In this way, a new population is made from, on average, the fitter species of the previous parent population and the random exchange of information that results from crossover among parents allows optimisation to continue.

As fitter solutions are chosen preferentially to reproduce, less fit solutions are removed and the population slowly homogenizes. Under the action of only the crossover operator, the quality of the final solution would be limited by the information included in the original population. To allow new information to be bred into the system, a mutation operator occasionally replaces a randomly chosen piece of information with another randomly created piece. Details of the genetic algorithm are available in a number of standard texts [36-38].

5.4 Application of Genetic Algorithms to Fuzzy Controllers

Fuzzy controllers are relatively transparent and easy to comprehend compared to more traditional methods of control, relying, as they do, on linguistic rules such as:

IF <temp is HOT> THEN <switch cooler to HIGH>


However, as the complexity of the system for which the controller is intended grows, so the time required to set up and debug the rules and fuzzy sets increases rapidly. The majority of the design time spent in the development of a new controller may be given over to this task. Rules and fuzzy sets are proposed and optimised by the designer of the controller, working with experts in the area. The sets and rules are therefore born out of extant solutions and control methods.

By contrast, in a GA-FLC, the GA is applied to breed simultaneously the rules and fuzzy sets of the controller. Each individual in the GA's population is a candidate controller and contains fuzzy sets and rules required to address aspects of system control. During development, each string is allowed to run the controller for a period of time. The fitness of the string is calculated by comparing the difference between a chosen goal (such as maximising the resin yield) and the output of the controller. Once the entire population has been assessed, the GA operators take hold and a new population is bred. In this way, a solution set will be created which incorporates no user value judgments. In favourable circumstances the time taken for the GA to breed a high quality solution should be much less than the time for the designer and expert to reach a comparable solution.

The following sections describe the development and testing of just such a controller, with the aim of producing a device able to control and optimise the resin production process.

...-.. 306~------------------~ ~ '-"'

~ ~

~ Q) a. E 300 ~

20 40 60 Time (s)

80

Fig. 7. Vessel Temperature Under the Control ofthe FLC.


6 Program Development

6.1 A Temperature Controller

As a first step, we construct a simple non-adaptive fuzzy controller (FLC) capable of controlling the temperature of a reactor. A resin synthesis controller clearly needs to be able to adjust the reactor to an appropriate temperature, so this provides a useful initial test of the control system. The FLC was given control of an imaginary heating/cooling jacket on a reactor vessel with the goal of adjusting the temperature of the vessel and contents to 300K and bringing the rate of change of temperature to zero. The FLC was allowed to inspect two variables of the system: the temperature, and its rate of change.

The vessel was assumed adiabatic, the heating jacket could instantly change its power output to bring about heating or cooling, and the heating/cooling immediately altered the temperature of the entire volume of liquid, in other words, there was no time lag due to heat transfer.

When simple predetermined rules are used for the controller, in a typical run the FLC adjusts the temperature of the vessel steadily towards 300K until, at temperatures very close to that required, a temperature oscillation sets in (Fig. 7).

Figure 8 shows the heating/cooling coil output, which is steady and negative, as the FLC is trying to cool down the vessel, until oscillations begin.

12r-------------------~

-. 8 .!!2 4 ~ .......... 0 ~ I~----------~IIHI

32 -4 ~ -8

-12 -16~~L-~~~~-L~~~

o 20 40 60 80 Time (8)

Fig. 8. Heating/Cooling Coils Output.


The oscillations in both the temperature and rate of change of temperature are a consequence of the FLC's predefined fuzzy sets lacking the correct degree of finegranularity.

6.2 An Adaptive Temperature Controller

As a first step towards breeding fuzzy rules, the FLC was made adaptive by attaching a Genetic Algorithm so that a more fine-grained collection of fuzzy rules could be evolved. Each string was made up of 25 rules of the type shown earlier. The FLC was allowed to run the reactor for a minute of simulated time, and the fitness of each string then determined by the temperature and its rate of change at the end of the timestep.

The genetic operators were applied and each string in the new population was allowed to run the reactor for a subsequent minute of simulated time. Now, however, although each string was given a reactor which had the same parameters as before, the temperature and its rate of change were those produced at the end of the previous timestep by the fittest string of the previous generation. In this manner, the population evolves towards the goal using the best information of previous generations. The performance of this model is shown in Fig. 9.

en en Q)

.E 0.5 u::

20 40 60 80 100 Generation

Fig. 9. Evolution of an Adaptive Temperature Controlling FLC.

Figure 9 shows that the best string in the population produces a nearly perfect output (perfect fitness = 1.0) after 12 minutes of simulated time. The GA has, in this simple task, evolved coherent and useful information.


6.3 An Adaptive Chemical Reactor Controller

In even the simplest of chemical systems the endo- or exothermic nature of the chemical reactions may lead to a temperature change, and the GA-FLC needs to be able to cope with ongoing chemical reactions. These may be competing, and some may be autocatalytic. At a particular temperature the enthalpic output of two reactions might cancel and the overaIl temperature remain steady. But if the reactor temperature rises, an autocatalytic exothermic reaction may undergo a rate increase causing the temperature of the reactor to rise, which wiIl then lead to a further rate increase and so on.

The sulfuric acid catalyzed hydration of 2,3-epoxy-l-propanol to 1,2,3-propanetriol illustrates the kind of complex behaviour that may arise from quite simple reactions [39]. The behaviour of this reaction system depends on the startup procedure, improper procedure sometimes resulting in 'runaway', with the temperature rising at a rate of tens of degrees per second. A reactor entering a runaway phase is very dangerous, and has lead to devastating accidents during commercial production. Slightly different starting conditions may lead to entirely stable behaviour, or continuing oscillations may develop.

In a move towards handling a reacting system, a model chemical system was introduced, in which a substrate A is converted to product B. Many industrial chemical reactions take place in CSTRs (Continuous Flow Tank Reactors) in which reactants are fed at a constant rate into the reactor, where reaction occurs, after which they leave as a mix of reacted and unreacted components; this model was used here.

As before, a GA population was created, in which each string was an entire candidate rule set for the Fuzzy Logic controIler that controls the reactor's heating/cooling coils. The previous assumptions of perfect mixing of liquid in the vessel and instantaneous effect of the coils on the reactor were used. The individual rules, however, were different. In this instance, the rules had three input conditions for each output condition. For example:

{ If <temp is a> AND <rate is b> AND

<rate of conversion of A to B is c> THEN <heat at d> }

where a, b, c and d refer to fuzzy sets.

Each string was all owed to run the reactor for a timestep, after which its fitness was assigned. The fitness was determined by twin goals: to run the reactor at a constant temperature and to select the temperature that maximises conversion of A to B, subject to an upper temperature limit.


The GA rapidly evolved a controller of optimum fitness, leading to the conversion of 80% of A to B on passing through the reactor. At this point the reactor is running at 350K (the upper temperature limit) with the rate of temperature change being effectively 0 Ks- I .

The fitness of 0.44 reached by this controller is the maximum possible due to the constraint of operating temperature. If the reaction could be run at infinite temperature, the fractional conversion of A would be one, but these would hardly be realistic operating conditions.

The hybrid GA-FLC is therefore capable of maximising the yield of a simple irreversible first order reaction subject to constraints on temperature and rate of change of temperature. The rapid convergence is promising in the development of control systems of more complex and involved problems.

6.4 Resin Production Control- Phenol Content Optimisation

The chemical reactions involved in the production of a phenolic resin are of course a good deal more complex than that described in the previous section. In this section, a GA-FLC is applied to the control of the resin production process.

Each string in the Genetic Algorithm population now contains candidate fuzzy rules and sets to control the temperature of the reactor, as well as the initial values and parameters of the reaction process, such as initial temperature and concentration of reagents, times and rates of addition of subsequent reagents and so forth.

In a typical commercial resin run, the reactor operators follow a method that has been set by previous experimentation on the smaller lab-scale and also in the plant-scale reactor itself. The reaction run is broken down into steps that are followed to meet batch consistency, safety and quality control specifications.

Actual production runs almost invariably deviate to some extent from the "ideal" production recipe. For instance, the valves used to admit reagents into the reactor have a lower volume limit, and hence the quantity added to the reactor might vary from the ideal by as much as 30 kilograms. The temperature of the reactor may oscillate as the temperature control system over- or under-estimates the cooling required and such variations may have detectable consequences upon the quality of the product.

The number of fuzzy rules was increased to 50 to assist the FLC in controlling this more complex system. The program was first tested on a problem of reduced scale. Instead of running the reactor for 10 hours as required for the entire manufacturing process, a one hour run was allowed. To help the GA to weed out the numerous poor solutions that arise upon initialisation of the population, the program first went through a training stage. In this training stage, the FLC was


allowed to control the temperature of a vessel of liquid under the influence of external heating and cooling for a simulated time of one hour. If the temperature at the end of this period was not between 293K and 373K, this string was discarded. This is an acceptable form of user intervention as the FLC must be able to adjust the temperature if it is to deal with a full-scale working reactor. Once the GA was populated by strings that could control the temperature of a reactor, the chemical model was introduced and the GA-FLC started simulating the production of resin.

In a typical run the fitness of the best string and average fitness of the population show only modest progress as GA generations pass, until mutation forces a rapid improvement in fitness. The whole population then benefits from interbreeding with this string. Nevertheless, the fitness of even the best string is never impressive, since the reaction time has been constrained to one hour: the reaction has not been allowed to run for long enough for the predicted resin composition to approach that of the real resin.

The GA-FLC was then tested to see if it could converge on a simplified solution over the full 10 hour reaction run. The resin produced by the reactor every generation was assessed by comparison with the ideal composition, and the fitness of each string determined from the essential parameters of the resin and its production, for example: reactor temperature, reactor pH, and the concentration of each species.

The GA-FLC again went through a procedure that selected only those strings that could control the temperature of a vessel and only then allowed the chemical model to be operational. As might be expected, as the number of constraints increases, so the number of generations required to breed high quality strings also increases. In order to examine the operation of the GA-FLC over longer time runs within a reasonable number of generations, the fitness of each string was initially assigned only on the basis of the phenol criteria.

Trials showed that the GA was able to produce a string that could control the temperature and pH of the reactor within realistic operational limits and also produce a resin of suitable phenol content. This suggests that the GA-FLC might be able to produce a string the output of which would satisfy all the criteria for production of the 'ideal' resin.

6.5 Resin Production Control- Species-by-Species Optimisation

If the fitness function includes the concentrations of all the relevant species in the resin: phenol, mono- di- and tri-hydroxymethylphenols, and the self-condensed dimers, it is possible that suitable temperature controller fuzzy sets would be evolved indirectly since fuzzy rules which give rise to temperatures that allowed the production of a resin close to the ideal would be favoured over those rules


which did not contribute in this way. The temperature constraint was therefore relaxed, and replaced by an upper limit which, if breached, stopped the reaction immediately to penalise strings in the population that allowed the temperature to rise to unrealistic levels.

Convergence of the GA with this complex fitness function was much slower than when just the ideal phenol concentration was being evolved. To encourage convergence the fitness function was set up to allow a species-by-species evolution of the resin: first the phenol concentration would be optimised, followed by the hydroxymethylphenols, and so on. In a typical run the ideal phenol concentration was reached quickly and optimisation of the hydroxymethylphenol concentration followed shortly thereafter. After that, the GA-FLIC seemed unable to make further significant progress.

6.6 Multiple Mutation

It became apparent during the trials outlined in section 6.S that the population was rapidly homogenizing, even with a substantial mutation rate, as a high proportion of strings in the initial population contained fuzzy rules which were unable to control the temperature of the reactor. The algorithm quickly removed these strings from the population, but at the expense of a loss of diversity. To counter this, a pre-GA selection trial was created, in which a test string was initialized and, in a similar manner to that discussed earlier, allowed to control the temperature of a vessel of water, the temperature of which was set to vary in a sinusoidal manner in order to test the string's ability to cope with varying temperatures and rates of change of temperature. This process was repeated for further strings until the GA population was full of strings that could control temperature adequately. From this, it was hypothesized the diversity in the initial population would survive longer, resulting in the GA-FLC's goals being achieved more rapidly.

It was also evident that the role of mutation needed to be considered further. The mutation operator introduces new information into the population. If a string requires two or more mutations to achieve a significant increase in fitness, the probability that the correct mutations will occur in two separate strings which are then crossed to form the required string is low. Accordingly, the mutation operator was changed to permit multiple mutations to occur in a string.

The immediate result of this is a large increase in noise in the average fitness, but the hurdle of producing quantities of di- and trihydroxymethylphenol that would satisfy the fitness function is still not overcome. Table 2 details the composition of the fittest string by generation for a typical run.


Generation Phenol % HMP% DHMP& THMP% Dimers %

0 50.42 49.58 2.86 % 10.3 0

100 28.60 71.40 4.46 % 10.3 0

200 45.77 54.23 3.29 % 10-3 0

300 44.35 55.65 0 0

400 39.05 60.95 0 0

500 39.05 60.95 0 0

1000 34.48 65.52 4.43 % 10-3 0

1500 33.66 66.34 4.49 % 10-3 0

2000 33.57 66.42 4.49 % 10-3 0

2500 33.57 66.43 4.49 % 10-3 0

3000 33.56 66.43 4.49 % 10-3 0

Actual ideal 33.10 32.20 23.70 11.10

Table 2. Resin Composition Predicted by the Fittest String, by Generation, During a Typical Run.

Comparison of the composition of the predicted resin with the commercial product shows good agreement for phenol, but a large amount of monohydroxymethylphenols which should have been reacted to form di- and trihydroxymethylphenol. Also, no dimerized product is formed. The program's behaviour was analyzed in more detail, with close attention paid to the point at which the first batch of formaldehyde is added to the reactor, and it was found that the temperature of the reactor rapidly rose to unacceptable levels, forcing the upper temperature constraint to stop further reaction. It seemed that the program was unable to evolve FLC rules to control the temperature at this crucial point in the reaction. Table 3 illustrates the uncontrolled temperature rise within a perfectly insulated reactor with no cooling when producing the commercial resin.


Time/s Temp/K Reactor W P Imoles MG HMP DHMP Volume/l Imoles Imoles Imoles

THMP Imoles

5400 273 5667 65079 47138 0 0 0

5400.1 277.12 5667.38 65079.01 47138.30 7.53 0 0

5400.2 289.86 5667.75 65079.04 47131.78 15.03 6.51 0

5400.3 358.45 5668.13 65079.1 47111.61 22.51 26.69 0

5400.4 18457.77 5668.51 65079.16 47003.16 29.87 135.25 0.01

Table 3. Reaction Runaway in an Insulated Reactor.

As table 3 shows, unless the temperature controller can antIcIpate and deal promptly with a large increase in temperature on the addition of formaldehyde, the temperature quickly rises to a point where the reaction is brought to a halt. At this stage, a large quantity ofhydroxymethylphenol has formed but virtually no di- and tri-hydroxymethylphenol, as the formaldehyde has not been in solution long enough to produce higher addition products. To allow these products to be made, the reaction must be able to control these sudden temperature rises, and it appears that the upper temperature constraint, though vital in a real system, has no place in an evolving GA-FLC as it retards development.

6.7 Expansion of the Rule Base

It is essential that the rule base evolves to include an effective temperature controller so that the reactor is always run at a sensible temperature, thereby allowing optimisation of the resin parameters.

By employing commercial parameters for chemical composition, and addition times of reagents, the GA-FLC can evolve rules that force the reactor's temperature to follow the temperature profile that give rise to resins of commercial composition. Once the rules have been perfected, giving a temperature profile identical to that observed commercially, the resin produced should also be identical. The program can then be allowed to optimise the whole process.


The upper temperature limit was changed so that if the temperature strayed above 1000K or below 273K, the temperature and rate of change of temperature were reset to 330K and 0 Ks·), the average of a typical commercial run. For each string, the number of times that this limit was breached was recorded and used in the fitness function as a penalty.

The reactor temperature was sampled every 100s and, at the end of each run, the fitness of each string was penalised by an amount related to the difference between the sampled temperature and the expected temperature. Incorporating these changes, the GA-FLC shows rapid improvements in fitness at an early stage, but then shows no further significant evolution over several hundred generations.

Fuzz~ sets Eer control variable Number of rules

2 16

3 81

4 256

5 625

6 1296

7 2401

Table 4. The Relationship between the Number of Fuzzy Sets per Control Variable and the Number of Rules.

Analysis of the rule base suggested that there were insufficient rules to run the reaction within the constraints laid down. Therefore, the GA-FLC must be provided with a larger set of rules. The penalty for this increase is twofold: first, a doubling in the number of fuzzy rules will halve the speed of the program, and secondly, the pre-selection stage will take longer, as increasing the number of rules per GA string will increase the chance that a randomly initialised rule set will produce an output that leads to the string being deleted at this stage.

The number of rules directly reflects the granularity of the control solution required. Table 4 illustrates how quickly the number of rules increases as extra granularity is introduced.

It should be apparent that, as the number of rules of per string is increased, the quality of the initial population is reduced as there is a greater chance that a string


will contain 'rogue' rules that give rise to unacceptable outcomes. On the other hand, the probability of evolutionary progress increases with the number of rules: as the number of rules per string increases, so evolution takes place more steadily. A 100 rule GA-FLIC evolves in more numerous, smaller steps than a 50 rule version and this reflects both the greater chance of mutation occurring in the 100 rule program, and the greater diversity in the population base.

A 100 rule GA-FLC covers a larger area of the control variable surface, but in its early generations, it is hardly able to control the exothermicity of the reaction and so the temperature constraints are breached frequently. The FLC often overshoots, either cooling or heating too rapidly, leading to further breaches of the temperature constraints. Since a 100-rule GA-FLC is more 'interventionist' than a 50-rule system, containing more rules which relate to temperature, the temperature is reset more often for a 100 rule GA-FLC than a 50 rule GA-FLC, though the positions gradually reverse on further evolution; this trend is observed for 200 and 400 rule GA-FLCs as well.

It is probable that several hundred rules are required to adequately control the reaction. The exothermic nature of the polymerisation reaction continually breaches the temperature limits whilst the substances are reacting, and the evolution of rules which can effectively cope with the considerable energy release during the reaction is slow.

7 Comment

In an ideal GA-FLC, a rule would exist to deal with every eventuality. However, this would lead to the GA-FLC converging On an optimised solution unrealistically slowly. The upper limit at present, 625 rules, allows only five fuzzy sets per control variable, and even this may not be a sufficiently large number.

The quality of the output of the GA-FLC is dependent not only On the size of the rule set, but also On the quality of the information embedded within the model that simulates the resin synthesis. It was noted earlier that the mechanism of the addition of formaldehyde to phenol is sti11 not known, and this provides a crucial limitation to the effectiveness of any simulation of the reaction. Rate equations have been fitted to the data but a deep understanding of the process is not possible in the absence of a verifiable mechanism. The empirical model built from commercial data used constants that were fitted to t~e first hour of the reaction. This resulted in rate constants for the phenol-formaldehyde addition reaction being discovered, but improvements in model accuracy would be made if the rate constants for the self-condensation reaction were also approximated in this way.


A detailed simulation of the chemical engineering of the reactor itself is also required; for example, the reactor was assumed to be adiabatic, and the imaginary heating/cooling power of the reactor was greatly 'enhanced' to deal with this. A real reactor, suffering heat loss due to imperfect thermal insulation, is broadly comparable to a perfect reactor with continuous cooling, as used in the calculation. Nevertheless, correct simulation of the chemical engineering of the reactor would result in the GA-FLIC more closely replicating reality.

A hybrid Genetic Algorithm-Fuzzy Logic controller is potentially an extremely useful tool for the control and optimisation of resin production processes. A fully operational GA-FLC would have many benefits: the operators of resin processes would have a tool with predictive power, allowing heightened levels of confidence and safety in the process: the process engineer would gain a device that allows the entire process to be streamlined and optimised, resulting in large increases in plant efficiency; and'the research chemist would be able to use the GA-FLC to develop new formulations of resin to precise and pre-specified parameters, thereby reducing time spent in the development of new products.

As the criteria by which a resin is judged can be changed at will, an operational GA-FLC will also allow the optimisation of other operational "recipes" for resin presently in production and the development of new resins of specified composition.

This work suggests that limitations in the kinetic model of the chemical reactions present the most fundamental difficulty in the design of a GA-FLC to control the reactor. Given more reliable kinetic data, the evolution of a complete rule set for controlling commercial-scale reactors is entirely feasible.

References 1. von Baeyer, A. Uber die Verbindungen der Aldehyde mit den Phenolen,

Deutsch. Chem. Gesell. Ber. V. 280-282, 25-26,1872. 2. Knop, A. and Pilato, L. Phenolic Resins - Chemistry. Applications and

Peiformance; Springer-Verlag, Berlin, 1985. 3. Walker, J. Formaldehyde. Reinhold, New York, 1964. 4. Bieber, R. and Trumpler, G. Helv. Chim. Acta. 30, 706-33, 1947. 5. Iliceta, A. et al. Gazz. Chim. Ital. 81, 915-32, 1952. 6. Hahnenstein I, Albert M., Hasse H., et al. NMR spectroscopic and

Densimetric study of Reaction Kinetics of Formaldehyde Polymer Formation in Water, Deuterium Oxide, and Methanol. Ind. Eng. Chem. Res., 34, 440-50, 1995.

7. Malhorta, HC. and Avinash (Mrs). Kinetics of Alkali-catalyzed Phenolformaldehyde reaction. Indian J. Chem. 13, 1159-62, 1975.


8. Freeman, J. and Lewis, C. Alkaline-catalyzed Reaction of Formaldehyde and the Methylols of Phenol: a Kinetic Study. J. Am. Chem. Soc., 76, 2080, 1954.

9. Ishida et al. Computer Simulation of the Reactions of Phenols with Formaldehyde. J. Polym. Sci., 19,1609, 1981.

10. Zavistas, A. Formaldehyde Equilibria: their Effect on Kinetics of Reaction with Phenol. J. Polym. Sci. (AI), 6,2540-59, 1968.

11. Astarloa-Aierbe G., Echeverria J.M., Martin M.D., et al. Kinetics of Phenolic Resin Formation by HPLC 2. Barium hydroxide. Polymer, 39, 3467-72,1998.

12. Grenier-Loustalot M-F, Larroque S, Grenier P, Leca J-P and Bedel D. Phenolic resins 1. Mechanisms and Kinetics of Phenol and the First Polycondensates Towards Formaldehyde in Solution. Polymer, 35, 3046-54, 1994.

13. Grenier-Loustalot M-F, Larroque S, Grande D, Grenier P and Bedel D. Phenolic resins 2. Influence of Catalyst Type on Reaction Mechanisms and Kinetics. Polymer, 37, 1363-9, 1996.

14. deJong J. and deJonge, J. Rec. Trav. Chim., 72,497, 1953. 15. Kakiuchi H. Chem High Polymers (Japan), 8, 33-43, 1951. 16. Imoto E. and Kimura TJ. Chem. Soc. Japan, Ind. Chem. Sect., 53, 9-11,

1950. 17. Megson. Phenolic Resin Chemistry, Butterworth's Scientific

Publications, London, 1958. 18. Martin R. The Chemistry of Phenolic Resins, Wiley, New York, 1956. 19. Grenier-Loustalot M-F, Larroque S, Grenier P and Bedel D. Phenolic

resins 4. Self-condensation of Methylolphenols in Formaldehyde-free Media. Polymer, 37, 955-64, 1996.

20. Jones R. The Condensation of Trimethylolphenol. J. Po/ym. Sci., 21, 1801,1983.

21. Sayler R. and Ford GD. On the Comparison of Numerical Methods for the Integration of Kinetic Equations in Atmospheric Chemistry and Transport Models. Atmos. Environ. 29,2585-93, 1995.

22. Kim H. et al. On the Numerical Solution of Kinetic Equations for Diffusion-influenced Bimolecular Reactions. J. Chem. Phys., 108,5861-69, 1998.

23. Takatus, H., Itoh, T., and Araki, M. Future Needs for Control Theory in Industries - Report and Topics of the Control Technology Survey in Japanese Industry. J. Proc. Cont., 8, 369, 1998.

24. Ordonez R., Zumberge J., Spooner J.T. and Passino K.M. Adaptive Fuzzy Control: Experiments and Comparative Advantages. IEEE T FUZZY SYST,5, 167-188, 1997.

25. Russo M. FuGeNeSys - A Fuzzy Genetic Neural System for Fuzzy Modelling. IEEE T FUZZY SYST. 6,373-388, 1998.


26. Goonatikae, S. and Khebbal, S. Intelligent Hybrid Systems. Wiley, Chichester, 1995.

27. Malki H. et al. New Design and Stability Analysis of Fuzzy PID Control Systems. IEEE T FUZZY SYST. 2,245, 1994.

28. Papadoulis A.V., Tsiligiannis C.A., and Svoronos S.A. A Cautious Selftuning Controller for Chemical Processes., AIChE, 33, 401-409, 1987.

29. Karr C. and Gentry E. Fuzzy Control of pH using Genetic Algorithms. IEEE T. FUZZY SYST., 1,46, 1993.

30. Homaifar A. and McCormick E. Simultaneous Design of Membership Functions and Rule Sets for Fuzzy Controllers Using Genetic Algorithms. IEEE T FUZZY SYST. 3, 129-139, 1995.

31. Ishibuchi H, Nozaki K., Yamamoto N. and Tanaka H. Selecting Fuzzy ifthen Rules for Classification Problems using Genetic Algorithms. IEEE T FUZZY SYST., 3, 260, 1995.

32. Chan. V., Lee, V. and Leung, H. Generating Fuzzy Rules for Target Tracking using a Steady-state Genetic Algorithm. IEEE T FUZZY SYST., 1, 189, 1997.

33. Zadeh, L.A. Fuzzy Sets. Information and Control, 8, 338, 1965. 34. Zadeh, L.A. Fuzzy Logic. IEEE Computer, 83, April 1988. 35. Rouvray, D. (ed.) Fuzzy Logic in Chemistry, AP Professional, San Diego,

1999. 36. Goldberg D.E. Genetic Algorithms in Search, Optimisation and Machine

Learning, 1989. Addison-Wesley, Reading, Massachusetts, USA. 37. Cartwright H.M. Applications of Artificial Intelligence in Chemistry,

1993. Oxford University Press, Oxford. 38. Reeves C.R Modern Heuristic Techniques. 1993. Blackwell Scientific,

Oxford 39. Vermeulen, D. Transient Behaviour of a Chemical Reaction System in a

CSTR, PhD Thesis, Krips Repro Meppe1, Amsterdam, 1986.

A Novel Approach to QSPR/QSAR Based on Neural Networks for Structures *

Anna Maria Bianucci1 , Alessio Micheli2 , Alessandro Sperduti2 , and Antonina Starita2

1 Dipartimento di Scienze Farmaceutiche, Universita di Pisa, Via Bonanno 6, 56126, Pisa, Italy

2 Dipartimento di Informatica, Universita di Pisa, Corso Italia, 40, 56125 Pisa, Italy

Abstract. We present a novel approach based on neural networks for structures to QSPR (quantitative structure-property relationships) and QSAR (quantitative structure-activity relationships) analysis. We face two quite different chemical applications using the same model, i.e. predicting the boiling point of a class of alkanes and QSAR of a class of benzodiazepines. The model, Cascade Correlation for structures, is a class of recursive neural networks recently proposed for the processing of structured domains. Through the use of this model we can represent and process the structure of chemical compounds in the form of labeled trees. We report our results on preliminary applications to show that the model, adopting this representation of molecular structure, can adaptively capture significant topological aspects and chemical functionalities for each specific class of the molecules, just on the basis of the association between the molecular morphology and the target property.

1 Introduction

The possibility of relating some significant aspects of molecular structures to any particular behaviour of a selected class of chemical compounds offers a big challenge in many fields of research, such as Chemistry, Biochemistry, Pharmaceutical Chemistry, etc. The assessment of such relationships represents the starting point for the prediction of required properties of new molecules. For instance, the ability of a model to predict specific properties of molecules allows the researchers to rationally design new compounds optimizing the requirement of both human and financial resources. For this reason the achievement of good predictive models constitutes· a big task for both the basic and the applied research.

Many mathematical models were developed in the past years with the aim of analyzing relationships between molecular structures and target properties such as chemical reactivity or biological activity. The earliest methods all imply a non-direct correlation of the molecular structure to the target property. In these models some physico-chemical properties were used as molecular descriptors. They should be better classified as property-property or property-

• Partially supported by MURST grant 9903244848 and MM09308497.

266 A.M. Bianucci, A. Micheli, A. Sperduti, and A. Starita

activity relationships models. The major problem in correlating some molecular properties (reflecting different structural aspects of molecules) to other kinds of properties (typically chemical reactivity or biological activity) is represented by the need to find a set of complete and relevant molecular descriptors.

The problem of identifying such proper descriptors, which initially had led to the use of physico-chemical properties [1-3], subsequently has been faced by the use of a wide class of numerical descriptors, more specifically oriented to the representation of molecular geometry/shape and atom connectivities (topological indices) [4-7]. Although these last methods use chemical graphs as versatile vehicles for representing structural information, the chemical graphs need to be encoded into the vectorial (or matricial) form required by the technique used to solve the regression problem. Of course, this encoding process is going to remove some structural information which may be relevant. Moreover, the a priori definition of the encoding process has other several drawbacks. For example, when the encoding is performed by using topological indexes, they need to be properly designed by an expert through a very expensive trial and error approach. Thus this approach needs an expert, which may not be available, or may be very expensive, or even may be misleading if the expert knowledge is not correct. Finally, changing the class of chemical compounds under study, or the computational task, will of course mean that all the above steps must be performed from scratch. More general vectorial representation of graphs, with unicity properties, may be very difficult to map on the target values.

A completely different approach is possible facing directly the processing of structured domain in the machine learning systems. While algorithms that manipulate symbolic information are capable of dealing with highly structured data, they very often are not able to deal with noisy and incomplete data. Moreover, they are usually not suited to deal with domains where both categorical (symbols) and numerical entities coexist and have the same relevance for the solution of the problem.

Neural networks are universally recognized as tools suited for dealing with noisy and incomplete data, especially in contexts where numerical variables playa relevant role in the solution of the problem. In addition to this capability, when used for classification and/or prediction tasks, they do not need a formal specification of the problem, just requiring a set of examples showing samples of the function to be learned. Unfortunately, neural networks are mostly regarded as learning models for domains in which instances are organized into static data structures, like records or fixed-size arrays, and thus they do not seem suited to deal with structured domains. Recurrent neural networks, that generalize feedforward networks to sequences (a particular case of dynamically structured data) are perhaps the best known exception.

In recent years, however, there has been some effort in trying to extend the computational capabilities of neural networks to structured domains. While

QSPR/QSAR by Neural Networks for Structures 267

the earlier approaches were able to deal with some aspects of processing of structured information, none of them established a practical and efficient way of dealing with structured information. A more powerful approach, at least for classification and prediction tasks, was proposed in [8) and further extended in [9). In these works a generalization of recurrent neural networks for processing sequences to the case of directed graphs is presented. The basic idea behind this generalization is the extension of the concept of unfolding from the domain of sequences to the domain of directed ordered graphs (DOGs). We will give more details on these types of neural networks for the class of directed ordered acyclic graphs (DOAGs) in Section 2.2.

The possibility of processing structured information using neural networks is appealing for several reasons. First of all, neural networks are universal approximatorsj in addition, they are able to learn from a set of examples and very often, by using the correct methodology for training, they are able to reach a quite high generalization performance. Finally, as already mentioned above, they are able to deal with noisy and incomplete, or even ambiguous, data.

All these capabilities are particularly useful when dealing with prediction tasks in Chemistry, where data are usually gathered experimentally and the compounds can naturally be represented as labeled graphs. Each node of the graph is an atom or a group of atoms, while edges represent bonds between atoms. So neural networks for processing of structures seem to have the computational capabilities to deal with chemical domains. The prediction model can face one fundamental problem in Chemistry: the prediction of the physical-chemical properties, chemical reactivity or biological activity of molecules, leading to Quantitative Structure-Property Relationship (QSPR), or Quantitative Structure-Activity Relationship (QSAR) studies. Recursive neural networks [8) face this problem by simultaneously learning both how to represent and how to classify structured patterns. The specificity of the proposed approach stems from the ability of these networks to automatically encode the structural information depending on the computational problem at hand, i.e., the representation of the molecular structures is not defined a priori, but learned on the basis of the training set. This ability is proved in this paper by the application of Cascade Correlation for structures (CCS) [8) to two radically different QSAR/QSPR problems: the prediction of the non-specific activity (affinity) towards the benzodiazepine/GABAA receptor by a group of benzodiazepines (Bz) [10), and the prediction of the boiling point for a group of acyclic hydrocarbons (alkanes)[ll).

It must be stressed that the generality and flexibility of a structured representation, allows one to deal with heterogeneous compounds and heterogeneous problems using the same approaches. This advantage is not at the expense of predictive accuracy, in fact our results [12) [13) compare favorably versus the traditional QSAR treatment, for the analysis of benzodiazepines, based on equations [10). It is also competitive with results on QSPR problems


(such as, the prediction of the boiling point of alkanes) where the a priori analytical knowledge allows the use of suitable 'ad hoc' representations as input to standard neural networks [11].

Successive studies on the internal representation developed by the recursive neural networks (realized by a Cascade Correlation algorithm) applied to QSAR studies of benzodiazepines were conducted using principal component analysis [14]. This study allows us to deal with the issue of the relationship between the developed neural numerical codes and the qualitative aspects of the QSAR problem. The results show that the recursive neural network is able to discover relevant structural features just on the basis of the associations between the molecular morphology and the target property (affinity). In particular the characteristics of many substituents affecting the activity of benzodiazepines, already highlighted by previous QSAR studies, were correctly recognized by the model. This is a further step towards the assessment of the model as a new tool for the rational design of new molecules.

The chapter is organized as follows. Section 2 begins with an outline of the traditional QSPR/QSAR approach and it is followed by the introduction of the new QSPR/QSAR approach based on recursive neural networks. General representational issues for chemical compounds are discussed in Section 3. The first computational tasks faced in this paper, i.e., the prediction of the boiling point for alkanes, including representation rules and experimental results, is explained in Section 4. Similarly, the application to the QSAR problem of the prediction of the affinity towards the benzodiazepine/GABAA receptor is explained in Section 5, where we present also the study of the internal representations developed by the neural model through Principal Component Analysis. Discussion of the results and conclusions are contained in Section 6 and Section 7, respectively.

2 Recursive Neural Networks in QSPR/QSAR

In this section we describe the new QSPR/QSAR approach based on neural networks for processing of structured data (recursive neural networks). First of all we briefly review the traditional way of performing QSPR/QSAR studies. Then we suggest how the use of neural networks for processing of structures may help in reducing the burden of developing and selecting relevant structural features for molecular representation.

Without loss of generality, for the sake of a simpler exposition and due to their relevance, we mainly focus the following explanations and examples on QSAR studies. However, thought QSPR deals with general properties instead of activity, the following considerations are valid both for QSPR and QSAR analysis.


2.1 Toward a New QSPR/QSAR Approach

The aim of a QSAR study is to find an appropriate function 70 which, given a molecule structure, predicts its biological activity, i.e.:

Activity = 7(Structure). (1)

More generally QSPR assumes that any molecular property, such as physicalchemical properties, can be related to the structure of the compounds, I.e.:

Property = 7(Structure). (2)

The function 7 : T -+ 0 is therefore a functional transduction from an input structured domain T, where molecules are represented, to an output domain 0, such as the real number set. In equations 1 and 2 the term "structure" stresses the importance of the use of global information about molecular shape, atom connectivities and chemical functionalities as understood in the QSPR/QSAR studies.

The function 70 is a complex object which can be described as the sequential solution to two main problems: i) the representation problem, I.e., how to encode molecules through the extraction and selection of structural features; ii) the mapping problem, i.e., the regression task usually performed by linear or non-linear regression tools (e.g., equational modeling, and feedforward neural networks).

According to this view, 70 can be decomposed as follows

70 = g(r()) (3)

where rO is the encoding function from the domain of the chemical structures to the descriptor space, while g is the mapping function from the descriptors space to the biological activity space. This corresponds to the traditional QSPR/QSAR approaches, as summarized in Fig. 1 for the QSAR, where chemical features are represented by a suitable set of numerical descriptors (function r), which are then used to predict the biological activity (function g). The representational problem is faced by using different approaches such as the definition and selection of physico-chemical or geometrical and electronic properties, the calculation of topological indices, or an explicit vector based representation of molecular connectivity (see the examples in Section 4.2). The question mark in the picture shown in Fig. 1 stresses that the number and type of descriptors used to represent the chemical compound depend on the specific QSAR problem at hand. The exact number and type of descriptors used for a specific study are decided by an expert in the field.

In more detail, the encoding process requires the solution of two subtasks. The aim of the first one is to explicitly represent the relevant structural information carried by molecules, while the second one is to codify this structural information into a numerical representation. For example, when considering topological indices, first of all a molecule is represented by the molecular


Molec.le NlllllerkaJ DeIc:rIpmn Actlvlty ,-----------------------,

.-........ ,....[~~~) .... ~~[ ~~ (

?

Fig. 1. Outline of the traditional QSAR approach. Structural features of the molecule are represented through different numerical descriptors. The numerical descriptors can be obtained by using different approaches. Their number and type depend on the QSAR task at hand. The encoding process on the whole defines the T function. A regression function (g) is then applied to the numerical descriptors to obtain the predicted biological activity.

graph skeleton, and then invariant properties of the molecular graph skeleton are used to define and compute a numerical formula. Thus, the function T

can be understood as the following composition

(4)

where TR extracts a specific structural aspect from the molecule (i.e., the solution to the first subtask), and TE computes a numerical value from the structure returned by TR (i.e., the solution to the second subtask). Examples of TE are the connectivity indices (X), or the hydrophobic, electronic, polar and steric properties.

We could sort the traditional approaches on the basis of the evolution toward the use of more direct representations of the molecular structures. Summarizing, we can mention models based on physico-chemical properties [15-18], that may be obtained as combinations of fragment contributions, on topological indices [19,11], or matricial [20] graph representations, and finally a template-based approach [21]. This last model uses a neural network which partially mimics the chemical structures of the analyzed compounds by means of a common molecular template, statically defined for all the compounds.

The mathematical and computational tools used in QSPR/QSAR approches are quite different from each other and include equation based models [1,2] and neural network based models [22-24].

However, in traditional QSPR/QSAR, both TR and TE are defined a priori, i.e., they do not depend on the regression task. Therefore they are designed through a very expensive trial and error approach in order to adapt


them to the regression problem required by the QSAR study. So, even if the chemical graph is clearly recognized as a flexible vehicle for the rich expression of chemical structural information, the problem of using it in a form amenable directly to QSAR analysis is still open.

In this chapter we propose to realize the TE function through an adaptive mapping, thus allowing the automatic generation of numerical descriptors which are specific for the regression task to be solved. This can be done by using recursive neural networks [8], which are able to take directly as input the graph generated by TR and to implement adaptively both TE and g.

In order to exemplify the above concepts, in Fig. 2, we show the outline of the proposed approach assuming that a given molecule is represented by TR as a labeled tree!. This tree-structured representation is then processed by a recursive neural network. The output of the recursive neural network constitutes the regression output, while the internal representations of the recursive neural network (Le., the output of the hidden units) constitute the neural implementation of the numerical descriptors returned by TE. It must be stressed, at this point, that the recursive neural network does not need to take as input a fixed-size numerical vector for each input graph, as it happens with standard neural networks typically used in QSAR studies, because it is able to treat variable-size representations of the input graph. Moreover, since the encoding function (TE) is learned by the neural network together with the mapping function (g), the resulting numerical code represents the "best" numerical coding of the input graph for the given QSAR task.

We may observe that the main difference between the traditional QSAR scheme shown in Fig. 1 and the proposed new scheme reported in Fig. 2 is due to the automatic definition of the TE function obtained by training the recursive neural network over the regression task. This implies that no a priori selection and/or extraction of features or properties by an expert is needed in the new scheme for TE.

To fully grasp the mathematical model underpinning recursive neural networks within the context outlined in Fig. 2, it is crucial to understand how the encoding function, Le., TE, is computed for each input graph.

For the sake of exposition, in the following we assume that TR returns labeled trees, where each label associated with each node of the tree is a symbol representing, for example, the atom type or a molecular group. Since TE will be realized by a recursive neural network, these symbols need to be represented as numerical vectors. For example, a bipolar localist representation can be used to code (and to distinguish among) the types of chemical objects. In a bipolar localist representation each component of the vector is assigned to one entity and it is equal to 1 if and only if the representation refers to that entity; otherwise it is set to -1; e.g., assuming that the fluorine atom (F) is associated with the i-th component and the chlorine atom (CI) is

1 The definition of an appropriate function TR for the specific set of molecules studied in this paper is discussed in Section 3.


Molecule Cbemlcal Tree umerlcal Code

Outpu .. of Reeumvc

Hidden Uni ..

9

Activity

Output of Rccunive

Neural NClWoric

Fig. 2. The new QSAR scheme using recursive neural networks is shown: the molecule, after a structural coding phase driven by ad hoc rules (TR), is directly processed by the recursive neural network through the adaptive encoding function TE . The internal representation developed by the recursive neural network is then used by the regression model implemented by the output part of the neural network (function g) to produce the final prediction (activity).

associated with the j-th component, the fluorine atom is represented by the following vector l-l, -1, ... , -I! 1, -1, ... , -1, -1], while the chlorine atom is ..

i-l

represented by l-l, -1, ... , -I! 1,'-1, ... , -1, -1] . .. j-l

The computation of TE is a progressive process which starts from the leaves of the input tree and terminates at the root of the tree, where a numerical code for the whole tree is generated. Specifically, this coding process starts at the leaf level by producing step by step a code for each visited leaf node and by storing these codes as state information for each corresponding leaf. Successively, the internal nodes are visited, from the frontier to the top of the tree. For each currently visited node its numerical label and the codes already computed for its children (stored in the state), are used to compute the code for the current node. Since this computation is performed in the same way for all the nodes in the tree, the generated codes are all constrained to be of the same size. Finally, the code computed for the root of the tree is used as the numerical code for the whole tree. The encoding function TE is therefore seen as a state transition function. Note that for leaf nodes the process starts with a null state because there is no previous information from descendants.

In Fig. 3 we exemplify the above visit on an input tree where the labels are not explicitly represented. First the leaves (nodes 4, 5, 6 and 7) are visited and the corresponding codes are generated. Then node 3 is visited and a code for it is produced taking into account its label and the codes generated for its children, i.e., nodes 5, 6, and 7. Successively, a code is computed for node 2 using the codes computed for (the subtrees rooted in the) nodes 3 and 4, and the label of node 2. Finally, the root node 1 is visited and the code for it, corresponding to the code for the whole tree, is generated. The different


grey levels used to fill in the tree nodes convey information about the time when the code of each node is used as state information for the current node.

Fig. 3. The coding process: a code is progressively generated for each node by using the code already produced for its descendants. Nodes colored with different grey levels are used to denote the time when the code of each node is used as state information for the current node: e.g., the code for node 2 is generated by using the codes generated for nodes 3 and 4 (in. addition to the numerical label attached to node 2).

Note that the way the encoding function acts on a specific tree, such as the tree in Fig. 3, is specified in terms of how the encoding function acts on the sub-trees of each node. In this sense the encoding is "recursive" . Moreover the encoding is stationary and causal. Stationary means that the computation that produces the code is the same for all the nodes, while causal means that the computation of each code depends only on the current node and nodes descending from it.

Concerning the regression function g, it takes as input the code generated by TE for the root of each input tree and returns the desired value associated with the tree.

2.2 The Recursive Neural Network Model

At this point we formally provide a proper instantiation of the input and output domains for the encoding and the output functions.

Let the structured input domain for TE, denoted by g, be a set of labeled directed ordered acyclic graphs (DOAGs), as produced by the application of TR to the input dataset of molecules I. For a DOAG we mean a DAG where for each vertex a total order on the edges leaving from it is defined. Moreover let us assume that g has for each node a bounded out-degree. Labels are tuples of variables and are attached to vertices. Let JRn denote the label space. ,

The descriptor (or code) space is chosen as JRm while the output space, for our purpose, is defined as 0 = JR.

Finally, we define the encoding function as

(5)


and the output function g as

(6)

The use of a stationary and causal model for TE allows us to choose a uniform and quite simple neural realization for each step of TE through the definition of a recursive neural network model. In order to process each node the recursive neural network uses the information available at the current node: i) the numerical label attached to the node, ii) the numerical code for each subgraph of the node (state information).

As a result, if k is the maximum out-degree of DOAGs in g, the recursive neural network, for each step of TE, gets input from the space

IRn x IRm x ... x IRm , , V'

k times

and produces a code in IRm. Let us consider, for example, a recursive neural network with m hidden

neurons. Given the current visited node, the output x E IRm of the hidden neurons (Le., the code for the current node) is computed as follows:

x = F (WI + t ~x(j) + 0) , 3=1

(7)

where I E IRn is the label (external input) associated with the current node, W E IRmxn is the weight matrix associated with the label space, ~ E IRmxm is the recursive weight matrix associated with the j-th subgraph code, x(j) E IRm is the code computed for the j-th subgraph of the current node, o E IRm is the bias vector, and F (Y)i = I(Yi) where 10 is a sigmoidal nonlinear function.

Specifically, let us study what happens for a single recursive neuron with m = 1. The simplest non-linear neural realization of the recursive model is given by

n k

X = leL Wili + L Wjx(j) + 8), (8) i=1 j=1

where 1 is a sigmoidal function, Wi are the weights associated to the label space, Wj are the weights associated to the subgraphs spaces, 8 is the bias, I is the current input label, x(1), ... , x(k) are the encoded representations of subgraphs , and x is the encoding of the current structure. A graphical representation of the single recursive neuron is given in Fig. 4.

Using equation (7) the recursive hidden neurons can realize each step of TE. Finally, in the simplest case, the output mapping function gO is realized by a single standard neuron with m inputs.


x"', I, ••• I, . .

,.- ____ -----------------_.'

Fig. 4. A graphical representation of a recursive neuron.

The neural encoding process of an input graph can be represented graphically by replicating the same recursive neurons (through the input graph) and connecting these replica according to the topology of the input graph. We obtain in this way the so called encoding network. Examples of encoding networks for m = 1 are shown in Fig. 5. The examples involve two substituents (-CH3 and -CH2-CF3) for the benzodiazepine class of molecules studied in this work. More complete examples are in Fig. 6, based on the same substituents, where two neurons are involved (m = 2) and a representation of the numerical vectors with n = 3 for the encoding of the symbol is reported. For the sake of simplicity, the labels shown here represent only the three different atoms involved in these examples (Le., H is represented by [1, -1, -1], C by [-1,1, -1], and F by [-1, -1, 1]).

The encoding network is a feedforward network that mimics the topology of the molecular graph. For each input graph a corresponding encoding network is built up. There is a correspondence between graph nodes and units of the encoding network; however, the template used to encode the molecular graph is not fixed a priori as happens in the template-based approach used in [21]. Notice that the weight matrices are shared by different encoding networks (see Fig. 5), since the same recursive neurons are used to "visit" the nodes of different input graphs. This is a consequence of the use of a stationary model.

The neural network output for a given molecular graph is obtained by completing the corresponding encoding network with the neural realization of gO. Such completed network is trained on the regression task. Thus, both the weights of the hidden recursive neurons and the weights of the output neuron (realizing gO) are trained simultaneously on the training set. As a result of this joint training, the encoding of the molecular graph is adaptive, since it is computed on the basis of the specific regression task.

There are different ways to realize the recursive neural network ([8]). In the present work we choose to use a constructive approach that allows the training algorithm to progressively add the hidden recursive neurons during the training phase. The model is an (recursive) extension of Cascade Corre-


Chemical Tree Neuron Encoding Network

• void pointer

Fig.5. Examples of encoding networks (left side) for the chemical fragments -CH3 and -CH2-CF3 with m = 1. The fragments are assumed to be represented by the chemical trees shown on the left side of the figure .. The encoding network are obtained by replicating (unfolding) the recursive neuron for each node in the chemical trees (as shown by the multiple occurrences of the weights) . The black squares represents void pointers which are encoded as null vectors (in this case, the void pointer is equal to 0). The labels, here represented as symbols, are supposed to be encoded through suitable numerical vectors. The output of each encoding network is the code computed for the corresponding chemical fragments.

lation based algorithms [25,26]. The built neural network has a hidden layer composed of recursive (hidden) units. The recursive hidden units compute the values of TE (in JRm) for each input DOAG, as shown in Fig. 5 or in Fig. 6. The number of hidden units, Le. the dimension m of the descriptor space, is automatically computed by the training algorithm, thus allowing an adaptive computation of the number and type of (numerical) descriptors needed for a specific QSPR/QSAR task. In the Cascade Correlation for structures (CCS) model, in order to realize the function g, we use a single standard linear output neuron. A complete description of the Cascade Correlation for structures algorithm and a formulation of the learning method and equations can be found in [27,8]

In summary, the hidden layer of a recursive network produces a numerical vectorial code (Le., its internal representation) that represents the input molecular graph. In terms of QSPR/QSAR studies, we can imagine that each hidden recursive neuron calculates an adaptive topological index on the basis of the information supplied to the model (Le., the training set). The outputs of the hidden units are arranged into a vector of these topological indices and

Encoding Networlf for -CH3

(1.-1 ,-1) (0.0)(0,0110.0) x. x. x.

Encoding Networlf for -CH2-CF3

(1 ,-1,-11 (0,0)(0.0)(0,0)

x. x. x.


x. x. x.

x. x .

(-1 ,-1 .1)

-CH3 c

(-1.1 .-1) C"""~I TrH

~ (1.-1 .-1) (1 ,-1,-1) (1 .-1.-1)

H H H

.. W3 (1 ,-1,-1) (0.0](0,0110.01

x . )C o )(0

-CH2-CF3 c

(-1 ,1,-1)

~ ClHHnlc.1 TrH I 2 3

c (1 ,-1 ,-1) (1,-1 ,-1) (-1.1 ,-1)

H H

~ x"" (-1 ,-1 ,1) (-1 ,-1,1)

f f

W3 (0,0)(0,0)(0,0) (-1,-1,1) (0,0)(0.0)(0,0)

(-1 ,-1 ,1) f

Fig. 6. Examples of encoding networks with n = 3 and m = 2 (left side) for the chemical fragments -CHa and -CHz-CFa, The labels of the chemical trees represent the atom types: H is represented by [1, -1, -1), C by [-1,1, -1] and F by [-1, -1, I), Void subgraphs are encoded by the null vector xo , The output of each encoding network is the code computed for the corresponding chemical fragments (i.e" XCHa

and XCH2-CFa, respectively),


used as input for a linear regression model realized by the output unit (the gO function), as shown in Fig. 2. It is important to stress that these topological indices are automatically developed by the neural network, since they arise from the training process as a function of the relationship between structures and corresponding values of the target property. They are developed, for this reason, independently from the domain knowledge.

The advantage of this new approach is that it allows us to describe and to process a molecular graph in a way that considers both the graph topology (connectivity) and the atom types (or the chemical functionalities). The use of a neural network to realize the encoding and regression functions allows the production of a flexible prediction model. However, the use of a "blackbox" approach to implement the encoding and the regression functions raises, expecially for QSAR, the following issues:

• chemical meaningfulness of the numerical descriptors produced by the recursive neural network;

• relationship between the developed numerical codes and the qualitative aspects of the QSAR problem.

Those issues were partially addressed in [14] by studying the internal representations developed by the recursive neural network trained on a specific family of benzodiazepines. Examples of such results are reported in Section 5.4.

A complete answer to these issues would allow the extraction of the knowledge learned by the neural network, posing the basis for a full understanding by human experts of the model and therefore permitting the assessment of the model as a new tool for the rational design of new molecules.

3 Representational Issues

A specific type of representation of the molecular structure is required for the model presented here. The choice of the representation defines the function TR

introduced in Fig. 2. Since the functions TE and g are automatically developed by the model, in the new QSPR/QSAR scheme the specification of function TR is the only one available for the designer's tuning.

Molecular structural formulas have already been treated in literature as mathematical objects (graphs) according to chemical graph theory. In our case, a representation of molecular structures in terms of DOAGs is required. The candidate representation should contain the detailed information about the shape of the compound, the atom types, the bond multiplicity, and the chemical functionalities, and finally it should retain a good similarity with the representations usually adopted in Chemistry.

When the molecular structure is represented as a DOAG, the main representational problems which are encountered are: (i) how to represent cycles,


(ii) how to give a direction to edges, and (iii) how to define a total order over the edges.

An appropriate description of the molecular structures analyzed in this work is based on a labeled tree representation.

For alkanes, where each carbon-hydrogens can correspond to a node of the tree, the root of the tree can be determined by the first carbon-hydrogens group according to the IUPAC nomenclature system, cycles are absent and the total order over the edges can be based on the size of the sub-compounds.

In the case of benzodiazepines, the major atom group that repeats unchanged throughout the class of analyzed compounds (common template) constitutes the root of the tree 2. When other repeating atom groups do exist in all the analyzed molecules, single atoms, belonging to these groups, do not require to be explicitly represented. Each atom that requires to be explicitly represented or each repeating atom group corresponds to a node of the tree. Each bond that requires to be explicitly represented corresponds to an edge. A label is associated with each node. Here, these labels are just used to discriminate among different atoms (or atom groups) and do not contain any physico-chemical information. The use of DOAGs for the molecular description implies the loss of only minor structural information. At the present level of development of the model, cycles are usually treated as repeating atom groups, for which a single label is used. When different types of cycles are present at corresponding positions of the molecular structure throughout the class of analyzed compounds, different labels are used to describe them.

The representational scheme described above basically solves all the representational problems (i)- (iii); In fact, with reference to the benzodiazepines data set, concerning the first problem, since cycles mainly constitute some common shared template of the benzodiazepines compounds, it is reasonable to represent them as a single node where the attached label codifies information about their chemical nature 3. The second problem was solved using the major common template as the root of a tree representing a benzodiazepine molecule. Finally, the total order over the edges follows a set of rules mainly based on the size of the molecular fragments.

Rules that allows to define the function TR according to the above ideas will be specified in each section of the two different task (alkanes, Section 4.2, and benzodizepines, Section 5.2).

2 An alternative representation, which the model was able to deal with, would have been to explicitly represent each atom in the major atom group. However, since this group is repeated for all the compounds, no additional information is conveyed by adopting this representation.

3 We distinguish different principal heterocycles or cycles that appear as substituents using different labels.


4 QSPR Analysis of Alkanes

4.1 QSPR Task: Alkanes

To assess the true performance of standard neural networks in QSPR, they are usually tested on well known physical properties. A typical example is the prediction of the boiling point of alkanes. The prediction task is well characterized for this class of compounds, since the boiling points of hydrocarbons depend upon molecular size and molecular shape, and vary regularly within a series of compounds, which means that there is a clear correlation between molecular shape and boiling point. Moreover, the relatively simple structure of these compounds is amenable to very compact representations such as topological indexes and/or vectorial codes, which are capable of retaining the relevant information for prediction. For these reasons, multilayer feed-forward networks using 'ad hoc' representations yield very good performances.

In order to perform a comparison with our method, we decided to use as reference point the work described in [11] which uses multilayer feed-forward networks. The data set used in [11] comprised all the 150 alkanes with 10 carbon atoms. Cherqaoui et al. use a vectorial code representation of alkanes obtained by encoding the chemical graph (tree) with suppressed hydrogens through an "N-tuple" code (see Fig. 7). Each component of the vectorial code, which in this case is of dimension 10, represents the number of carbon bonds for each carbon atom. The last components are filled by zeros when the number of atoms of the compound is less than 10. The uniqueness of the code is guaranteed by keeping a lexicographic order.

This representation for alkanes is particularly efficient for the prediction of the boiling point since it is well known that the boiling point is strongly correlated with the number of carbon atoms and the branching of the molecular structure. However, the same representation could be useless for a different class of compounds and different tasks.

4.2 Representation of Alkanes

We observe that the hydrogens suppressed graphs of alcane molecules are trees and they can be represented as ordered rooted trees by the following minimal set of rules:

1. the carbon-hydrogens groups (H, C, CH, CH2 , CH3 ) are associated with graph vertexes while bonds between carbon atoms are represented by edges;

2. the root of the tree is defined as the first vertex of the main chain (Le., the longest chain present in the compound) numbered from one end to the other according to IUPAC rules (the direction is chosen so to assign the lowest numbers possible to side chains, resorting, when needed, to


Chemical Structure

Ie I

e-e-e-e-e-e-e I

2e I 4-ethyI3-methylheptane

Ie

IC I

e-e-e-e-e-e 3-ethylhexane

Vectorial Code

1233221121

1232211000

Fig. 7. Example of derivation of the vectorial code (N-tuple) for two alkanes. The vectorial code is obtained starting from a chemical graph where hydrogen atoms are "suppressed". The numbers represent the degree of each node in the graph.

a lexicographic order); moreover, if there are two or more side chains in equivalent positions, instead of using the IUPAC alphabetical order of the radicals names, we adopt an order based on the size of the side chains (Le., depth of substructure);

3. the orientation of the edges follows the increasing levels of the trees; 4. the total order on the subtrees of each node is defined according to the

depth of the substructure; we impose a total order on the three possible side chains occurring in our data set: methyl < ethyl < isopropyl.

Examples of representations for alkanes are shown in Fig. 8. The complete lists of the compounds, according with our represenation,

along with the target and the ouput results are reported in [13].

4.3 Experimental Results (Alkanes)

As the target output for the networks we used the boiling point in Celsius degrees normalized into the range [-1.64, 1.74]. A bipolar localist representation encoding the atom types was used.

For the sake of comparison, we tested the prediction ability using exactly the same lO-fold cross validation (15 compounds for each fold) used in [11]. Moreover, we repeated the procedure for four times. Learning was stopped when the maximum absolute error for a single compound was below 0_08.

The obtained results for the training data are reported in Table 1 and compared with the results obtained by different approaches, Le., the results obtained by Cherqaoui et. al. using 'ad hoc' Neural Networks, two different equations based on connectivity (X) topological indexes, and multilinear regression over the vectorial code for alkanes. The results obtained on the test set are shown in Table 2 and compared with the MLP results obtained by

282 A.M. Bianucci, A. Micheli, A. 8perduti, and A. 8tarita

Chemical Representation

3-ethyl-3-methylpentane

CH 3

I CH 3 - CH 2 - T - CH 2 - CH 3

CH 2 - CH 3

2.2.4 trimethylpentane

CH 3

I CH -C - CH 2 - CH -CH3

3 I I CH 3 CH 3

Our Representation

/CH 3

CH 3 -- C - CH 3 ~

CH 2 - CH -- CH 3

Fig. 8. Example of representations for alkanes.

Cherqaoui et. al. For completeness we have reported the cumulative results from a set of several trials of our model in row 3 of Table 2. It must be pointed out that the results are computed by removing the methane compound from the test set (for the MLP and CCS in Table 2), since it turns out to be an outlier. Particularly, from the point of view of our new approach that considers the structure of compounds, methane (CH4 ) is so structurally small that it does not represent a typical element in the class of alkanes.

Model #Units Mean Abs. Error R 8 CC8 (Mean) 110.7 1.98 0.99987 2.51

Best MLP 7 2.22 0.99852 2.64 Top. Index 1 0.9916 6.36 Top. Index 2 0.9945 5.15

MLR 0.9917 6.51

Table 1. Results obtained for alkanes on training data set by Cascade Correlation for structure (CC8), by Cherqaoui et. al. using 'ad hoc' neural networks (MLP), by using topological indexes and by using multi linear regression. The data are obtained by a lO-fold cross-validation with 15 compounds for each fold. The correlation coefficient (R) and the standard deviation of error (8) are reported.

The results are presented in full, with t residual errors for each compound, in [13]. Examples of training and test curves for two different instances of Cascade Correlation networks trained over the same fold, are shown in Fig. 9.


Model Mean Abs. Error Max Abs. Error R S Best MLP 3.01 10.42 0.99663.49 Best CCS 2.74 13.27 0.9966 3.5 Mean CCS 3.71 30.33 0.9917 5.43

Table 2. Results obtained for alkanes on test data set by Cascade Correlation for structure (CCS) and by 'ad hoc' neural networks (MLP). The data are obtained by a 10 fold cross-validation with 15 compounds for each fold. The last row of the Table is computed over four different cross-validation evaluations.

0.35 r-_~_~....:L=ea:,;.m,,-,ing"--~-:-:-""---:---1 training set -

0.3

0.25 8 ~ 0.2

w c:: 0.15

:i 0.1

0.05

test set ...

0.3 r-_~~~-=Le=.~ml""ng'-..-~==-:-,.,---, training set -

test set .... 0.25

: :i 0.1

0.05

.Hidden Units Hidden Units

Fig. 9. Mean training and test error for two different instances of Recursive Cascade Correlation netw9rks trained over the same fold. The mean error is plotted versus the number of inserted hidden units.

5 QSAR Analysis of Benzodiazepines

5.1 QSAR Task: 1,4-benzodiazepin-2-ones

Due to the strong therapeutic interest [10] and to the multiplicity of SAR studies of this class of compounds, benzodiazepines (Bz) were chosen as the starting application domain for QSAR analysis. At this stage, a group of 1,4-benzodiazepin-2- ones, previously studied by Hadjipavlou-Litina and Hansch [10] through traditional QSAR equations, was selected for testing our model, the evaluation of the method being the initial step of its application. The task is the prediction of the non-specific activity (affinity) towards the Bz/GABAA receptor. The affinity can be expressed as logarithm of the reciprocal of the drug concentration C (mol./liter) able to give a fixed biological response4 •

The data set analyzed by Hadjipavlou-Litina and Hansch (see Table 2 of [10)) is characterized by a good molecular diversity, and this last requirement makes it particularly significant for QSAR analysis. The total number of molecules analyzed was 77. The complete list of the compounds, the training and test set used, and the ouput results are reported in [14].

All the molecules present a common template consisting of the Bz nucleus (in three compounds the A ring of the Bz nucleus consists of a thienyl instead

4 In order to characterize the fixed response, the drug concentration able to give half of the maximum response (ICso) is commonly used.


of a phenyl group) and they differ in a variety of substituents at the positions shown at the left side of Fig. 10.

5.2 Representation of Benzodiazepines

The labeled tree representation of a Bz is obtained by the following minimal set of rules:

1. the root of the tree represents the Bz nucleus; 2. the root has as many subtrees as substituents on the Bz nucleus, sorted

according to the order conventionally followed in Chemistry (standard IUPAC numbering of substituent positions);

3. each explicitly represented atom (or any other common atomic group) of a substituent corresponds to a node, and each explicitly represented bond5

to an edge; the root of each subtree that represents the substituent is the atom directly connected to the common template, and the orientation of the edges follows the increasing levels of the trees;

4. different atoms (or any other common atomic group) are represented by different labels, and each node in the trees has a label associated;

5. the total order on the subtrees of each node is hierarchically defined according to: i) the subtree's depth, ii) the number of nodes of the subtree, iii) the atomic weight of the subtree's root.

In the analyzed data set different labels are used for the following atoms: C, N, 0, F, CI, Br, I, H. Moreover we use a different label for each of the following atomic groups: bdz (Bz nucleus), bdztg (Bz nucleus where the A ring is a thienyl group instead of a phenyl one) and ph, py, cya, naf, respectively, for fragments of Phenyl, 2-pyridyl, Cyclohexenyl, Cyclohexyl and Naphthyl. For labeling we use a bipolar localist representation, as shown in Section 2.

Examples of representations for benzodiazepines (or substituents) which comply with the above rules are shown in Fig. 10 (compound #60 in Table 5 in the Appendix) and in Fig. 5.

5.3 Experimental Results (Benzodiazepines)

In this section we briefly summarize experimental results obtained for the QSAR task [13,14].

For the analysis of the data set described in Section 5, four different splittings in disjoint training and test sets of the data were used (Data set I, II, II, and IV, respectively). Specifically, the first test set (5 compounds) has been chosen as it contains the same compounds used by Hadjipavlou-Litina and Hansch. The second data set is obtained from Data set I by removing 4 racemic compounds from the training set and one racemic compound from

5 The multiplicity of the bound is implicitly encoded in the structure of the subtree.

R.

R7

Ro·

R.

R.


R,

I 0

":)-" L) -N

Rl=CH) R2'=F R3=H

R,. R5=Ph R6=H R6'=H R7=NHOH R8=H R9=H

TEMPLATE

Fig. 10. Example of representation for a benzodiazepine.

the test set. This allows the experimentation of our approach without the racemic compounds which are commonly recognized to introduce ambiguous information. The test set of Data set III (5 compounds) has been selected as it simultaneously shows a significant molecular diversity and a wide range of affinity values. Furthermore, the included compounds were selected so that substituents, already known to increase the affinity on given positions, appear in turn in place of H-atoms, which allows the decoupling of the effect of each substituent. So, a good generalization on this test set means that the network is able to capture the relevant aspects for the prediction. The test set of Data set IV (4 compounds) has been randomly chosen so to test the sensitivity of the network to different learning conditions. The training set III, with the used numbering of the molecules, is reported in Table 5 in the Appendix.

As target output for the networks we used 10g(ljC). Six trials were carried out for the simulation involving each one of the different training sets. The initial connection weights used in each simulation were randomly set. Learning was stopped when the maximum error for a single compound was below 0.4. This tolerance is largely below the minimal tolerance needed for a correct classification of active drugs.

The main statistics computed over all the simulations for the training sets are reported in Table 3. Specifically, the results obtained by HadjipavlouLitina and Hansch, as well as the results obtained by the null model, i.e., the model in which the expected mean value of the target is used to perform the prediction, are reported in the first and second row, respectively. For each data set, statistics on the number of inserted hidden units are reported for the Cascade Correlation for structures network. The mean absolute error (Mean Abs. Error), the correlation coefficient (R) and the standard deviation of error (S), as defined in regression analysis, are reported in the last three columns, respectively. Note that Mean Abs. Error, Rand S for Cascade Correlation for structures are obtained by averaging over the performed trials (six trials); the minimum and maximum values of the mean absolute error over these six trials are reported as well.


The results for the corresponding test sets are reported in Table 4. In case of small test data sets the correlation coefficient is not meaningful so we prefer to report the maximum absolute error for the test data (Max Abs. Error), calculated as the average over the six trials, and the corresponding minimum and maximum values of the maximum absolute error obtained for each trial.

In Figures 11 and 12 we have plotted the error of the network versus the desired target for data set I and III. Moreover, for the sake of comparison, in Fig. 11 the error obtained using an equational approach [to] on data set I is reported as well.

Each point referring to the neural networks models in the plots represents the average error, together with the deviation range, as computed over the six trials (Le., the extremes of the deviation range correspond to the minimum and maximum output values computed over the six trials for each compound).

Training Set Mean #Units Mean Abs. Error R S (Min-Max) (Min-Max)

HLH 0.311 0.847 0.390 Null model 0.580 0 0.702 Data set I 29.75 (23-40) 0.090{0.066-0.114) 0.999790.127 Data set II 34.0 (27-38) 0.087 (0.080-0.102) 0.99982 0.117 Data set III 19.7 (18-22) 0.087 (0.072-0.105) 0.99985 0.098 Data set IV 16.5 (13-20) 0.099 (0.078-0.132) 0.999760.131

Table 3. Results obtained for benzodiazepines on training data set I by Hadjipavlou-Litina and Hansch (HLH, first row), by a "null model" (second row) and on all the training data sets by Cascade Correlation for structures. The mean absolute error, the correlation coefficient (R) and the standard deviation of error (S) are reported.

Test Set Data # Mean Abs. Error Mean Max Abs. Error S (Min-Max) (Min-Max)

HLH 5 1.272 1.750 1.307 Null model 5 1.239 1.631 1.266 Data set I 5 0.720 (0.611-0.792) 1.513(1.106-1.654) 0.842 Data set II 4 0.546 (0.444-0.653) 0.727 (0.523-0.973) 0.579 Data set III 5 0.255 (0.206-0.325) 0.606 (0.433-0.712) 0.329 Data set IV 4 0.379 (0.279-0.494) 0.746 (0.695-0.763) 0.460

Table 4. Results obtained for benzodiazepines on test data set I by HadjipavlouLitina and Hansch (HLH, first row), by a "null model" (second row) and on all the test data sets by Cascade Correlation for structures. The mean absolute error, the mean of the maximum of the absolute error, and the standard deviation of error (S) are reported.


9.5 Equa1ion·. Resullo 9.5 CCSRHUitI

9 9

8.5 ..

8.5 ' 111 ..... ,.,1'

Jr.: t. i 8 ,II'

~ .' '.' , ~ . , 8 ,1-." ", ~ .. . , 1.5

. . .Il ! '. T"'In~Set • rrelnlng Set -

8,5 Teat Set • 6.5 r .. Is.I_

6r---~bl .. )r---r---n,.~r-~5r--'5n, .. ~,-~ 6 ·

T"'IJOII rlrvel

Fig. 11. Output of the models proposed by Hadjipavlou-Litina and Hansch (left) and for the Cascade Correlation for structures network (CCS) (right) versus the desired target; both models use the same training and test sets (data set I). Each point in the right plot represents the mean expected output for Cascade Correlation network, together with the deviation range (minimum and maximum values), as computed over six trials. The tolerance region is shown on the plots.

9.5 ..--__ --.-____ -r-.....::.oa::..:1a=-;:..Se::..:t..:..:":....' --.------r-----,

9

8.5

:> 8 9 :> o 7,5

T3(gel

Training Set Test Set .....

Fig. 12. Output of the models for Cascade Correlation network (CCS) versus the desired target using the data set III . Each point in the plot represents the mean expected output for Cascade Correlation network, together with the deviation range (minimum and maximum values), as computed over six trials. Note that the test data are spread across the input range.

Due to the small number of training examples we considered various learning strategies in order to avoid or mitigate the overfitting problem. We fully described the adopted strategies in [13] and [14]. Basically we control the gains of the sigmoids, and the increase of the weight values through an incremental strategy on the number of training epochs for each new inserted hidden node. The improvement in the learning behavior using our strategies is analyzed in [14).


5.4 Internal Representation Analysis

In order to understand the degree to which the proposed model is able to capture relevant domain knowledge from the training data, we investigated the internal representations, i.e. the output of hidden units, developed by the neural network trained with the selected set of benzodizepines.

The outputs of hidden units correspond to the encoding values generated for each compound or molecular fragments in the data set. Some of these fragments exactly correspond to the substituents attached to the main common template; other fragments are part of the substituents and do not have any chemical meaning.

Since the information about the morphological characteristics of the chemical compounds is directly given in input to the model as labeled trees, it is possible to perform a direct analysis of the computed values for these numerical codes associated to each compound and its subcomponents.

For this investigation we performed a Principal Component Analysis (PCA) of the internal representations. Due to the relatively large dimensionality of the representational space (typically around 20-30 hidden units are inserted by the training algorithm), we studied 2-D plots of the first two principal components. The aim was to show, as a first approximation, the relative distance and position of internal representations and how they cluster within the representational space of the model. We expect the configurations of the points in the plots to approximately describe the knowledge learned by the neural network from the training data.

From previous SAR studies some positions of the Bz nucleus are recognized to be the ones where substituents play significant roles in determining the biological activity also in relation to their specific chemical characteristics: positions 1, 7 and 2' ([lO) and references therein). Within the class of compounds analyzed the above mentioned positions appear to be widely sampled.

In brief, the most important characteristics required for substituents at position 1 concern lipophylicity and steric hindrance, while the ones required for substituents at position 7 and 2' (or 2' and 6'), mostly concern the electronic effect. Lipophylicity (11" = logP) and electronic effect of the substituents (Hammett u constant) constitute the most popular physico-chemical descriptors employed in the traditional equation based Hansch approach [1,2). Substitutions at positions 6, 8, and 9 are known, instead, to decrease the affinity.

What we were interested in finding, through the analysis of the first two principal components was the presence of clusters possibly containing molecules grouped according to a classification amenable to the two above mentioned descriptors. As a first approach we reduced the relevant molecular descriptor to very simple entities, in order to make the analysis as clear as possible. From this perspective we collected into a unique class the lipophylicity (11") and steric hindrance descriptors, and only considered an on-off classification {molecules with a hydrogen atom or molecules with substituents, mostly


lipophylic, at position 1). In the more detailed analysis reported in [14] we reduced the scale of the substituent effect values (Hammett (1' scale) to a few sub-classes corresponding to the effect that the substituents produce on well known chemical reactions (electrophilic substitution in aromatic compounds). But for the results reported here we again considered an on-off classification, Le. presence or absence of halogens atoms (F, CI, Br, I). In fact halogens atoms strongly affect the (1' values.

We then focused our interest on the analysis of the molecules on the basis of the substituents at position 2' or 2' and 6'. We considered three cases: (i) molecules with only one halogen substituent (at position 2'), (ii) molecules with two halogen substituents (at the symmetrical 2' and 6' positions) and (iii) molecules bearing no-halogen substituents at these positions.

The principal components of the internal representations developed by the Cascade Correlation for structures (outputs of recursive hidden neurons) were analyzed for all the six experiments on data set III mentioned in Section (5.3).

A representative plot of the first two principal components is shown in Fig. 13. It shows the biologically active molecules analyzed (compounds associated to a target) and the relevant molecular fragments. Examples involving more experimental trials are described in [14].

Analysis of the plot shows that molecules and fragments are clustered on the basis of both morphological differences and specifc chemical features, that can not be inferred directly by the observation of the molecular graph, rather only by the association of molecular structures and targets.

The plot (see Fig. 13) appears to be split in two big clusters: all the substituents or molecular fragments approximately fall into its triangular upper right side, while all compounds to which a target is associated (molecules) approximately fall into its triangular lower left side.

The group containing compounds associated to a target is divided, in turn, in two sub-groups, highlighted in the plot shown in Fig. 13 by contour lines. On the left side we find all the molecules bearing a methyl substituent or other alkyl groups at position 1 of the Bz nucleus (the alkyl groups may be substituted in turn and may show bigger steric hindrance and/or different chemical features). In a central region of the plot we find all the molecules that bear no substituents at position 1. The little sub-group on the right side of the plot contains compounds characterized by thienyl, instead of the phenyl, for the group A ring of the Bz nucleus.

Both the biggest clusters contain molecules divided in turn into smaller homogeneous sub-clusters on the basis of the presence of substituents at the other significant positions of the Bz nucleos previously mentioned.

In Fig. 14 we observe that each of the two big clusters identified in the previous plots is sub-clustered on the basis of which kind of atom or atomic group is present at position 7. Compounds characterized by the presence of a halogen atom at position 7 are marked by little boxes, while little crosses are used to mark the remaining compounds. The sub-groups so identified

290 A.M . Bianucci, A. Micheli, A. Sperduti, and A. Starita

I J

..

." .. ......

1!J,.1

W ,"

..

I'CA ,.

II

"

..

II

"

" "

1/

.. . ," I ~.

lft ~.I.. Q "" ,.,

.I' , ..

" ., JJ If I: ......

II, ., "¥,,

.1~,,7,----~----~~.-----:-.-----:.:-. ----~----:':-' , .... ~c..........

Fig. 13. Principal component analysis of training compounds used in the experiment I derived from 28 output values of hidden neurons. Compounds characterized by Rl =H (left side of the plot) and compounds bearing a substituent at position 1 (lower side of the plot) are grouped by contour lines. The circled sub-cluster on the right side includes compounds where the A ring of the Bz nucleus is a thienyl group instead of a phenyl one. See Table 5 in the Appendix for compound numbering.

only partially overlap; mostly it is possible to find regions of the plot where molecules characterized by one or another kind of substituent prevail.

The plot shown in Fig. 15 allows us to focus the analysis on the presence and the type of substituent at position 2' and 2' - 6': once again quite homogeneous sub-groups were found. The sub-groups appear only slightly overlapping. Compounds characterized by the presence of only one halogen at position 2' are marked by little boxes, and compounds characterized by the simultaneous presence of halogens at position 2' and 6' are marked by a cross within little boxes.

The analysis of positions 6, 8, and 9, shows sub-groups still characterized by a certain degree of homogeneity, as reported in [14].

It is noticeable that the differences in analogous plots showing the results obtained from distinct experiments (corresponding to different realizations of the model) only consist of rotations and/or translations of the clusters with respect to each other, i.e. the molecules are still homogeneously clustered on the basis of the substituent effects. For details see [14].


IllCAI"lOC)a.t .. ~.RJ'

o. x ..

O~

'u. x ,",

,eI II" X"

L. " .. 101 x .. <u x ... " ..

n OOll) x ..

I "t'Ja 0 " cfi o,..~ 1 ....

" lit

f XII -u

I XZI o. 0 '" X tOO " ...

.U .. N' 0 "

~,.~ X J1 )( 12: O. X .. 0 '" . , x ... X W

0 " x U ;X &O l4x tP_ 0 "

~I'" 0 " ....

u

Fig. 14. An expanded view of the circled areas in Fig. 13. Compounds characterized by R7 = halogen are marked by little boxes; compounds where R7 is not a halogen are marked by times signs. Compounds bearing a halogen atom at position 7 appear to be located at the (left) lower side of each group.

6 Discussion

Regarding the evaluation of the performance of the proposed model for the treatment of benzodiazepines, from the comparison with the results obtained by the traditional equational treatment, we can observe a strong improvement in the fitting of the molecules included both in the training set and in the test set. The experimental results suggest a significant improvement over traditional QSAR techniques. Good results were obtained also for Data set III, where the most poorly predicted compound is the one bearing hydrogen atoms in place of substituents which play an important role in determining affinity. Finally, the soundness of the proposed model was confirmed by the experimental results obtained for Data set IV, where the only compound which showed the maximum variance through the trials contains a Naphthyl group as C ring which never occurs in the training set. This explains the high variance observed in the prediction.

The ability of recursive neural networks to automatically discover useful numerical representations of the input structures at the hidden layer is the key feature of the adaptive solution to the QSAR task. By analyzing these representations through Principal Component Analysis, as expected, we found that the global distribution of molecules and fragments in the plots of the two


0.'

0' 0 "

0 01

,(u 0'" 17 EBO lO

I .. ~ 0 " ~, .... "tll~ i ....

" .. 0 " 0'" j ....

X >2

0 211

,ar X''''

x ,.

.1.1

·1,.

·u

Fig. 15. An expanded view of the circled areas of the plot in Fig. 13. Compounds characterized by Rz, = halogen are marked by boxes; compounds bearing halogen atoms both at position 2' and 6' are marked by plus signs in boxes, and compounds where Rz, and Re,1 are not an halogen are marked by times signs. Compounds bearing halogen atoms at positions 2' or 2' and 6' appear to be located at the (left) upper side of each group.

first principal components reflects the expected capability of the model in detecting homogeneous structural features that can be directly observed on the basis of the molecular morphology. However, the most remarkable aspect is that the distribution reflects its ability in detecting the similar characteristics of the substituents not directly related to the molecular morphology, such as electronic effects produced by halogen atoms. It has to be recalled here that halogen atoms are represented and distinguished, with respect to each other, only by four different labels, which do not contain any evident information regarding their very homogeneous electronic properties.

The behavior of the model for the prediction of the boiling point of alkanes demonstrates the ability of the model to be competitive with respect to 'ad hoc' techniques. In fact, the obtained results compare favorably with the approach proposed by Cherqaoui et. al. bearing in mind that the vectorial representation of alkanes retains the structural information which is known to be relevant to the prediction of the boiling point.

We would like to stress that the experimental results seem to confirm that our approach allows the prediction, without substantial modifications, both


for QSAR and QSPR tasks, obtaining competitive or even better results than traditional approaches.

7 Conclusions

We have demonstrated that the application of neural networks for structures to QSAR/QSPR tasks allows the treatment of different computational tasks by using the same basic representations for chemical compounds, obtaining improved prediction results with respect to traditional equational approaches for QSAR and competitive results with respect to 'ad hoc' designed representations and MLP networks in QSPR. It must be stressed that for QSAR, no physico-chemical descriptor was used by our model, however, it is still possible to use them by the insertion into the representation of the compounds.

The main advantage of the proposed approach with respect to topological indexes is that in our case no a priori definition of structural features is required. Specifically, since the learning phase involves both the encoding and the regression process, the numerical encoding for the chemical structures devised by the encoding network are optimized with respect to the prediction task. Of course, this is not the case for topological indexes which need to be devised and optimized through a trial and error procedure by experts in the fields of application. Moreover, in our approach it is possible to store into the label attached to each node information at different levels of abstraction, such as the atom types or functional groups, allowing a flexible treatment of different aspects of the chemical functionality.

The capability of the model in extracting structural features which are significant for the target correlation is shown by the PCA of internal representation. In this regard the analysis of the principal components shows that the neural network used here for QSAR studies is capable of capturing in most cases the physico-chemical meaning of the above mentioned substituents even when the use of different labels does not allow a direct grouping of substituents into chemically homogeneous classes. Globally, we can observe that the characteristics of many substituents affecting the activity of benzodiazepines, already highlighted by previous QSAR studies, were correctly recognized by the model, i.e. the numerical code developed by the recursive neural network is effectively related to the qualitative aspect of the QSAR problem.

Concerning a comparison with respect to approaches based on feedforward networks, the main advantage resides in the fact that the encoding of chemical structures does not depend on a fixed vectorial or template based representation. In fact, due to the dynamical nature of the computational model, our approach is able to adapt the encoding process to the specific morphology of each single compound.

Moreover, the generality of the compound representations used by our approach allows the simultaneous treatment of chemically heterogeneous Com-


pounds. Finally, our approach must be regarded as a major step towards a fully structural representation and treatment of the chemical compounds using neural networks.

References

1. C. Hansch, P.P. Maloney, T. Fujita, and R.M. Muir. Correlation of biological activity of phenoxyacetic acids with hammett substituent constants and partition coefficients. Nature, 194:178-180, 1962.

2. C. Hansch and T. Fujita. Analysis. A method for the correlation of biological activity and chemical structure. J. Am. Chem. Soc., 86:1616-1626, 1964.

3. S.M. Free Jr. and J.W. Wilson. A mathematical contribution to structureactivity studies. J. Med. Chem., 7:395-399, 1964.

4. L. H. Hall and L. B. Kier. Reviews in Computational Chemistry, chapter 9, The Molecular Connectivity Chi Indexes and Kappa Shape Indexes in StructureProperty Modeling, pp 367-422. VCH Publishers, Inc.: New York, 1991.

5. D. H. Rouvray. Should we have designs on topological indices? In R. B. King, editor, Chemical Applications of Topology and Graph Theory, pp 159-177. Elsevier Science Publishing Company, 1983.

6. V. R. Magnuson, D. K. Harris, and S. C. Basak. Topological indices based on neighborhood symmetry: Chemical and biological application. In R. B. King, editor, Chemical Applications of Topology and Graph Theory, pp 178-191. Elsevier Science Publishing Company, 1983.

7. M. Barysz, G. Jashari, R. S. Lall, V. K. Srivastava, and N. Trinajstic. On the distance matrix of molecules containing heteroatoms. In R. B. King, editor, Chemical Applications of Topology and Graph Theory, pp 222-230. Elsevier Science Publishing Company, 1983.

8. A. Sperduti and A. Starita. Supervised neural networks for the classification of structures. IEEE TI-ans on Neural Networks, 8(3):714-735, 1997.

9. P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of data structures. In IEEE TI-ans on Neural Networks, 9: 768-785, 1998.

10. D. Hadjipavlou-Litina and C. Hansch. Quantitative Structure-Activity Relationships of the benzodiazepines. A review and reevaluation. Chemical Reviews,

. 94(6):1483-1505, 1994. 11. D. Cherqaoui and D. Villemin. Use of neural network to determine the boiling

point of alkanes. J. Chem. Soc. Faraday TI-ans., 90(1):97-102, 1994. 12. A.M. Bianucci, A. Micheli, A. Sperduti, and A. Starita. Quantitative structure

activity relationships of benzodiazepines by recursive cascade correlation. In IEEE International Joint Conference on Neural Networks, pp 117-122, 1998.

13. A.M. Bianucci, A. Micheli, A. Sperduti, and A. Starita. Application of cascade correlation networks for structures to chemistry. Journal of Applied Intelligence, 12:117-147,2000.

14. A. Micheli, A. Sperduti, A. Starita, and A.M. Bianucci. Analysis of the internal representations developed by neural networks for structures applied to quantitative structure-activity relationship studies of benzodiazepines. Journal of Chemical Information and Computer Sciences, 41(1):202-218, January 2001.


15. Y. Suzuki T. Aoyama and H. Ichikawa. Neural networks applied to quantitative structure-activity relationships. J. Med. Chem., 33:2583-2590, 1990.

16. Ajay. A unified framework for using neural networks to build QSARs. J. Med. Chem., 36:3565-3571, 1993.

17. K. L. Peterson. Quantitative structure-activity relationships in carboquinones and benzodiazepines using counter-propagation neural networks. J. Chem. In/. Comput. Sci., 35(5):896-904, 1995.

18. A. F. Duprat, T. Huynh, and G. Dreyfus. Towards a Principled Methodology for Neural Network Design and Performance Evaluation in QSARj Application to the Prediction of LogP. J. Chem. In/. Comput. Sci., pp 854-866, 1998.

19. Shuhui Liu, Ruisheng Zhang, Mancang Liu, and Zhide Hu. Neural networkstopological indices approach to the prediction of properties of alkene. J. Chem. Inf. Comput. Sci., 37:1146-1151, 1997.

20. D. W. Elrod, G. M. Maggiora, and R. G. Trenary. Application of neural networks in chemistry. 1. prediction of electrophilic aromatic substitution reactions. J. Chem. In/. Comput. Sci., 30:447-484, 1990.

21. V. Kvasnicka and J. Pospichal. Application of neural networks in chemistry.prediction of product distribution of nitration in a series of monosubstituted benzenes. J. Mol. Struct. (Theochem), 235:227-242, 1991.

22. James Devillers, editor. Neural Networks in QSAR and Drug Design. Academic Press, London, 1996.

23. J. Zupan and J. Gasteiger. Neural Networks for Chemists: an introduction. VCH Publishers, NY(USA), 1993.

24. J. A. Burns and G. M. Whitesides. Feed-forward neural networks in chemistry: Mathematical system for classification and pattern recognition. Chemical Reviews, 93(8):2583-2601, 1993.

25. S. E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pp 524-532. San Mateo, CA: Morgan Kaufmann, 1990.

26. S. E. Fahlman. The recurrent cascade-correlation architecture. In R.P. Lippmann, J.E. Moody, and D.S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pp 190-196, San Mateo, CA, 1991. Morgan Kaufmann Publishers.

27. A. Sperduti, D. Majidi, and A. Starita. Extended cascade-correlation for syntactic and structural pattern recognition. In Petra Perner, Patrick Wang, and Azriel Rosenfeld, editors, Advances in Structural and Syntactical Pattern Recognition, volume 1121 of Lecture notes in Computer Science, pp 90-99. SpringerVerlag, Berlin, 1996.

A Appendix

In the following the training set for benzodiazepines data used in data set III are reported. We report in the tables the numbers associated to compounds (not their fragments) as used in Fig. 13, Fig. 14, and Fig. 15.

Note that the C ring, located at position 5, is a phenyl group in all the analyzed compounds except in compounds 47, 108, 109, 111 and 113 where it is replaced by 2-pyridyl, Cyclohexenyl, Cyclohexenyl, Cyclohexyl and Naphthyl, respectively (marked by * in Table 5).


Table 5. Training Data Set III

* IR1 IR3/R6 IR7 IR8/R9 IR2' IR6'ILog l/cl

5 - CH3 -CN -F 7.52 8 -CH=CH, 7.62 9 -F 7.68 12 -COCH. -F 7.74 14 -CF. 7.89 16 -CH. I 8.09 17 - CH3 -CI -CI -Cl 8.26 20 -N. -F 8.27 22 - N 02 -CF3 8.45 24 -CHs -I -F 8.54 26 -CH. -Dr -F -F 8.62 27 -Cl -F 8.70 28 -CI :(JI 8.74 29 -NO, -F 8.82 30 -CH. -F -F 8.29 31 -<.;HS -F 7.77 32 -F -F 8.13 33 -CI -F -F 8.79 34 -CH. -CI -F -F 8.39 35 -CI -CI -F 8.52 36 -C -:::1 -m 8.15 37 -NO. 7.99 38 -CHS -N 02 -CI 8.66 42 -CH,CH,OH -CI -F 7.61 43 R3 -. CH3 -CI -F 8.46 44 R3 - .)CH. -NO, -CI 8.92 45 -CH. R3 - • CH. -NO, -F 8.15

47 -Dr 7.74 48 -CI "n 8.03 49 -F -F 7.72 50 - CH3 -CI 8.42 51 R8= -CI -F -F 7.55 52 R8 - CH3 -F 7.72 53 -CI R8 -Cl -F 8.44 54 -CH. R8 -CI -F 7.85 56 - CH3 -NH2 6.34 57 -NH, 6.41 58 -CH. -CN 6.42 60 -CHS -NHOH -F 7.02 61 -NH2 -CI 7.12 63 -CHO 7.37 64 -F 7.40 66 -C2 H • 7.44 67 -CHs -NH2 -F 7.19 71 -CH. -NHCONHCH. -F 6.34 73 -CH.-CF. -CI 7.04 77 -CH~C=CH -CI 7.03 81 -CH.C.H. -CI 6.96 84 -CH20CH. -N 02 6.37 86 -C(CH3)3 -CI 6.21 92 - CH.120CH,CONH2 I -~. 7.37 95 H2CHOHCH20H -CI -F 6.85 96 -CH. R6 -CI R8 -Cl -F 6.52 97 -CH. -CI R8 -CI 7.40 98 -CI R9= -CI 7.43 99 -CI R9 -CH. 7.28 100 -CI 7.43 101 -CI 7.15 102 R6_ -CH. -CH. 6.77 lOS R6 I 6.49 104 -CH. R6= -C -F 6.82 105 -C CH.la -NO. -CI 6.52 106 -CHa R9 -CI -F 7.14 lOS· -CI 7.47 109· - CH3 -CI 7.47 111 -CI 7.06 113· I 6.54

Hybrid modeling of kinetics for methanol synthesis

Primoz Potocnik1 , Igor Grabec1, Marko Setinc2 , and Janez Levec2

1 University of Ljubljana, Faculty of Mechanical Engineering, Laboratory of Technical Physics, Askerceva 6, SI-1000 Ljubljana, Slovenia

2 National Institute of Chemistry, Laboratory for Catalysis and Chemical Reaction Engineering, Hajdrihova 19, SI-1000 Ljubljana, Slovenia

Abstract. Neural network based experimental and hybrid approaches to modeling of processes are presented. A hybrid modeling, combining parametric model with radial basis function network, is proposed. The parametric model is used for principal modeling of the process and the radial basis function network is applied for nonlinear error correction. Experimental modeling can be improved by selecting as model inputs only variables with high predictive importance. Two feature selection methods are presented: analysis of mutual information, and genetic algorithm based feature selection. The methods proposed are applied to modeling of the liquid-phase methanol synthesis. Analytical, experimental and hybrid approaches are applied and results demonstrate that a hybrid modeling approach, exploiting available analytical knowledge and experimental data, can considerably outperform a purely analytical approach.

1 Introduction

A conventional approach to modeling of chemical processes is mainly analytical. Such an approach is based on a priori knowledge of the properties of the underlying chemical process. In the absence of adequate knowledge, an experimental approach can be applied. Neural networks are a promising tool for experimental modeling, optimization and control of biochemical processes [1]. The analytical and experimental modeling approaches can also be combined in a hybrid manner, thus exploiting analytical knowledge and experimental information obtained by measurement. Several possibilities of combining neural networks with a priori knowledge have been suggested elsewhere [2].

The objective of this chapter is to present a parallel hybrid structure, combining a parametric model with a radial basis function, and to apply the strategy to modeling of methanol synthesis kinetics. Large quantities of methanol are being produced all around the world, therefore even a slight optimization of the process can result in a great financial benefit. Consequently, it is very important to accurately predict the production rate of the methanol synthesis. In this chapter, a conventional analytical modeling approach is extended in a hybrid manner which considerably improves the prediction accuracy.

298 Primoz Potocnik et al.

The chapter is organized as follows: An introduction to neural networks is presented in section 2 with description of a radial basis function network and generalized regression neural network. Section 3 describes a hybrid modeling approach with a parallel neural-parametric structure which combines an analytical model and a radial basis function neural network. Since the measured process variables vary in their predictive importance, selection of informative variables is preferable. Two feature selection methods are described in section 4, namely, genetic algorithm based feature selection, and mutual information based feature selection. In section 5, the methods proposed are applied to modeling of methanol synthesis kinetics. Analytical, experimental and hybrid modeling approaches are compared in order to construct a model for the prediction of methanol production rate. Conclusions are summarized in section 6 and appendix A gives a description of the analytical model (Lee et at. [3]) of methanol synthesis kinetics.

2 Neural networks

A Neural Network (NN) can be described as an interconnected assembly of simple processing elements, or neurons, whose functionality resembles a biological neuron. The processing ability of the network is stored in the interunit connection strengths, or weights, determined by a process of learning from a set of training patterns. An overview of neural networks can be found in [4,5]. Neural networks have a universal approximation ability [6,7] and are therefore suitable for experimental modeling of nonlinear systems. The construction of a neural network is based on learning, where the signals from the environment are used to set the weights of the network. Generalization ability is achieved by a properly designed network.

2.1 Radial basis function network

A radial basis function network (RBFN) is used for nonlinear error correction in the proposed hybrid modeling approach due to its local approximation properties. RBFN, as proposed by Moody and Darken [8], is composed of neurons with radial basis activation functions. The structure of RBFN is shown in Fig. 1 and RBFN predictor is defined by:

K

y(t) = I: Wk g(x, qk, a), (1) k=l

where the predictor's output y(t) is composed as a sum of weighted activations g(.) of K neurons. A Gaussian function can be applied for the activation function of neurons:

(2)

Hybrid modeling of kinetics for methanol synthesis 299

outputs

output weights

activations

neurons

centers

Xl X2 •.. XD inputs

Fig. 1. The structure of a radial basis function network

with V(·) denoting the Euclidean distance between the input vector x and the memorized center of the k-th neuron qk. (J' denotes the width of a basis function and should be set properly in order to obtain suitable generalizatioh ability. Fig. 2 shows function approximation with RBFN, based on a noisy data set. Prediction results for three networks with different smoothing parameters (J' are displayed.

y

0.5 "

til " .. " ... ,.'.

" " ',., \ " " ,," 11. _ I \ j \ .J ~ I \ .:.. ..... - -,... \ -

. i I. .'.: 'I \ I

" """. I " ,I

o data - - (7=0.01 - (7=1

(7= 100

<r , 1\ ,\ .'

j - .~. I

-1~--~----~----~----~----L---~----~----~----~--~ -1 -0.8 -0.6 -0.4 -0.2 o

X

0.2 0.4 0.6 0.8

Fig. 2. Function approximation with RBFN, based on a noisy data, using smoothing parameter (7 = {O.OI, 1, 100}


Over-fitting occurs for a = 0.01. Samples from the data set are well approximated but the mapping lacks generalization ability. A proper value of a = 1 represents suitable function approximation with good generalization ability. Intensive smoothing with a = 100 gives a poor fit to the original function. An average distance between the samples can be used as an estimation of a and the proper value can be obtained by using cross-validation technique.

2.2 Generalized regression neural network

Generalized regression neural network (GRNN) [9], or conditional average estimator (CAE) [10], originates from the theory of statistical estimation and is based on nonparametric regression of the variable y on the independent variable x. The best predicted value for y is defined by its conditional expectation:

y(x) = J y f(x, y) dy J f(x, y) dy

(3)

The joint density function f(x, y) can be approximated by a multivariate Parzen kernel estimator, where the kernel can be expressed by a Gaussian function, given in Eq. (2). Thus, the fundamental equation for GRNN is obtained:

K

LYk g(x,xk,a) ~() :.:...k=-=l ____ _ y X = K (4)

Lg(x,xn,a) n=l

GRNN is a robust function approximator, extrapolating as a nearest-neighbor predictor. Similar as in RBFN, kernel width a denotes smoothing parameter.

2.3 Cross-validation

The prediction accuracy of the model can be expressed by a root-meansquare error which measures the averaged error between predicted values, y, and measured data, y:

N

RMS = ~ L (Yn - Yn)2. n=l

(5)

When estimating the quality of the model, the generalization ability plays an important role beside the prediction accuracy. One is interested not only how accurately the model approximates the learning data, but also, how the model generalizes on new data. Cross-validation [11] is a method for estimation of the model's quality and comprises the following steps:


1. The data set is split into several subsets.

2. One subset is removed from data as a test set, and the other subsets are used to train the model. Then the model is applied to predict on test set.

3. The procedure is repeated on all subsets of data and finally, the prediction errors obtained on test sets are averaged in order to estimate the generalization ability of the model.

If each test subset contains only one sample, the method is called leave-oneout cross-validation.

3 Hybrid modeling

The aim of hybrid modeling is to integrate several types of knowledge into a unified approach. The integration of a priori analytical knowledge with neural networks into a hybrid structure is presented in this section. The proposed parallel parametric-neural structure is shown in Fig. 3. The hybrid model consists of an analytical model and a radial basis function network, connected in parallel.

I Parametric Model

~'-----R-B-F-N-----'~--~

y

Fig. 3. Hybrid parallel parametric-RBFN model, composed of an analytical model and a radial basis function network

The motivation to combine parametric and neural models comes from the different properties of the two models. While neural networks have good approximation and interpolation properties, they show limited extrapolation capacity. On the other hand, parametric analytical models often show more robust extrapolation behavior but are not sufficiently accurate to describe the variations found in the measured data.

A parametric model is responsible for principal modeling of the process and for extrapolation. When high quality a priori knowledge is available, the parametric model can be represented by an elaborated analytical model. In the absence of adequate analytical knowledge, linear regression can be used to build the parametric model. A neural network is connected in parallel to the parametric model and is used for nonlinear error correction. The objective of using neural network based error corrector is to enhance modeling accuracy in the domain where enough measured data exist. Outside this domain, the


neural network's contribution should be negligible in order to leave the extrapolation to the parametric model. For this purpose, a radial basis function neural network is suitable because of its local approximation properties. The hybrid modeling procedure consists of the following steps:

1. Formation of the parametric model by meanS of analytical modeling.

2. Calculation of residual errors, obtained by prediction with the parametric model. Residuals are partially caused by noise contained in the measured data, and partially by nonlinear effects, not captured by the analytical model.

3. Training a radial basis function network to model the residual errors of the analytical model. By properly configuring the structure of the network, overfitting to noisy data can be avoided and only residual nonlinear effects are incorporated.

4 Feature selection

The measured process variables are usually not equally relevant for the prediction of process states. For the modeling procedure it is preferable to use a small set of input variables because many variables increase the complexity of the model. Therefore, it is important to select the most informative variables and to reject signals which do not have predictive importance. Two feature selection methods are described in this section, namely genetic algorithm based feature selection, and mutual information based feature selection. First, a brief introduction to genetic algorithms is given.

4.1 Genetic algorithms

Genetic algorithms (GA) are exploratory search and optimization techniques, using the principles of natural evolution and population genetics. The basic concepts were proposed by Holland [12], and a comprehensive overview can be found in [13,14]. The ability of GA to search efficiently in large spaces makes them a robust optimization technique with respect to the complexity of the optimization problem compared to the more conventional optimization algorithms. A flow diagram of the GA procedure is shown in Fig. 4. The procedure of problem solving by genetic algorithms includes the following steps:

1. encoding solutions; 2. specifying fitness function; 3. defining genetic operators; 4. iterating the simulated evolution.


Single solutions for solving the problem are encoded in chromosome-like data structures (e.g. arrays, trees or lists) which belong to a certain alphabet, such as binary, real, or any other. The set of possible solutions, or individuals, is called a population. GA evaluates the performance of the individuals by using a fitness function which indicates how good is an individual in terms of solving the problem. The specification of the fitness function has an important influence on the performance of the genetic algorithm and on the solution of the problem.

GA evaluates a population of solutions and then generates a new population for the next step of the iterated evolution. The new population is created by applying genetic operators:

Selection enables reproduction of the individuals with higher fitness value. The original fitness value is modified by the process of ranking to prevent a premature convergence of the population. Fitter individuals get a higher probability to mate and their genetic material is exploited.

Crossover operator exchanges genetic material No between the selected individuals. Crossover consists of merging the chromosomes of two individuals (parents) to obtain two new individuals (children). This reordering of the ge-netic material includes the effects of both exploration and exploitation.

Mutation introduces new genetic variations by randomly changing the individuals. This operator provides additional exploration of the search space.

Fig. 4. Flow diagram of a genetic algorithm

After the specification of the encoding scheme, the fitness function and the genetic operators, the evolution is simulated. Through successive iterations, new generations of individuals are created until the termination condition is met which stops the evolution. The termination condition can be defined as a maximum number of generations, or as a convergence measure.

4.2 Genetic algorithm based feature selection

Genetic algorithms can be used as a search method in a model based feature selection [15). Model based feature selection methods train and apply


an experimental model to evaluate the predictive importance of a selected set of variables. A search method is needed to browse through various combinations of input variables. The method is suitable for problems with many input variables, nonlinear relations between variables, or when only a limited number of samples is available.

A mask Sx is introduced to represent the selected inputs. The mask is composed of a binary string, with single bits determining inclusion ("I") or exclusion ("0") of the corresponding input variable. An example of a binary mask operating on a set of 10 input variables is shown in Fig. 5.

Input variables: X = {Xl, X2, .•• , XlO}

Mask: So; = I 0 11 I 0 I 0 11 11 I 0 I 0 11 I 0 I 1 1 1 1

Xs X6 Xg

Selected inputs: Xs = {X2,XS,X6,X9}

Fig. 5. The application of a binary mask So; for the selection of a subset of variables. The initial set of input variables X is mapped to a subset Xs which contains only selected variables.

The subset of input variables Xs is evaluated by the objective function !o(Xs) which returns a scalar value indicating the predictive importance of the evaluated subset. The goal is to find a small subset of informative input variables which renders possible the construction of a reliable model M with good generalization properties. The objective function is defined by

!o(Xs) = g(MIXs) h(LbP) (6)

where the functions g(.) and h(·) have the following meaning:

9 - generalization error of the model M which is built with the subset of input variables Xs. Generalization error is estimated by a cross-validation procedure.

h - penalty function, limiting the number of included variables L1 . The 0'value in interval [0,1] determines the percentage by which the generalization error 9 is increased if all inputs are selected. An example of a penalty function h(L1' 0') is described by the equation:

(7)

where L denotes the number of all input variables and L1 the number of selected input variables. Fig. 6 shows the penalty function for the a-values 0' = 0.1 and 0' = 0.5:


1.6

1.5 _._._ ._. ______________ , ___________ . Jl. :::.=f~--.""Q.""Jj~.......,

b 1.4

~ 1.3 0.4

_.:.:: .. :::::::.:.:J::::.:::._._._._ .... i.-- _·············_·-i-·-

--~ 1.2 0.1

1.1

1 0 2 4 6 8 10

number of selected variables L l

Fig. 6. Penalty function h(Ll' 0') is used to increase the return value of the objective function (Eq. 6), if many input variables are selected. The penalty curves for the a-values 0' = 0.1 and 0' = 0.5 with respect to the number of selected variables Ll are shown.

The selection of inputs by genetic algorithm consists of the following steps:

1. An initial population of masks {Sx(n), n = 1,2, ... Np} is generated. Np denotes the size of the population.

2. Initial selectors are evaluated by the objective function fo (Eq. 6) where the model is built and validated by the cross-validation method.

3. Numeric optimization of the initial binary masks is performed by a genetic algorithm. A binary mask which achieves the minimum of the objective function fo determines the most informative subset of input variables.

4.3 Mutual information based feature selection

Mutual information is a measure of connection between input and output variables which describes the amount of uncertainty in the system output Y that is resolved by knowing the input X. Mutual information is not limited to linear relations and is model independent. It is expressed by:

~ ~ p(x,y) J(X;Y) = ~ ~p(x,y)log (x) ( )'

xEXyEY P P Y (8)

where x and y denote discrete values of the random variables X and Y. Probability and joint probability functions are denoted by p(x) and p(x,y). Statistically independent variables X and Y result in J(X; Y) = O. The variables which are more conneCted, have higher value of J(X; Y). Consequently, mutual information can be used to estimate the predictive importance of input variables. Inputs which have higher mutual information with outputs, have presumably also higher predictive importance.


5 Modeling of methanol synthesis kinetics

5.1 Introduction

Methanol (CH30H) synthesis is one of the oldest industrial processes and methanol itself is one of the largest commodity chemicals produced in the world. Research and development interests in this area have been stimulated particularly by recent concern over increasing energy cost and the future availability of petroleum feedstock. Furthermore, the demand for a clean burning fuel has been increasing from the global point of environmental aspects. Methanol can be synthesized from simple molecules such as CO, H2 ,

H20, and CO2 , which, since CO2 is a waste product in energy production, is a particularly attractive option. Beside fuel, methanol has also been used in a variety of applications as a feedstock for other chemicals (formaldehyde, acetic acid, ... ) and other direct uses as a solvent, antifreeze, inhibitor, or substrate.

5.2 Problem definition

As large quantities of methanol are produced around the world, even a slight optimization of the process can result in a great financial benefit. Therefore, it is of a great importance to model the rate of methanol synthesis in the most accurate way. Consequently, our goal is to construct a model for the prediction of production rate rCHaOH based on the following input variables:

• temperature T, • partial pressure of reactants PH2' PCO, Pco2, • methanol partial pressure PCHaOH .

5.3 Solution approach

Several analytical models were reported [3,16-18] for the prediction of methanol production rate rCHaOH. In our approach, we try to improve the analytical modeling by using hybrid modeling, combining analytical model with neural networks as experimental modeling structures. In order to construct an accurate model for the prediction of methanol production rate rCHaOH , we investigate the following steps:

• Predictive importance of measured process variables (T, PH2' PCO, PC02' PCHaOH) for the prediction of production rate (rCHaOH) is estimated by analysis of mutual information.

• The prediction of rCHaOH by analytical model of Lee et al. [3] is calculated for the comparison with experimental and hybrid models.

• Several experimental models are constructed for the prediction of rCHaOH

(linear regression, GRNN, linear-RBFN model). Input variables are selected by GA-based feature selection method.


• A parallel hybrid analytical-RBFN model is designed for the prediction of rCH30H .

• The models are validated by using 20-fold cross-validation procedure where quality of the models is described by root-mean-square (RMS) error measure.

• The results of analytical, experimental and hybrid modeling are compared in order to find the most appropriate modeling approach.

5.4 Experimental setup

Methanol is currently indirectly produced from petroleum, natural gas and coal resources via syngas (H2, CO, CO2 and inert CH4 ) over Cu/ZnO/ Al20 3

catalyst. It is mainly produced in a gas phase where only a solid catalyst and gaseous reactants are present. As an alternative to the traditional way of producing methanol, this work presents a three-phase system where catalyst is suspended in a paraffin oil (slurry) [19], as shown in Fig. 7.

co

analysis of CO, CO2 ,

H2 , H20, CH30H

Fig. 7. The liquid-phase methanol synthesis.

A continuously stirred slurry reactor of 300 ml volume equipped with a temperature and pressure control unit, was fed with the gaseous reactants (H2, CO, C02) and inert gas (nitrogen). By using small catalyst particles and proper mixing it was ensured that there was no external and internal mass transfer limitation in the system. The analysis of the reactor effluent components (CO, CO2, H2, H20 and CH30H) was performed by a gas chromatograph in order to determine the corresponding partial pressure of each


component (PCO, PC02' PH 2 , PCH30H ). The rate of methanol production was calculated by the following equation:

cfJv P ( ) rCH30H = YCH30H w RT ' 9

where the symbols denote: YCH30H - methanol mole fraction, cfJv - gas flow rate, w - weight of catalyst, P - pressure, R - gas constant, and T - temperature.

The measurements of T, P H2' PCO, Pco2 and PCH30H were recorded together with the calculated production rate rCH30H , and a data base with 106 samples was created for the subsequent analysis.

5.5 Estimation of predictive importance of input variables

Mutual information between the input variables (T, P H2' PCO, PC02' PCH30H)

and the production rate (rcH30H ) was calculated to estimate the relevant inputs. Results are shown in Fig. 8.

0.7

0.6

o.~ ..--. :I: 0 .. 0.4 :I: 0 ~

~ 0.3

;::;-0.2

0.1

T Pea

input variable, X

Fig. 8. Mutual information between the input variables (T, PH2 , Peo, Pe0 2 '

PeH30H ) and the production rate (rcH30H).

The variables PCH3 0 H and T are the most related to the production rate rCH30H , followed by PH2 and PCO. The variable Pco2 has the lowest mutual information with rCH30H. The result indicates the predictive importance of the input variables and can be interpreted as a basic estimation about which variables should be included as inputs of the model. Nevertheless, care should be taken regarding the selection of variables because some variables with low mutual information can be connected and can be together highly related to the production rate.


5.6 Analytical modeling

The conventional approach to modeling of chemical rate processes is mainly analytical. The strongest argument in favor of searching for the actual mechanism is that if we find one, which we think represents what truly occurs, extrapolation to a new and more favorable operating conditions can be done much more safely. Such approach is based on a priori knowledge of the properties of all underlying chemical processes. Methanol synthesis is a heterogeneous reaction, which occurs on the catalyst surface. Mechanism of the adsorption of the reactants, the reaction and desorption of the products, consists of many reaction steps, and it is usually rather complicated and not well understood. In the presence of a liquid phase, the liquid can also influence the rate of methanol synthesis. With some assumptions and simplifications the analytical approach ends up with a simple kinetic model that can reasonably predict the reaction rate in an entire experimental range. In our case the rate model was taken from Lee et al. [3] (see appendix A). The model is postulated by assuming the hydrogenation of formiat (an intermediate in methanol synthesis) as the rate limiting step in the methanol synthesis. The rate of methanol production thus accounts for the hydrogen concentration driving force, which is given as a difference between the actual hydrogen concentration in the reactor and the equilibrium one. The effect of other reactant concentrations is expressed by the term representing the equilibrium hydrogen concentration. This means that the concentrations of CO and CO2

only indirectly affect the methanol production rate. All available data were used for determination of the model parameters. The model validation was not performed in a cross-validation manner. By using the input variables (T, PH 2' PCO, Pco2 , PCH30H), the prediction error of RMS = 3.59 was obtained. A scatter plot of the data, predicted by Eq. (11), versus the experimental data, is shown in Fig. 9.

5.7 Experimental modeling

The objective of experimental modeling is to construct a predictive model based only on the measured observations of the process, without knowing the actual physical properties and driving mechanisms. Often, names such as "black-box modeling", "empirical modeling" or "nonparametric modeling" are used to denote a similar approach. For the experimental modeling of methanol synthesis kinetics we assume that production rate rCH30H can be expressed as a function F of measured variables:

(10)

The following model structures were used to approximate the function F:

• linear regression; • generalized regression neural network;

310 Primoz Potocnik et ai.

50

., " 40 v

bO .:.:: ..c: ......... "0 ..§. 30

:c 0 ., :c (~ 20 "0

Q) ... ()

] 10 ... c.

0 0

: ,~

o +10?",,/'

"r' , ... ", . ,' : .. " , , ' .>"/~ , """""'''''''''~'''''''''''''''''''''''!'''''' ..... . ; "'''>':'j'ci'9i,' '

" / A/ ,/~" // .................... : ......... .............. : ...... ...... ;:;, .. .... j ~.;:';~' .......... ". . .................. , ..

: " . : <j/ -<

, . " AI> o 0 ; 0 0 ~' ~' ... ,.,.,.,.,.,., ..... : .............. )/ :'7~8'"

~ ~ , :.

........ ,.,>~'~~~~'~'~ ,o i ....... " .. " ....... ,.,:, ... ,., .. ,.,.,.,

10 20 30 40 50

measured TCH30H [moljh kgC&t)

Fig. 9. Scatter plot of predictions for the production rate rCH 30H with the analytical model Lee et ai. [3J. Predictions rCH30H are shown in relation to measured values rCH30H. Prediction error is RMS = 3.59.

• parallel linear- neural model, combining linear regression with a RBFN neural network.

Modeling procedure was combined with several feature selection methods:

• inclusion of all available measured variables; • GA based feature selection with 0' = 0 (no constraint regarding the num

ber of selected features) ; • GA based feature selection with 0' = 0.1 (limiting the number of selected

features).

The modeling results are shown in Table 1. The best prediction results were obtained by applying a parallel linear- RBFN model, using the following inputs: T, PH2 , Peo, PeH3 0H. Generalization ability was improved by omitting Peo 2 from the inputs. This is consistent with the estimation obtained by analysis of mutual information (Fig. 8), where the lowest mutual information was calculated for Peo 2 •

5.8 Hybrid modeling

A parallel parametric-RBFN model, presented in Fig. 3, was used for hybrid modeling of the process. Analytical model of Lee et al. [3] was used as a parametric model and RBFN was applied for nonlinear error-correction. Prediction error of RMS = 1.80 was obtained by a hybrid model. Scatter plot of hybrid modeling results is shown in Fig. 10.


Model Feature selection Selected variables RMS

linear all inputs T,PH2,PCO,PC02,PCH30H 3.72

regression GAq=o T,PH2,PCO,PCH30H 3.51 GAq=O.l T, P CH30H 3.76 all inputs T,PH2,PCO,PC02,PCH30H 2.56

GRNN GAq=o T, PH2' P CH30H 2.48 GAq=O.l T, PH2' P CH30H 2.48

linear-all inputs T,PH2,PCO,PC02,PCH30H 2.25

RBFN GAq=o T,PH2,PCO,PCH30H 2.05 GAq=O.l T, PH2' PCH30H 2.14

Table 1. Prediction results for the production rate rCH30H with a linear regression, a generalized regression neural network, and a linear-neural model. RMS prediction errors, obtained by cross-validation, are shown for three feature selection approaches.

50

Ci 40 u till ~

..<:: --"0 E.. 30

1: 0 ., 1:

(~ 20 "0

'" .., .:: "0

'" 10 ... 0.

50

Fig. 10. Scatter plot of predictions for the production rate rCH30H with a hybrid model (analytical model Lee et al. [3] + RBFN). Predictions rCH30H are shown in relation to measured values rCH30H. Prediction error is RMS = 1.80.

5.9 Results

Modeling results for the analytical, experimental and hybrid models are summarized in Table 2. Selected inputs and corresponding cross-validation prediction errors (RMS) are shown for each model structure.

Feature selection by genetic algorithm reveals the following informative variables: T, PH 2' Pco and PCH 3 0H . Good prediction results can be obtained even without using Pco. By only using selected variables, generalization abil-


Model Selected input variables RMS prediction error analytical T,PH2,PCO,PC02,PCH30H 3.59 linear regression T,PH2,PCO,PCH30H 3.51 GRNN T, PH2' PCH3 0H 2.48 linear + RBFN T,PH2,PCO,PCH30H 2.05 hybrid T,PH2,PCO,PCH30H 1.80

Table 2. Modeling results for the prediction of production rate rCH30H with analytical, experimental and hybrid model structures. Selected inputs and corresponding prediction errors (RMS) are shown.

ity of the models can be improved. Selected variables also correspond to the selection by mutual information (Fig. 8), therefore, both feature selection methods confirm predictive importance of the same variables. However, the result is a numerical one, therefore, low predictive importance of Pco does not suggest that it is insignificant for the analytical modeling.

Prediction results for TCH30H by the linear regression are similar to the analytical modeling results. Considerable prediction improvement is obtained by using a generalized regression neural network. The best results are obtained by the hybrid modeling approach which combines a priori knowledge (analytical model Lee et al. [3]) with a radial basis function network.

6 Conclusions

Whenever a reactive process is considered, the reaction kinetics is very important. The intrinsic kinetics should be based on the mechanistic steps, but must also account properly for the rates of chemisorption and desorption steps. However, obtaining such kinetic information from the mechanistic understanding would be very difficult, if not impossible. Therefore, for the engineering design much simpler correlations are developed by means of the statistical and experimental analysis. Having sufficient experimental data available, one can use an experimental or a hybrid modeling approach and thus compensate the lack of knowledge necessary in the analytical approach.

A hybrid modeling approach, combining parametric modeling with neural networks, was presented in this chapter. The hybrid model structure consists of a parametric model connected in parallel with the radial basis function network. The parametric model represents a basis of the hybrid model and is responsible for principal modeling of the process and for the extrapolation. Neural network is applied as nonlinear error corrector which improves the modeling accuracy in the domain where measured data exist.

Experimental modeling can be considerably improved by selecting as model inputs only those variables with high predictive importance. By rejecting non-informative variables, the complexity of the model is reduced, and often, its generalization ability improved. Therefore, two feature selec-


tion methods were presented, namely, analysis of mutual information, and genetic algorithm based feature selection.

The proposed modeling methods were applied to the prediction of production rate of the liquid-phase methanol synthesis. The analytical modeling was compared to experimental and hybrid approaches. Superior prediction results were obtained by using a hybrid analytical-RBFN model which reduces prediction error for 50% with respect to the analytical approach. Such an improvement can be very useful for the design of commercial reactors, pilot plant studies, process improvement, and process optimization.

A Analytical model of methanol synthesis kinetics

The analytical model for the kinetics of methanol synthesis, as proposed by Lee et al. [3], is defined by Equation

where k is calculated as

k = 2.627.1014 exp (_ 112.~~.103) .

The equilibrium constants are obtained from the following relations:

Kp = [ PCH3 0H ]

PH2 2 PCO eq

Kw = [PCO P H20 ] PC02 P H2 eq

where Kp and Kw are taken from [20]

Kp = 10(5139.0/T - 12.621)

Kw = 10(-2073.0/T + 2.029)

Equilibrium partial pressures in Eq. (13-14) are defined by:

PCH 3 0H,eq = PCH3 0H + X

PH 2 ,eq = PH2 - 2x - y

PH2 0 ,eq = Y

PCO ,eq = PCO - x + Y

PC02,eq = PC02 - Y

(11)

(12)

(13)

(14)

(15)

(16)

(17)

(18)

(19) (20)

(21)

Equilibrium value of P H2,eq can be calculated from Eq. (13-21). Concentrations in Eq. (11) are related to pressure by Henry's law

P c= H' (22)


with H denoting Henry's constant

Acknowledgments

( 2854.0) H = 78.192 exp R T . (23)

The authors would like to acknowledge the Ministry of Science and Technology, Slovenia, for its financial support.

References

1. R. Baughman and Y. A. Liu, Neural Networks in Bioprocessing and Chemical Engineering. San Diego: Academic Press, 1995.

2. M. Agarwal, "Combining neural and conventional paradigms for modelling, prediction and control," International Journal of Systems Science, vol. 28, no. 1, pp. 65-81, 1997.

3. S. Lee, J. B. Berty, H. L. Green, S. Desirazu, M. Ko, V. Parmareswaran, and A. Sawant, "Thermodynamics, kinetics, and thermal stability of liquid phase methanol synthesis," in Ninth Annual EPRI Contractor's Conference on Coal Liquefaction, (Palo Alto), pp. 15-28, Electronic Power Research Institute, EPRI AP-3825-SR, 1985.

4. C. M. Bishop, "Neural networks and their applications," Review of Scientific Instruments, vol. 65, no. 6, pp. 1803-1832, 1994.

5. S. Haykin, Neural Networks: A Comprehensive Foundation. New Jersey: Prentice Hall, 2nd ed., 1999.

6. G. Cybenko, "Approximation by superposition of a sigmoidal function," Mathematics of Control, Signals, and Systems, vol. 2, pp. 303-314, 1989.

7. F. Girosi and T. Poggio, "Networks and the best approximation property," Biological Cybernetics, vol. 63, pp. 169-176, 1990.

8. J. Moody and C. J. Darken, "Fast learning in networks of locally-tuned processing units," Neural Computation, vol. 1, pp. 281-294, 1989.

9. D. F. Specht, "A generalized regression neural network," IEEE Transactions on Neural Networks, vol. 2, pp. 568-576, 1991.

10. I. Grabec and W. Sachse, Synergetics of Measurements, Prediction and Control. Heidelberg, Germany: Springer Verlag, Series in Synergetics, 1997.

11. T. Masters, Advanced Algorithms for Neural Networks: A C++ Sourcebook. New York: John Wiley & Sons, 1995.

12. J. H. Holland, Adaptation in Natural and Artificial Systems. Cambridge: MIT Press, 1975.

13. L. Davis, ed., Handbook of Genetic Algorithms. New York: Van Nostrand Reinhold, 1991.

14. D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley, 1989.

15. P. Potocnik and I. Grabec, "Neural-genetic system for modeling of antibiotic fermentation process," in Proceedings of the International ICSC Symposium on Engineering of Intelligent Systems EIS'98, Volume 2, (Tenerife, Spain), pp. 307-313, 1998.


16. G. H. Graaf, J. G. M. Winkelman, E. J. Stanhius, and A. A. C. M. Beenackeers, "Kinetics of the three phase methanol synthesis," Chemical Engineering Science, vol. 43, pp. 2161-2168, 1988.

17. S. Ledakowicz, M. Stelmachowski, and A. Chauk, "Methanol synthesis in bubble column slurry reactors," Chemical Engineering and Processing, vol. 31, pp. 213-219, 1992.

18. K. M. V. Bussche and G. F. Froment, "A steady-state kinetic model for methanol synthesis and the water gas shift reaction on a commercial Cu/ZnO/Ab03 catalyst," Chemical Engineering Science, vol. 161, pp. 1-10, 1996.

19. M. Set inc and J. Levec, "On the kinetics of liquid-phase methanol synthesis over commercial Cu/ZnO/ Ab03 catalyst," Chemical Engineering Science, vol. 54, no. 15/16, pp. 3577-3586, 1999.

20. G. H. Graaf, P. J. J. M. Sijtsema, E. J. Stanhius, and G. E. H. Joosten, "Chemical equilibria in methanol synthesis," Chemical Engineering Science, vol. 41, no. 11, pp. 2883-2890, 1986.

About the Editors

Dr. Les M. Sztandera is an Associate Professor of Computer Science and Head of the Computer Science Program at the Philadelphia University, Philadelphia, Pennsylvania, U.S.A.

He has been involved in soft computing teaching and research since 1987. Dr. Sztandera has 11 years of full time university teaching experience, and is a recipient of a Teaching Excellence Award. He developed a sequence of soft computing courses coupled with laboratory assignments in which students work with real life problems, such as detecting an industrial pollutant, predicting strength and density of materials, designing a medical expert system, simulating protective systems in complex power generating units, detecting carcinogenic dyes, or designing new drugs.

Complementary with his teaching effort, Dr. Sztandera has been involved in a variety of research activities. That has resulted in numerous research grants from the Department of Commerce, National Textile Center, National Science Foundation, Ohio Supercomputer Center, Pittsburgh Supercomputer Center, and American Heart Association. Over $1,000,000 in research funding has been experienced. Those research activities also resulted in 25 journal publications and 40 conference presentations.

Dr. Sztandera received his Ph.D. degree from the Department of Electrical Engineering and Computer Science, University of Toledo, Ohio, U.S.A., with a dissertation on Fuzzy Sets in Self-Generating Neural Network Architectures. He earned his M.Sc. degree from the Department of Computer Science and Engineering, University of Missouri, Missouri, U.S.A., with a thesis on Spatial Relations Among Fuzzy Subsets of an Image, and a Diploma in English from University of Cambridge, Cambridge, England.

Dr. Sztandera is a member of professional organizations in the U.S. and Canada: the North American Fuzzy Information Processing Society, Association for Computing Machinery, and Canadian Society for Fuzzy Information and Neural Systems. His scientific and scholarly research contributions to the fuzzy set theory are internationally recognized. He proposed, designed, and implemented fuzzy neural trees. For this and other contributions to the fuzzy sets and systems theory, he was included in the Encyclopedia of Computer Science and Technology, 1999 Edition. Dr. Sztandera is also listed in the Marquis Who's Who in the World, Who's Who in Science and Engineering, Who's Who in America, and Who's Who in the East.

318

Dr. Hugh Cartwright is a member of the Physical and Theoretical Chemistry Laboratory, which is a division of the Chemistry Department at Oxford University, England.

He has been actively involved in both teaching and research for thirty years, working most recently on the application of Artificial Intelligence and related methods to the solution of scientific problems. These studies have included the use of Genetic Algorithms, Neural Networks, Kohonen Networks, Data Mining techniques, and Pheromone Trail (Ant System) algorithms to problems such as the dispersal of airborne pollution, optimization of organic synthetic routes, industrial process control, drug design, "reverse engineering" of archeological discoveries, bacterial growth, spectral interpretation, bioinformatics, and the assessment of data from drug testing.

Dr Cartwright is a computational chemist, who received a PhD in Chemistry from the University of East Anglia in England in 1972, studying the use of molecular orbital theory in the description of color centers. After a short period at University College, Cork, Eire, working in Prof. Brian Hathaway'S group, he moved to the Chemistry Department at the University of Victoria, Canada where his research interests diversified .to include high resolution spectroscopy, chemical education and Artificial Intelligence.

In 1984 he returned to the U.K., to take up a position in Oxford University's Chemistry Department, where he is Laboratory Officer, and Lecturer at St Anne's and Oriel Colleges. His research interests now lie principally in the study of how intelligent methods can be used in science, and this work has lead to publications, lectures and conference papers covering a wide area.

List of Contributors

Zulfiqur Ali School of Science and Technology University of Tees side Middlesborough, England

Anna Maria Biannuci Dipartimento di Scienze Farmaceutiche, Universita di,Pisa Via Bonanno 6, 56126 Pisa, Italy

Charles Bock School of Science and Health Philadelphia University, Philadelphia, PA 19144, USA

Hugh M. Cartwright Physical and Theoretical Chemistry Laboratory. University of Oxford Oxford, England

Ashish Garg School of Textiles and Materials Technology Philadelphia University, Philadelphia, PA 19144, USA

Valerie J. Gillet Department of Information Studies University of Sheffield Western Bank, Sheffield, S10 2TN

Igor. Grabec Faculty of Mechanical Engineering, Laboratory of Technical Physics, University of Ljubljana, Ljubljana, Slovenia

Taizo Hanai Department of Biotechnology, Nagoya University Nagoya 464-8603, Japan

Hiroyuki Honda Department of Biotechnology, Nagoya University Nagoya 464-8603, Japan

David Iss ott Physical and Theoretical Chemistry Laboratory, University of Oxford Oxford, England

Roy L. Johnston School of Chemistry University of Birmingham Birmingham, UK

Takeshi Kobayashi Department of Biotechnology, Nagoya University Nagoya 464-8603, Japan

Janez Levec Laboratory for Catalysis and Chemical Reaction Engineering, National Institute of Chemistry, Ljubljana, Slovenia

Werner Lottermoser Institute of Mineralogy University of Salzburg Salzburg, Austria

Alessio Micheli Dipartimento di Informatica, Universita di Pisa Corso Italia 40, 56125 Pisa, Italy

W.T.O'Hare School of Science and Technology University of Teesside Middlesborough, England

Ketan Patel Physical and Theoretical Chemistry Laboratory, University of Oxford Oxford, England

Andrew Porter Physical and Theoretical Chemistry Laboratory, University of Oxford Oxford, England

320

Primoz Potocnik Faculty o/Mechanical Engineering, Laboratory o/Technical Physics, University 0/ Ljubljana, Ljubljana, Slovenia

Christopher Roberts Schoolo/Chemistry University 0/ Birmingham Birmingham, UK

Thomas Schell Institute 0/ Mineralogy University 0/ Salzburg Salzburg, Austria

S.M.Scott Schoolo/Science and Technology University o/Teesside Middlesborough, England

Marko Setinc Laboratory lor Catalysis and Chemical Reaction Engineering, National Institute o/Chemistry, Ljubljana, Slovenia

Alessandro Sperduti Dipartimento di In/ormatica, Universita di Pisa Corso Italia 40, 56125 Pisa, Italy

Antonina Starita Dipartimento di Informatica, Universita di Pisa Corso Italia 40, 56125 Pisa, Italy

Konrad Steiner Institute 0/ Mineralogy University 0/ Salzburg Salzburg, Austria

Les M. Sztandera Philadelphia University Computer Science Department Philadelphia, PA 19144, US.A.

Mendel Trachtman School 0/ Science and Health Philadelphia University, Philadelphia, PA 19144, USA

Janardhan Veiga Philadelphia University, Philadelphia, PA 19144, USA

[studies in fuzziness and soft computing] soft computing approaches in chemistry volume 120 ||

Documents