a genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

16
Journal of Analytical and Applied Pyrolysis 50 (1999) 47–62 A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data Barry K. Lavine a, *, Anthony Moores a , Lisa K. Helfend b a Department of Chemistry, Box 5810, Clarkson Uni6ersity, Potsdam, NY 13699 -5810, USA b Department of Chemistry, Uni6ersity of California at Santa Cruz, Santa Cruz, CA 95064, USA Received 22 September 1998; accepted 14 December 1998 Abstract The development of a genetic algorithm (GA) for pattern recognition analysis of pyrolysis gas chromatographic data is reported. The GA selects features that optimize the separation of the classes in a plot of the two largest principal components (PCs) of the data. Because the largest PCs capture the bulk of the variance in the data, the peaks chosen by the GA convey information primarily about differences between the classes in the data set. Hence, the principal component analysis routine embedded in the fitness function of the GA acts as an information filter, significantly reducing the size of the search space, since it restricts the search to feature sets whose PC plots show clustering on the basis of class. In addition, the algorithm can focus on those classes and or samples that are difficult to classify as it trains using a form of boosting. Samples that consistently classify correctly are not as heavily weighted in the analysis as samples that are difficult to classify. Over time, the algorithm learns its optimal parameters in a manner similar to a neural network. The proposed algorithm integrates aspects of artificial intelligence and evolutionary computations to yield a ‘smart’ one-pass procedure for pattern recognition. The efficacy and efficiency of the pattern recognition GA is demonstrated using a data set consisting of 133 pyrochro- matograms of cultured skin fibroblasts obtained from 24 obligate cystic fibrosis homozygotes and from 22 normal controls. © 1999 Elsevier Science B.V. All rights reserved. Keywords: Genetic algorithms; Pattern recognition techniques; Pyrolysis gas liquid chromatography; Principal component analysis; Evolutionary computations; Feature selection * Corresponding author. Tel.: +1-315-2682389; fax: +1-315-2686610. 0165-2370/99/$ - see front matter © 1999 Elsevier Science B.V. All rights reserved. PII:S0165-2370(99)00002-9

Upload: barry-k-lavine

Post on 02-Jul-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

Journal of Analytical and Applied Pyrolysis50 (1999) 47–62

A genetic algorithm for pattern recognitionanalysis of pyrolysis gas chromatographic data

Barry K. Lavine a,*, Anthony Moores a, Lisa K. Helfend b

a Department of Chemistry, Box 5810, Clarkson Uni6ersity, Potsdam, NY 13699-5810, USAb Department of Chemistry, Uni6ersity of California at Santa Cruz, Santa Cruz, CA 95064, USA

Received 22 September 1998; accepted 14 December 1998

Abstract

The development of a genetic algorithm (GA) for pattern recognition analysis of pyrolysisgas chromatographic data is reported. The GA selects features that optimize the separationof the classes in a plot of the two largest principal components (PCs) of the data. Becausethe largest PCs capture the bulk of the variance in the data, the peaks chosen by the GAconvey information primarily about differences between the classes in the data set. Hence,the principal component analysis routine embedded in the fitness function of the GA acts asan information filter, significantly reducing the size of the search space, since it restricts thesearch to feature sets whose PC plots show clustering on the basis of class. In addition, thealgorithm can focus on those classes and or samples that are difficult to classify as it trainsusing a form of boosting. Samples that consistently classify correctly are not as heavilyweighted in the analysis as samples that are difficult to classify. Over time, the algorithmlearns its optimal parameters in a manner similar to a neural network. The proposedalgorithm integrates aspects of artificial intelligence and evolutionary computations to yielda ‘smart’ one-pass procedure for pattern recognition. The efficacy and efficiency of thepattern recognition GA is demonstrated using a data set consisting of 133 pyrochro-matograms of cultured skin fibroblasts obtained from 24 obligate cystic fibrosis homozygotesand from 22 normal controls. © 1999 Elsevier Science B.V. All rights reserved.

Keywords: Genetic algorithms; Pattern recognition techniques; Pyrolysis gas liquid chromatography;Principal component analysis; Evolutionary computations; Feature selection

* Corresponding author. Tel.: +1-315-2682389; fax: +1-315-2686610.

0165-2370/99/$ - see front matter © 1999 Elsevier Science B.V. All rights reserved.PII: S0165 -2370 (99 )00002 -9

Page 2: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

48 B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

1. Introduction

Pyrolysis gas liquid chromatography has proven to be a valuable technique forobtaining chemical fingerprints of complex samples such as microorganisms, poly-mers, or human cell lines [1–8]. To use these fingerprints or pyrochromatograms todiscriminate between different sample types, e.g. various strains of microorganismsor copolymer types, it is necessary to examine the variability of large numbers ofpyrochromatograms. Pattern recognition techniques are ideally suited for this task,since they can display variability between a large number of chromatograms andshow major clustering trends in large chromatographic data sets. Pattern recogni-tion techniques that have been applied to pyrochromatographic (PyGC) data setsinclude cluster analysis [9], SIMCA [10–13], nonlinear mapping [14], and K-NN[15]. Data reduction, i.e. selection of peaks or groups of peaks related to a givenproperty, is a major goal in many pattern recognition studies involving PyGC data.

In this paper, we report on the development of a genetic algorithm (GA) forpattern recognition analysis of PyGC data. The GA selects features that optimizethe separation of the classes in a plot of the two largest principal components (PCs)of the data. Because the two largest PCs capture the bulk of the variance in thedata, the peaks chosen by the GA will contain information primarily aboutdifferences between the classes in the data set if the PC plot shows large differencesbetween the classes in the data set. (Since variance is synonymous with information,feature selection via this criterion may cause information about class differences toemerge as the dominant source of variation in the data.) Hence, the principalcomponent analysis routine in the fitness function of the GA acts as an informationfilter, significantly reducing the size of the search space, since it restricts the searchto feature sets whose PC plots show clustering on the basis of class. (If a plot of thetwo largest principal components for a set of features yields well separated classes,one can only conclude that the bulk of the variance encoded by the set of GC peaksis about discrimination. Such features usually produce a good classifier.) Inaddition, the algorithm focuses on those classes and/or samples that are difficult toclassify as it trains using a form of boosting [16]. Samples that consistently classifycorrectly are not as heavily weighted in the analysis as samples that are difficult toclassify. Over time, the algorithm learns its optimal parameters in a manner similarto a perception [17]. The proposed algorithm integrates aspects of artificial intelli-gence and evolutionary computations to yield a ‘smart’ one-pass procedure forpattern recognition.

2. Experimental

The PyGC data set consisted of 133 pyrochromatograms of cultured skinfibroblasts obtained from 24 obligate CF homozygotes and from 22 normalcontrols (see Table 1). Although the primary clinical manifestations in cysticfibrosis (CF) patients appear centered in the pancreas and in the mucus-producingcells, a number of studies have indicated that biochemical abnormalities are also

Page 3: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

49B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

Table 1Human skin fibroblast samples

AgeSample Type Passage Source Gender

14GM142 CF 7 MHM10GM668 HCF 7

13GM768 CF 7 MH19GM770 CF 6 H M10 MGM997 HCF 12

H 4 MGM998 CF 1213GM999 CF 18 H F18 FGM1348 HCF 7

FGM1707 CF 14 H 93GM3466 CF 6 H M

23 MRONROB ECF 2E 25 FLM CF 5

12JJ CF 2 E M22 MDANL ECF 222KIML CF 2 E F28DHOLD CF 2 E M23 MTCAL ECF 217TEDDY CF 3 E M

M19MCAFEE ECF 2E 14 FLGALE CF 2

M18MGLISS ECF 2FABELT CF 2 E 11

25GK CF 3 E M27 MMS ECF 7

H 4 MGM497 Normal 125GM408 Normal 13 H F

19 MGM2987 HNormal 625GM3651 Normal 6 H F82GM1706 Normal 10 H F

3 MGM2938A HNormal 73 daysGM970 Normal 8 H M

M8GM499 HNormal 11H 10 MGM500 Normal 12

M12GM316A HNormal 13H 13 FGM1651 Normal 11

29GM2674 Normal 11 H F61 FGM2623 HNormal 8

H 36 MGM2185 Normal 1011GM2036 Normal 14 H F30 MASH ENormal 220LIP Normal 2 FE45MN Normal 5 E F28 FJNEW ENormal 240EP Normal 2 ME

M50JP ENormal 2E 30 MJMS Normal 2

Page 4: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

50 B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

present in skin fibroblasts [18]. Skin fibroblasts are also an appealing model systemfor their ease of culture and their insensitivity to the transient metabolic status ofthe donor.

Skin fibroblasts were grown from live human volunteers, from refrigeratedsubjects, or from the Human Genetic Mutant Cell Repository (Institute for MedicalResearch, Camden, NJ). The cells were cultured in a modified Eagle’s minimumessential medium. Batches of the growth medium were prepared upon demand froma stock solution of modified minimum essential medium. The established cell lineswere serially passaged until sufficient material was available for several PyGCexperiments. Standard precautionary measures were taken to monitor cell viabilityand the absence of contamination.

Pyrolysis of the fibroblast sample was carried out in two stages using CDSAnalytical Pyroprobe (Oxford, PA) interfaced to a 5830A Hewlett-Packard (PaloAlto, CA) gas chromatograph. The fibroblast sample was pyrolyzed at 400°C andthen again at 700°C. Only the PyGCs from the 700°C run were used for the patternrecognition study. The two-stage pyrolysis procedure yielded reproducible gaschromatograms because water was eliminated from the sample as a result of the400°C run. A typical PyGC from a 700°C run is shown in Fig. 1. The volatileproducts were separated on Carbowax 20M capillary columns (25 m in length×0.21mm). The columns were temperature-programmed from 45 to 250°C at 10°C/min, and the run-time was about 60 min per sample.

To ensure reproducibility of the pyrochromatograms, the platinum pyro-probewas periodically calibrated using salts of known melting temperature. A smallquantity of the salt was deposited in the center of the quartz tube and mounted inthe pyro-probe coil, which was clamped at the focus of a viewing magnifier. Thetemperature setting of the pyro-probe controller was slowly increased from aninitial value slightly below the salt’s melting point. Repeated 10 s pyrolyses andincremental increases in the temperature setting were carried out until the saltcrystals were observed to melt. Three replicate tests were performed using KHSO4

(197°C), NaI (651°C), and NaBr (730°C). The values obtained from these calibra-tions were used to appropriately modify the pyro-probe temperature settings for thefibroblast sample pyrolyses.

3. Data preparation

For each subject, duplicate, triplicate or quadruplicate PyGCs were taken. The 73CF and 60 normal PyGCs were standardized using a peak-matching programwritten in FORTRAN IV for a Prime 750 super-minicomputer. The program issimilar to one written by Morgan [19]. Divided into sections, the program, scales,rubberbands, and trains the chromatographic data for final matching. The scalingand rubber-banding module is based on a set of prominent peaks called referenceor marker peaks, which are present in all the PyGCs and defined by the user. EachPyGC is divided into intervals defined by marker peaks. Retention time offsets arecomputed for the marker peaks in each individual PyGC. In other words, the

Page 5: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

51B

.K.

La6ine

etal./

J.A

nal.A

ppl.P

yrolysis50

(1999)47

–62

Fig. 1. Pyrochromatogram at 700°C of a CF and normal human skin fibroblast sample.

Page 6: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

52 B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

difference in retention time between the marker peaks in the reference chro-matogram (which is selected by the user) and a chromatogram’s own markerpeaks are computed. These offsets are then used to adjust the retention times ofeach set of reference peaks to force time base consistency with the marker peaksof the reference chromatogram. Finally, all the peaks that lie between the refer-ence markers are adjusted by linear interpolation, which is a valid assumption inthe case of temperature programming. Data preceding the first marker peak andfollowing the last marker peak is ignored—the early information is obscured bythe solvent peak, and trailing peaks beyond the last reference peak are poorlyresolved.

Once the retention times for all peaks are adjusted, the areas are then adjustedusing a scaling factor based on the sum of the areas of the reference peaks. (Inother words, the area of each peak is divided by the sum of the areas of thereference peaks.) The peak matching routine also computes tolerance windowsfor the retention time differences. The peaks are matched with those in thereference chromatogram, provided that the difference in the adjusted retentiontime for a given pair of peaks falls within the specified tolerance window. Tocompute the tolerance window, a peak that is present in all the PyGCs, and isreadily identifiable by the user, but not labeled as a reference peak, was fol-lowed. The number of occurrences of the peak at various retention times wasplotted. The width of the retention time interval for the peak was used to definethe tolerance window. For each interval of the PyGC, a tolerance window wascomputed which was 0.2 minutes on average.

The peak matching software yielded a final cumulative reference file containing84 standardized retention time windows, though not all peaks were present in allchromatograms. Hence, for pattern recognition analysis, each gas chromatogramwas initially represented as an 84 dimensional vector x= (x1, x2, x3, …, xj, …,x84) where xj is the area of the jth peak. The set of data—133 PyGCs (73 CFand 60 normal) of 84 peaks each—was normalized and auto-scaled to ensurethat each chromatogram and feature had equal weight in the analysis. The datawere analyzed using a genetic algorithm developed for pattern recognition, whichwas implemented on MATLAB (MathWorks, Inc). All calculations referred to inthis study were performed on a Micron Pentium 133 MHZ personal computerequipped with 64 megabytes of EDO RAM.

4. Genetic algorithm

Genetic algorithms were developed by Holland [20] as part of a study onadaptive processes. They are based on the principles of natural evolution andselection. The procedure builds a population of binary strings, each of whichrepresents a possible solution. Fit solutions are allowed to live and breed. Ablock diagram of the genetic algorithm (GA) developed for pattern recognition isshown in Fig. 2. The various components of the algorithm are described below.

Page 7: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

53B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

Fig. 2. Block diagram of the genetic algorithm developed for pattern recognition analysis.

4.1. Population

Selected features subsets are coded as binary strings called chromosomes. Eachchromosome describes a unique set of features. A particular feature is present in achromosome or binary string only if the corresponding bit in the string is set to 1.The length of each chromosome is equal to the number of features in the data set.(In this study, the length of each chromosome is 84.) The number of chromosomesin the initial population is f. (Usually, f is set to 100.) The chromosomes or binarystrings comprising the initial population (i.e. the population at generation 0) aregenerated at random to minimize potential bias.

4.2. Fitness function

With each generation, the GA computes class and sample weights. These weightsare an integral part of the fitness function, which is based on scoring the featuresaccording to their ability to optimize the separation of the different sample-types ina plot of the two largest principal components of the data. The principal compo-nent plot functions as an embedded information filter. Because a good PC plot isonly generated by features whose variance is primarily about differences betweenthe classes, the principal component plots limit the GA search to these types offeature sets, thereby reducing the search space. The fitness function of the GA hasboth normalization and scoring components.

Normalization is used to adjust sample weights (SW) and class weights (CW) topreserve the following property: the sum of the class weights is equal to 100, andthe sum of the sample weights in a class is always equal to the class weight. Thisfacilitates the tracking and scoring of the chromosomes between generations. Thenormalization functions are given in Eqs. (1) and (2). CW(c) is the weight of classc, and SWc(s) is the weight of sample s in class c. Prior to the first generation, theuser initializes the class weights, with the sample weights being uniformly dis-tributed in a class.

CW(c)=100CW(c)

%c

CW(c)(1)

Page 8: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

54 B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

SWc(s)=CW(c)SWc(s)

%s�c

SWc(s)(2)

Scoring is performed on each principal component plot, which is generated foreach chromosome after the subset of features coded in the chromosome has beenextracted. A principal component plot is scored using k-nearest neighbor (K-NN).For a given sample point, Euclidean distances are computed between it and everyother point in the principal component plot. These distances are arranged fromsmallest to largest, and a poll is taken of the point’s k-nearest neighbors. For themost rigorous classification, k equals the number of samples in the class to whichthe point belongs. The sample hit count, or the number of like nearest neighbors is05SHC(s)5Kc. The fitness is computed using Eq. (3).

%c

%s�c

1Kc

×SHC(s)×SW(s) (3)

The mean sample hit rate (SHR) for each sample over the entire population ofchromosomes or feature subsets is computed via Eq. (4), and is used to drive theboosting routine, where f is the size of the population.

SHR(s)=1f

%f

i=1

SHCi(s)K

(4)

To understand scoring, consider a data set with two classes, which have beenassigned equal weights. Class 1 has ten samples, and class 2 has 20 samples. Foruniformly distributed sample weights, class 1 samples will have a weight of 5 andclass 2 samples will have a weight of 2.5, since each class has a weight of 50 and thesample weights in each class are uniformly distributed. Suppose a sample in class 1has, as its nearest neighbors, seven class 1 samples in a principal component plotdeveloped from a particular feature subset. Hence, SHC(c)/Kc=7/10, and thecontribution of this sample to the fitness function for the particular feature subsetequals 0.7�5 or 3.5. Multiplying SHC/Kc by SW(s) for each sample and summingup the corresponding product for the 30 samples in the data set yields the value ofthe fitness function for this particular set of features.

4.3. Reproduction

Selection, crossover, and mutation operators are applied to the chromosomes. Fitstrings are retained and selected for breeding, a process called selection, which is thefirst step toward population reorganization. The fit feature subsets are thenbroken-up, swapped, and recombined, creating new feature subsets, which areintroduced into the population of potential solutions. This process is calledcrossover. In this study, the selection and crossover operators are implemented byordering the population of strings, i.e. potential solutions, from best to worst, whilesimultaneously generating a copy of the same population and randomizing theorder of the strings in this copy with respect to their fitness. A fraction of thepopulation is then selected as per the selection pressure which is set at 0.5. The top

Page 9: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

55B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

half of the ordered population is mated with strings from the top half of therandom population, guaranteeing the best 50% are selected for reproduction, whileevery string in the randomized copy has a uniform chance of being selected, see Fig.3. (This is due to the randomized selection criterion imposed on strings from thispopulation.) If a purely biased selection criterion were used to select strings, only asmall region of the search space would be explored. Within a few generations, thepopulation would consist of only copies of the best strings in the initial population.

For each pair of strings selected for mating, two new strings are generated usinga variation of three-point crossover (see Fig. 4). As in the case of simple three-pointcrossover, the length of each new string is the same as the dimensionality of thedata. Unlike simple three-point crossover, our crossover operator is not compelledto preserve order among the exchanged string fragments, which safeguards the lossof information or features in the population. It will become less likely for thepopulation variability to fall below a critical value due to the additional degree offreedom provided by the reordering. Furthermore, our variation of three-pointcrossover may be useful in searching for good string arrangements. For example, ifthe current population has bad ordering, where features with a high synergism arespaced at great distances, simple crossover would probably destroy potentiallyimportant allele packets. On the other hand, there is a chance to obtain good alleleordering, by using a crossover operator with a reordering algorithm embedded in it.

In the last step of reproduction, a mutation operator is applied to the newstrings. The mutation probability of the operator is usually set at 0.01, so 1% of thefeature subsets are selected at random for mutation. A chromosome marked formutation has a single random bit flipped, which allows the GA to explore otherregions of the parameter space. If the GA finds a better point, the genes from thispoint can invade the population, with the optimization continuing in a newdirection.

Fig. 3. The population of strings is ordered from best to worst based on fitness, while simultaneously acopy of the strings from this same population has been randomized. The top half of the orderedpopulation is mated with strings from the top half of the random population. (F is the fitness; SP is theselection pressure).

Page 10: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

56 B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

Fig. 4. Three-point crossover with mixing. Instead of swapping alleles and simultaneously preservingtheir position, four chromosome fragments are distributed and recombined at random with four factorialunique possibilities.

The resulting population of strings, both the parents and children, are sorted byfitness, and the top f strings are retained for the next generation. Because theselection criterion used for reproduction exhibits bias for the higher-ranking strings,the new population is expected to perform better on average than its predecessor.The aforementioned reproductive operators, however, also assure a significantdegree of diversity in the population, since the crossover points and reordering ofexchanged string fragments of each chromosome pair is selected at random.

4.4. Adjusting internal parameters

The GA is able to concentrate its efforts on classes and samples which are moredifficult to classify by boosting their weights (see Fig. 5). There are two stages inboosting. In the first or learning stage, class weights are adjusted relative to eachother, in order to achieve an optimal configuration. The class hit-rate or CHR (seeEq. (5)), which is the average of the mean sample hit-rate for the samples in theclass, is computed. Those classes with lower class hit-rates will be weighted moreheavily than classes, which score well. The change in the class weights, which iscomputed using a perceptron (see Eq. (6)) is monitored throughout the run. (P isthe momentum parameter for the perceptron and is assigned a value by the user.)

Page 11: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

57B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

If the average change in the weights is greater than some tolerance, the GA is saidto be learning its optimal class weights. Once the tolerance is reached, the classweights are fixed and the sample weights in each class become uniformly distributedaccording to the class weight. This initiates the second stage. The momentum,which controls the rate at which the sample and class weights are changed (seeequations 6 and 7), is initially assigned a value of 0.8 while the GA is learning, butP is adjusted to 0.4 once the class weights become fixed. These values have beenchosen in part because they facilitate learning by the GA but do not cause aparticular sample or class to dominate the calculation, which would result in theother samples or classes not contributing to the fitness function.

CHRg(c)=AVG(SHRg(s):Ös�c) (5)

CWg+1(c)=CWg(c)+P(1−CHRg(c)) (6)

SWg+1(s)=SWg(s)+P(1−SHRg(s)) (7)

4.5. End criterion

During each generation, class and sample weights are updated using the class andsample hit-rates from the previous generation. (g+1 is the current generation,whereas g is the previous generation.) The aforementioned procedure, whichinvolves evaluation, reproduction, and boosting of the potential solutions, isrepeated until a specified number of generations are executed or a feasible solutionis found.

Fig. 5. Block diagram of the boosting algorithm used to adjust the weights of difficult classes and/orsamples.

Page 12: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

58 B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

Fig. 6. A plot of the two largest principal components of the 84 GC peaks obtained from the 133 peakmatched PyGCs. 1, Normal; and 2, CF.

5. Results and discussion

The first step in the study was to apply principal component analysis [21] to thedata. Principal component analysis is a powerful method for uncovering hiddenrelationships in multivariate data sets. Using this procedure is analogous to findinga new coordinate system that is better at conveying information present in the datathan axes defined by the original measurement variables. The new coordinatesystem is linked to variation in the data. The basis vectors of the new coordinatesystem are the principal components of the original data. Each principal componentis a linear combination of the original measurement variables. Often, only two orthree principal components are necessary to explain all of the information presentin a data set that has a large number of interrelated measurement variables. Hence,using principal component analysis, the dimensionality of a data set can be reduced,while simultaneously retaining the information present in the data.

Fig. 6 shows a principal component map of the 84 GC peaks obtained from the133 PyGCs. The map of the two largest PCs of the data explains 72% of the totalcumulative variance. Each PyGC is represented as a point in the principal compo-nent map. CF homozygote chromatograms are not well separated from the PyGCsof the presumed normals in the map. The overlap of the two classes in the map is

Page 13: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

59B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

not surprising, since it has been reported that PyGCs of human cell lines fromdiseased and normal donors exhibit an overall qualitative and quantitative similar-ity [22,23].

A genetic algorithm for pattern recognition was used in this study to uncoverfeatures characteristic of the PyGC profile of each class. The genetic algorithmidentified features by sampling key feature subsets, scoring their principal compo-nent plots, and tracking those samples or classes that were most difficult to classify.The boosting routine used this information to steer the population to an optimalsolution. After 100 generations, the genetic algorithm identified ten standardizedretention time windows (features 1, 12, 15, 27, 39, 41, 47, 67, 73, and 78), whoseprincipal component map (see Fig. 7) shows clustering of the PyGCs according tothe diseased state of the sample. (k was set equal to the number of PyGCs in eachclass.)

The number of features in each feature subset of the initial population can be acritical parameter. If the feature sets are initially sparse, the probability of includingfeatures, which are neither good nor bad, is low since the fitness function does notprovide additional points for adding them. On the other hand, the probability ofremoving these features from less sparse feature subsets is also low since there is no

Fig. 7. A principal component map of the 10 GC peaks identified by the genetic algorithm. 1, Normal;and 2, CF. The first principal component explains 40% of the total cumulative variance of the ten GCpeaks, and the second principal component explains 25% of the total cumulative variance.

Page 14: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

60 B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

Fig. 8. A map defined by peaks 1 and 15 of the 133 peak matched PyGCs. 1, Normal; and 2, CF.

advantage in deleting them. For data sets with a large number of good features, itis probably best not to employ sparse feature subsets in the initial population.Otherwise, it may take thousands of generations to ensure the inclusion of all goodfeatures in the solution.

The feature subsets in the initial population contained on average only eight ornine features. To verify that all ten gas chromatographic peaks identified by the GAwere crucial for the classification, we reran the GA using sparser feature subsets inthe initial population. For the initial population, the number of features in eachsubset was reduced on average to five. Rerunning the GA produced an interestingresult. Only two peaks (one and 15) were necessary to correctly classify all of thepyrochromatograms in the data set (see Fig. 8). These two peaks were members ofthe ten GC peak set uncovered by the GA in a previous run. Since the largestprincipal component of the ten peaks did not convey information about differencesbetween CF versus Normal cell lines, one must conclude that the bulk of thevariance of the other eight GC peaks is not about CF versus normal. In thesesituations, information about the desired effect loads on a smaller principalcomponent [24]. This conclusion is reinforced when one examines a principalcomponent map of the two largest principal components of the eight GC peaks,which shows substantively less separation between CF and normals on the secondprincipal component. Clearly, the other eight GC peaks are not necessary to classify

Page 15: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

61B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

the pyrochromatograms in the data set but had been retained by the GA in aprevious run because there was no advantage in discarding them.

The linear dependence of peak 1 and peak 15 for some CF samples (see Fig. 8)cannot be attributed to the normalization procedure because of the large number ofmeasurement variables in this data set. Evidently, the pyrolysis products responsiblefor peaks 1 and 15 are related in some CF samples.

The principal component maps generated by the genetic algorithm indicatedifferences between CF and normal cell lines. Furthermore, the normals appear toyield a homogenous cluster, whereas a sub-division is evident in the CF cell lines.Interestingly enough, pyrolysis mass spectrometry (PyMS) was used to characterizethe same cell lines [25], and similar results were obtained. Evidently, pyrolysis gaschromatography can yield results for chemical fingerprinting problems that arecomparable to PyMS when suitable data analysis techniques are employed toanalyze the PyGC data.

6. Conclusions

The results of this study demonstrate that a properly configured genetic al-gorithm can identify markers or peaks in a pyrochromatogram characteristic ofsample type. This is possible because of the unique fitness function and powerfulboosting routines used by the GA, which ensures both its efficacy and efficiency.The method of feature selection described in this study has several advantages overconventional methods such as variance or Fisher weights [26,27]. First, it ismultivariate. Second, the GA considers many points in the search space simulta-neously, and therefore has a reduced chance of converging to a local minimum.Third, the GA makes no assumptions about the geometry of the search space.Fourth, the pattern recognition GA is a multicategory classifier. That is, it has beensuccessfully applied to classification problems involving three, four, five, six, etc.classes without loss in efficiency or efficacy [28–30]. Clearly, the computationalenvironment offered by the GA can be readily adjusted to match a particularapplication, which would explain the popularity of evolutionary computations todata analysis problems involving PyMS data [31–33].

References

[1] P.B. Smith, A.J. Pasztor, M.L. McKelvy, D.M. Meunier, S.W. Froelicher, F.C.Y. Wang, Anal.Chem. 69 (1997) 95R.

[2] S.L. Morgan, B.E. Watt, R.C. Galipo, Characterization of microorganisms by pyrolysis GC,pyrolysis GC/MS, and pyrolysis MS, in: T. Wampler (Ed.), Applied Pyrolysis Handbook, PlenumPress, NY, 1995.

[3] E. Reiner, F. Bayer, J. Chromatogr. Sci. 10 (1978) 623.[4] T.A. Roy, Ann. Lett. B11 (1978) 175.[5] E. Reiner, J. Hicks, Chromatographia 5 (1972) 525.[6] C.S. Gutteridge, J.R. Norris, Appl. Environ. Microbiol. 40 (1980) 462.

Page 16: A genetic algorithm for pattern recognition analysis of pyrolysis gas chromatographic data

62 B.K. La6ine et al. / J. Anal. Appl. Pyrolysis 50 (1999) 47–62

[7] G.L. French, C.S. Gutteridge, I. Phillips, J. Appl. Bacteriol. 49 (1980) 505.[8] G. Blomquist, E. Johansson, B. Soderstrom, S. Wold, J. Anal. Appl. Pyrolysis 1 (1979) 53.[9] S.L. Morgan, C.Al. Jacques, Anal. Chem. 54 (1982) 741.

[10] S. Wold, J. Anal. Appl. Pyrolysis 1 (1979) 67.[11] B. Soderstrom, S. Wold, G. Blomquist, J. Gen. Microbiol. 128 (1982) 1773.[12] G. Blomquist, E. Johansson, B. Soderstrom, S. Wold, J. Chromatogr. 173 (1979) 7.[13] G. Blomquist, E. Johansson, B. Soderstrom, S. Wold, J. Chromatogr. 173 (1979) 19.[14] W. Ehuis, P.G. Kistemaker, H.L.C. Meuzelaar, J. Anal. Appl. Pyrol. 1 (1977) 151.[15] E. Kulik, M. Kaljurnd, M. Koel, J. Chromatogr. 112 (1975) 297.[16] Y. Fruend, Inf. Comput. 121 (1996) 256.[17] P.D. Wasserman, Neural Computing, Van Nostrand Reinhold, NY, 1989.[18] Y. Ben-Yoseph, C.L. DeFranco, H.L. Nadler, Biochim. Biphys. Acta 718 (1982) 172.[19] R. Sahota, S.L. Morgan, Anal. Chem. 65 (1993) 70.[20] J.H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann

Arbor, MI, 1975.[21] I.T. Jolliffe, Principal Component Analysis, Springer Verlag, NY, 1986.[22] J.A. Pino, Pyrochromatography of Human Skin Fibroblasts: Normal Subjects vs Cystic Fibrosis

Heterozygotes, PhD Thesis, Cornell University, Ithaca, NY, 1984.[23] P.C. Jurs, B.K. Lavine, T.R. Stouch, NBS J. Res. (December), (1985) 543.[24] B.K. Lavine, H.T. Mayfield, P.R. Kroman, A. Faruque, Anal. Chem. 67 (1995) 3846–3852.[25] L.K. Helfend, Pyrolysis/Gas Chromatography of Human Skin Fibroblasts: Cystic Fibrosis vs.

Normal, PhD Thesis, University of California, Santa Cruz, 1983, Appendix G.[26] B.R. Kowalski, Comput. Chem. Biochem. Res. 2 (1974) 1.[27] R.A. Fisher, Ann. Eugen. (London) 7 (1936) 179.[28] B.K. Lavine, A.J. Moores, Buletin Kimia 12 (2) (1997) 73–86.[29] B.K. Lavine, A.J. Moores, H.T. Mayfield, A. Faruque, Anal. Lett. 31 (1998) 2805.[30] B.K. Lavine, A.J. Moores, H.T. Mayfield, A. Faruque, Micro. J. 61 (1999) 69.[31] D. Broadhurst, R. Goodacre, A. Jones, J.J. Rowland, D.B. Kell, Anal. Chim. Acta. 348 (1997)

71–86.[32] R.J. Gilbert, R. Goodacre, A.M. Woodward, D.B. Kell, Anal. Chem. 69 (1997) 4381.[33] J. Taylor, R. Goodacre, W.G. Wade, J.J. Rowland, D.B. Kell, FEMS Microbiol. Lett. 160 (1998)

237.

.