Transcript
Page 1: Screening Alternative Degreasing Solvents Using Multivariate Analysis

Screening Alternative DegreasingSolvents Using Multivariate AnalysisC . T R E V I Z O , † D . D A N I E L , ‡ A N DN . N I R M A L A K H A N D A N * , †

Civil, Agricultural, and Geological Engineering Departmentand University Statistics Center, New Mexico State University,Las Cruces, New Mexico 88003

Multivariate analysis was used to explore physicochemicalproperties of organic chemicals that would characterizeand identify degreasing solvents. The exploratory techniquesused in this study include cluster analysis, discriminantfunction analysis, and canonical discriminant analysis. Outof a compilation of 16 physicochemical propertiesevaluated, aqueous solubility, Henry’s constant, andsurface tension were identified as relevant properties thatcould effectively screen degreasing solvents from among30 chemicals of similar chemical classes. The suitability ofthese three properties and the multivariate techniquesused in classifying degreasing solvents were demonstratedon an external testing set of 10 solvent- and nonsolvent-type chemicals. On the basis of the results of these studies,canonical discriminant analysis is recommended as apotential tool for screening purposes. The cluster analysisprocedure was informative for explorative purposes; thediscriminant function analysis procedure was not efficientin separating solvents from others.

IntroductionSolvents are a class of chemicals that can dissolve specificcomponents or break down certain chemicals in a complexmixture into more elementary forms. Because of this property,solvents have been used widely in various applicationsranging from cleaning, degreasing, coating, painting, andextracting to chemical processing, manufacturing, andequipment maintenance (1, 2). In addition to their directuse in the industry, numerous commercial formulations andproducts containing solvents are used on a daily basis in thedomestic, commercial, institutional, and military sectors.Common specific uses of solvents include mobilization ofsolids; preparation of reactants; application of particles ontoa surface for coating; extraction of oil, flavors, and fragrances;thinners for paints, oils, and ink; adhesive for plastics; cleaningprinted circuit boards and machine parts; dry cleaning ofgarments; decaffeinating coffee; etc. (3).

Over 30 different synthetic organic chemicals have beenused as degreasing solvents. It is estimated that the annualuse of the five most commonly used solvents [viz., trichlo-roethylene (TCE), tetrachloroethylene (PCE), methylenechloride, 1,1,1-trichloroethane (TCA), and trichlorotrifluro-ethane (CFC 113)] in the United States is around 800 000 t(4). Such large usage as well as improper storage and disposalof spent solvents over the past decades have resulted in their

release into the environment, contaminating soils, ground-water, and the atmosphere.

Because of their toxic, persistent, and recalcitrant nature,environmental contamination by degreasing solvents hasemerged as one of the serious problems in the industrializedworld. Recent studies have confirmed that many of thecurrent solvents are hazardous to humans and harmful tothe environment, causing (or suspected to cause) cancer,smog formation, ozone depletion, etc. As such, many of thecommon solvents are now targets of public concern andregulatory control. The Environmental Protection Agency(EPA) has included over 20 solvents in their list of 127 prioritypollutants. The Clean Air Act Amendments of 1990 have listedseveral solvents as hazardous air pollutants (HAPs). Theemissions of the most common solvents (viz., methylenechloride, PCE, TCE, TCA, carbon tetrachloride, and chloro-form) are now regulated by 40 CFR, Parts 9 and 63, underthe Toxic Release Inventory (TRI) program, whereby indus-tries are now required to report to the EPA on their productionand transfers.

In an effort to minimize the release and environmentalimpacts of solvents, industries are being forced to adaptprocess modifications, recycling, and reuse of solvents onone hand and to develop environment-friendly substitutesolvents on the other (2). In seeking substitutes or designingnew ones, it is important to identify or develop solvents thathave the desired degreasing characteristics and, at the sametime, are nontoxic and readily biodegradable and poseminimal threat to the environment. Evaluation of solventsthat are in current use in terms of their physical and chemicalproperties is the first step to characterize the desired featuresof a good solvent and to effectively develop a “greener” andefficient substitute solvent.

Selection of substitute solvents is not a straightforwardtask because no single physicochemical property relates tosolvent characteristics. The search for alternate solvents hasbeen “characterized as Edisonian” because of the trial anderror nature of the experimental evaluation of numerouspotential alternatives (5). While acknowledging this processto be a significant technical challenge, Zhao and Cabezas (2)have identified the following three steps in developingsubstitute solvents:

step 1: to determine the substitute candidates or thereplacement formulations;

step 2: to do performance and evaluation tests; andstep 3: to do the full scale test.

The first step has been recognized as the most importantand most difficult one. Efforts of previous workers in fulfillingthe first step have been classified the into three categoriesby Zhao and Cabezas (2): (i) screening of available solventdatabases for single chemical substitutes; (ii) using computer-based molecular designing tools to develop new chemicalswith the desired properties; and (iii) designing mixtures ofavailable chemicals to achieve desired properties. Severalspecial purpose computer software tools have been devel-oped and are being applied for this purpose (2, 6). The firstapproach of screening databases is a more simple approachand can also enhance the effectiveness of the other twomethods. Irrespective of the approach, identification ofdesired properties for a given application is a prerequisite inseeking substitute chemicals. One of the objectives of thisstudy was to identify physicochemical properties of goodsolvents.

The second objective of this study was to develop ascreening process based on statistical multivariate analysisof physicochemical properties of good solvents. The screening

* Corresponding author fax: (505)646-6049; e-mail: [email protected].

† Civil, Agricultural, and Geological Engineering Department.‡ University Statistics Center.

Environ. Sci. Technol. 2000, 34, 2587-2595

10.1021/es9912832 CCC: $19.00 2000 American Chemical Society VOL. 34, NO. 12, 2000 / ENVIRONMENTAL SCIENCE & TECHNOLOGY 9 2587Published on Web 05/16/2000

Page 2: Screening Alternative Degreasing Solvents Using Multivariate Analysis

of substitute solvents remains a subjective process, dependingon the application and the experience of end-users. Two ofthe commonly employed methods are the weighted-sumevaluation method and the pass/fail screening method. Inthe first method, quantifiable screening criteria weighted byappropriate weighting factors are summed up and comparedfor the alternatives. The criteria used are indirect measuresof the overall effectiveness of the solvent. Some examples ofcriteria are reductions in raw material input, waste quantity,operational hazards, costs, etc. (4).

The second method involves a step-by-step evaluation ofthe alternatives against yes/no or pass/fail type of criteria.Those that satisfy all the criteria are then selected for furthertesting. Examples of criteria might be as follows: is flashpoint less than or greater than 140° C, is dielectric strengthless than or greater than 20 kV, etc. (7). Proposed solventsthat pass the necessary criteria are then evaluated furtherunder field conditions.

An expert system software named SAGE is now availableto aid in the screening process (http://clean.RTI.org/sol_alt).Users can run SAGE online over the Internet or downloadit to run on desktop computers to identify possible alternatesolvents. This software first prompts the user to specify thematerial, nature, and shape of the part or surface to becleaned; the contaminants to be cleaned; the degree ofcleaning expected; the process configuration; etc. It thenrecommends a list of possible alternate solvents and pro-cesses that best satisfy the input data.

The ultimate aim of this study was to develop and validatean alternate screening process to aid the substitute solventsearch process. A statistical exploratory approach involvingmultivariate analysis procedures is adapted in this study.The following procedures are used: cluster analysis, dis-criminant function analysis, and canonical discriminantanalysis.

Materials and MethodsA training data set of 45 common solvent and nonsolventchemicals was initially compiled as the starting point for thisstudy. The following physicochemical properties for thesechemicals were compiled from handbooks (e.g., refs 8-11)and literature (e.g., refs 12 and 13): boiling point (BoilPt),melting point (MeltPt), molecular weight (MW), octanol/water partition coefficient [log(P)], water solubility [log(S)],vapor pressure (VP), Henry’s law constant [log(HC)], surfacetension (ST), solubility parameter (SolP), autoignition tem-perature (AT), excess molar refraction (R), solute dipolarity(π), effective hydrogen-bond acidity (â), effective hydrogen-bond basicity (R), and the characteristic volume of McGowan(MolarV). The significance of each of these parameters hasbeen discussed elsewhere (e.g., refs 2, 10, and 13). In addition,calculated values of zero-order and first-order simple andvalence molecular connectivity indexes (0ø, 0øν, and 1øν) werealso adapted as additional properties (14).

From the initial 45 chemicals identified, only 30 chemicalscould be evaluated in this study as a training set due to thenonavailability of all the 18 physicochemical properties. Thesolubility parameter and autoignition temperature could notbe found for several of the remaining 30 chemicals in thetraining set, so these parameters were discarded as variablesin the analyses. Each of the remaining 30 chemicals wasthen identified as a “good” solvent or a nonsolvent basedupon recommendations in solvent handbooks and usage inindustry. The final training set thus consisted of a total of 30chemicals, classified into 22 good solvents and 8 nonsolvents,each having 16 physicochemical properties.

Additionally, a testing data set of 10 chemicals consistingof solvent and nonsolvent types was assembled to test anyscreening processes developed from the multivariate analysis

procedures evaluated in this study. Because of the difficultyin identifying solvents and nonsolvents having readilyavailable the physicochemical properties being examined inthis study, preliminary cluster analyses were performed priorto forming the testing set. These cluster analyses identifiedwater solubility, Henry’s law constant, surface tension, andthe zero-order valence molecular connectivity index asphysicochemical properties that were likely to be useful inthe evaluation process. Thus, the 10 chemicals in the testingdata set were selected based upon the availability of thesefour physicochemical properties, while also striving to obtaintesting chemicals that greatly varied in their ability to act asa solvent (e.g., propane is an extreme that must be classifiedas a nonsolvent by any reasonable method). Table 1 lists the30 training set chemicals along with the 10 testing setchemicals.

To evaluate the screening method developed from clusteranalysis, the 10 testing chemicals were added to the trainingset of 30 chemicals, and the cluster analysis process wasrepeated, noting the placement of the test chemicals in thedendogram relative to the good solvents and the nonsolventsof the training set. Evaluation of the method based ondiscriminant analysis was straightforward, simply giving apredicted classification of each testing set chemical ac-companied by a (posterior) probability associated with theclassification. A canonical discriminant analysis techniquegave a distance measure for each of the testing chemicals,which was then compared to the distance measures of thegood solvents and nonsolvents from the training set.

Cluster analysis, its accompanying graphs, and discrimi-nant function analysis were conducted in JMP (SAS InstituteInc.). All other graphs and canonical discriminant analysiswere run in SAS (SAS Institute Inc.). All computations andgraphs in both JMP and SAS were carried out on a 233-MHzApple Macintosh G3-based computer.

Cluster Analysis. Hierarchical cluster analysis is a com-mon, multivariate pattern recognition technique used togroup observations together according to their proximity toone another in the multidimensional space defined by thevariables being studied (15). A cluster is defined to be eithera single point or multiple points grouped together becauseof their relative closeness. To determine the closeness of twoclusters, one must define a multidimensional measure ofdistance between the clusters and also the point of referencein each cluster between which the distance is measured. Inmost studies, though not all, the Euclidean distance basedon standardized variables is used as a distance measure. Thepoints of reference, between which the distance of twoclusters is measured, define the cluster analysis “method”.The centroid method measures the distances between themeans of each cluster. The nearest-neighbor method mea-sures distances between two observations, one from eachcluster, that are closer than any other such pair. This studyinvestigated use of the centroid method but ultimately foundthe nearest-neighbor method with a Euclidean distancemeasure to be more effective.

The results of the cluster analysis are displayed in twographssa dendogram and an amalgamation schedule. Thedendogram is a tree diagram connecting all the observations,which are listed to the left, and illustrates the relationshipbetween the clusters that are formed. Clusters that areconnected by lower branches on the tree are closer thanclusters that are connected by higher branches. The amal-gamation schedule is a line chart whose vertexes arehorizontally aligned with the dendogram’s connectingbranches, having one vertex associated with each mergingof two clusters. The vertical distance between two vertexesindicates the distance between the two clusters that aremerged by the corresponding branch. A plateau acrossvertexes indicates strong similarity among observations, while

2588 9 ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 34, NO. 12, 2000

Page 3: Screening Alternative Degreasing Solvents Using Multivariate Analysis

a sudden change in height indicates a large differencebetween the adjacent clusters.

In this study, hierarchical cluster analysis was usediteratively on the training data set. Analyses began by usingthe entire set of the physicochemical variables discussedabove to cluster the chemicals from the training set.Numerous subsets of the physicochemical variables werethen tried, including all pairs and triplets, until a minimalsubset of variables was found that clustered “good” solventsapart from nonsolvents.

Discriminant Function Analysis. Discriminant functionanalysis is a multivariate analysis technique used for clas-sification of each observation into one of multiple subpopu-lations based upon its location in the multidimensional spacedefined by the variables in the data set (15). A training set,with observations already correctly classified into populationsand having standardized variables, is used to develop“discriminant functions” that provide rules for classifyingother observations not in the training set. In situations wherethere are two populations to be discriminated between (suchas the case in this study, where chemicals are classified asgood solvents and nonsolvents), discriminant functionanalysis determines a single axis in the multidimensionalspace along which the greatest Euclidean distance is mea-sured between the two population means relative to thevariability of the observations within each of the twopopulations.

The discriminant function is just one of several importantcomponents in classification using discriminant functionanalysis. The discriminant function yields a “discriminantscore” for each observation, which is the location of theobservation along the discriminant axis. An observation witha discriminant score below a determined cutoff value isclassified into one population, while a discriminant scoreabove the cutoff value will classify an observation in the otherpopulation. The cutoff value is determined so that clas-sification coincides with the population having the highest“posterior probability”sthe estimated probability that anobservation is from a specific population given its observeddiscriminant score. Aside from determining the classificationof an observation, posterior probabilities are informativebecause they estimate the confidence in the classification ofthe observation into a particular population.

Validation of the developed classification rules is oftenaccomplished in two manners. Misclassification rates arereported for the training set, indicating the proportion ofobservations from each population that were misclassifiedby the classification rules. Also, the classification rulesdeveloped using the training set can be applied to a testingdata set whose correct classifications are known, andmisclassification rates again reported. In this study, bothtypes of validation were examined. Discriminant functionanalysis was performed, and corresponding misclassificationrates were examined using the full compliment of variables

TABLE 1. Training and Test Chemicals and Their Identification Code, Solvent Classification (Good Solvent, Nonsolvent, or TestChemical), log(S), log(HC), ST, and 0øν Values

data set no. chemical class log(S) log(HC) ST 0øν

training 1 butanol good 4.87 -5.06 25.67 3.56training 2 ethanol good 5.77 -5.20 22.39 2.15training 3 2-propanol good 6.00 -4.91 22.40 3.02training 4 methanol good 6.06 -3.87 22.50 1.44training 5 benzene good 3.25 -2.27 28.88 3.46training 6 toluene good 2.73 -2.23 28.52 4.38training 7 1,2-xylene good 2.24 -2.29 30.31 5.30training 8 carbon tetrachloride good 2.91 -1.52 27.65 5.03training 9 chlorobenzene good 2.59 -2.34 32.93 4.68training 10 1,1,1-trichloroethane good 3.18 -2.10 25.14 4.90training 11 methylene chloride good 4.29 -2.61 28.77 2.97training 12 perchloroethylene good 2.17 -1.57 31.65 5.53training 13 trichloroethylene good 3.04 -1.99 32.00 4.47training 14 ethyl acetate good 4.81 -3.92 24.00 4.02training 15 acetone good 6.00 -4.18 23.04 2.90training 16 methyl ethyl ketone good 5.38 -4.98 23.96 3.61training 17 methyl iso butyl ketone good 4.31 -4.03 24.74 5.19training 18 chloroform good 3.90 -2.36 26.67 3.97training 19 1,1-dichloroethane good 3.70 -2.23 24.66 3.84training 20 1,2-dichloroethane non 3.93 -3.01 32.57 3.68training 21 1,1,2-trichloroethane non 3.65 -2.92 35.37 4.68training 22 cyclohexane non 1.74 -0.71 26.43 4.24training 23 acetic acid non 6.78 -7.00 27.59 2.35training 24 n-butyl acetate good 3.83 -3.50 25.41 5.43training 25 cyclohexanone non 4.36 -4.92 35.19 4.52training 26 diethylamine good 5.89 -4.59 22.39 3.91training 27 triethylamine non 4.74 -3.86 20.72 5.56training 28 propanol good 5.34 -5.16 23.71 2.86training 29 methyl chloride non 3.77 -2.08 15.19 2.13training 30 propane non 1.79 -0.16 7.02 2.70testing 31 pentachloroethane test 2.70 -2.71 34.37 6.74testing 32 n-pentane test 1.58 0.10 15.47 4.12testing 33 n-octanol test 2.73 -4.60 28.20 4.98testing 34 n-butylcyclohexane test -0.75 0.13 26.51 6.53testing 35 3-ethylhexane test -0.15 0.63 21.08 6.40testing 36 ethylbenzene test 2.22 -2.17 28.59 2.78testing 37 trichlorobenzene test 1.54 -2.53 44.66 6.84testing 38 propionic acid test 6.00 -6.03 26.20 3.36testing 39 valeric acid test 4.38 -5.87 26.81 4.47testing 40 acrylic acid test 6.00 -6.39 47.13 2.63

VOL. 34, NO. 12, 2000 / ENVIRONMENTAL SCIENCE & TECHNOLOGY 9 2589

Page 4: Screening Alternative Degreasing Solvents Using Multivariate Analysis

using the training set. However, to examine misclassificationrates for the testing data set, the discriminant functionanalysis was limited to using the variables available in thetesting data set as discussed above.

Canonical Discriminant Analysis. Canonical discriminantanalysis is a multivariate dimension reduction technique (15).This method results in a series of variables (or axes) called“canonical variables”, each being a linear combination ofthe original variables. The first canonical variable gives themaximum possible separation between the population means(or more precisely, maximizes the between populationvariability) relative to the within population variability. Eachsubsequent canonical variable in the series increases thisrelative separation as much as possible but contributes lessthan previous canonical variables in the series.

Canonical discriminant analysis differs from discriminantfunction analysis in several fundamental ways. First, beinga dimension reduction technique, it does not directly yieldclassification rules. Second, discriminant function analysisuses each of the axes it yields to iteratively separate apopulation from another population or from a group of otherpopulations, while each axis (or canonical variable) incanonical discriminant analysis contributes to the separationof all populations. And finally, for m populations, discriminantfunction analysis develops and requires m - 1 axes, whereascanonical discriminant analysis allows the user to select andutilize as many of the canonical variables (or axes) as desired,up to a maximum of m - 1, until the desired level of separationis obtained.

This study used canonical discriminant analysis to developa single axis that would yield maximum separation betweenthe good solvent and the nonsolvent populations. Theimplementation was refined by further classifying the non-solvents into two groups based on the direction they wererelative to the good solvents. The entire complement ofphysicochemical variables was used to study the trainingdata set, but examination of the testing data set was limitedto the variables defined in the testing data set as discussedpreviously.

Results and DiscussionInitial Investigation. Initial investigations examined two-dimensional scatterplots and sample correlations of all 16variables for the 30 training set chemicals. Both the correlationcoefficients and the scatterplots revealed strong pairwiserelationships among several groups. Table 2 displays sets ofvariables among which the sample correlation coefficientswere at least 0.7, 0.8, and 0.9, respectively. Particularly strongcorrelations existed among the sets [log(S), log(HC), log(P)]and (MolarV, 0ø, 0øν, 1øν), where each pair in the set had acorrelation coefficient greater than 0.9. These strong rela-tionships reduce the need to have several of the variablesfrom each set in analyses and give possible insight into resultsof some analyses such as the canonical discriminant analysis.Also of interest is the correlation between the solubilityparameter (SolP) and the solvato chromic parameter, R, of0.74, suggesting that SolP may not have contributed sub-stantially beyond the contribution of R had it been usablein the analyses.

Cluster Analysis. Cluster analysis was performed on boththe training data set and the combinedstraining and testingsdata set. The initial strategy of the cluster analysis was 2-fold.The first objective was to determine a minimal subset of the16 variables that separates the “good” solvents from thenonsolvents through exploratory use (i.e., repeatedly add oreliminate one variable at a time until reasonable clusteringswere obtained). That is, we sought a small subset of variablesthat was sufficient to cluster the good solvents together butaway from the nonsolvents. The second objective was to seeif the chemicals from the testing data set could be ap-propriately clustered with the good solvents and nonsolventsusing the minimal set. The purpose in doing this was todetermine how few variables might be needed to separatethe good solvents from the nonsolvents. Knowing this couldpotentially better focus the direction of future investigationson screening methodologies as well as simplify implementa-tion of these techniques in practice.

In pursuing the first objective, both centroid and nearest-neighbor methods were employed, but the centroid methodwas not as fruitful, and the nearest-neighbor method wasultimately adopted. The first objective led to two sets ofvariables: [log(S), ST, 0øν] and [log(S), ST]. The clusteringresulting from the two variable sets were somewhat different.Perhaps the most notable difference was that the variableset containing 0øν clustered 1,2-dicholoroethane among thegood solvents, while the variable set without 0øν clustered itamong the nonsolvents. Unfortunately, inclusion of the testchemicals caused clustering of the training chemicals torearrange substantially when 0øν was used, including theplacement of three good solvents among the nonsolvents.

Inclusion of the test chemicals did not substantially alterclustering of the training chemicals when 0øν was not a factor,and this led to a preference for the [log(S), ST] variable set.Figure 1 shows the dendogram and associated amalgamationschedule graph (bottom of figure) for the training data setusing the [log(S), ST] variable set. The dendogram shows twodistinct clusters among the good solvents (distinguished bydistinct symbols preceding the chemical namessa circle forthe first cluster and a star for the second cluster). It alsoshows that the nonsolvents were the last chemicals to beclustered (distinguished by a square preceding the chemicalname), indicating that they are the most isolated chemicalsin the data set, even from each other. The cluster analysisfor the combined data set of 40 chemicals, based on log(S)and ST (Figure 2), placed valeric acid, n-octanol, ethylben-zene, and pentachloroethane among the good solvents. Thefirst two seem to be misclassified while the other two areappropriately placed. Propionic acid, n-pentane, n-butyl-cyclohexane, 3-ethylhexane, trichlorobenzene, and acrylicacid were all correctly placed among the nonsolvents.

Discriminant Function Analysis. Discriminant functionanalysis was performed on the training data set and thecombined data set. The objective in both cases was to evaluatethe potential for discriminant function analysis to ap-propriately identify chemicals as good solvents or nonsol-vents. While the training data set has no test data to evaluate,discriminant function analysis reports misclassifications thatwould occur for the training set chemicals using the rules itdevelops. Examination of the training data set allowsinvestigating the use of the 16 variables, as opposed to thecombined data set which limits investigation to four variables.Discriminant function analysis on the training data set withall 16 variables resulted in three misclassifications of goodsolvents (though one of these was 1,2-dichloroethane) andone misclassification of a nonsolvent, giving a total mis-classification rate of 13.3%. Using only the variables [log(S),

TABLE 2. Groups of Variables with High Correlations amongthe 30 Training Variables

G g 0.70 G g 0.80 G g 0.90

1øν, log(P), MW 0øν, MW log(S), log(HC), log(P)R, ST â, log(HC) MolarV, 0ø, 0øν, 1øν

BoilPt, STâ, log(P)â, log(S)

2590 9 ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 34, NO. 12, 2000

Page 5: Screening Alternative Degreasing Solvents Using Multivariate Analysis

ST] resulted in 8/4 misclassifications (solvents as nonsol-vents/nonsolvents as solvents), which was better than [log(S),ST, 0øν] with 8/5 misclassifications but not quite as good as[log(S), ST, log(HC)] with 7/3 misclassifications. Treating 1,2-dichloroethane as a solvent generally decreased the numberof misclassifications by one or two, except when all 16variables were used, where it increased misclassifications bytwo.

Table 3 shows the classification results of the discriminantfunction analysis using [log(S), ST] with the combined dataset. The predicted classification of the 10 test chemicalsdisagrees with the cluster analysis for four chemicals. Amongthe 30 training chemicals, 21 had posterior probabilities inthe range of 0.5 ( 0.1, and 28 had posterior probabilities inthe range of 0.5 ( 0.2. Hence, few chemicals were stronglyclassified, indicating a lack of certainty associated with thedecision rules developed by the discriminant functionanalysis. Using other combinations of log(S), ST, log(HC),and 0øν did not substantially change these results. Theposterior probabilities for the training set with all 16 variableswere more extreme, typically occurring below 0.1 or above

0.9. This indicates more certainty in the decision rulesdeveloped by the discriminant function analysis. Ultimately,discriminant function analysis proved to be disappointingin its potential to classify solvents and nonsolvents using thefour variables available in the testing data but showed somepotential for situations where more of the 16 variables fromthe training data set are available.

Canonical Discriminant Analysis. Further investigationof the four predictor variables available in the testing dataset [log(S), ST, log(HC), 0øν] through two- and three-dimensional plots prompted the idea of using either principalcomponent analysis or canonical discriminant analysis forclassifying solvents and nonsolvents. Using the training set,each possible subset of three variables from the original fourvariables [log(S), ST, log(HC), 0øν] were interactively examinedin rotating three-dimensional plots. The good solventobservations and the nonsolvent observations were distin-guished by different symbols, and patterns distinguishingthe groups from one another were sought. A two-dimensionalscatterplot of ST versus log(S) (Figure 3) illustrates thedominant and pertinent information found in this explora-

FIGURE 1. Dendrogram from cluster analysis with training cases only using nearest neighbor method with variables log(S) and ST.

VOL. 34, NO. 12, 2000 / ENVIRONMENTAL SCIENCE & TECHNOLOGY 9 2591

Page 6: Screening Alternative Degreasing Solvents Using Multivariate Analysis

tion. Here, the good solvents form along a path runningbetween two subgroups of the nonsolvents. This patternprompted the use of canonical discriminant analysis.

A similar strategy using principal component analysiscould also be developed. One strategy would determineprimary axes along which the good solvent observations havethe greatest variability. Another axis could then be calculatedthat runs perpendicular to the primary axes and simulta-neously minimizes the distances from the nonsolventobservations to the axis. However, this is a more convolutedapproach. In many cases, these approaches may yield similarresults. However, a strategy using principal componentanalysis does not have the direct objective of separating the

groups as canonical discriminant analysis does (exceptingthe use of principal component analysis on the group means,weighted by the number of observations in the groups, whichis equivalent to canonical discriminant analysis).

In order for canonical discriminant analysis to be effectivein this situation, it was necessary to reclassify the nonsolventsinto two groupssthose lying above the path of the goodsolvents and those lying below the pathsfor a total of threegroups (good solvents, high nonsolvents, and low nonsol-vents). The idea was to determine an axis in the four-dimensional space of the original variables, which gives thebest single measure of separation between the three groups.The first canonical variable from a canonical discriminant

FIGURE 2. Dendrogram from cluster analysis including testing cases using nearest neighbor method with variables log(S) and ST.

2592 9 ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 34, NO. 12, 2000

Page 7: Screening Alternative Degreasing Solvents Using Multivariate Analysis

analysis is defined on such an axis. Figure 4 shows a plot ofthe first canonical variable (CAN1) for the combined data set(note that the data are randomly spread in the horizontaldirection for better visibilitysthere is no variable on thebottom axis). This plot shows even better separation thanthe plot of ST versus log(S) in Figure 3, yet it involves onlya single (transformed) variable.

Placing the test data on this axis defined by CAN1 usingthe four variables from the combined data set gives a visualindication of how these data would be classified (Figure 5).Note that the test chemicals are not used in the canonicaldiscriminant analysis itself. Rather, the value of CAN1 iscalculated for test chemicals using the total sample stan-dardized canonical coefficients derived from the canonicaldiscriminant analysis performed on the training chemicalsonly. Table 4 displays, in descending order, the value of CAN1for all chemicals in the combined data set. This table indicatesthat acrylic acid and trichlorobenzene lie among the highnonsolvents; n-butylcyclohexane, 3-ethylhexane, and n-pentane lie among the low nonsolvents; valeric acid, pen-tachloroethane, n-octanol, and ethylbenzene lie among thegood solvents; and propionic acid lies between the highnonsolvents and the good solvents. This is in good agreement

with the classification by the cluster analysis procedure shownin Figure 2. Often, including the second canonical variable(CAN2) will further separate the groupings, but as Figure 6shows, addition of this variable does not help separate thegood solvents and the nonsolvents.

Figure 7 shows a plot of the first canonical variable usingall 16 variables from the training data set (again, there is novariable on the bottom axis). Comparing this plot with the

TABLE 3. Discriminant Function Analysis Predicted SolventStatus of the Training and Testing Sets Chemicals and TheirPosterior Probability of Good Solvent Classification Based onthe Variables log(S) and Surface Tensiona

chemicalsolventstatus

predictedsolventstatus

posterior probof solvent

classification

butanol good good 0.544634ethanol good good 0.5545752-propanol good good 0.566480methanol good good 0.570646benzene good non 0.496453toluene good non 0.4654041,2-xylene good non 0.460015carbon tetrachloride good non 0.465013chlorobenzene good good 0.5075421,1,1-trichloroethane good non 0.451004methylene chloride good good 0.549188perchloroethylene good non 0.471352trichloroethylene good good 0.520518ethyl acetate good good 0.522876acetone good good 0.573521methyl ethyl ketone good good 0.551923methyl isobutyl ketone good good 0.505171chloroform good good 0.5054871,1-dichloroethane good non 0.4725521,2-dichloroethane non good 0.5727281,1,2-trichloroethane non good 0.589128cyclohexane non non 0.392095acetic acid non good 0.659895n-butyl acetate good non 0.487709cyclohexanone non good 0.622482diethylamine good good 0.560738triethylamine non non 0.482454propanol good good 0.547084methyl chloride non non 0.372771propane non non 0.214293pentachloroethane test good 0.529391n-pentane test non 0.276135n-octanol test non 0.461832n-butylcyclohexane test non 0.2782183-ethylhexane test non 0.255003ethylbenzene test non 0.439882trichlorobenzene test good 0.583710propionic acid test good 0.607795valeric acid test good 0.531997acrylic acid test good 0.798574

a 1,2-Dichloroethane is treated as a nonsolvent.

FIGURE 3. Plot of surface tension vs log(S) (training set only).

FIGURE 4. Plot of CAN1, no test group, using four variables fromthe combined set.

VOL. 34, NO. 12, 2000 / ENVIRONMENTAL SCIENCE & TECHNOLOGY 9 2593

Page 8: Screening Alternative Degreasing Solvents Using Multivariate Analysis

one in Figure 4 [which is based on the combined data sethaving only four variablesslog(S), ST, log(HC), 0øν], one cansee that the separation between the good solvents and thenonsolvents is even greater than when [log(S), ST, log(HC),0øν] are used. While there is no test data set to evaluate thathas all 16 variables, these results suggest even greater potential

for separating out the test chemicals if the variables absentfrom the combined data set were available.

The total sample standardized canonical coefficients forthe canonical discriminant analysis using [log(S), ST, log-(HC), 0øν] (Table 5) show that log(HC) and ST contribute themost in the construction of CAN1, implying that these twovariables are the most useful in separating the solvent andnonsolvent groups. Table 6 gives the total sample standard-ized canonical coefficients for the canonical discriminantanalysis using all 16 variables. These coefficients imply thatBoilPt, 0ø, and 0øν contribute the most in separating thegroups. However, because of the numerous strong correla-tions among many of the variables in the data set, thesevariables may not be the only ones capable of producinggood separation between the groups. For example, removalof 0ø does not dramatically impact the results of the canonicaldiscriminant analysis nor even the removal of 0ø, 0øν, and 1øν.Because of the strong linear relationship between these

FIGURE 5. Plot of CAN1, with test group, using the four variablesfrom the combined set.

FIGURE 6. Plot of CAN1 vs CAN2, no test group, using the fourvariables from the combined set.

TABLE 4. 30 Training Chemicals and 10 Test Chemicals inDescending Order by Value of First Canonical Variable Basedon log(S), log(HC), ST, and 0øν

name class CAN1

acrylic acid test 7.61790960trichlorobenzene test 3.62024414cyclohexanone high 3.13461569acetic acid high 2.911746841,1,2-trichloroethane high 2.09923253propionic acid test 1.755306831,2-dichloroethane good 1.66534459valeric acid test 1.39453531pentachloroethane test 1.15143481chlorobenzene good 1.05130216butanol good 0.96966402n-octanol test 0.83893781trichloroethylene good 0.73851830propanol good 0.71605466methylene chloride good 0.67491067ethanol good 0.60000411methyl ethyl ketone good 0.52866831benzene good 0.305676512-propanol good 0.29132138ethylbenzene test 0.21502539methanol good 0.185380811,2-xylene good 0.15295675acetone good 0.13626953perchloroethylene good 0.09920000toluene good -0.08020000diethylamine good -0.08170000ethyl acetate good -0.13187222methyl isobutyl ketone good -0.21006047chloroform good -0.27406502n-butyl acetate good -0.39981566carbon tetrachloride good -0.778788541,1-dichloroethane good -0.860846121,1,1-trichloroethane good -1.10036275triethylamine low -1.38645320cyclohexane low -1.44790864n-butylcyclohexane test -2.64809756methyl chloride low -3.032624713-ethylhexane test -4.22102278n-pentane test -4.72104029propane low -6.47613626

TABLE 5. Total Sample Standardized Canonical Coefficients forthe Canonical Discriminant Analysis Using log(S), log(HC), ST,and 0øν

variable CAN1 variable CAN1

log(S) 0.163886 ST 1.518834log(HC) -0.748385 0øν -0.255353

2594 9 ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 34, NO. 12, 2000

Page 9: Screening Alternative Degreasing Solvents Using Multivariate Analysis

variables and MolarV, MolarV and other less correlatedvariables are able to greatly compensate for the contributionsof the removed variables.

Comparison of Multivariate Methods. Of the severaltechniques considered in this paper, canonical discriminantanalysis appears to hold the most promise for screeningpotential solvents. Canonical discriminant analysis was ableto separate the solvent and nonsolvent groups well usingjust the four variables available in the testing data set, yetappears to have even greater ability to separate these groupswhen more variables are available. Additionally, the implica-tion this measure has on a chemical’s potential as a solventis easily discernible when displayed in either a table or agraph.

A strategy using principal component analysis was initiallyconsidered as an alternative to canonical discriminantanalysis, but it lacked the direct objective of separating thesolvent and nonsolvent groups. Cluster analysis may be aninformative tool but has no clear indicator of solvent potentialassociated with it. Discriminant function analysis requiredtoo many variables to be generally useful, resulted innumerous misclassifications of the training data set, andlacked the ability to illustrate a chemical’s solvent potentialin a simple manner. Two-dimensional scatterplots and three-dimensional interactive rotating plots give insight intopatterns that might be useful in developing strategies forscreening chemicals and offer some understanding ofrelationships between variables, which is often useful inselecting variables to be used with a particular technique.

Use of cluster analysis may prove useful in identifyingcommonalities and differences that exist among variousgroups of solvents. Understanding these distinctions mayhelp develop future strategies in screening potential solvents.For example, it may be beneficial to classify solvents intotwo distinct groups for use in discriminant function analysisor canonical discriminant analysis based strategies. Inves-tigation of other variables may also lead to better screeningmethodology, but the variables used should be easily obtainedgiven that many of the chemicals that may be screened willnot be well studied. Finally, further evaluation of the methodspresented here using other sets of chemicals would add toour understanding of their suitability as screening techniques.

Literature Cited(1) Billatos, S.; Basaly, N. Green Technology and Design for the

Environment; Taylor & Francis: London, 1997.(2) Zhao, R.; Cabezas, H. Ind. Eng. Chem. Res. 1998, 37 (8), 3268-

3280.(3) Kirschner, E. M. Chem. Eng. News 1994, June, 13-20.(4) Callahan, M.; Green, B. Hazardous Solvent Source Reduction;

McGraw-Hill: New York, 1995.(5) Allen, D. T. Pollut. Prev. Rev. 1997, Winter, 113-118.(6) Pretel, J.; Lopez, A.; Bottini, B.; Brignole, A. AIChE J. 1994, 40,

1349-1353.(7) Callahan, M.; Sciarrotta, T. Pollut. Prev. Rev. 1994, Winter.(8) Howard, P. Handbook of Environmental Fate and Exposure Data

for Organic Chemicals; Lewis Publishers: Chelsea, MI, 1990.(9) Lide, D. Handbook of Organic Solvents; CRC Press: Boca Raton,

FL, 1995.(10) Smallwood, I. Handbook of Organic Solvent Properties; Arnold:

London, 1996.(11) Yaws, C. Chemical Properties Handbook; CRC Press: Boca Raton,

FL, 1999.(12) Jasper, J. J. Phys. Chem. Ref. Data 1972, 1 (4), 841-1009.(13) Abraham, M.; Andonian-Haftvan, J.; Whiting, G.; Leo, A.; Taft,

R. J. Chem. Soc., Perkin Trans. 2 1994, 1777-1791.(14) Nirmalakhandan, N.; Speece, R. E. Environ. Sci. Technol. 1988,

22 (6), 606-615.(15) Johnson, R.; Wichern, D. Applied Multivariate Statistical Analysis;

Prentice Hall: New York, 1988.(16) Hairston, D. Chem. Eng. 1997, February, 55-58.(17) Meloun, M.; Militky, J.; Forina, M. Chemometrics For Analytical

Chemistry Volume 1: PC-aided Statistical Data Analysis; EllisHorwood: Chichester, 1992.

Received for review November 15, 1999. Revised manuscriptreceived March 29, 2000. Accepted April 4, 2000.

ES9912832

FIGURE 7. Plot of CAN1, no test group, using all 16 variables fromthe training set.

TABLE 6. Total Sample Standardized Canonical Coefficients forthe Canonical Discriminant Analysis Using All 16 Variablesfrom the Training Data Set

variable CAN1 variable CAN1

log(S) 5.2450143 R 2.0698525log(HC) 0.3571004 π -2.1804122ST -2.3988698 R -2.9647183BoilPt 3.3482704 â -8.0965957MeltPt -0.7266137 MolarV 2.3238106MW -0.3777263 0ø 1.8892035log(P) -5.9752174 0øν -0.4653586VP -0.7120605 1øν -0.2699279

VOL. 34, NO. 12, 2000 / ENVIRONMENTAL SCIENCE & TECHNOLOGY 9 2595


Top Related