geographical classification of crude oils by kohonen self-organizing maps

9
Analytica Chimica Acta 556 (2006) 374–382 Geographical classification of crude oils by Kohonen self-organizing maps Ana M. Fonseca a , Jos´ e L. Biscaya b , Jo˜ ao Aires-de-Sousa a,, Ana M. Lobo a a CQFB and REQUIMTE, Departamento de Qu´ ımica, Faculdade de Ciˆ encias e Tecnologia, Universidade Nova de Lisboa, 2829-516 Monte de Caparica, Portugal b Instituto Hidrogr´ afico, Rua das Trinas, 49, 1249-093 Lisboa, Portugal Received 7 June 2005; received in revised form 16 September 2005; accepted 28 September 2005 Available online 21 November 2005 Abstract In the analysis of an environmental disaster caused by spillage of crude oil, limitation of the possible sources to a few geographical origins can help in the identification of the polluting vessel from a group of potential candidates. In this paper we show that Kohonen self-organizing maps (or Kohonen neural networks) can classify samples of crude oils on the basis of gas chromatography–mass spectrometry (GC–MS) descriptors, in terms of geographical origin, with a high degree of accuracy. Two data sets were investigated – one from Instituto Hidrogr´ afico (Lisbon, Portugal) with 188 samples from 20 geographical origins, and another from EUROCRUDE ® with 374 samples. After training the Kohonen self-organizing maps with a training set, predictions were obtained for an independent test set. Correct predictions were obtained for 70% and 60% of the test sets for the two studies, respectively. Ensembles of networks were highly interesting for the calculation of a prediction score, which can be used as a measure of the reliability of the prediction. For the samples with high prediction scores, the percentage of correct predictions jumped to 94–96%. The ability of the maps to identify a given origin is very much dependent on the availability of samples from that class in the training set. Equally good predictions were obtained for a small test set of weathered samples. This investigation adds value to the GC–MS descriptors already in use for practical analytical work, suggesting new ways to ferret out useful knowledge from them. © 2005 Elsevier B.V. All rights reserved. Keywords: Crude oils; Self-organizing maps; Neural network; GC–MS; Geographical classification; Spills 1. Introduction Huge amounts of crude oil are permanently transported by vessels presenting a risk for the marine environment. Spillages of crude oil are particularly harmful since they are generally transported in much larger quantities than refined products. Iden- tification of the source of spillage is crucial for the enforcement of environmental laws, and a challenging analytical problem. In the analysis of an environmental disaster, limitation of the pos- sible sources of spillage to crude oils from a few geographical origins can support the isolation of a vessel from a group of candidates. Corresponding author. Tel.: +351 21 2948300; fax: +351 21 2948550. E-mail address: [email protected] (J. Aires-de-Sousa). For assistance in the identification of unknown samples, and assignment of their geographical origin, several organizations have tested different analytical methods [1] including IR [2–4] and UV [5] spectroscopy, UV fluorescence [6], 13 C NMR [7], and chromatography [8]. Beyond the ability to discriminate between different origins, the methods must be based on charac- teristics that are resistant to environmental conditions, and must be easily standardized into reproducible routines. The most recognized method for the identification of crude oil samples is gas chromatography (GC), frequently enhanced by sophisticated analytical techniques such as mass spectrom- etry (MS) [8,9]. The power of current analytical techniques applied to crude oils lead to the proposal of a “petroleomics” concept, which envisages “the ultimate characterization of all of the chemical constituents of petroleum, along with their inter- actions and reactivity” [10]. The GC–MS technique allows a large number of characteristics, the “fingerprints” of crude oils, 0003-2670/$ – see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.aca.2005.09.062

Upload: ana-m-fonseca

Post on 26-Jun-2016

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Geographical classification of crude oils by Kohonen self-organizing maps

Analytica Chimica Acta 556 (2006) 374–382

Geographical classification of crude oils byKohonen self-organizing maps

Ana M. Fonsecaa, Jose L. Biscayab,Joao Aires-de-Sousaa,∗, Ana M. Loboa

a CQFB and REQUIMTE, Departamento de Quımica, Faculdade de Ciencias e Tecnologia,Universidade Nova de Lisboa, 2829-516 Monte de Caparica, Portugal

b Instituto Hidrografico, Rua das Trinas, 49, 1249-093 Lisboa, Portugal

Received 7 June 2005; received in revised form 16 September 2005; accepted 28 September 2005Available online 21 November 2005

Abstract

In the analysis of an environmental disaster caused by spillage of crude oil, limitation of the possible sources to a few geographical originscan help in the identification of the polluting vessel from a group of potential candidates. In this paper we show that Kohonen self-organizingm y (GC–MS)d idrogr( eK re obtainedf lation of ap percentageo ability ofs nvestigationa edge fromt©

K

1

vottotsoc

, andtions

tearac-must

udenced

ctrom-esics”all ofter-a

oils,

0d

aps (or Kohonen neural networks) can classify samples of crude oils on the basis of gas chromatography–mass spectrometrescriptors, in terms of geographical origin, with a high degree of accuracy. Two data sets were investigated – one from Instituto HaficoLisbon, Portugal) with 188 samples from 20 geographical origins, and another from EUROCRUDE® with 374 samples. After training thohonen self-organizing maps with a training set, predictions were obtained for an independent test set. Correct predictions we

or 70% and 60% of the test sets for the two studies, respectively. Ensembles of networks were highly interesting for the calcurediction score, which can be used as a measure of the reliability of the prediction. For the samples with high prediction scores, thef correct predictions jumped to 94–96%. The ability of the maps to identify a given origin is very much dependent on the availamples from that class in the training set. Equally good predictions were obtained for a small test set of weathered samples. This idds value to the GC–MS descriptors already in use for practical analytical work, suggesting new ways to ferret out useful knowl

hem.2005 Elsevier B.V. All rights reserved.

eywords: Crude oils; Self-organizing maps; Neural network; GC–MS; Geographical classification; Spills

. Introduction

Huge amounts of crude oil are permanently transported byessels presenting a risk for the marine environment. Spillagesf crude oil are particularly harmful since they are generally

ransported in much larger quantities than refined products. Iden-ification of the source of spillage is crucial for the enforcementf environmental laws, and a challenging analytical problem. In

he analysis of an environmental disaster, limitation of the pos-ible sources of spillage to crude oils from a few geographicalrigins can support the isolation of a vessel from a group ofandidates.

∗ Corresponding author. Tel.: +351 21 2948300; fax: +351 21 2948550.E-mail address: [email protected] (J. Aires-de-Sousa).

For assistance in the identification of unknown samplesassignment of their geographical origin, several organizahave tested different analytical methods[1] including IR[2–4]and UV[5] spectroscopy, UV fluorescence[6], 13C NMR [7],and chromatography[8]. Beyond the ability to discriminabetween different origins, the methods must be based on chteristics that are resistant to environmental conditions, andbe easily standardized into reproducible routines.

The most recognized method for the identification of croil samples is gas chromatography (GC), frequently enhaby sophisticated analytical techniques such as mass speetry (MS) [8,9]. The power of current analytical techniquapplied to crude oils lead to the proposal of a “petroleomconcept, which envisages “the ultimate characterization ofthe chemical constituents of petroleum, along with their inactions and reactivity”[10]. The GC–MS technique allowslarge number of characteristics, the “fingerprints” of crude

003-2670/$ – see front matter © 2005 Elsevier B.V. All rights reserved.oi:10.1016/j.aca.2005.09.062

Page 2: Geographical classification of crude oils by Kohonen self-organizing maps

A.M. Fonseca et al. / Analytica Chimica Acta 556 (2006) 374–382 375

to be determined and compared. The characteristics determinedreflect the chemical composition of a specific crude oil, whichresults from the composition of the original biomass, the pres-sure and temperature conditions during its formation, and thehistory of the reservoir within which it was formed[11]. It is notsurprising that these sets of parameters (biomarkers) are gener-ally well correlated with geography[12].

Several procedures have been optimized for oil spill identi-fication [9]. The Nordtest method[13] is based on fingerprintmatching between the unknown sample and samples from sus-pect sources. In a first stage, the comparison is made between GCand FID chromatograms. If a high level of similarity is found,further confirmation is investigated by matching of GC–MS data.ASTM International[14] also developed standards for the anal-ysis of petroleum oils recovered from water, which are usedfor spill source determination. Comparisons are performed withsuspect samples, usually based on gas chromatography, fluores-cence spectroscopy, and infrared spectroscopy. GC–MS is usedas a complementary technique.

Processing of the analytical data is not a simple task, andtheir full exploration can only be achieved with an appro-priate pattern recognition technique. In this paper, we showthat Kohonen self-organizing maps (or Kohonen neural net-works) can classify samples of crude oils on the basis ofGC–MS descriptors, in terms of geographical origin, with ahigh degree of accuracy. In contrast to oil identification, geo-g ang listo mai ph-i noto ealsr am-p dis-c arya n ba abil-i Thet , i.et mape MSd ssest thes lus-t MSd ions honeS en-s ionas ay tov reas e oilp

rudeo ,P -l ing

their geographical origin, a crucial information in the identifi-cation of the source of the oil spillage, and in the managementof environmental disasters. Furthermore, a test was performedwith weathered samples to assess the usefulness of the methodin real-life situations.

2. Methodology

2.1. Data sets

Data set I, of Instituto Hidrografico (Lisbon, Portugal),includes a total of 188 samples characterized by 21 descriptors.The descriptors are based on peak areas of GC chromatograms.These descriptors were chosen taking into account somerelationships between chemical compounds of oils thatare more resistant to the environmental conditions afterspillage in the sea. Ratios between the areas of some peakscorresponding to aromatic compounds and hopanes wereselected as descriptors. The following descriptors were used(MP: methylphenanthrene; DMP: dimethylphenanthrene): (1)18�(H)-22,29,30-Trisnorhopane/17�(H),21�(H)-Hopane, (2)17�(H)-22,29,30-Trisnorhopane/17�(H),21�(H)-Hopane, (3)17�(H),18�(H), 21�(H)-28, 30-Bisnorhopane/17�(H),21�(H)-Hopane, (4) 17�(H),21�(H)-30-Norhopane/17�(H),21�(H)-Hopane, (5) (22S)17�(H),21�(H)-30-Homohopane/17�(H),21�(H)-Hopane, (6) (22R)17�(H),21�(H)-30-Homohopane/1s((( 2,9 -D ),(1 2,1 7-D hi-o -tD7( 3,1 -Da whiled ticc

d inT ndt wayt nings moge-n ninem rabia,A ain-i study,t class

raphical classification requires a degree of abstractioneneralization. An unknown sample is not matched to af samples, but rather analyzed in terms of patterns auto

cally identified by Kohonen neural networks for (geogracal) classes of samples. A self-organizing map (SOM)nly identifies similarities between samples but also revegions of similarities (clusters) within the universe of the sles, and assists in the identification of descriptors withriminating ability. Furthermore it can serve as a preliminnalysis, after which a more specific set of descriptors capplied. An interesting aspect of these networks is their

ty to self-organize, and to automatically classify objects.raining of a Kohonen SOM is an unsupervised processhe samples (crude oils) are distributed on a Kohonenxclusively on the basis of similarities between GC andescriptors, without using the information about the cla

o which they belong. Only at the end of the training areamples labelled with their known classes to reveal if cering emerged from the training. The determined GC–escriptors can be seen as coordinates in a multidimenspace, and mapping the samples on the surface of a KoOM can be interpreted as a reduction of the multidimional space to a 2D space. Comparing to other tradittatistical techniques, Kohonen SOMs provide an easy wisualize and interpret the results. On the basis of theseons, it was decided to explore their potential for the crudroblem.

In this work, we studied two databases of samples of cils – data set I[12,15] from Instituto Hidrografico, Lisbonortugal, and data set II from EUROCRUDE® [16] – and ana

yzed them using Kohonen SOMs with the aim of assign

d

t-

e

.

aln

l

-

7�(H),21�(H)-Hopane, (7) (22S)17�(H),21�(H)-30,31-Bi-homohopane/17�(H),21�(H)-Hopane, (8) (22R)17�(H),21�-H)-30,31-Bishomohopane/17�(H),21�(H)-Hopane, (9) 18�-H)-Oleanane/17�(H),21�(H)-Hopane, (10) (3-MP + 2-MP)/1,3-DMP + 2,10-DMP + 3,9-DMP + 3,10-DMP +1,6-DMP +-DMP + 1,7-DMP), (11) (9-MP + 1-MP)/(1,3-DMP + 2,10MP + 3,9-DMP+3,10-DMP+1,6-DMP+2,9-DMP+1,7-DMP

12) 3-MP/(9-MP + 1-MP), (13) 2-MP/(9-MP + 1-MP), (14)-MP/9-MP, (15) 4-Methyldibenzothiophene/(1,3-DMP +0-DMP + 3,9-DMP + 3,10-DMP + 1,6-DMP + 2,9-DMP + 1,MP), (16) (3-Methyldibenzothiophene + 2-Methyldibenzotphene)/(4-Methyldibenzothiophene), (17) 1-Methyldibenzo

hiophene/4-Methyldibenzothiophene, (18) (1,6-DMP + 2,9-MP)/(1,3-DMP + 2,10-DMP + 3,9-DMP + 3,10-DMP), (19) 1,-DMP/(1,3-DMP + 2,10-DMP + 3,9-DMP +3,10-DMP), (20)2,6-DMP + 2,7-DMP)/(1,3-DMP + 2,10-DMP + 3,9-DMP +0-DMP), (21) (2,3-DMP + 1,9-DMP + 1,8-DMP)/(1,3MP + 2,10-DMP + 3,9-DMP + 3,10-DMP). Descriptors1–9re ratios of peaks corresponding to hopane compounds,escriptors10–21 correspond to methyl polycyclic aromaompounds.

The geographical origins of the 188 samples are listeable 1together with the partition of the data into training aest sets. This partition was done randomly, but in such ahat the size of the test set is approximately 1/3 of the traiet, and the sizes of the classes in the training set are as hoeous as possible. In a first experiment only data from theost populated classes were used – North Sea, Saudi Angola, Nigeria, Iraq, Iran, Africa, Russia, Venezuela – tr

ng set: 115 samples, test set: 40 samples. In a secondhe less-populated classes were also included, the Angola

Page 3: Geographical classification of crude oils by Kohonen self-organizing maps

376 A.M. Fonseca et al. / Analytica Chimica Acta 556 (2006) 374–382

Table 1Geographical classes of data set I, and partition between training and test sets

Geographical class Number of samples

Total Training set Test set

Africa 9 7 2Angola–Angola 8 5 3Angola–Cabinda 12 9 3Dubai 5 3 2Egypt 5 3 2Indonesia 5 3 2Iran 10 8 2Iraq 14 10 4Kuwait 4 2 2Libya 5 3 2Malaysia 4 2 2Mexico 5 3 2Nigeria 17 13 4North Sea Aa 25 18 7North Sea Bb 7 4 3North Sea Cc 13 9 4North Sea Dd 1 1 0Russia 8 6 2Saudi Arabia 24 19 5Venezuela 7 5 2

Total 188 133 55

a North Sea with low values of descriptors 3 and 15.b North Sea with high values of descriptors 3 and 15.c North Sea with null descriptor 3 and low values for descriptor 15.d North Sea with low values for descriptor 3 and high values for descriptor 15.

was divided into Angola–Cabinda and Angola–Angola, and theNorth Sea class was sub-divided into four distinct classes A, BC, and D on the basis of the content in descriptors 3 and 15, athese are widely recognized important parameters in the chaacterization of a crude oil from North Sea. This study requiredrebalance of the training/test sets in respect to the North Seclass (only three samples were involved).Table 1relates to thissecond study. For the experiments here described each descrtor was linearly scaled between 0 and 1. Scaling is mandatorin order that all descriptors are equally accounted by the selforganizing maps. Direct inter-correlations between descriptorhad been expressly avoided[15]. In any case, Kohonen SOMsdo not require independent variables.

Data were also available for 12 samples that were weathered under simulated marine conditions over a period of time upto 12 days[15]. The set was selected to be diverse and representative, and included the following classes: North Sea A (2)Africa (1), Angola–Cabinda (1), Angola–Angola (2), Saudi Ara-bia (1), Egypt (1), Iran (1), Nigeria (1), Mexico (1), and Russia(1). In the weathering experiments, 5 g of a crude oil samplewas added to 1 L of ocean water, stirred for 15 min and exposeto the action of sun, wind, and rain under open-air conditionsfor 2 days. After this time, 80% of the water was replaced, andthe procedure was repeated concerning stirring and environmeexposure. Aliquots were taken and analyzed in the usual waafter 2, 6, and 12 days.

achs d b

EUROCRUDE® [16]: (1) Heptadecane, (2) Pristane, (3) Octade-cane, (4) Phytane, (5) Fluorene, (6) 1,4-Dimethylnaphthalene,(7) 2,3-Dimethylnaphthalene, (8) 2,3,6-Trimethylnaphthalene,(9) 1-Isopropyl-2-Methylnaphthalene, (10) 3-Methylphenanthr-ene, (11) 2-Methylphenanthrene, (12) Methylanthracene,(13) 9-Methylphenanthrene, (14) 1-Methylphenanthrene, (15)Dibenzothiophene, (16) Dimethylphenanthrene, (17) Trimethy-lphenanthrene, (18) 2,3,5-Trimethylphenanthrene, (19) 4-Methyldibenzothiophene, (20) 3/2-Methyldibenzothiophene,(21) 1-Methyldibenzothiophene, (22) Dimethyldibenzothio-phene (2,6/1,6/1,7/3,6/2,7/2,8/4,6/3,7), (23) 1-Methyl-7(1-methylethyl)-phenanthrene, (24) Dimethyldibenzothiophene(2,6/1,6/1,7/3,6/2,7/2,8/4,6/3,7), (25) Dimethyldibenzothio-phene (2,6/1,6/1,7/3,6/2,7/2,8/4,6/3,7), (26) Benzonaphtoth-iophene, (27) Trimethyldibenzothiophene, (28) Methylpyrene(isomer 1), (29) Methylpyrene (isomer 2), (30) (20S)13�,17�-Diacholestane, (31) (20R)13�,17�-Diacholestane, (32) (20R)-14�,17�-Cholestane, (33) 24-Methyl-14�,17�-cholestane,(34) 24-Ethyl-14�,17�-cholestane, (35) (20S)24-Ethyl-13�,17�-diacholestane/14�,17�-Cholestane, (36) (20R)24-Ethyl-13�,17�-diacholestane/24-Methyl-14�,17�-cholestane, (37)24-Ethyl-14�,17�-cholestane, (38) 24-Ethyl-14�,17�-chole-stane, (39) C23-Tricyclic Terpane, (40) C24-Tricyclic Terpane,(41) C25-Tricyclic Terpane, (42) C28-Tricyclic Terpane, (43)C29-Tricyclic Terpane, (44) Triaromatic Sterane, (45–47) Tri-a 7),( 7),(1HHH

rudeo in thef aftere y thea riptor5 crip-t erera f datai f thes whichi earlys

2

uraln ininga to bem ucha data.T sur-f , i.e.,t ns,m

Data set II consists of 374 samples of crude oils, eample represented by the following 56 descriptors define

,sr-

a

ip-y-s

-

-,

d

nty

y

romatic Sterane, (48) 18�-(H)-22,29,30-Trisnorhopane (C249) Not assigned, (50) 17�(H)-22,29,30-Trisnorhopane (C251) 17�(H),18�(H),21�-(H)-28,30-Bisnorhopane (C28), (52)7�(H),21�(H)-30-Norhopane (C29), (53) 17�(H),21�(H)-opane (C30), (54) (22S) and (22R)17�(H),21�(H)-30-omohopane (C31), (55) (22S) and (22R)17�(H),21�(H)-30-omohopane (C31), (56) 18�(H)-Oleanane (C30).To account for variations in the column behavior, every c

il sample was calibrated against a North Sea Brent sampleollowing way: the Brent sample was analyzed before andach crude oil sample. Then each descriptor was divided bverage of that descriptor in the two Brent samples. Desc6 is null for most samples, so it was calibrated against des

or 53 of the Brent sample. As for data set I, the data wandomly partitioned into training and test sets (Table 2), but inway to homogenize as much as possible the number o

n the training set, and to place approximately one third oamples of each class in the test set (except for North sea,s larger than the other classes). Each descriptor was lincaled between 0 and 1.

.2. Kohonen self-organizing maps [17,18]

A Kohonen self-organizing map (SOM), or Kohonen neetwork (NN), is a 2D array of neurons, each neuron contas many weights as the input descriptors for the objectsapped into the network. A SOM is trained to reflect as ms possible the relationships between individual pieces ofhey are able to map multidimensional information into a

ace (the 2D array, the top surface of the parallelepiped)hey reduce multidimensional information to two-dimensioaintaining the topology of the information.

Page 4: Geographical classification of crude oils by Kohonen self-organizing maps

A.M. Fonseca et al. / Analytica Chimica Acta 556 (2006) 374–382 377

Table 2Geographical origins of data set II, and partition between training and test sets

Geographical origin Number of samples

Total Training set Test set

Africa 27 18 9Angola 13 9 4Dubai 8 5 3Egypt 10 7 3England 6 4 2Indonesia 5 3 2Iran 17 11 6Iraq 8 5 3Kuwait 2 1 1Libya 15 10 5Lithuania 6 4 2Malaysia 4 2 2Mexico 3 2 1Middle East 14 9 5Nigeria 25 17 8North Sea 117 28 89Russia 20 13 7Saudi Arabia 30 20 10USA 12 8 4Venezuela 32 21 11

Total 374 197 177

Fig. 1. Architecture of a Kohonen self-organizing map or Kohonen neuralnetwork.

Fig. 1 shows the architecture of a Kohonen network: eachcolumn in the grid represents a neuron, and each box in suchcolumn represents a weight (a number). Each neuron has as maweights as the input descriptors for the objects to be mapped intthe network. In our case, the objects are samples of crude oil, anthe weights correspond to GC–MS descriptors. The topology oKohonen networks is toroidal, i.e., the left side is continuing theright side and the bottom neurons are considered to be adjaceto those at the top.

Before self-organization (training) starts, the weights takerandom values. Learning in a Kohonen network consists ofadjusting the weights during the training phase, and this is acompetitive process: every time an object from the training seis presented to the network, all the neurons compete to be stimulated by the input object, but only one neuron is chosen as thwinner – the neuron with the most similar weights to the inputdescriptors. Similarity is here defined as the Euclidean distanc

between vectors – the input vector and the neuron vector. Thenetwork then corrects the weights of the winning neuron so thatthey become even more similar to the input descriptors. Theweights of the neighborhood neurons are also corrected with thesame aim, although proportionally less the further they are fromthe winning neuron. The process is repeated iteratively with allobjects of the training set, until a predefined number of cycles(epochs) is reached.

Once the network has been trained, the entire training set issent again through the network and each neuron is classifiedaccording to the class of the objects that were mapped onto it.Empty neurons are classified according to the objects that acti-vate their neighbors. Clustering of objects represented by similardescriptors results from the training algorithm. If the descriptorsglobally contain relevant information concerning the class, thenclustering by class will emerge. In our case, the classes are thegeographical origins of the crude oil samples.

In this study, Kohonen networks were trained and appliedusing JATOON software[19]. The number of epochs was alwaysset to 50. The networks were trained with all the samples of thetraining sets. The size of the network and the initial learningspan were optimized so as to obtain the highest number of cor-rect predictions for the training set. Networks of sizes rangingbetween 18× 18 and 26× 26 were tried. The test set was even-tually submitted to the trained network, and predictions wereobtained.

2

nenn usedb froma achp then le, iff mplei

3

3

pleso atedc ige-r la).T ivingt ana-l als f tests eachl withh

ronsw sam-p atd seful

anyodf

nt

t-e

e

.3. Ensembles of networks

For the classification of data sets I and II several Kohoetworks were put together and a decision process wasy vote. The prediction was made assembling the resultsll trained networks, i.e., the class with more votes wins. Erediction is associated with a prediction score, which isumber of votes obtained by the winning class. For examp

our networks out of five classify a sample as A, then the sas classified as A, and the prediction score is 4.

. Results and discussion

.1. Classification of data set I

In a first experiment Kohonen NNs were trained with samf data set I belonging exclusively to the nine most popullasses: A (North Sea), B (Saudi Arabia), C (Angola), D (Nia), E (Iraq), F (Iran), G (Africa), H (Russia), I (Venezuehe 21 descriptors were used and the three networks g

he best clustering for the training set (115 samples) wereyzed. These are networks with 20× 20 neurons in a toroidurface. Correct predictions were obtained for 85–92% oet. The networks were visually analyzed layer by layer,ayer corresponding to one descriptor, to identify layersigh correlation with classes.

For example, it was found that in the 9th layer, the neuith high weights correspond to the neurons where mostles of class D were mapped –Fig. 2. So, we concluded thescriptor 9 (proportional to the content in oleanane) is u

Page 5: Geographical classification of crude oils by Kohonen self-organizing maps

378 A.M. Fonseca et al. / Analytica Chimica Acta 556 (2006) 374–382

Fig. 2. Graphical representation of the weights in the layer corresponding todescriptor 9 (oleanane) after training a network with samples from data set Ibelonging to nine different geographical origins, and represented by 21 descrip-tors. Neurons activated by Nigeria samples (D) are highlighted.

for the identification of class D (Nigeria). This is in agreementwith the known fact that oleanane is generally present in Nige-ria oils [20,21]. But we chose this layer as an example becauseit allows for further analysis of the data: a few samples fromNigeria also activated neurons in two different regions, whichprobably means a diversity of origins within Nigeria. The mapin Fig. 2suggests that the lack of oleanane is one of the causesfor heterogeneity. In fact the samples mapped in neurons withlow 9th descriptor exhibit a low level of oleanane.

The results obtained for all layers are summarized inTable 3.The following descriptors were found to be relevant and associated with some classes: class A – 1, 2, 3, 4, 10, 21; class B

4, 12, 15, 16; class C – 2, 4, 14, 18; class D – 6, 8, 9; class E –1, 15, 19; class F – 12; class G – 1;class H – 15; class I – 21.Table 3shows the resolution power of these descriptors.

Inspection of these results reveals that six descriptors (num-ber 5, 7, 11, 13, 17, and 20) could not be used to discriminatebetween classes. Interestingly, inspection of layer 3 (correspond-ing to bisnorhopane) showed a correlation between high weightsand class A, which is in accordance with the known fact that bis-norhopane is abundant in samples from North Sea[22].

We then excluded the six descriptors that could not discrim-inate between classes, and trained again networks with the 115samples of the training set. The 40-sample test set was then sub-mitted to a 20× 20 Kohonen network that gave 100% correctpredictions for the training set, obtaining 38 correct predictions,and 2 wrong predictions. The map resulting from the training,with the samples of the test set mapped onto it, is shown inFig. 3.

The map emerging from the training clearly reveals cluster-ing of the samples according to geographical origin. Here itmust be emphasized that the topology of Kohonen networks istoroidal, i.e., the left side is continuing the right side and thebottom neurons are considered to be adjacent to those at thetop. Typical zones of the map were particularly well definedfor samples from North Sea, Iran, Iraq, and Saudi Arabia.Significantly, the Saudi Arabia cluster is adjacent to the Iraqcluster, and the latter to the Iran cluster. Some classes definemore than one cluster indicating probable different geograph-i theA stersc rtheri t thes

Table 3Qualitative correlations between classes and descriptors for the network trainrepresented by 21 descriptors

Descriptor Network 1 Networ sus

1 A + C + E + F + H, G A + E, G2 A + C + D, B + E + F + H + I A + C +3 A A4 A + C, D + F + H, B + G A + C, F5 C + D, B + F + G, I C, D, B6 D, B + E + F + H + I D, E, B7 C + D, G C + D, A8 C + D, F + G + H + I D9 D D1 + F +1 C +1 + B +1 B + E1 + E, A1 H + I1 , D + E1 B + G1 B + E1 + D +2 + B +2 + E +

W sses, t.

0 B + C + E + G + F, A B1 B + C + D + E + G + I, A B +2 B + E + F + H, C + D + I A3 D + I, C A +4 B + F + G, D + F + I, A + C + H B5 C + D, H + I,F + G, B + E A,6 C + D + E + F + G + H + I, B + G A7 B + G + H C,8 C, D C,9 B + D + E + F + G + H + I, C B0 A + B + E + F + G + H, C + D A1 B + C + E + F + G + I, A + H C

hen a descriptor can discriminate between more than one group of cla

a No significant correlation was observed.

-–

cal origins within the class. This could be confirmed forngola class, in which one of the two homogeneous cluorresponds to samples from Cabinda, but could not be fu

nvestigated for other classes due to lack of information abouamples.

ed with samples from data set I belonging to nine different geographical origins, and

k 2 Network 3 Consen

A + C + E, G A/E/GD, H + I A + C, B + E A/C

A A, B + E A + C, B + D + H, I A/B/C+ F + H + G, I nca nca

C + D D+ B + G + I H nca

D DB + C, C + D D

G, A + I A AD + E + G + I, A nca nca

E + F + H, D + I B + C + F, C B/F+ F + G + H, I nca nca

+ C + F + H + I C C, B + E + G C, H, B + E B/E/H

+ H + I, B A, B + G BD, H nca

+ F + I C, I CE + F + H E, C EE + F + G + H, I C nca

F + I, A I, A A/I

the groups corresponding to lower values of the descriptor are listed firs

Page 6: Geographical classification of crude oils by Kohonen self-organizing maps

A.M. Fonseca et al. / Analytica Chimica Acta 556 (2006) 374–382 379

Fig. 3. Kohonen self-organizing map (20× 20) resulting from the training with 115 samples represented by 15 descriptors. Samples belong to data set I and are fromnine geographical origins. Each neuron is labelled according to the samples of the training set that were mapped onto it. Characters represent the classes of the testsamples, and are placed on the neurons activated by them.

Only two predictions for the test set were wrong (highlightedwith bold border): a sample of North Sea (a) that was predictedas being Angola and a sample of Nigeria (d) that was predicted asbeing Iran. However, the wrongly classified sample from NorthSea was mapped onto a neuron that is adjacent to the large NorthSea region. As mentioned before, the samples from Nigeria in thetraining set activated neurons in three different regions, whichsuggests a large diversity of origins within Nigeria, and difficultyof the network in identifying a common pattern. This is reflectedin the wrong classification of one sample from Nigeria in the testset.

Once it was proved that Kohonen NNs could classify thesamples on the basis of the GC–MS descriptors, we tried to alsoinclude the less populated classes (Table 1), and used all the 21

descriptors. New networks were trained, and the five networksthat gave the best predictions for the training set were selected.Then these nets were tested with the independent test set, andthe results are presented inTable 4.

As expected, the inclusion of smaller classes introduced ahigher degree of difficulty. However, ca. 97% of the training setwas correctly predicted and ca. 70% of the samples from thetest set were correctly predicted by each of the five networks.The predictions from the five networks were combined in anensemble, each network contributing to the prediction with onevote. A prediction score was defined for each sample as thenumber of votes obtained by the winning class. The numberof correct predictions slightly increased to 39 (71%), and theresults are shown inTable 5, in terms of true and false positive

Page 7: Geographical classification of crude oils by Kohonen self-organizing maps

380 A.M. Fonseca et al. / Analytica Chimica Acta 556 (2006) 374–382

Table 4Predictions for the training and test sets of data set I (20 classes) by five individual networks

Networksa Training set (133 samples) Test set (55 samples)

Correct Wrong Undecided Correct Wrong Undecided

Net 1 (18× 18) 129 2 2 38 11 6Net 2 (18× 18) 131 2 0 38 11 6Net 3 (18× 18) 131 2 0 39 15 1Net 4 (19× 19) 129 2 2 37 12 6Net 5 (19× 19) 129 2 2 38 15 2Ensemble 131 2 0 39 10 6

a In parentheses the size of the network is specified (number of neurons on the map).

Table 5Predictions for the test set of data set I (20 classes) obtained by the ensemble of five networks

Geographical class (number of samples) True positivesa False positivesb

Number of samples % Number of samples %

Africa (2) 1 50 1 50Angola–Angola (3) 3 100 1 25Angola–Cabinda (3) 3 100 0 0Dubai (2) 1 50 0 0Egypt (2) 0 0 0 0Indonesia (2) 0 0 1 100Iran (2) 2 100 2 50Iraq (4) 4 100 1 20Kuwait (2) 0 0 0 0Libya (2) 1 50 0 0Malaysia (2) 0 0 0 0Mexico (2) 1 50 0 0Nigeria (4) 3 75 2 40North Sea A (7) 5 71 0 0North Sea B (3) 3 100 0 0North Sea C (4) 3 75 2 40North Sea D (0) 0 0 0 0Russia (2) 2 100 0 0Saudi Arabia (5) 5 100 0 0Venezuela (2) 2 100 0 0

In the entire test set six samples were undecided.a Correct prediction as class X/number of samples of class X× 100.b Wrong predictions as class X/number of samples predicted as class X× 100.

predictions for each class. The inclusion of more classes in thetraining set, with a reduced number of objects, did not deterioratethe ability of the networks to classify samples from the morepopulated and better-defined classes.

Eight out of twenty geographical families (North Sea A,North Sea B, Saudi Arabia, Angola–Cabinda, Angola–Angola,Iraq, Russia, and Venezuela) have simultaneously more than70% of true positives, and less than or 25% of false positives. Onthe other hand, two geographical families (Africa and Indonesia)exhibited 50% or less of true positives and 50% or more of falsepositives. We can see that three families were never predicted(Egypt, Malaysia, and Kuwait). These more difficult familieshave a lower number of samples in the training set.

Predictions were evaluated according to the scores of pre-diction, and interesting results could be achieved in terms ofassessment of reliability –Table 6. A clear general trend isobserved – the percentage of correct predictions increases withthe prediction scores. Although 71% of the samples from the testset were correctly predicted, the percentage increases to 94% for

the predictions associated with a score of 5 (64% of the sam-ples). This observation shows how an ensemble of networks cannot only improve the predictions, but also associate a degreeof reliability with each prediction. This feature is extremelyimportant in the practical use of an automatic classificationsystem.

The ensemble of networks was eventually tested with 12 sam-ples that were exposed to simulated marine conditions (weath-ering), over 2, 6, and 12 days. This period of 12 days wasfound to be the most practically useful interval in real life situ-

Table 6Correct predictions related to the prediction score for data set I

Prediction score Number of samples Correct predictions

5 35 33 (94%)4 6 3 (50%)3 8 3 (38%)2 0 –

Page 8: Geographical classification of crude oils by Kohonen self-organizing maps

A.M. Fonseca et al. / Analytica Chimica Acta 556 (2006) 374–382 381

Table 7Predictions for the training and test sets of data set II (20 classes) by 10 individual networks

Networksa Training set (197 samples) Test set (177 samples)

Correct Wrong Undecided Correct Wrong Undecided

Net 1 (22× 22) 166 6 25 89 52 36Net 2 (23× 23) 169 2 26 91 41 45Net 3 (23× 23) 167 4 26 99 38 40Net 4 (24× 24) 178 2 17 98 37 42Net 5 (25× 25) 163 1 33 85 45 47Net 6 (25× 25) 164 2 31 96 28 53Net 7 (25× 25) 178 2 17 102 38 37Net 8 (26× 26) 168 0 29 102 30 45Net 9 (26× 26) 175 0 22 92 41 44Net 10 (26× 26) 169 1 27 82 32 63Ensemble 184 0 13 106 22 49

a In parentheses the size of the network is specified (number of neurons on the map).

ations, enabling the confident use of the set of descriptors herepresented. After 12 days, 9 out of 12 samples were correctlyclassified, one was undecided and two were wrongly classified.Significantly the three samples that were not correctly predictedbelong to classes with a reduced number of objects in the trainingset (Mexico and Egypt) or with more than one cluster in the map(Nigeria). For the samples predicted with a prediction score of4 or 5 the percentage of correct prediction was 89% (8 out of 9).The results for 2 days (eight correct, three undecided, one wrong)and 6 days (nine correct, one undecided, two wrong) were similarwith problems appearing in the same samples. Although a directcorrelation is not possible between the duration of weathering insimulated conditions and in the sea, these results indicate that theSOMs can be safely applied to weathered samples. This is made

possible by the design of robust GC–MS descriptors, namelythe use of ratios between peaks corresponding to certain com-pounds. In fact, data for weathered samples which still containn-pentadecane can safely be analyzed using descriptors basedon aromatic and sulfur-containing compounds, such as descrip-tors 10–21. Descriptors based on hopanes, such as descriptors1–8, can be used in practice only ifn-hexadecane is still present[15].

3.2. Classification of data set II

Several Kohonen networks were trained with the training setof data set II (20 geographical families and 56 descriptors), andwere applied to the test set afterward. The results obtained for

Table 8Predictions for the test set of data set II (20 classes) obtained by the ensemble of 10 networks

Geographical class (number of samples) True positivesa False positivesb

Number of samples % Number of samples %

Africa (9) 7 78 1 13Angola (4) 2 50 2 50Dubai (3) 2 67 1 33Egypt (3) 0 0 0 0England (2) 0 0 0 0Indonesia (2) 0 0 0 0IIKLLMMMNN 3RS 0UV 0

F

ran (6) 2raq (3) 1uwait (1) 0ibya (5) 4ithuania (2) 0alaysia (2) 0exico (1) 0iddle East (5) 2igeria (8) 7orth Sea (89) 58ussia (7) 3audi Arabia (10) 6SA (4) 1enezuela (11) 11

orty-nine samples (28%) were undecided.a

Correct prediction as class X/number of samples of class X× 100.b Wrong predictions as class X/number of samples predicted as class X× 100.

33 1 3333 1 500 0 0

80 2 330 0 00 1 1000 0 040 2 5088 1 1365 243 3 5060 4 4

25 1 50100 0

Page 9: Geographical classification of crude oils by Kohonen self-organizing maps

382 A.M. Fonseca et al. / Analytica Chimica Acta 556 (2006) 374–382

Table 9Correct predictions related to the prediction score for data set II

Prediction score Number of samples Correct predictions

10 41 40 (98%)9 15 14 (93%)8 13 9 (69%)7 22 17 (77%)6 17 12 (71%)5 13 9 (69%)4 7 5 (71%)3 0 –2 0 –

the 10 networks yielding the most accurate predictions for thetraining set are presented inTable 7.

Correct predictions were obtained for the training and test setswith percentages of 83–90% and 46–58%, respectively. Whenthe 10 networks were combined in an ensemble to give consen-sus predictions and associated prediction scores, more accurateclassifications were obtained for the test set: 106 (60%) cor-rect, 22 (12%) wrong, and 49 (28%) undecided. The results areshown inTable 8, separately for each class. It is noteworthy thatthe ensemble gave a higher number of correct predictions, anda lower number of wrong predictions than any of the individualnetworks.

Now for 6 out of 20 geographic classes the percentage oftrue positives is≥65% and simultaneously the percentage offalse positives is≤33% – Africa, Dubai, Libya, Nigeria, NorthSea, and Venezuela. Four of these classes additionally hadrespectable number of samples correctly predicted in the test s– North Sea, Venezuela, Africa, and Nigeria. Assignment of asample to one of these classes is expected to be more reliabthan to others. North Sea and Venezuela were also classified wihigh accuracy and low incidence of false positives in data set Iusing a different set of descriptors. On the other hand, the seof descriptors in data set I was more successful in identifyingIraq and Angola classes.Table 8also shows that predictionscould not be made for classes with a small number of objectsin the training set, e.g., Lithuania, England, Mexico, Indonesia,a ought

f corr ctions ctionA andt . Fos 7%.

dataw s ofI O-C wellw lassl sultI giont es ot ple

which requires that the diversity of the class is well coveredby the objects of the training set. This however is not a seriousproblem since the enlargement of the training set can be doneincrementally, and thus the method may be gradually improved[23].

In conclusion, Kohonen SOMs could accurately classifycrude oil samples in terms of geographical origin, particu-larly in the case of classes well represented in the training set.The predictions were improved with ensembles of SOMs thatalso associated a measure of reliability to each prediction. Fordata set I, the SOMs could be interpreted in order to identifydescriptors with discriminating ability within the universe ofprovided classes, such as 28,30-bisnorhopane for North Sea and1,6-DMP + 2,9-DMP for Angola (Table 3). These models wererobust enough to accommodate, by retraining, new data fromnew geographical classes, and to make accurate classification ofweathered samples.

Acknowledgements

We are thankful to Instituto Hidrografico (Lisbon, Portugal)for the help provided in using the crudes’ analytical database,and to Faculdade de Ciencias e Tecnologia of Universidade Novade Lisboa for the computing facilities.

References

266.ssifi-e: M.rres,

N.

, J.

[ .[ nger-

[ .[ land,

[[[ inal

[ 3.[ sign,

[[ eol.

[[[ . 44

nd Kuwait. Egypt was never predicted in the test set, althhere were seven samples in the training set.

Table 9shows the relationship between the percentage oect predictions and the prediction scores. Again, the predicore allows for some inference about the accuracy of prediprediction score≥9 was obtained for 32% of the samples,

he percentage of correct predictions in this group was 96%cores below nine, this percentage is in the range of 69–7

Although both sets of descriptors derived from GC–MSere found useful, the results obtained with descriptor

nstituto Hidrografico were slightly better than those of EURRUDE. In general, the method performed particularlyith large classes, and presented some weakness with c

ess populated in the training set, which is not a surprising ren fact, as expected, for a self-organizing map to define a rehat reliably corresponds to a class, the common featurhe class must emerge over the singularities of the sam

aet

leth,t

-

.

r

es.

fs,

[1] N.B. Vogt, C.E. Sjoegren, Anal. Chim. Acta 222 (1989) 135.[2] F.K. Kawahara, J. Chromatogr. Sci. 10 (1972) 629.[3] F.K. Kawahara, J.F. Santner, E.C. Julian, Anal. Chem. 46 (1974)[4] For an investigation of neural networks for the geographical cla

cation of bitumens on the basis of near-infrared spectroscopy seBlanco, S. Maspoch, I. Villarroya, X. Peralta, J.M. Gonzalez, J. ToAppl. Spectrosc. 55 (2001) 834.

[5] E.M. Levy, Water Res. 6 (1972) 57.[6] E.S.V. Vleet, Mar. Technol. Soc. J. 18 (1984) 11.[7] O.M. Kvalheim, D.W. Aksnes, T. Brekke, M.O. Eide, E. Sletten,

Telnaes, Anal. Chem. 57 (1985) 2858.[8] S.D. Killops, J.W. Readman, Org. Geochem. 8 (1985) 247.[9] For a review on oil spill identification see: Z.D. Wang, M. Fingas

Chromatogr. A 774 (1997) 51.10] A.G. Rodgers, R.P. Marshall, Accounts Chem. Res. 37 (2004) 5311] B. Tissot, D. Welte, Petroleum Formation and Occurrence, Spri

Verlag, 1978.12] J.L. Biscaya, H.C. Neves, Anais do Instituto Hidrografico 13 (1992) 4913] Nordtest Method, NT Chem 001, NORDTEST, 2nd ed., Espoo, Fin

1991.14] ASTM International.http://www.astm.org.15] J.L. Biscaya, Ph.D. Thesis 1997, Universidade Nova de Lisboa.16] EUROCRUDE®, European Crude Oil Identification System – F

Report, 1995.17] J. Gasteiger, J. Zupan, Angew. Chem. Int. Ed. Engl. 32 (1993) 5018] J. Zupan, J. Gasteiger, Neural Networks in Chemistry and Drug De

2nd ed., Wiley-VCH, Weinheim, 1999.19] J. Aires-de-Sousa, Chemometr. Intell. Lab. Syst. 61 (2002) 167.20] C.M. Ekweozor, J.I. Okogun, D.E.U. Ekong, J.R. Maxwell, Chem. G

27 (1979) 29.21] O.T. Udo, C.M. Erweozor, Energy Fuels 4 (1990) 248.22] N. Telnaes, B. Dahl, Org. Geochem. 10 (1986) 425.23] Y. Binev, M. Corvo, J. Aires-de-Sousa, J. Chem. Inf. Comp. Sci

(2004) 946.