evolutionary association rules for total ozone content ...riquelme/publicaciones/chemo.pdf ·...

11
Evolutionary association rules for total ozone content modeling from satellite observations M. Martínez-Ballesteros a , S. Salcedo-Sanz b, , J.C. Riquelme a , C. Casanova-Mateo c , J.L. Camacho d a Department Languages and Information Systems, Universidad de Sevilla, Seville, Spain b Department of Signal Theory and Communications, Universidad de Alcalá, Madrid, Spain c Department of Applied Physics, Universidad de Valladolid, Valladolid, Spain d Meteorological State Agency of Spain (AEMET), Madrid, Spain abstract article info Article history: Received 30 March 2011 Received in revised form 5 September 2011 Accepted 23 September 2011 Available online 1 October 2011 Keywords: Association rules Evolutionary algorithms Total Ozone Content (TOC) Satellite data In this paper we propose an evolutionary method of association rules discovery (EQAR, Evolutionary Quan- titative Association Rules) that extends a recently published algorithm by the authors and we describe its ap- plication to a problem of Total Ozone Content (TOC) modeling in the Iberian Peninsula. We use TOC data from the Total Ozone Mapping Spectrometer (TOMS) on board the NASA Nimbus-7 satellite measured at three lo- cations (Lisbon, Madrid and Murcia) of the Iberian Peninsula. As prediction variables for the association rules we consider several meteorological variables, such as Outgoing Long-wave Radiation (OLR), Temperature at 50 hPa level, Tropopause height, and wind vertical velocity component at 200 hPa. We show that the best as- sociation rules obtained by EQAR are able to accurate modeling the TOC data in the three locations consid- ered, providing results which agree to previous works in the literature. © 2011 Elsevier B.V. All rights reserved. 1. Introduction Modeling ozone series from satellite observations, past data and its relationship with meteorological variables is an important topic quite often tackled in the literature [111]. The interest in modeling ozone series started on the early 70's, when changes into the strato- spheric ozone were claimed to be caused by catalytic reactions in the stratosphere that originated losses in the total amount of ozone [12,13]. More specically, other studies on this topic focused on the role of chlorine [14] and the CFCs [15] in ozone losses at the strato- sphere. Those theories were conrmed by the observation of a sharp decrease in the stratospheric ozone levels over Antarctica at the start of the southern spring season in the middle 80's over several polar bases of this continent [16]. A wide review on concepts and history of ozone depletion can be found in [17,18]. In recent years, ozone variation has been related to climate change, so ozone modeling has become an important indicator of deep changes in the atmosphere. That is why very different ap- proaches can be found in the modeling of ozone series in recent bib- liography. Specically, a large amount of works dealing with Total Ozone Content (TOC) of the atmosphere have been published in the last few years, since it seems that variations in these TOC series are a more complete indicator of climate change than only stratospheric ozone series. Thus, there are important works devoted to comparison of different satellite and terrestrial measurements of TOC over differ- ent sites [1921]. The inuence of aerosols in total ozone measures is analyzed in [22], where ground and satellite measures are considered. Studies on rare events related to ozone content are studied as well in the literature [23], this includes cases located at the Iberian Peninsula [24]. Also, the modeling of TOC variability has been previously stud- ied, treating different aspects such as its relationship with atmospher- ic circulation and dynamics or with greenhouse gasses [1,8,25,26] In this paper, we present an analysis of TOC series modeling in the Iberian Peninsula using Association Rules (ARs) obtained by an evolu- tionary algorithm. The discovery of ARs is a non-supervised learning and descriptive tool, which explains or summarizes the data, i.e., ARs are used to explore the properties of the data, instead of predict- ing the class of new data [27]. The aim of ARs mining is discover the presence of pairs (attributevalue), which appear in a dataset with certain frequency, in order to obtain rules that show the existing relationships among the attributes. There exist many algorithms for obtaining ARs from a dataset, such as AIS [28], Apriori [29], and SETM [30]. However, many of these tools that work in continuous domains just discretize the attributes by using a specic strategy and deal with these attributes as if they were discrete, which may lead to poor results in real continuous datasets. Another important class of techniques for ARs discovery is based on evolutionary algo- rithms (EAs), which have been extensively used for the optimization and adjustment of models in data mining tasks. EAs are search algo- rithms which generate solutions for optimization problems using Chemometrics and Intelligent Laboratory Systems 109 (2011) 217227 Corresponding author at: Department of Signal Theory and Communications, Uni- versidad de Alcalá, 28871 Alcalá de Henares, Madrid, Spain. Tel.: + 34 91 885 6731; fax: +34 91 885 6699. E-mail address: [email protected] (S. Salcedo-Sanz). 0169-7439/$ see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2011.09.011 Contents lists available at SciVerse ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab

Upload: others

Post on 14-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evolutionary association rules for total ozone content ...riquelme/publicaciones/chemo.pdf · Evolutionary association rules for total ozone content modeling from satellite observations

Chemometrics and Intelligent Laboratory Systems 109 (2011) 217–227

Contents lists available at SciVerse ScienceDirect

Chemometrics and Intelligent Laboratory Systems

j ourna l homepage: www.e lsev ie r .com/ locate /chemolab

Evolutionary association rules for total ozone content modeling fromsatellite observations

M. Martínez-Ballesteros a, S. Salcedo-Sanz b,⁎, J.C. Riquelme a, C. Casanova-Mateo c, J.L. Camacho d

a Department Languages and Information Systems, Universidad de Sevilla, Seville, Spainb Department of Signal Theory and Communications, Universidad de Alcalá, Madrid, Spainc Department of Applied Physics, Universidad de Valladolid, Valladolid, Spaind Meteorological State Agency of Spain (AEMET), Madrid, Spain

⁎ Corresponding author at: Department of Signal Theversidad de Alcalá, 28871 Alcalá de Henares, Madrid, Spa+34 91 885 6699.

E-mail address: [email protected] (S. Salcedo-S

0169-7439/$ – see front matter © 2011 Elsevier B.V. Alldoi:10.1016/j.chemolab.2011.09.011

a b s t r a c t

a r t i c l e i n f o

Article history:Received 30 March 2011Received in revised form 5 September 2011Accepted 23 September 2011Available online 1 October 2011

Keywords:Association rulesEvolutionary algorithmsTotal Ozone Content (TOC)Satellite data

In this paper we propose an evolutionary method of association rules discovery (EQAR, Evolutionary Quan-titative Association Rules) that extends a recently published algorithm by the authors and we describe its ap-plication to a problem of Total Ozone Content (TOC) modeling in the Iberian Peninsula. We use TOC data fromthe Total Ozone Mapping Spectrometer (TOMS) on board the NASA Nimbus-7 satellite measured at three lo-cations (Lisbon, Madrid and Murcia) of the Iberian Peninsula. As prediction variables for the association ruleswe consider several meteorological variables, such as Outgoing Long-wave Radiation (OLR), Temperature at50 hPa level, Tropopause height, and wind vertical velocity component at 200 hPa. We show that the best as-sociation rules obtained by EQAR are able to accurate modeling the TOC data in the three locations consid-ered, providing results which agree to previous works in the literature.

ory and Communications, Uni-in. Tel.: +34 91 885 6731; fax:

anz).

rights reserved.

© 2011 Elsevier B.V. All rights reserved.

1. Introduction

Modeling ozone series from satellite observations, past data andits relationship with meteorological variables is an important topicquite often tackled in the literature [1–11]. The interest in modelingozone series started on the early 70's, when changes into the strato-spheric ozone were claimed to be caused by catalytic reactions inthe stratosphere that originated losses in the total amount of ozone[12,13]. More specifically, other studies on this topic focused on therole of chlorine [14] and the CFCs [15] in ozone losses at the strato-sphere. Those theories were confirmed by the observation of asharp decrease in the stratospheric ozone levels over Antarctica atthe start of the southern spring season in the middle 80's over severalpolar bases of this continent [16]. A wide review on concepts andhistory of ozone depletion can be found in [17,18].

In recent years, ozone variation has been related to climatechange, so ozone modeling has become an important indicator ofdeep changes in the atmosphere. That is why very different ap-proaches can be found in the modeling of ozone series in recent bib-liography. Specifically, a large amount of works dealing with TotalOzone Content (TOC) of the atmosphere have been published in thelast few years, since it seems that variations in these TOC series are

a more complete indicator of climate change than only stratosphericozone series. Thus, there are important works devoted to comparisonof different satellite and terrestrial measurements of TOC over differ-ent sites [19–21]. The influence of aerosols in total ozone measures isanalyzed in [22], where ground and satellite measures are considered.Studies on rare events related to ozone content are studied as well inthe literature [23], this includes cases located at the Iberian Peninsula[24]. Also, the modeling of TOC variability has been previously stud-ied, treating different aspects such as its relationship with atmospher-ic circulation and dynamics or with greenhouse gasses [1,8,25,26]

In this paper, we present an analysis of TOC series modeling in theIberian Peninsula using Association Rules (ARs) obtained by an evolu-tionary algorithm. The discovery of ARs is a non-supervised learningand descriptive tool, which explains or summarizes the data, i.e.,ARs are used to explore the properties of the data, instead of predict-ing the class of new data [27]. The aim of ARs mining is discover thepresence of pairs (attribute–value), which appear in a dataset withcertain frequency, in order to obtain rules that show the existingrelationships among the attributes. There exist many algorithms forobtaining ARs from a dataset, such as AIS [28], Apriori [29], andSETM [30]. However, many of these tools that work in continuousdomains just discretize the attributes by using a specific strategyand deal with these attributes as if they were discrete, which maylead to poor results in real continuous datasets. Another importantclass of techniques for ARs discovery is based on evolutionary algo-rithms (EAs), which have been extensively used for the optimizationand adjustment of models in data mining tasks. EAs are search algo-rithms which generate solutions for optimization problems using

Page 2: Evolutionary association rules for total ozone content ...riquelme/publicaciones/chemo.pdf · Evolutionary association rules for total ozone content modeling from satellite observations

Table 1Meteorological input variables and associated grid size. “IP” stands for IberianPeninsula.

Area Met. variable Variable name Grid coordinates

IP 50 hPa temperature t50 35–42.5 N, 12.5 W–5EIP Outgoing Longwave Radiation OLR 35–42.5 N, 12.5 W–5EIP Omega at 200 hPa ω200 35–42.5 N, 12.5 W–0EIP Tropopause height TPG 35–42.5 N, 12.5 W–5ELisbon Tropopause height TPW 35–42.5 N, 12.5 W–5 WMadrid Tropopause height TPC 35–42.5 N, 7.5 W–0 WMurcia Tropopause height TPE 35–42.5 N, 2.5 W–5E

218 M. Martínez-Ballesteros et al. / Chemometrics and Intelligent Laboratory Systems 109 (2011) 217–227

techniques inspired by natural evolution [31]. They are implementedas a computer simulation in which a population of abstract represen-tations (chromosomes) of candidate solutions (individuals) for an op-timization problem evolves toward better solutions. EAs can be usedto discover ARs, since they offer a set of advantages for knowledge ex-traction and specifically for rule induction processes. In this work, theevolutionary algorithm proposed in [32] has been extended andcalled EQAR (Evolutionary Quantitative Association Rules). EQAR isapplied to the ARs extraction to explain TOC data. The new featuresadded improve the AR mining task and result in the TOC modelingin the Iberian Peninsula. We show that the best rules obtained bythe EQAR approach are able to accurately model the TOC data in thethree locations considered, providing results which agree to previousworks in the literature.

The structure of the rest of the paper is as follows: next sectionpresents the available TOC data and input meteorological variablescollected for this study, the description of measurement locationand their characteristics. In this section we also detail the predictionvariables used in the paper. Section 3 describes the main characteris-tics of the evolutionary algorithm used to obtain the associated rules.Section 4 presents the main results obtained using the associativerules obtained in the explanation of the TOC series in the three loca-tions considered within the Iberian Peninsula. Section 6 closes thepaper giving some final conclusions.

2. TOC data over the Iberian Peninsula and prediction variables

Monthly mean satellite measurements of TOMS (Total OzoneMapping Spectrometer, on board the NASA Nimbus-7 satellite[33,34]) data for the period 1979–1993 have been used in thisstudy. In addition, a group of several meteorological variables hasbeen selected as input (prediction) variables. Specifically: tropopauseheight (hPa), TP, outgoing longwave radiation (Wm−2), OLR, temper-ature at 50 hPa (K), t50 and air vertical velocity at 200 hPa (hPa/s),a200. All these variables have been obtained with a spatial resolutionof 2.5 degree latitude×2.5 degree longitude from NCEP/NCAR reana-lysis [35,36]. These four meteorological variables have been selectedbecause all of them have a close relation to TOC concentration:

1. Temperature at 50 hPa (t50): Many studies have shown that mapsof total ozone and 50 hPa temperature look very similar, reflectinga very close coupling between them [8,37]. These studies highlightthe fact that, as a rule of the thumb, a 10 Dobson Units (DU)change in total ozone corresponds to a 1 K change of 50 hPa tem-perature. Consequently, this meteorological variable should becorrelated with TOC values.

2. Tropopause height (TP): The tropopause is a transition layer be-tween the troposphere and the stratosphere. It is not uniformlythick, and it is not continuous from the equator to the poles. Aswell, tropopause separates the well-mixed ozone poor tropo-sphere and the stratified ozone rich and well mixed stratosphere.This fact gives the key to use the tropopause as a proxy to analyzeTOC values. According to [8], in a tropospheric high pressure sys-tem, sinking air in the troposphere leads to an adiabatic warming,causing tropopause and low stratosphere air to rise. As a conse-quence of these vertical movements, the lower stratosphere coolsadiabatically and ozone-poor air moves up, decreasing totalozone. The opposite occurs in tropospheric low pressure systems.Thus, it can be said that high tropopause values are correlatedwith low total ozone and a low tropopause values with high totalozone [7].As has been shown in [38], the selected definition of the tropo-pause, thermal or dynamical, is not critical. Therefore we have de-cided to use the thermal one, following the standard criterion ofthe World Meteorological Organization (WMO) to define thermaltropopause: the lowest level at which the lapse rate decreases to

2 K ⋅km−1 or less, provided also the average lapse rate betweenthis level and all higher levels within 2 km does not exceed2 K ⋅km−1 [39]. To determine each thermal tropopause fromNCAR reanalysis data, we have used the methodology proposedby [40] using European Centre for Medium-Range Weather Fore-casts (ECMWF) reanalysis (ERA) data.

3. Vertical wind velocity (ω200): As stated before, vertical move-ments through the tropopause bring ozone-poor air into thestratosphere, attenuating ozone-layer. Conversely, descending airfrom the upper layers of the stratosphere bring ozone-rich airinto the ozone-layer, increasing the density of this layer. In[41,42] the authors have proposed a phenomenological model toexplain this idea (another discussion about this model can befound in [43]). Thus, in order to deepen in the correlation betweenvertical movements and variations in total ozone, ω (total time de-rivative of the pressure -isobaric coordinate system-) can be usedfor this purpose. ω negative values will indicate ascending move-ments, whereas positive omega values will indicate descendingmovements.

4. Outgoing longwave radiation (OLR): Among other gasses, ozone isone of the most important absorbers in the atmosphere. The ozonemolecule has a relatively strong rotation spectrum. The threefundamental ozone vibrational bands occur at wavelengths of9.066, 14.27, and 9.597 μm, respectively. The very strong9.597 μm and moderately strong 9.066 μm fundamentals combineto make the well-known 9.6 μm band of ozone [44]. Because this9.6 μm band is a portion of the infrared region of the electromag-netic spectrum, a direct relationship exists between ozone andthe OLR [45] and can be used to characterize TOC.

Our study is focused in three locations of the Iberian Peninsula:Lisbon (38.70 N, 9.10 W), Madrid (40.40 N, 3.70 W) and Murcia(38.00 N, 1.10 W). The four meteorological variables have been calcu-lated using a spatial grid covering the three locations. In addition,having into account the strong correlation between tropopauseheight and TOC, we have decided to calculate this meteorological var-iable with a customized grid for each of the three locations, i.e., TPvariable is divided into four different variables depending on each lo-cation (the global TP for the Iberian Peninsula (TPG) and the TP vari-able calculated with a grid centered at each location (TPW, TPC andTPE for Lisbon, Madrid and Murcia, respectively)). Table 1 summarizesthe grids used in this study.

3. Methods

In this section we introduce the main AR concepts necessary tofollow the rest of the paper, and also the evolutionary algorithm pro-posed in this work in order to look for ARs.

3.1. Association rules

The massive use of computational processing techniques has rev-olutionized the scientific research due to the high volume of data

Page 3: Evolutionary association rules for total ozone content ...riquelme/publicaciones/chemo.pdf · Evolutionary association rules for total ozone content modeling from satellite observations

Table 2Illustrative dataset.

Instance F1 F2 F3

t1 35 183 88t2 42 154 47t3 37 186 93t4 30 199 112t5 33 173 83t6 24 178 75t7 63 177 91t8 22 167 60

219M. Martínez-Ballesteros et al. / Chemometrics and Intelligent Laboratory Systems 109 (2011) 217–227

which can be obtained. Data mining is one of the most used instru-mental tool for discovering knowledge from transactions. In thefield of data mining, the learning of ARs is a popular and well-known research method for discovering interesting relations amongvariables in large databases [29,46].

Formally, ARs were first defined by Agrawal et al. in [28] asfollows. Let I={i1, i2,…, in} be a set of n items and D={t1, t2,…, tN} aset of N transactions, where each tj contains a subset of items. Thus,a rule can be defined as X⇒Y, where X,Y I and X∩Y=∅. Finally, Xand Y are called antecedent (or left side of the rule) and consequent(or right side of the rule), respectively.

When the domain is continuous, the ARs are known as Quantita-tive Association Rules (QAR). In this context, let F={F1,…,Fn} be aset of features, with values in ℝ. Let A and C be two disjunct subsetsof F, that is, A⊂F, C⊂F, and A∩C=∅. A QAR is a rule X⇒Y, inwhich features in A belong to the antecedent X, and features in C be-long to the consequent Y, such that X and Y are formed by a conjunc-tion of multiple boolean expressions of the form Fi∈ [v1,v2]. Theconsequent Y is usually a single expression. In this proposal, QAR isused, since the domain variable (TOC) is a continuous one.

3.2. Quality measures for association rules

The following paragraphs detail the most popular quality mea-sures used to evaluate an AR. Note that it is very important to havea measure of the quality of a given rule in order to select the bestset of rules. In the ARs mining process, probability-based measuresthat evaluate the generality and reliability of ARs have been selected.In particular, the support measure is used to represent the generalityof the rule and the confidence, the lift and the leverage are normallyused to represent the reliability of the rule [47,48]. The formal defini-tions of these variables are the following:

• Support(X) [48]: The support of an itemset X is defined as the ratioof instances in the dataset that satisfy X. Usually, the support of X isnamed as the probability of X.

sup Xð Þ ¼ P Xð Þ ¼ n Xð ÞN

: ð1Þ

where n(X) is the number of occurrences of the itemset X in thedataset, and N is the number of instances forming such dataset.

• Support(X⇒Y) [48]: The support of the rule X⇒Y is the percentageof instances in the dataset that satisfy X and Y simultaneously.

sup X⇒Yð Þ ¼ P Y∩Xð Þ ¼ n XYð ÞN

: ð2Þ

where n(XY) is the number of instances that satisfy the conditionsfor the antecedent X and consequent Y simultaneously.

• Confidence(X⇒Y) [48]: The confidence is the probability that in-stances satisfying X, also satisfy Y. In other words, it is the supportof the rule divided by the support of the antecedent.

conf X⇒Yð Þ ¼ P Xð jYÞ ¼ sup X⇒Yð Þsup Xð Þ ð3Þ

• Lift(X⇒Y) [49]: Lift or interest is defined as how many times moreoften X and Y are together in the dataset than expected, assumingthat the presence of X and Y in instances is statistically indepen-dent. Lifts greater than one involve statistical dependence in simul-taneous occurrence of X and Y, in other words, the rule providessuccessful information about X and Y occurring together in thedataset.

lift X⇒Yð Þ ¼ sup X⇒Yð Þsup Xð Þsup Yð Þ ¼

conf X⇒Yð Þsup Yð Þ ð4Þ

• Leverage(X⇒Y) [50]: Leverage measures the proportion of addi-tional cases covered by both X and Y above those expected if Xand Y were independent of each other. Leverage takes values inside[−1, 1]. Values equal or under value 0, indicate a strong indepen-dence between antecedent and consequent. On the other handvalues near 1 are expected for an important association rule. Valuesabove 0 are desirable. In addition, leverage is a lower bound for sup-port, and therefore, optimizing only the leverage guarantees a cer-tain minimum support (contrary to optimizing only theconfidence or only the lift).

lev X⇒Yð Þ ¼ sup X⇒Yð Þ−sup Xð Þsup Yð Þ ð5Þ

• Accuracy(X⇒Y) [48]: Accuracy measures the degree of veracitythus, the degree of fit (matching) between the obtained valuesand the actual data. An accuracy of 100% means that the measuredvalues are exactly the same as the given values. In the field of min-ing association rules, accuracy measures the sum of the percentageof instances in the dataset that satisfy the antecedent and the con-sequent and the percentage of instances in the dataset that do notsatisfy neither the antecedent nor the consequent. Accuracy takesvalues inside [0, 1] and values near 1 are expected for a rule withhigh quality and veracity.

Acc X⇒Yð Þ ¼ sup X⇒Yð Þ þ sup ÄX Ä⇒Y� �

ð6Þ

where ¬ means negation, therefore sup(¬X⇒¬Y) is the percentageof instances in the dataset that do not satisfy X and Ysimultaneously.

In most cases, it is enough to focus on a combination of support,confidence, and either lift or leverage to obtain a good measure ofthe rule “quality”. However, how good a rule is for modeling a datasetin terms of usefulness and actionability is a subjective concept, anddepends on the particular domain and the business objectives.

For a better understanding of these quality measures, we give asmall example, by using a dataset comprising eight instances andthree features are shown in Table 2. Consider then two examplerules, henceforth called Rule (7) and Rule (8), respectively:

F1∈ 32;35½ �∧F2∈ 179;188½ �⇒F3∈ 84;94½ � ð7Þ

F1∈ 32;35½ �∧F2∈ 179;188½ �⇒F3∈ 46;94½ � ð8Þ

In Rule (7), the support of the antecedent is 12.5%, since oneinstance, t1, simultaneously satisfy that F1 and F2 belong to the inter-vals [32,35] and [179,188], respectively (one instance out of eight, sup(X)=0.125). As for the support of the consequent, sup(Y)=0.375 be-cause instances t1, t3 and t7 satisfy that F3∈ [84,94]. Regarding theconfidence, only one instance t1 satisfies all the three features (F1and F2 in the antecedent, and F3 in the consequent) appearing in therule; in other words, sup(X⇒Y)=0.125. Therefore, conf(X⇒Y)=0.125/0.125=1, that is, the rule has a confidence of 100%. The lift

Page 4: Evolutionary association rules for total ozone content ...riquelme/publicaciones/chemo.pdf · Evolutionary association rules for total ozone content modeling from satellite observations

l2 u 2l1 u 1

t 2t 1

ln un...

t n...

Fig. 1. Representation of an individual of the evolutionary algorithm's population.

179 18832 35

11

84 94

2

Fig. 2. Representation of Rule (6) example.

220 M. Martínez-Ballesteros et al. / Chemometrics and Intelligent Laboratory Systems 109 (2011) 217–227

is lift(X⇒Y)=0.125/(0.125⋅0.375)=2.66, the leverage is lev(X⇒Y)=0.125−(0.125⋅0.375)=0.078 and the accuracy is acc(X⇒Y)=0.125+0.625=0.75, since sup(X⇒Y)=0.125, sup(¬X⇒¬Y)=0.625, sup(X)=0.125 and sup(Y)=0.375, as discussed before.

In Rule (8), the support of the antecedent is the same as in Rule(7), i.e. 12.5%, since one instance, t1, simultaneously satisfy that F1and F2 belong to the intervals [32,35] and [179,188]. However, thesupport of the consequent is sup(Y)=0.875 because all instances ex-cept t4 satisfy that F3∈ [46,94]. The confidence in this rule is the samethat in Rule (7), only one instance satisfies all the three featuresappearing in the rule, i.e., sup(X⇒Y)=0.125. Therefore, the confi-dence value of this rule is also 100%. Regarding the lift or interest, lift(X⇒Y)=0.125/(0.125⋅0.875)=1.14, the leverage is lev(X⇒Y)=0.125−(0.125⋅0.875)=0.016 and the accuracy is acc(X⇒Y)=0.125+0.125=0.25, since sup(X⇒Y)=0.125, sup(¬X⇒¬Y)=0.125, sup(X)=0.125 and sup(Y)=0.875, as discussed before.

Note that confidence does not take into account the support of therule consequent, because the confidence is the same in the two Rules(7) and (8). The lift of the rule should be considered to solve thisdrawback. Lift or interest measures the degree of dependence be-tween the antecedent and the consequent. The lift of Rule (7) andRule (8) is 2.66 and 1.14 respectively. Here, lift of Rule (7) is largerthan the lift of Rule (8), which corresponds to our intuition that thefirst rule is more interesting than the second one. Regarding thevalues of accuracy and leverage are also higher for the Rule (7).Therefore, we can conclude that first rule has better quality, accuracy,interest and strong dependency between the antecedent and conse-quent than the second one even if they have the same confidence.

3.3. EQAR: an effective evolutionary algorithm for AR searching

As has been previously mentioned, EAs have been quite used todiscover ARs, since they offer several advantages for knowledge ex-traction and specifically for rule induction processes. In [51] the au-thors proposed an EA to obtain numeric ARs, dividing the process intwo phases. Another EA was used in [52] to obtain QAR where theconfidence was optimized in the fitness function. In [53] a multi-objective pareto-based EA was presented in which the fitness func-tion was composed by four different objectives. A study of three evo-lutionary ARs extraction methods was presented in [54] to show theireffectiveness for mining ARs in quantitative datasets. Other EAs thatuse a weighted scheme for the fitness function which involved sever-al evaluation measures of rules were presented in [55] and [32]. Themain motivation of these works was to develop an algorithm able tofind QAR over datasets with continuous attributes without a previousdiscretization in the process. In fact, in this paper, we use the basicscheme algorithm proposed in [32] and we extend this approach,henceforth called EQAR (Evolutionary Quantitative AssociationRules), with new features in order to improve the ARs mining task.The results were obtained by EQAR in our problem of TOC modelingin the Iberian Peninsula.

EQAR follows the general scheme of the CHC binary-coded evolu-tionary algorithm proposed by Eshelman in 1991 [56]. The originalCHC presents an elitist strategy for selecting the population that willmake up the next generation and includes strong diversity in the evo-lutionary process through mechanisms of incest prevention and aspecific operator of crossover called Half Uniform (HUX). Further-more, the population is re-initialized when its diversity is poor.However EQAR adopts a more conservative re-initialization strategyand a less disruptive crossover operator than the HUX crossoverprocedure.

The search of the most appropriate intervals is carried out bymeans of EQAR and the intervals are adjusted to find ARs with highquality. Each individual constitutes a rule in the population. Eachgene of an individual represents the limits of the intervals and thetype of each attribute to indicate whether it belongs to the

antecedent, consequent or not belonging to the rule. Thus, the repre-sentation of an individual consists in two data structures as shown inFig. 1. The upper structure includes all the attributes of the database,where lj is the lower limit of the range and uj is the upper limit. Thebottom structure indicates the membership of an attribute to therule represented by an individual. The type of each attribute tj, canhave three values: 0 when the attribute does not belong to the rule,1 if it belongs to the antecedent of the rule and 2 when it belongs tothe consequent part.

An illustrative example of Rule (7) is depicted in Fig. 2. In particu-lar, the rule F1∈ [32,35]∧F2∈ [179,188]⇒F3∈ [84,94] is repre-sented. Note that attributes F1 and F2 appear in the antecedent andF3 in the consequent. Therefore t1= t2=1 and t3=2.

The individuals of the population are subjected to an evolutionaryprocess in which both crossover operator with incest prevention andre-initialization of the population are applied. At the end of this pro-cess, the fittest individual is designated as the best rule. Moreover, thefitness function has been provided with a set of parameters so thatthe user can drive the search process depending on the desiredrules. The proposed algorithm is based on the Iterative Rule Learning(IRL) [57]. The punishment of the covered instances allows the subse-quent rules found by EQAR to try to cover those instances that werestill not covered. General scheme of the IRL is shown in Fig. 4.

In addition to the features of the algorithm described above, newones have been added in order to improve the performance and thequality of the rules obtained in this specific problem of TOC analysis.The generation of the initial population for each evolutionary processhas been modified to help the examples that are covered by a fewrules, and also the fitness function has been expanded. These newfunctionalities of EQAR are detailed in the following subsections.

3.3.1. Generation of the initial populationThe generation of the initial population is carried out at the begin-

ning of each evolutionary process. It must be noted that the genera-tion of the rules in EQAR is different to the algorithm proposed in[32], in which the process for generating the initial population wascarried out in such a way that at least one randomly chosen sampleor instance of the dataset was covered. However, in EQAR the samplesof the dataset are not randomly selected but they are selected basedon their level of hierarchy. The hierarchy is organized according tothe number of rules which cover a sample. Thus, the records are

Page 5: Evolutionary association rules for total ozone content ...riquelme/publicaciones/chemo.pdf · Evolutionary association rules for total ozone content modeling from satellite observations

169.54 176.4631.35 34.65

01

76.36 89.64

2

Fig. 3. Representation of an individual example of generation of initial population.

221M. Martínez-Ballesteros et al. / Chemometrics and Intelligent Laboratory Systems 109 (2011) 217–227

sorted by the number of rules that are covered and the samples cov-ered by few rules have a higher priority.

A sample is selected according to the inverse of the number ofrules which cover such sample. Intuitively, the process is similar toroulette selection method where the parents are selected dependingon their fitness. In the roulette selection method, a sample is repre-sented by a portion of roulette inversely proportional to the numberof rules that cover such sample. Thus, the samples covered by a fewrules have a greater portion of the roulette and, therefore, they willbe more likely selected. In the first evolutionary process, all sampleshave the same probability to be selected.

This process for generating the initial population can be describedby means of a pseudo-code, as follows.

(1) For all instances of the database the cumulative sum, totalSum,of the inverse of the number of rules that cover every instanceis calculated.

(2) A random number R between 0 and totalSum is generated.(3) For each instance of database, if the totalSum is greater or equal

than R, then the current example is selected.

Constraints to generate individuals are given by the followingsettings:

• number of attributes that belong to rule represented by anindividual.

• number of attributes in the antecedents and consequents.

Data se

EvolutioAlgorith

Nº RulesMax_Rul

END

no

yes

punishment

Fig. 4. Scheme of the Iterative

• structure of the rule (attributes fixed or not fixed in consequent).

For a better understanding of the generation of initial population,we describe one example of generation of an individual following theTable 2. For each iteration one instance is selected based on their levelof hierarchy. In this case, we have assumed that the algorithm isstarting, that is, the first evolutionary iteration of the process, andall instances have the same probability to be selected.

Assuming that the instance t5 of Table 2 is randomly selected, thevalues of each attribute are: F1=33, F2=173 and F3=83. In order togenerate an individual (one rule), we have to randomly select thenumber of attributes appearing in the rule and the type and intervalof each attribute. We have supposed that the number of attributeschosen is 2 and the type for each attribute is 1 for F1, 0 for F2 and 2for F3. Then, we have to select a random number between 0 and amaximum amplitude (10%) for generating the intervals for eachattribute. The value obtained is added and subtracted to the valuecorresponding for each attribute to the instance selected (t5) inTable 2. For example: 5% for F1, 2% for F2 and 8% for F3. Therefore,the intervals of each attribute are [33±(0.05⋅33)] for F1, [173±(0.02⋅173)] for F2 and [83m(0.08 ⋅83)].

The individual generated is shown in Fig. 3 and the rule obtainedis represented as follows:

F1∈ 31:35;34:65½ �⇒F3∈ 76:36;89:64½ � ð9Þ

3.3.2. Fitness functions proposedThe fitness of each individual in the evolutionary algorithm allows

determining which are the best candidates to remain in subsequentgenerations. In order to make this decision, its calculation involvesseveral measures that provide information about the rules. In thiswork, two fitness functions have been designed to maximize differentobjectives depending on the desired rules. Both are formed by thecombination of different measures of association rules but theirgoals are different.

t

narym

Rule

>es

The bestIndividual

Rule Learning algorithm.

Page 6: Evolutionary association rules for total ozone content ...riquelme/publicaciones/chemo.pdf · Evolutionary association rules for total ozone content modeling from satellite observations

Table 3Ozone quartiles (DU) for each location.

Quartile Lisbon Madrid Murcia

1° [255.7, 291.2] [253.9, 291.1] [259.92, 293.03]2° [291.2, 326.7] [291.1, 328.35] [293.03, 326.15]3° [326.7, 362.2] [328.35, 365.6] [326.15, 359.26]4° [362.2, 397.7] [365.6, 402.8] [359.26, 392.38]

222 M. Martínez-Ballesteros et al. / Chemometrics and Intelligent Laboratory Systems 109 (2011) 217–227

As first fitness function of guide the evolutionary search, we pro-pose the following:

f ið Þ ¼ ws⋅supþwc⋅conf−wr⋅recov−wa⋅ampl ð10Þ

where sup is the support of the rule, conf is the confidence of the rule,recov is the number of recovered instances (it is used to indicatewhen a sample has already been covered by a previous rule, thus,rules covering different regions of search of space are preferred),ampl is the average size of intervals of the attributes belong to therule and ws, wc, wr and wa are weights in order to drive the processof rules searching. Note that this function takes into considerationthe support and the confidence of the rule. This function is usedwhen QAR with high support and confidence is desired. High valuesof ws imply that more samples from the database are covered andhigh values of wc imply rules with greater reliability, that is, ruleswith fewer errors.

Nevertheless, only the support is usually not enough to calculatethe fitness, because the algorithm would try to enlarge the amplitudeof the intervals until the whole domain of each attribute would becompleted to get a great support. For this reason, this fitness functionincludes a measure to limit the growth of the intervals during the

Table 4Association rules for TOC concentration at Lisbon. The “Code” of the rules describes the locatLisbon, medium TOC rule 2, and Lh1 stands for Lisbon, high TOC rule 1, and so on.

Code Rules TOC (DU)

Lm1 TPG[12.5,12.1] & t50[215.9,216.9] [332.4350.5]Lm2 TPW[11.5,11.2] & t50[214.6,215.7] & OLR[237.5,256.6] [353.2367.4]Lh1 TPG[11.4,10.6] & t50[212.7,216.8] & OLR[224.4,239.8] [349.0387.8]Lh2 t50[215.1,217.3] & OLR[231.5,257.1] & ω200[−5,8] [349.6,387.8]Lv1 t50[215.6,217.7] & OLR[231.5,245.4] [369.5,397.7]

Table 5Association rules for TOC concentration at Madrid. The “Code” of the rules describes the locafor Madrid, medium TOC rule 2, Mh1 stands for Madrid, high TOC rule 1, and so on.

Code Rules TOC (DU)

Mm1 TPG[14.1,12.5] [285.6,327.8]Mm2 TPG[12.4,11.9] & OLR[262.9,270.3] & t50[215.4,217.2] [329,347.5]Mh1 TPC[12.0,11.3] & t50[215.1,216.4] [334.1, 361]Mh2 TPC[11.5,10.8] & t50[214.4,216.1] [356.3,392.5]Mv1 TPG[11.2,10.6] & OLR[224.4,245.4] & t50[214.5,216.8] [368.1,402.8]

Table 6Association rules for TOC concentration at Murcia. The “Code” of the rules describes the locafor Murcia, medium TOC rule 2, Uh1 stands for Murcia, high TOC rule 1, and so on.

Code Rules TOC (DU)

Um1 TPE[11.5,10.7] and t50[212.5,213.4] and OLR[211.5,243.6] [341.9,359.7]Um2 TPG[11.8,11.4] and TPE[12.1,10.6] and OLR[250.6,256.6] [343.0,354.5]Uh1 TPG[11.2,10.6] and t50[214.6,216.5] and OLR[214.9,217.7] [343.7,387.5]Uv1 TPE[11.2,10.8] and t50[214.5,217.7] [359.3,392.4]

evolutionary process. In addition, this function is able to find rulesthat cover different regions of the search space because it also in-cludes a measure to negatively affect an instance that has alreadybeen covered by a previous rule.

However, this function is not entirely appropriate in some situa-tions because the confidence has some drawbacks. Specifically, confi-dence does not take into account the support of the rule consequenthence it is not able to detect negative dependencies between items.

For this reason, a new fitness function has been proposed as alter-native to the support and confidence measures. The second fitnessfunction to be maximized used by EQAR is given by the followingexpression:

f ið Þ ¼ wi⋅lift þwl⋅lev−wr⋅recov ð11Þ

where lift is the lift or interest of the rule and lev is the leverage of therule and wi, wl and wr are weights in order to drive the process ofsearch of rules.

This function considers lift and leverage measures instead of sup-port and confidence measures. This function is used when QARs witha high lift and high leverage are desired. High values of wi ensure adegree of dependence between antecedent and consequent. Thehigher this value, the more likely that the existence of antecedentand consequent together in an instance is not just a random occur-rence, but because there is some relationship or dependency betweenthem. High values of wl guarantee a certain minimum support. Thus,for leverage, values above 0 are desirable, whereas for lift, we wantto see values greater than 1. Note that leverage and lift measure sim-ilar things, except that leverage measures the proportion of additionalcases covered by both antecedent and consequent above thoseexpected if antecedent and consequent were independent of each

ion and TOC concentration, i.e., Lm1 stands for Lisbon, medium TOC rule 1, Lm2 stands for

Scores Fitness

Sup(%) Conf(%) Ampl(%) Lift Lev Acc(%)

5.2 100 10.0 6.0 0.04 88.5 Eq. (11)3.5 100 9.2 14.4 0.03 96.6 Eq. (11)

10.4 81.8 21.3 4.3 0.08 89.1 Eq. (10)10.4 62.1 22.0 3.4 0.07 85.8 Eq. (10)4.0 87.5 14.8 12.6 0.04 96.5 Eq. (11)

tion and TOC concentration, i.e.,Mm1 stands for Madrid, medium TOC rule 1,Mm2 stands

Scores Fitness

Sup(%) Conf(%) Ampl(%) Lift Lev Acc(%)

23.1 97.6 37.1 1.9 0.11 71.1 Eq. (10)5.8 100 11.2 5.8 0.05 88.4 Eq. (11)5.8 100 16.1 4.6 0.05 83.8 Eq. (11)6.9 100 20.6 6.2 0.06 90.8 Eq. (11)6.4 91.7 16.9 11.3 0.06 97.7 Eq. (11)

tion and TOC concentration, i.e., Um1 stands for Murcia, medium TOC rule 1, Um2 stands

Scores Fitness

Sup(%) Conf(%) Ampl(%) Lift Lev Acc(%)

4.6 100 15.7 6.9 0.04 90.1 Eq. (11)3.5 100 11.0 9.6 0.03 93.1 Eq. (11)8.7 100 22.6 4.7 0.07 87.4 Eq. (11)8.1 100 19.7 7.2 0.07 94.2 Eq. (11)

Page 7: Evolutionary association rules for total ozone content ...riquelme/publicaciones/chemo.pdf · Evolutionary association rules for total ozone content modeling from satellite observations

Table 7TOC variation obtained with different association rules for different meteorologicalvariables at Lisbon.

Met. variable Ratio Rule code

Lm1 Lm2 Lh1 Lh2 Lv1

With TPG DU/km 22.5 5.9 19.6 13 17.1With TPW DU/km 15.3 13.9 14.2 8.5 18With t50 DU/K 6.4 6.6 6.2 6.4 4.3

Table 8TOC variation obtained with different association rules for different meteorologicalvariables at Madrid.

Met. variable Ratio Rule code

Mm1 Mm2 Mh1 Mh2 Mv1

With TPG DU/km 13.7 13.8 23.0 21.4 10.8With TPC DU/km 13.4 9.3 22.2 21.4 11.1With t50 DU/K 9.0 9.1 5.7 6.6 6.0

Table 9TOC variation obtained with different association rules for different meteorologicalvariables at Murcia.

Met. variable Ratio Rule code

Um1 Um2 Uh1 Uv1

With TPG DU/km 9.6 8.7 14 14.7With TPE DU/km 9.3 25.1 13.7 18.7With t50 DU/K 9.4 6.6 6.9 6.3

223M. Martínez-Ballesteros et al. / Chemometrics and Intelligent Laboratory Systems 109 (2011) 217–227

other. Leverage is also included because lift is susceptible to noise insmall databases. Rare itemsets with low probability that per chanceoccur a few times (or only once) together can produce enormouslift values. In this function, the amplitude of intervals is not included

Fig. 5. Location of the different O3 observing stations considered in the validation pro

because leverage is inversely proportional to the size of the intervals.If leverage is maximized, we ensure that the intervals of attributes donot extend to the whole domain. This function also includes a mea-sure to negatively affect an instance that has already been coveredby a previous rule in order to find rules that cover different regionsof the search space.

In conclusion, the first fitness function corresponding to Eq. (10)should be used when rules covering many examples with a high de-gree of reliability are desired without interesting the degree of depen-dence between antecedent and consequent of the rule. Highconfidence and high support could imply interdependence betweenantecedent and consequent. While the second fitness function corre-sponding to Eq. (11) should be used when a rule with a high degree ofdependence between antecedent and consequent is desired regard-less of the number of instances covered by the rule. High lift andhigh leverage could imply low support.

4. Experimental results

In order to apply the methodology describe above, we have divid-ed TOC values of each location (Lisbon, Madrid and Murcia) into fourequal groups or quartiles, each representing a fourth of each TOC dataset, i.e., first quartile [0%, 25%], second [26%,50%], third [51%, 75%] andfourth [76%, 100%]. Following this idea, ozone quartiles for each loca-tion can be calculated for the period of study considered in this work(1979–1993), as shown in Table 3. Once TOC values have been divid-ed into the four quartiles, the following criteria to set different Ozoneconcentrations have been used:

• Medium ozone concentration: Rules which ozone values belong tothird quartile or a lower quartile.

• High ozone concentration: Rules which ozone values belong tothird and fourth quartiles.

• Very high ozone concentration: Rules which ozone values belongonly to fourth quartile.

cess of the results: 1. Madrid, 2. Arenosillo, 3. Lisbon, 4. Montlouis and 5. Murcia.

Page 8: Evolutionary association rules for total ozone content ...riquelme/publicaciones/chemo.pdf · Evolutionary association rules for total ozone content modeling from satellite observations

Table 10Accuracy of association rules for TOC concentration trained in Lisbon data and tested inLisbon, Murcia, Madrid, Arenosillo and Montlouis data separately.

Rule codeLisbon

Accuracy (%)

Lisbon Murcia Madrid Arenosillo Montlouis

Lm1 88.4 86.7 87.9 87.3 83.8Lm2 96.5 90.2 90.8 91.3 90.2Lh1 89.0 88.4 86.1 88.4 80.9Lh2 82.1 81.5 78.0 81.5 76.3Lv1 95.4 96.5 94.8 95.4 89.0

Table 11Accuracy of association rules for TOC concentration trained in Murcia and tested in Lis-bon, Murcia, Madrid, Arenosillo and Montlouis data separately.

RulecodeMurcia

Accuracy (%)

Lisbon Murcia Madrid Arenosillo Montlouis

Um1 85.5 90.2 86.7 87.3 83.2Um2 88.4 93.1 91.9 91.3 89.0Uh1 83.8 87.3 83.2 89.6 73.4Uv1 89.0 94.2 89.6 92.5 85.0

224 M. Martínez-Ballesteros et al. / Chemometrics and Intelligent Laboratory Systems 109 (2011) 217–227

Thus it is possible to calculate association rules for each locationand TOC range (medium, high and very high) and the consideredinput meteorological variables (given in Table 1).

Applying the EQAR described in Section 3.3, association rules forthe TOC1 and the considered input meteorological variables havebeen calculated. EQAR has been executed five times for the two fit-ness function represented by the Eqs. (10) and (11) considering dif-ferent TOC concentration (medium, high and very high) in eachdataset. The best QAR obtained, thus, the rules with support greaterthan 3% and accuracy greater than 70% have been examined by thegroup of expert authors in meteorological data. Tables 4 to 6 showthe results obtained for the three locations, where we have displayedthe rules selected by the expert authors from the best ones accordingto their meteorological relevance. The column Scores describes thevalues obtained for the different interestingness measures used toqualify the QAR (support, confidence, amplitude, lift, leverage and ac-curacy). The column Fitness indicates the number of equation used asfitness function that has been optimized to obtain each QARrespectively.

It can be shown that most of the QARs provide in these tables wereobtained by the second fitness function (Eq. (11)) which shows thatthe enhancement carried out in EQAR adding a new fitness functionto evaluate the individuals in the population provides better andmore relevant rules in terms of TOC concentration. Therefore the re-sults obtained by the second fitness function improve the resultsobtained by the first fitness function (Eq. (10)). This enhancementis due to the first fitness function that only optimizes confidenceand support, while the second fitness function optimizes the interestof the rules and the degree of dependence among the attributes be-longing to the antecedent and the consequent (TOC concentrationin this paper).

It can be appreciated that the scores of the quality measures of theQARs are very good in terms of the confidence and accuracy. Most of

Fig. 6. Accuracy of association rules for TOC concentration trained in Lisbon data a

them reaches values very close to 100% also the lift and leveragevalues are greater than 1 and 0 respectively, therefore, the rulesobtained present have high accuracy, reliability, and strong depen-dence among the attributes belonging to the antecedent and theconsequent.

It is interesting to observe that all the considered variables formpart of the association rules obtained, with some interesting peculiar-ities: variable ω200 is used to explain high and very high TOC concen-tration in Lisbon, but it does not appear in Murcia nor Madrid. Also inthe Murcia case the confidence score is always 100. Note also that, ingeneral, the confidence score is better for Madrid and Murcia than forLisbon.

In order to discuss the physical correctness of the obtained associ-ation rules, we will do a comparison of these rules with results in pre-vious studies. Several previous works have shown the quasi-linearrelation that exists between TOC and the meteorological variablestropopause height and temperature at 50 hPa [8,37]. Note that thisquasi-linear relationship cannot be found for OLR and ω200. Thus,in order to analyze how good association rules obtained are, ratiosDU/km and DU/K have been calculated to be compared against theresults obtained by other authors using similar data sources. InTables 7–9, ratios (in absolute value) for the different tropopauseheights (DU/km) and the temperature at 50 hPa (t50) (DU/K) areshowed. In each table we show values of TOC variation for two TPvariables (the global (TPG) and the TP variable calculated with a gridcentered in the point (TPW, TPC and TPE) for Lisbon, Madrid andMurcia, respectively).

In general, values for the four tropopause heights considered andtemperature at 50 hPa (t50) agree with results in different previousstudies. In the case of the tropopause height in [7,37] it is shownthat TOC values change approximately between 8 and 25 DU per1 km increase in tropopause height. Note that the results obtainedwith the proposed association rules also follow these range of TOC

nd tested in Lisbon, Murcia, Madrid, Arenosillo and Montlouis data separately.

Page 9: Evolutionary association rules for total ozone content ...riquelme/publicaciones/chemo.pdf · Evolutionary association rules for total ozone content modeling from satellite observations

Fig. 7. Accuracy of association rules for TOC concentration trained in Murcia data and tested in Lisbon, Murcia, Madrid, Arenosillo and Montlouis data separately.

Table 12Accuracy of association rules for TOC concentration trained in Madrid data and testedin Lisbon, Murcia, Madrid, Arenosillo and Montlouis data separately.

Rule codeMadrid

Accuracy (%)

Lisbon Murcia Madrid Arenosillo Montlouis

Mm1 65.9 67.6 71.1 64.2 72.3Mm2 80.9 81.5 83.8 85.0 81.5Mh1 86.7 85.0 88.4 86.7 86.7Mh2 90.2 91.9 90.8 90.2 87.9Mv1 94.2 96.0 97.1 95.4 93.6

225M. Martínez-Ballesteros et al. / Chemometrics and Intelligent Laboratory Systems 109 (2011) 217–227

variation against all the TP variables considered at each location, andfor all the rules (medium, high and very high TOC) considered. Theonly result out of this range in our rules is for Lisbon, medium TOCconcentration (rule Lm2), in which a value of 5.9 DU/km is found forTOC variation against variable TPG. For the case of TOC variationwith variable t50 the variation ratio obtained in other works such as[8,37] is that 10 DU change in TOC corresponds to a roughly 1 Kchange of t50. However, in other studies it is shown that these valuescan be quite affected by atmospheric variability, El Niño Southern Os-cillation (ENSO), Quasi-Bienial Oscillation (QBO) [58], and values ofTOC variation with t50 of 6, 12 or even 16 DU/K can be found at midlatitudes. Our results show that these thumb rule of 10 DU/K is verywell fulfilled in the TOC variation against t50 in Madrid (mediumTOC concentration) and in Murcia, mainly in the rule Um1. The restof the cases are not far away from these rule, showing a TOC variation

Fig. 8. Accuracy of association rules for TOC concentration trained in Madrid and

with t50 between 6 and 7 DU/K which also agrees with valuesobtained in other works.

5. Validation of the obtained results

This section describes the tests carried out to validate the resultsobtained by EQAR in the previous section. In order to confirm thatour model has no risk of over-fitting, the rules obtained by EQARhave been tested with six different datasets evaluating the accuracyof the rules. First, the rules obtained for each considered location(Lisbon, Madrid and Murcia) have been tested in the datasetscorresponding to five locations (Lisbon, Madrid, Murcia, Arenosilloand Montlouis) separately (Fig. 5 shows these locations, the 3 previ-ously considered and Montlouis and Arenosillo, newly added forthis validation study). In addition, the rules have been tested in adataset containing the TOC values of Arenosillo and Montlouiswhich consist of 346 instances in total.

Table 10 and Fig. 6 describe the accuracy values corresponding tothe rules obtained in the dataset of Lisbon as training data for eachlevel of TOC concentration (medium, high and very high) in the fivelocations as test data separately. Similarly, Table 11 and Fig. 7 showthe accuracy values obtained for the rules belonging to the datasetof Murcia. Finally, Table 12 and Fig. 8 indicate the accuracy valuesobtained for the rules corresponding to the dataset of Madrid. It canbe observed that the accuracy obtained for each rule discoveredwith the training datasets (Lisbon, Murcia and Madrid) are quitesimilar when they are applied to other dataset used as test data(Lisbon, Murcia, Madrid, Arenosillo and Montlouis). As can be seen,even in some cases there is greater value of accuracy on the test

tested in Lisbon, Murcia, Madrid, Arenosillo and Montlouis data separately.

Page 10: Evolutionary association rules for total ozone content ...riquelme/publicaciones/chemo.pdf · Evolutionary association rules for total ozone content modeling from satellite observations

Table 13Accuracy of association rules trained in Lisbon, Murcia and Madrid and tested in a data-set containing all TOC concentration of Arenosillo and Montlouis.

Rule code Accuracy(%)

Rule code Accuracy(%)

Rule code Accuracy(%)

Medium High Very high

Lm1 85.5 Lh1 84.7 Lv1 92.2Lm2 90.8 Lh2 78.9Um1 85.3 Uh1 81.5 Uv1 88.7Um1 90.2Mm1 68.2 Mh1 83.2 Mv1 94.5Mm2 86.7 Mh2 89

Fig. 10. Accuracy of association rules for high TOC concentration trained in Lisbon,Murcia and Madrid and tested in a dataset containing all location (Lisbon, Murcia.Madrid, Arenosillo and Montlouis).

Fig. 11. Accuracy of association rules for very high TOC concentration trained in Lisbon,Murcia andMadrid and tested in a dataset containing all location (Lisbon,Murcia. Madrid,Arenosillo and Montlouis).

226 M. Martínez-Ballesteros et al. / Chemometrics and Intelligent Laboratory Systems 109 (2011) 217–227

data with respect to training data. Therefore the accuracy obtained isstable, since there are no distinct differences among the datasets,which indicates that there is not over-fitting among the rules learnedand datasets used as training data.

Accuracy values of QAR tested in a dataset containing the TOCconcentration of Arenosillo and Montlouis have been displayed inTable 13. Rules for each location used as training data have beenseparated by level of TOC concentration and accuracy values areshown graphically in Figs. 9 to 11. Fig. 9 represents the rulesdiscovered in Lisbon, Murcia and Madrid for medium TOC concentra-tion and their accuracy values obtained in the test dataset. Similarly,Fig. 10 describes the rules discovered in Lisbon, Murcia and Madridfor high TOC concentration and their accuracy values obtained inthe test dataset. Finally, Fig. 11 shows the rules discovered in Lisbon,Murcia and Madrid for very high TOC concentration and their accura-cy values obtained in the test dataset. These tests prove that the rulesobtained separately for a particular location are valid for locations an-alyzed together. The results show accuracy rates above 80% except inone case (Mm1), and over 90% in many cases.

After this validation study, we can conclude that there is no over-fitting among rules obtained and the dataset used as training data andwe can confirm that the EQAR approach has been really good in termsof the quality of the QAR found because the accuracy values are veryhigh (exceeding 80%) and are very similar in all datasets used as testdata.

6. Conclusions

In this paper we have described the application of the EQAR algo-rithm (Evolutionary Quantitative Association Rules) to a problemTotal Ozone Content (TOC) modeling in the Iberian Peninsula. Differentimprovements in the initial population generation and fitness functionhave been incorporated to EQAR in order to improve its performance inthis problem of TOC modeling. Experimental results have been carriedout in TOC data from the Total Ozone Mapping Spectrometer (TOMS)

Fig. 9. Accuracy of association rules for medium TOC concentration trained in Lisbon, MurcArenosillo and Montlouis).

measuring at three locations (Lisbon,Madrid andMurcia) of the IberianPeninsula. As prediction variables for the association rules we have con-sidered several meteorological variables, such as Outgoing Long-waveRadiation (OLR), Temperature at 50 hPa level, Tropopause height, andwind vertical velocity component at 200 hPa. The results obtainedwith the EQAR approach have been really good in terms of the qualityof the association rules found. Also, the analysis of these rules agreeswith the results obtained in other works dealing with TOC modeling,so we can conclude that the use of association rules in TOC modelingcould be an interesting analysismethod for the future in this and similarproblems.

ia and Madrid and tested in a dataset containing all location (Lisbon, Murcia. Madrid,

Page 11: Evolutionary association rules for total ozone content ...riquelme/publicaciones/chemo.pdf · Evolutionary association rules for total ozone content modeling from satellite observations

227M. Martínez-Ballesteros et al. / Chemometrics and Intelligent Laboratory Systems 109 (2011) 217–227

References

[1] S. Bronnimann, J. Luterbacher, C. Schmutz, H. Wanner, J. Staehelin, Variability oftotal ozone at Arosa, Switzerland, since 1931 related to atmospheric circulationindices, Geophysical Research Letters 27 (15) (2000) 2213–2216.

[2] B. Massart, O.M. Kvalheim, L. Stige, R. Aasheim, Ozone forecasting from meteoro-logical variables: part I. Predictive models by moving window and partial leastsquares regression, Chemometrics and Intelligent Laboratory Systems 42 (1–2)(1998) 179–190.

[3] B. Massart, O.M. Kvalheim, L. Stige, R. Aasheim, Ozone forecasting frommeteorolog-ical variables: part II. Daily maximum ground-level ozone concentration from localweather forecasts, Chemometrics and Intelligent Laboratory Systems 42 (1–2)(1998) 191–197.

[4] X. Jin, J. Li, C.C. Schmidt, T.J. Schmit, J. Li, Retrieval of total column ozone fromimagers onboard geostationary satellites, IEEE Transactions on Geoscience andRemote Sensing 46 (2) (2008) 479–488.

[5] M. Palacios, F. Kirchner, A. Martilli, A. Clappier, F. Martín, M.E. Rodríguez, Summerozone episodes in the Greater Madrid area. Analyzing the ozone response to abate-ment strategies by modelling, Atmospheric Environment 36 (2002) 5323–5333.

[6] J.W. Krzyscin, J.L. Borkowski, Variability of the total ozone trend over Europe forthe period 1950–2004 derived from reconstructed data, Atmospheric Chemistryand Physics 8 (2008) 2847–2857.

[7] W. Steinbrecht, U. Köhler, K.P. Hoinka, Correlation between tropopause heightand total ozone: implication for long-term trends, Journal of GeophysicalResearch 103 (1998) 19 183–19 192.

[8] W. Steinbrecht, B. Hassler, H. Claude, P. Winkler, R.S. Stolarski, Global distributionof total ozone and lower stratospheric temperature variations, AtmosphericChemistry and Physics 3 (2003) 1421–1438.

[9] M.A. Barrero, J.O. Grimalt, L. Cantón, Prediction of daily ozone concentrationmaxima in the urban atmosphere, Chemometrics and Intelligent LaboratorySystems 80 (1) (2006) 67–76.

[10] G. Christakos, A. Kolovos, M.L. Serre, F. Vukovich, Total ozone mapping by inte-grating databases from remote sensing instruments and empirical models, IEEETransactions on Geoscience and Remote Sensing 42 (5) (2004) 991–1008.

[11] M. Felipe-Sotelo, L. Gustems, I. Hernández, M. Terrado, R. Tauler, Investigation ofgeographical and temporal distribution of tropospheric ozone in Catalonia(North-East Spain) during the period 2000–2004 using multivariate data analysismethods, Atmospheric Environment 40 (2004) 7421–7436.

[12] P.J. Crutzen, The influence of nitrogen oxide on the atmospheric ozone content,Quarterly Journal of the Royal Meteorological Society 96 (1970) 320–327.

[13] P.J. Crutzen, Ozone production rates in an oxygen–hydrogen nitrogen oxideatmosphere, Journal of Geophysical Research 76 (1971) 7311–7327.

[14] R.S. Stolarski, R.J. Cicerone, Stratospheric chlorine: a possible sink for ozone,Canadian Journal of Chemistry 52 (1974) 1610–1615.

[15] M.J. Molina, F.S. Rowland, Stratospheric sink for chlorofluoromethanes: chlorineatom catalyzed destruction of ozone, Nature 249 (1974) 810–812.

[16] J.C. Farman, B.G. Gardiner, J.D. Shanklin, Large losses of total ozone in Antarcticareveal seasonal ClOx/NOx interaction, Nature 315 (1985) 207–210.

[17] S. Solomon, Stratospheric ozone depletion: a review of concepts and history,Reviews of Geophysics 37 (3) (1999) 275–316.

[18] United Nations Environment Programme, Environmental Effects Assessment Panel,Environmental effects of ozone depletion and its interactions with climate change:progress report 2005, Photochemical & Photobiological Sciences 5 (13) (2006).

[19] K. Bramstedt, J. Gleason, D. Loyola, W. Thomas, A. Bracher, M. Weber, J.P. Burrows,Comparison of total ozone from the satellite instruments GOME and TOMS withmeasurements from the Dobson network 1996–2000, Atmospheric Chemistryand Physics 3 (2003) 1409–1419.

[20] A.A. Silva, A quarter century of TOMS total column ozone measurements overBrazil, Journal of Atmospheric and Solar-Terrestrial Physics 69 (12) (2007)1447–1458.

[21] V. Savastiouk, C.T. McElroy, Brewer spectrophotometer total ozone measure-ments made during the 1998 middle atmosphere nitrogen trend assessment(MANTRA) Campaign, Atmosphere-Ocean 43 (4) (2005) 315–324.

[22] K.M. Latha, K.V. Badarinath, Impact of aerosols on total columnar ozone measure-ments a case study using satellite and ground-based instruments, AtmosphericResearch 66 (4) (2003) 307–313.

[23] Z. Xiangdong, Z. Xiuji, T. Jie, Q. Yu, C. Chuenyu, A meteorological analysis on a lowtropospheric ozone event over Xining, North Western China on 26–27 July 1996,Atmospheric Environment 38 (2) (2004) 261–271.

[24] A. Pérez, I. Aguirre de Cárcer, F. Jaque, Low ozone event atMadrid inNovember 1996,Journal of Atmospheric and Solar-Terrestrial Physics 64 (3) (2002) 283–289.

[25] C. Appenzeller, A.K. Weiss, J. Staehelin, North Atlantic oscillation modulates totalozone winter trends, Geophysical Research Letters 27 (8) (2000) 1131–1134.

[26] T.G. Shepherd, A.I. Jonsson, On the attribution of stratospheric ozone and temper-ature changes to changes in ozone-depleting substances and well-mixed green-house gases, Atmospheric Chemistry and Physics 8 (2008) 1435–1444.

[27] J. Han,M. Kamber, DataMining: Concepts and Techniques, Morgan Kaufmann, 2006.[28] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of

items in large databases, Proceedings of the 1993 ACM SIGMOD InternationalConference on Management of Data, 1993, pp. 207–216.

[29] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large data-bases, Proceedings of the International Conference on Very Large Databases,1994, pp. 478–499.

[30] M. Houtsma, A. Swami, Set-Oriented Mining for Association Rules in RelationalDatabases, IEEE Computer Society, 1995, pp. 25–33.

[31] A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, Springer-Verlag,2003.

[32] M. Martínez-Ballesteros, F. Martínez-Álvarez, A. Troncoso, J.C. Riquelme, Mining quan-titative association rules based on evoluationary computation and its application to at-mospheric pollution, Integrated Computer-Aided Engineering 17 (3) (2010) 227–242.

[33] R.D. McPeters, P.K. Bhartia, A.J. Krueger, J.R. Herman, B.M. Schlesinger, C.G.Wellemeyer, C.J. Seftor, G. Jaross, S.L. Taylor, T. Swissler, O. Torres, G. Labow, W.Byerly, R.P. Cebula, Nimbus-7 Total Ozone Mapping Spectrometer (TOMS) DataProducts User's Guide, NASA Reference Publication, 1996.

[34] M. Anton, J.M. Vilaplana, M. Kroon, A. Serrano, M. Parias, M.L. Cancillo, B.A. de laMorena, The empirically corrected EP-TOMS total ozone data against Brewermeasurements at El Arenosillo (Southwestern Spain), IEEE Transactions onGeoscience and Remote Sensing 48 (7) (2010) 3039–3045.

[35] E. Kalnay, M. Kanamitsu, R. Kistler, W. Collins, D. Deaven, L. Gandin, M. Iredell, S.Saha, G. White, J. Woollen, Y. Zhu, A. Leetmaa, R. Reynolds, M. Chelliah, W.Ebisuzaki, W. Higgins, J. Janowiak, K.C. Mo, C. Ropelewski, J. Wang, R. Jenne, D.Joseph, The NCEP/NCAR reanalysis 40-year project, Bulletin of the AmericanMeteorological Society 77 (1996) 437–471.

[36] B. Liebmann, C.A. Smith, Description of a complete (Interpolated) outgoing long-wave radiation dataset, Bulletin of the American Meteorological Society 77(1996) 1275–1277.

[37] C. Varotsos, C. Cartalis, A. Vlamakis, C. Tzanis, I. Keramitsoglou, The long-termcoupling between column ozone and tropopause properties, Journal of Climate17 (2004) 3843–3854.

[38] K. Hoinka, H. Claude, U. Kohler, On the correlation between tropopause pressureand ozone above central Europe, Geophysical Research Letters 23 (1996)1753–1756.

[39] World Meteorological Organization, Meteorology — a three dimensional science,World Meteorological Organization Bulletin 6 (1957) 134–138.

[40] G. Zangl, K. Hoinka, The tropopause in the polar regions, Journal of Climate 14(14) (2001) 3117–3139.

[41] A. Rabbe, S.H. Larse, Ozone variations in the northern-hemisphere due to dynam-ic processes in the atmosphere, Journal of Atmospheric and Solar-TerrestrialPhysics 54 (9) (1992) 1107–1112.

[42] A. Rabbe, S.H. Larse, ‘On the low ozone values over Scandianvia during the winterof 1991–1992’, Journal of Atmospheric and Solar-Terrestrial Physics 57 (4)(1995) 367–373.

[43] K. Henriksen, V. Roldugin, Total ozone variations and meteorological processes,Atmospheric Ozone Dynamics, in: Costas Varotsos (Ed.), Series I: Global Environ-mental Change, vol. 53, Springer Verlag, 1997.

[44] P.W. Menzel, Applications with meteorological satellites, World MeteorologicalOrganization (WMO), Technical Document No. 1078, 2001.

[45] V. Williams, R. Toumi, The correlation between tropical total ozone and outgoinglong-wave radiation, Quarterly Journal of the Royal Meteorological Society 127(2001) 989–1003.

[46] B. Alatas, E. Akin, An efficient genetic algorithm for automated mining of bothpositive and negative quantitative association rules, Soft Computing 10 (3)(2006) 230–237.

[47] O. Berzal, I. Blanco, D. Sánchez, M. Vila, Ios press measuring the accuracy and in-terest of association rules: a new framework, , 2001.

[48] L. Geng, H. Hamilton, Interestingness measures for data mining: a survey, ACMComputing Surveys 38 (3) (2006) 9.

[49] S. Brin, R. Motwani, C. Silverstein, Beyond market baskets: generalizing associa-tion rules to correlations, Proceedings of the 1997 ACM SIGMOD InternationalConference on Management of Data, vol. 26, no. 2, 1997, pp. 265–276.

[50] G. Piatetsky-Shapiro, Discovery, analysis and presentation of strong rules, Knowl-edge Discovery in Databases, 1991, pp. 229–248.

[51] J. Mata, J.L. Álvarez, J.C. Riquelme, Discovering numeric association rules via evo-lutionary algorithm, Lecture Notes in Artificial Intelligence 2336 (2002) 40–51.

[52] X. Yan, C. Zhang, S. Zhang, Genetic algorithm-based strategy for identifying asso-ciation rules without specifying actual minimum support, Expert Systems withApplications: An International Journal 36 (2) (2009) 3066–3076.

[53] B. Alatas, E. Akin, A. Karci, MODENAR: multi-objective differential evolution algo-rithm for mining numeric association rules, Applied Soft Computing 8 (1) (2008)646–656.

[54] J. Alcala-Fdez, N. Flugy-Pape, A. Bonarini, F. Herrera, Analysis of the effectivenessof the genetic algorithms based on extraction of association rules, FundamentaInformaticae 98 (1) (2010) 1001–1014.

[55] M. Martínez-Ballesteros, F. Martínez-Álvarez, A. Troncoso, J. Riquelme, Quantita-tive association rules applied to climatological time series forecasting, IntelligentData Engineering and Automated Learning — IDEAL 2009, ser. Lecture Notes inComputer Science, vol. 5788, 2009, pp. 284–291.

[56] L. Eshelman, The CHC Adaptative Search Algorithm: How to Have Safe Searchwhen Engaging in Nontraditional Genetic Recombination, Morgan Kaufmann,1991.

[57] G. Venturini, SIA: a supervised inductive algorithm with genetic search forlearning attribute based concepts, Proceedings of the European Conference onMachine Learning, 1993, pp. 280–296.

[58] W.J. Randel, J.B. Cobb, Coherent variation of monthly mean total ozone and lowerstratospheric temperature, Journal of Geophysical Research 99 (1994)5433–5447.