Meteorological data analysis using self-organizing maps

Download Meteorological data analysis using self-organizing maps

Post on 11-Jun-2016

215 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>Meteorological Data Analysis UsingSelf-Organizing MapsTatiana Tambouratzis,1, George Tambouratzis2,1Department of Industrial Management &amp; Technology, University of Piraeus,107 Deligiorgi St., Piraeus 185 34, Greece2Institute for Language and Speech Processing, Artemidos 6 &amp; Epidavrou,Paradissos Amaroussiou 151 25, Athens, Greece</p><p>A data analysis task is described, which is focused on the clustering of high-dimensional mete-orological data collected long term (more than 43 years) at 128 weather stations in Greece. Theproposed hybrid method combines (a) the assignment of the stations to two-dimensional grids ofnodes via self-organizing maps (SOMs) of various sizes and (b) statistical clustering of the SOMnodes. The areas resulting from clustering have well-defined meteorological profiles; they are alsodescribed by distinct combinations of morphological and geographical characteristics, indicatingthat morphology and geographical location largely affect the meteorological measurements. Themost salient data parameters per area as well as over the entire map are determined, wherebythe parameters and parameter ranges that shape the various meteorological profiles are exposed.The classification of stations with missing and noise-contaminated meteorological measurementsinto their expected areas demonstrates the prediction capability and robustness of the proposedhybrid method. C 2008 Wiley Periodicals, Inc.</p><p>1. INTRODUCTION</p><p>The analysis and clustering of high-dimensional data (i.e., data characterizedby a large number of parameters) aim both at exposing the natural groups that existin the dataset and at extracting the salient information that is inherent in the data.Analysis and clustering are affected by such factors as</p><p> Data parameter selection. Different parameter sets may generate distinct classificationresults. On the one hand, the repetition of parameters or the occurrence of highly dependentparameters in the parameter set may increase their saliency disproportionally over that ofthe other parameters in the parameter set. On the other hand, the elimination of repeated or</p><p>This paper is dedicated to the memory of our beloved father Dr. Professor Demetrius G.Tambouratzis, who lost the fight against Amyotrophic Lateral Sclerosis (Lou Gehrigs Disease)(ALS) on June 14, 2004.</p><p>Author to whom all correspondence should be addressed: tatiana@unipi.gr;tatianatambouratzis@gmail.com.</p><p>e-mail: giorg t@ilsp.gr.</p><p>INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 23, 735759 (2008)C 2008 Wiley Periodicals, Inc. Published online in Wiley InterScience(www.interscience.wiley.com). DOI 10.1002/int.20294</p></li><li><p>736 TAMBOURATZIS AND TAMBOURATZIS</p><p>dependent parameters may distort the original parameter set and produce counterintuitiveclustering and classification results.</p><p> Uniform normalization. Depending on the distribution of the parameter values, uniformnormalization of all parameters may place more emphasis on parameters with smallerranges over those with larger ranges. In such cases, the distance relations between unscaledand normalized data may be significantly modified,a whereby considerably different anal-ysis and clustering results are obtained when working with the unscaled and normalizeddatasets.</p><p>The task studied in this piece of research involves the analysis of (a) theoriginal 28-dimensional and (b) the reduced 20-dimensionalb meteorological datacollected long term (over a period of 43 years) at 130 weather stations in Greece.A hybrid method has been employed: following the assignment of the stations totwo-dimensional self-organizing maps (SOM)1 of various sizes and the selection ofthe maps most capable of preserving the topology of the dataset, statistical-basedclustering2,3 has been employed for partitioning the SOM nodes into clusters).4,5The hybrid method has been found effective at clustering the stations and partition-ing Greece into areas such that stations in the same area have similar meteorologicalprofiles, whereas stations classified in different areas have distinct meteorologicalprofiles. Especially when working with the original 28-dimensional data, the areasare described by distinct morphological and geographical characteristics, thus indi-cating that morphology and geographical location largely affect the meteorologicalmeasurements.</p><p>The most salient data parameters for classification have been uncovered bydetermining the parameters whose values vary in accordance with the emergent orderof the SOM; concurrently, the most salient parameters per area have been establishedby their abilityin terms of parameter valuesto distinguish the area of interestfrom the other areas. The effects of parameter selection, uniform normalization,and map size have been investigated and evaluated. The successful classificationof stations with missing and noise-contaminated meteorological measurements intotheir expected areas demonstrates the prediction capability and robustness of theproposed hybrid method.</p><p>This paper is organized as follows: the principles and properties of the SOMare presented in Section 2; the data employed for the analysis task are describedin Section 3; the results generated during the analysis of the dataset are given inSection 4; Section 5 concludes the paper.</p><p>2. THE SELF-ORGANIZING MAP</p><p>2.1. SOM Structure</p><p>The processes of competition, self-organization, and emerging order abound inbiological neural networks: competitive (winner-take-all) groups of neurons become</p><p>aDistance modification becomes especially apparent for high-dimensional data, where it isunlikely that the ranges of all parameters are comparable.</p><p>bThe reduced dataset has been created after the elimination of the highly dependentparameters.</p><p>International Journal of Intelligent Systems DOI 10.1002/int</p></li><li><p>METEOROLOGICAL DATA ANALYSIS USING SELF-ORGANIZING MAPS 737</p><p>organized during the presentation of stimuli in such a manner that neighboring neu-rons are sensitized to similar stimuli while increasingly remote neurons respondto increasingly different stimuli. Two of the most well-known examples of emerg-ing order and self-organization via neuron competition are the motor and sensoryhomunculi.6 These constitute topographically mapped human bodies whose head,torso, limbs, and fingers are projected to the tiniest detail on the motor and sensorycortex; projection is not proportional to the actual size of the body part but to theprecision with which it must be controlled. Stimulation of a given body part is trans-ferred as activation of the corresponding part of the sensory homunculus, whereasactivation of a given part of the motor homunculus is transformed into motion ofthe corresponding body part.</p><p>Inspired by such biological neural networks, the SOM1 consists of a set of nodesorganized into a regular (one- or, most frequently, two-dimensional) structure. TheSOM self-organization process is based on an unsupervised adaptation law thatgenerates a global ordering of the input patterns through competition between theSOM nodes. Prior to training, the codebook vectors (weights) of the nodes areinitialized either randomly or with predefined values that introduce a partial orderin the map. During training, each input pattern is normalized and, subsequently,presented to the SOM. The winner node (the node whose codebook vector bestmatches the input pattern) together with its neighboring nodes are subjected to thefollowing adaptation rule:</p><p>mi(t + 1) = mi(t) + a(t)[x(t) mi(t)], i Nc(t)mi(t + 1) = mi(t), i / Nc(t) (1)</p><p>where mi(t) is the codebook vector of node i at time t , a(t) the gain factor at time t ,x(t) the input pattern at time t , and Nc(t) a function denoting the nodes that belongto the neighborhood of the winner node at time t . Both Nc(t) and a(t) decrease astraining progresses, in effect dividing the SOM training into two phases:</p><p> Rough training. A gradual ordering of the nodes is achieved, with each neighborhoodcomprising several nodes.</p><p> Fine-tuning. The codebook vectors become fine-tuned to their optimal values, with eachneighborhood comprising a very small number of nodes.</p><p>The shrinking neighborhoods promote emergent order, with neighboring nodeslearning to respond alike to each input pattern and increasingly distant nodes learningto respond to progressively dissimilar input patterns. In fact, following a sufficientamount of training, the codebook vectors of neighboring nodes are similar, whereasthose of increasingly distant nodes are gradually more dissimilar.</p><p>2.2. SOM Properties</p><p>An important property of the SOM is its topology preservation capability,which is derived from the ordering of the codebook vectors during training(Section 2.1). Topology preservation is responsible for the pattern classification</p><p>International Journal of Intelligent Systems DOI 10.1002/int</p></li><li><p>738 TAMBOURATZIS AND TAMBOURATZIS</p><p>property of the SOM: the map can be partitioned into classes of nodes, where eachclass is characterized by specific combinations of parameter values/ranges and de-fines a distinct subset of the dataset. Following class creation, novel input patternsare classified according to the corresponding winner node: the class to which thewinner node is assigned constitutes the class to which the input pattern belongs.</p><p>Extensive studies of the SOM have resulted in various metrics and similaritycriteria that express the successful formation of topology-preserving mappings.7,8The unified distance matrix (U-matrix) constitutes the standard way of visualizingthe distances between neighboring nodes and partitioning the map into classes:1,9a small value in the U-matrix denotes a small distance between neighboring code-book vectors and thus supports the placement of the corresponding nodes in thesame class; by contrast, a large value in the U-matrix denotes a large distancebetween neighboring codebook vectors and thus suggests a class-border betweencorresponding nodes. Existing methodologies for class creation include</p><p> Observation. Each class comprises nodes with sufficiently similar codebook vectors(sufficiently small U-matrix values).</p><p> Hybrid statistical clustering.5 Following inspection of the U-matrix, the nodes are groupedinto classes via hierarchical k-means clustering.3</p><p> Three-step hybrid clustering.4 Initially, an oversized SOM is trained with the data andthe resulting codebook vectors are clustered into n classes via the k-nearest neighbormethod.3 Subsequently, a SOM with n nodes is trained with the data; each of the resultingn classes comprises all the patterns of the dataset that are assigned to the same node.</p><p>Another appealing (though largely unexplored) property of the SOM is itsability to indicate the important parameters of the dataset. This property has beeninvestigated here, resulting in</p><p>a. The most salient parameters for shaping the SOM. A parameter that is ordered in thetrained SOM is assumed to influence the emergent order and self-organization of the map,a fact that is indicative of its saliency.1 By contrast, a parameter that is not ordered in thetrained SOM points toward a lack of significance in shaping the map.</p><p>b. The parameters that characterize the different clusters. Parameters whose values, for agiven class, are distinct from the corresponding values for other classes are assumed to beimportant for classification, especially for pinpointing the differences between classes.</p><p>3. THE METEOROLOGICAL DATA</p><p>The National Meteorological Service (EMY) of Greece maintains a networkof 130 stations covering the Greek territory (shown in Figure 1). The objective ofthis network is to study the weather patterns of different locations and provide thefoundation for weather forecasts. The released data,10 which has been employedfor the present analysis and clustering task, includes the 28 parameters described inTable I; each parameter comprises a single value equaling the numeric average over43 years of collection (from 1955 to 1997). Owing to extensive averaging (over themeasuring interval to either daily, monthly, or yearly averages and, subsequently,</p><p>International Journal of Intelligent Systems DOI 10.1002/int</p></li><li><p>METEOROLOGICAL DATA ANALYSIS USING SELF-ORGANIZING MAPS 739</p><p>Figure 1. Geographical map of Greece with the locations of the 130 EMY stations; the circlesdemarcate the two stations with missing parameter values.</p><p>to the pooled averages over the 43 years), the dataset is assumed to be practicallynoise-free. Two of the 130 stations (stations 679 and 739 in Figure 1) have sometwo and four, respectivelyof their parameter values missing. Together with thenovel data that have been generated with missing and noise-contaminated parametervalues, these stations have been retained for testing the SOM; the data from the other128 stations has been employed exclusively for training.</p><p>Parameter selection has been performed in such a manner as to</p><p> eliminate highly correlated (i.e., repeated or dependent) parameters of the original param-eter set that collectively describe the same natural phenomenon. Elimination reverses theapparent increase in saliency of such natural phenomena over others that are describedby a single parameter;</p><p>International Journal of Intelligent Systems DOI 10.1002/int</p></li><li><p>740 TAMBOURATZIS AND TAMBOURATZISTable I. The 28 data parameters measured at the 130 EMY stations, accompanied bytheir ranges prior to normalization. The highlighted rows denote the repeated/dependentparameters that have been eliminated in the creation of the reduced dataset.</p><p>Meteorological parameter Range</p><p>1 Yearly mean(ambient temperature) (C) 8.62 Yearly average of daily max(ambient temperature) (C) 7.43 Yearly average of daily min(ambient temperature) (C) 13.14 Yearly max(ambient temperature) (C) 135 Yearly min(ambient temperature) (C) 24.66 Yearly average of monthly max(ambient temperature) (C) 77 Yearly average of monthly min(ambient temperature) (C) 14.78 Yearly mean(relative humidity) (%) 219 Yearly mean(cloud cover) (1/8th) 2.610 Yearly mean(precipitation) (mm) 1520.211 Yearly max(precipitation) (mm) 251.812 Yearly mean(wind speed) (km/h) 13.213 Yearly number of days with cloud cover in [0,1.5]/8ths 179.314 Yearly number of days with cloud cover in [1.6,6.4]/8ths 139.115 Yearly number of days with cloud cover in [6.5,8]/8ths 102.516 Yearly number of days with showers 10717 Yearly number of days with rain 90.418 Yearly number of days with snow 27.119 Yearly number of days with thunderstorms 56.720 Yearly number of days with hail 8.421 Yearly number of days with snow-covered ground 36.722 Yearly number of days with fog 50.123 Yearly number of days with dew 137.824 Yearly number of days with rime 84.725 Yearly number of days with partial ground frost (min(ambient temperature) 0) 117.426 Yearly number of days with total ground frost (max(ambient temperature) 0) 12.227 Yearly number of days with max(wind speed) 6Bf 124.728 Yearly number of days with max(wind speed) 8Bf 26.1</p><p> unless highly correlated with other parameters, retain at least one parameter from eachnatural phenomenon. Meteorological pa...</p></li></ul>