meteorological data analysis using self-organizing maps
Post on 11-Jun-2016
Embed Size (px)
Meteorological Data Analysis UsingSelf-Organizing MapsTatiana Tambouratzis,1, George Tambouratzis2,1Department of Industrial Management & Technology, University of Piraeus,107 Deligiorgi St., Piraeus 185 34, Greece2Institute for Language and Speech Processing, Artemidos 6 & Epidavrou,Paradissos Amaroussiou 151 25, Athens, Greece
A data analysis task is described, which is focused on the clustering of high-dimensional mete-orological data collected long term (more than 43 years) at 128 weather stations in Greece. Theproposed hybrid method combines (a) the assignment of the stations to two-dimensional grids ofnodes via self-organizing maps (SOMs) of various sizes and (b) statistical clustering of the SOMnodes. The areas resulting from clustering have well-defined meteorological profiles; they are alsodescribed by distinct combinations of morphological and geographical characteristics, indicatingthat morphology and geographical location largely affect the meteorological measurements. Themost salient data parameters per area as well as over the entire map are determined, wherebythe parameters and parameter ranges that shape the various meteorological profiles are exposed.The classification of stations with missing and noise-contaminated meteorological measurementsinto their expected areas demonstrates the prediction capability and robustness of the proposedhybrid method. C 2008 Wiley Periodicals, Inc.
The analysis and clustering of high-dimensional data (i.e., data characterizedby a large number of parameters) aim both at exposing the natural groups that existin the dataset and at extracting the salient information that is inherent in the data.Analysis and clustering are affected by such factors as
Data parameter selection. Different parameter sets may generate distinct classificationresults. On the one hand, the repetition of parameters or the occurrence of highly dependentparameters in the parameter set may increase their saliency disproportionally over that ofthe other parameters in the parameter set. On the other hand, the elimination of repeated or
This paper is dedicated to the memory of our beloved father Dr. Professor Demetrius G.Tambouratzis, who lost the fight against Amyotrophic Lateral Sclerosis (Lou Gehrigs Disease)(ALS) on June 14, 2004.
Author to whom all correspondence should be addressed: firstname.lastname@example.org;email@example.com.
e-mail: giorg firstname.lastname@example.org.
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 23, 735759 (2008)C 2008 Wiley Periodicals, Inc. Published online in Wiley InterScience(www.interscience.wiley.com). DOI 10.1002/int.20294
736 TAMBOURATZIS AND TAMBOURATZIS
dependent parameters may distort the original parameter set and produce counterintuitiveclustering and classification results.
Uniform normalization. Depending on the distribution of the parameter values, uniformnormalization of all parameters may place more emphasis on parameters with smallerranges over those with larger ranges. In such cases, the distance relations between unscaledand normalized data may be significantly modified,a whereby considerably different anal-ysis and clustering results are obtained when working with the unscaled and normalizeddatasets.
The task studied in this piece of research involves the analysis of (a) theoriginal 28-dimensional and (b) the reduced 20-dimensionalb meteorological datacollected long term (over a period of 43 years) at 130 weather stations in Greece.A hybrid method has been employed: following the assignment of the stations totwo-dimensional self-organizing maps (SOM)1 of various sizes and the selection ofthe maps most capable of preserving the topology of the dataset, statistical-basedclustering2,3 has been employed for partitioning the SOM nodes into clusters).4,5The hybrid method has been found effective at clustering the stations and partition-ing Greece into areas such that stations in the same area have similar meteorologicalprofiles, whereas stations classified in different areas have distinct meteorologicalprofiles. Especially when working with the original 28-dimensional data, the areasare described by distinct morphological and geographical characteristics, thus indi-cating that morphology and geographical location largely affect the meteorologicalmeasurements.
The most salient data parameters for classification have been uncovered bydetermining the parameters whose values vary in accordance with the emergent orderof the SOM; concurrently, the most salient parameters per area have been establishedby their abilityin terms of parameter valuesto distinguish the area of interestfrom the other areas. The effects of parameter selection, uniform normalization,and map size have been investigated and evaluated. The successful classificationof stations with missing and noise-contaminated meteorological measurements intotheir expected areas demonstrates the prediction capability and robustness of theproposed hybrid method.
This paper is organized as follows: the principles and properties of the SOMare presented in Section 2; the data employed for the analysis task are describedin Section 3; the results generated during the analysis of the dataset are given inSection 4; Section 5 concludes the paper.
2. THE SELF-ORGANIZING MAP
2.1. SOM Structure
The processes of competition, self-organization, and emerging order abound inbiological neural networks: competitive (winner-take-all) groups of neurons become
aDistance modification becomes especially apparent for high-dimensional data, where it isunlikely that the ranges of all parameters are comparable.
bThe reduced dataset has been created after the elimination of the highly dependentparameters.
International Journal of Intelligent Systems DOI 10.1002/int
METEOROLOGICAL DATA ANALYSIS USING SELF-ORGANIZING MAPS 737
organized during the presentation of stimuli in such a manner that neighboring neu-rons are sensitized to similar stimuli while increasingly remote neurons respondto increasingly different stimuli. Two of the most well-known examples of emerg-ing order and self-organization via neuron competition are the motor and sensoryhomunculi.6 These constitute topographically mapped human bodies whose head,torso, limbs, and fingers are projected to the tiniest detail on the motor and sensorycortex; projection is not proportional to the actual size of the body part but to theprecision with which it must be controlled. Stimulation of a given body part is trans-ferred as activation of the corresponding part of the sensory homunculus, whereasactivation of a given part of the motor homunculus is transformed into motion ofthe corresponding body part.
Inspired by such biological neural networks, the SOM1 consists of a set of nodesorganized into a regular (one- or, most frequently, two-dimensional) structure. TheSOM self-organization process is based on an unsupervised adaptation law thatgenerates a global ordering of the input patterns through competition between theSOM nodes. Prior to training, the codebook vectors (weights) of the nodes areinitialized either randomly or with predefined values that introduce a partial orderin the map. During training, each input pattern is normalized and, subsequently,presented to the SOM. The winner node (the node whose codebook vector bestmatches the input pattern) together with its neighboring nodes are subjected to thefollowing adaptation rule:
mi(t + 1) = mi(t) + a(t)[x(t) mi(t)], i Nc(t)mi(t + 1) = mi(t), i / Nc(t) (1)
where mi(t) is the codebook vector of node i at time t , a(t) the gain factor at time t ,x(t) the input pattern at time t , and Nc(t) a function denoting the nodes that belongto the neighborhood of the winner node at time t . Both Nc(t) and a(t) decrease astraining progresses, in effect dividing the SOM training into two phases:
Rough training. A gradual ordering of the nodes is achieved, with each neighborhoodcomprising several nodes.
Fine-tuning. The codebook vectors become fine-tuned to their optimal values, with eachneighborhood comprising a very small number of nodes.
The shrinking neighborhoods promote emergent order, with neighboring nodeslearning to respond alike to each input pattern and increasingly distant nodes learningto respond to progressively dissimilar input patterns. In fact, following a sufficientamount of training, the codebook vectors of neighboring nodes are similar, whereasthose of increasingly distant nodes are gradually more dissimilar.
2.2. SOM Properties
An important property of the SOM is its topology preservation capability,which is derived from the ordering of the codebook vectors during training(Section 2.1). Topology preservation is responsible for the pattern classification
International Journal of Intelligent Systems DOI 10.1002/int
738 TAMBOURATZIS AND TAMBOURATZIS
property of the SOM: the map can be partitioned into classes of nodes, where eachclass is characterized by specific combinations of parameter values/ranges and de-fines a distinct subset of the dataset. Following class creation, novel input patternsare classified according to the corresponding winner node: the class to which thewinner node is assigned constitutes the class to which the input pattern belongs.
Extensive studies of the SOM have resulted in various metrics and similaritycriteria that express the successful formation of topology-preserving mappings.7,8The unified distance matrix (U-matrix) constitutes the standard way of visualizingthe distances between neighboring nodes and partitioning the map into classes:1,9a small value in the U-matrix denotes a small distance between neighboring code-book vectors and thus supports the placement of the corresponding nodes in thesame class; by contrast, a large value in the