identiﬁcation of source components in multivariate time...

Identification of source componentsin multivariate time series by state space modelling

Andreas Galka1,2,�, Kin Foon Kevin Wong3, Ulrich Stephani4, Hiltrud Muhle4 and Tohru Ozaki1

1 Institute of Statistical Mathematics (ISM), Minami-Azabu 4-6-7, Tokyo 106-8569, Japan2 Institute of Experimental and Applied Physics, University of Kiel, 24098 Kiel, Germany3 Graduate University of Advanced Studies, Minami-Azabu 4-6-7, Tokyo 106-8569, Japan4 Clinic for Neuropediatrics, University of Kiel, Schwanenweg 20, 24105 Kiel, Germany

ISM Research Memorandum No. 981 (2006)

Abstract

In this paper we study the application of classical methods for dynamical modellingof time series to the task of decomposing multivariate time series into approximatelyindependent source components, a task that has traditionally been addressed by FactorAnalysis (FA) and more recently by Independent Component Analysis (ICA). Based onmaximum-likelihood fitting of linear state space models we develop a new frameworkfor this task, for which many of the limitations of standard ICA algorithms can berelieved. Through comparison of likelihood, or, more precisely, of the Akaike InformationCriterion, it is demonstrated that dynamical modelling provides considerably betterdescription of given data than FA and non-dynamical ICA. The comparison is applied toboth simulated and real-world time series, the latter being given by an electrocardiogramand an electroencephalogram.

Keywords: time series analysis, Kalman filtering, multivariate autoregressive modelling,independent component analysis, whitening, innovation approach

1 Introduction

In many fields of contemporary scientific research multivariate data sets are recorded in largequantities, either from experiments or from field observations; this data typically will requirefurther processing, in order to reduce its size and dimensionality, to extract relevant infor-mation and to characterise and classify the underlying physical systems in a meaningful way.If the data consists of multivariate time series, dynamical relationships within the systemcan be investigated; the ideal goal would consist of identifying the true differential equationsgoverning the dynamics, but such goal may be infeasible in many of the highly complicatedsystems studied in disciplines such as biology, medicine or econometrics. Still, data from suchsystems may be accessible for less ambitious tasks, such as filtering, prediction and control, asthey arise in many application fields [8], and numerous methods for such purposes have beendeveloped.

Recently, considerable attention has been devoted to a particular filtering task, sometimesknown as Blind Signal Separation (BSS), which is based on the assumption that a set of

1

mutually independent source components exists, from which the data was generated by somemixing process [20]. A number of algorithms for estimating these source components, as wellas the parameters of the mixing process, from given data have been developed during the lastdecade; these algorithms are now generally subsumed under the denotation of IndependentComponent Analysis (ICA) [10, 12, 13, 14, 20, 24, 29, 34]. In ICA two properties of the sourcecomponents (i.e. the independent components) are assumed: mutual independence and non-Gaussianity (at most one Gaussian component is permitted); furthermore in most (but notall) cases it is assumed that the number of source components is not larger than the dimensionof the data, and that measurement noise is neglegible. Many ICA algorithms that have beenproposed so far, are instantaneous, i.e. they do not take the temporal ordering of the data intoaccount, as it is also the case with the well-known method of Principal Component Analysis(PCA).

A more traditional approach to the analysis of multivariate time series consists of tryingto identify an explicit dynamical model which is defined by the ability to perform optimalprediction of the given data, within a certain prespecified model class, either defined directlyfor the data or via a state space approach [18]; such a model would, except for a set of modelparameters and initial conditions, require either no input at all (in case of a deterministicmodel) or featureless white Gaussian noise (in case of a stochastic model). In general, the de-terministic part of the model may include nonlinearities, but in this paper we will not employnonlinear modelling, but we will focus our attention on the class of linear stochastic models,represented by Multivariate Autoregressive (MAR) models and their generalisation, Multivari-ate Autoregressive Moving-Average (MARMA) models; for the case of neglegible measurementnoise, the latter class is equivalent to linear state space models [1, 5].

Autoregressive modelling aims at capturing the dynamics within the data in a predictivemodel, thereby taking into account temporal ordering and also the direction of time, while mostICA methods tend to ignore dynamical aspects and provide a mainly descriptive approach tothe data; note that time reversibility is given only for univariate linear autoregressive (AR)models, while it is lost not only for nonlinear AR models, but also for multivariate linear ARmodels. Through this dynamical aspect a close relationship exists between modelling by MARmodels and by differential equations, in particular, stochastic differential equations [22].

Recently there has been growing interest in developing ICA algorithms which take temporalcorrelations into account, i.e. which are dynamical instead of instantaneous. Several authorshave employed lagged covariance matrices for this purpose [30, 43] or worked directly withtime-delay embedding vectors [24, 34], while some have proposed to model the sources byautoregressive models [7]. A particularly interesting approach has been provided by Cheung& Xu [10], who model temporal correlations by autoregressive processes in the space of sources(i.e. state space), allowing for time-dependent noise variance by using GARCH (generalisedautoregressive conditional heteroscedasticity), project the innovations remaining from thismodelling step into observation space, and then decompose them by applying an instantaneousICA algorithm. Thereby the decomposition task is split up into two steps which need to beiterated. This contribution establishes a first link between ICA and state space modelling,and in this paper we intend to proceed further into this direction.

The purpose of this paper is twofold. First, we intend to discuss at a general level therelationship between ICA, (linear) MAR modelling and state space modelling, and developan approach for employing the latter two techniques for the same purpose as standard ICAalgorithms, in other words, we will present a fully dynamical approach to ICA, which is basedexclusively on maximum-likelihood estimation of state space models, without the need toadditionally apply instantaneous ICA to the innovations; and second, we provide, to limitedextent, a comparison of time series modelling by various ICA algorithms with modelling byMAR models and state space models, when applied to simulated and real multivariate time

2

series. This comparison will be neither comprehensive, nor do we intend to compress the resultsinto universal claims of superiority or inferiority between these various approaches, but we willprovide a detailed discussion of the various available measures and criteria of performance ofdifferent methods, such as likelihood, residual mutual information, or computational timedemand.

Our reformulation of MAR models follows and extends in some parts earlier work, suchas Principal Oscillation Pattern analysis [32, 31], Singular Spectrum Analysis [9, 35] andthe Dynamic Linear Model approach of West and coworkers [36, 37, 38]; while the latterauthors have explored Bayesian extensions, here we prefer to focus on developing a generalmethodology for state space modelling in a maximum-likelihood framework.

In the ICA literature, simulations are commonly employed for assessing the performanceof algorithms; if the true sources of a given BSS problem are known, quantitative measures ofperformance can easily be defined, but in the case of real-world data, when even the validityof the basic assumptions underlying ICA is questionable, it is much harder to define suchmeasures. In this paper we will demonstrate that likelihood provides a natural measure ofperformance which can be used for comparing different models for a given multivariate timeseries. More precisely, for the comparison of non-nested models a corrected estimator ofthe likelihood needs to be employed, which is provided by the Akaike Information Criterion(AIC). Regrettably, many popular ICA algorithms are nonparametric algorithms, and forthese it is difficult or impossible to compute a likelihood. In this case, the value of the residualmutual information of the set of estimated source components may be used as a measure ofperformance; but here the problem arises that the independence assumption itself may beinappropriate.

It will be shown that MAR models provide a more general description of multivariate timeseries than most ICA approaches, and therefore they are also suitable for situations where theassumption of independence is inappropriate. The transition from MAR models to generalstate space models represents a step of further generalisation; this can be shown by consideringthe class of MARMA models, i.e. MAR models which are driven by correlated noise. Whilemethods for fitting univariate ARMA models are well established [8], the multivariate caseposes particular problems which have prevented widespread application so far. Due to theequivalence between MARMA models and linear state space models, the latter class providesan alternative approach to MARMA modelling and hence a further generalisation of MARmodelling.

The structure of this paper is as follows. In Section 2 we will briefly introduce observationmodels and source components, and we will give a classification of methods for the estimationof source components into 4 classes. In Section 3 we will review a small selection of ICAalgorithms; we will include also the time-honoured method of Factor Analysis. In Section4 we will discuss MAR modelling and its generalisation to state space modelling. Section 5provides a comparison of ICA and MAR methods. In Section 6 we will apply the methodsdiscussed so far to three data sets, one simulated and two from real applications. In Section7 discussion and conclusion are given.

2 Source components in time series

2.1 Observation models

Assume that the data is denoted by xptq � �x1ptq, . . . , xNptq�: , t � 1, . . . , T , where N denotes

the number of channels and T the number of time points at which the data was sampled. In the

3

case of assuming instantaneous mixing the observation model is defined by a mixing equation

xptq � Csptq � εptq ô xiptq � M

j�1

Cijsjptq � εiptq , i � 1, . . . , N (1)

where sptq � �s1ptq, . . . , sMptq�: , t � 1, . . . , T , denotes the unobserved source components

(also termed as ”latent components”), M the number of source components, C � pCijq theN �M mixing matrix and εiptq observational noise. Let the covariance matrices of the data,the sources and the observation noise be denoted by Σx, Σs and Σε, respectively.

In general, the task of estimating both C and sptq from xptq represents an underdetermined,and therefore ill-posed problem, especially in the case M � N , which is assumed frequently inthe application of ICA algorithms; even if M is reduced sufficiently to remove the problem ofunderdetermination, the result will remain ambiguous with respect to rescaling and reorderingof the sources sptq.

We mention that an apparently more general model has been defined, known as convolutivemixing, where the mixing equation is given by [34, 12]

xptq � p

τ�1

Cpτqspt� τq � εptq ô xiptq � M

j�1

p

τ�1

Cijpτq sjpt� τq � εiptq (2)

where p is a lag order. However, it can easily be shown that this model corresponds to thecombination of instantaneous mixing with linear MAR modelling of the sources, therefore itdoes not represent a truly different model.

2.2 Classification of algorithms for estimating source components

Approaches to reconstructing the unobserved sources sptq can be roughly classified according totwo criteria: whether or not they assume Gaussian distributions, i.e. use second-order statisticsonly, in contrast to higher-order statistics; and whether or not they take into account temporalcorrelations in the data (which would be a dynamic approach, in contrast to instantaneousapproaches, which make use only of the distribution of the data, but not its time ordering).This gives us four main classes:

• Gaussian instantaneous case: This case has a long tradition in multivariate statistics.Two related approaches belong here, Principal Component Analysis (PCA) and FactorAnalysis (FA); however, sometimes PCA is regarded as a preprocessing step for FA. InFA, the sources siptq are known as factors and the coefficients Cij as loadings.

• Non-Gaussian instantaneous case: Most approaches to Independent Component Analysis(ICA) which have been developed during the last two decades belong here.

• Gaussian dynamic case: Some methods developed by the ICA community belong here,but also the largely independently developed generalisation of Factor Analysis, knownas Dynamic Factor Analysis (DFA); also the method based on MAR modelling, to bedeveloped in this paper, falls into this class.

• Non-Gaussian dynamic case: A number of more recent ICA methods fall into this mostgeneral class, such as the ”MILCA-delay” method of Stogbauer et al.. Time seriesmodelling by nonlinear models in a state space would also fall into this class, but suchmodels have been rarely discussed in this context so far.

4

While a full review of this extensive fields of methods and concepts is beyond the scope ofthis paper, we will now briefly summarise some core ideas and definitions, as far as it will benecessary for our purposes.

3 Factor analysis and ICA

In this section we will briefly review classical Factor Analysis and three examples for Inde-pendent Component Analysis, and we will show how these methods relate to the problem ofestimating source components; furthermore, expressions for the log-likelihood and its correctedversion, the Akaike Information Criterion, will be given, if possible.

3.1 Factor analysis

Unlike PCA, Factor Analysis is based explicitly on the model Eq.(1), such that estimates forthe parameters C and Σε are required. In order to reduce the number of unknown parametersit is common to impose the following constraints [17]:

Σs � IM , Σε � diagpσ2iiq , (3)

i.e., Σs is a unity matrix (corresponding to uncorrelated, standardised sources), and Σε is adiagonal matrix (corresponding to uncorrelated noise). Parameters may be estimated by themaximum-likelihood method; the log-likelihood for this problem is given by [23, 17]

logLpC,Σεq � �1

2T�log |Σx| � trpΣ�1

x Sxq �N logp2πq� , (4)

where from Eq.(1)Σx � CC: � Σε , (5)

and Sx denotes the maximum-likelihood estimator of Σx:

Sx � 1

T

¸t

�xptq � x

��xptq � x

�:, x � 1

T

¸t

xptq . (6)

The set of model parameters, to be estimated from given data (which may or may not bea time series), is given by ϑ � pC,Σεq; since Σε is a diagonal matrix, it contributes only Nparameters. An efficient iterative algorithm for estimating these parameters by maximisingthe log-likelihood has been given by Joreskog [23].

In practical application of FA it is usually aimed at chosing the dimension of the sourcespace M considerably smaller than the dimension of the data space N ; while this will help todefine a unique solution, it is not strictly necessary, and the iterative method of Joreskog canalso be applied to the underdetermined case M � N , similar as in the case of ridge regression.In both cases it is well known that further rotations in source space may be necessary in orderto identify factors which are well interpretable.

For the purpose of comparing FA with other methods for decomposing given time seriesdata into source components, log-likelihood can be employed; but in order to avoid overfittingand to reward parsimonious models, a correction needs to be applied to the log-likelihood,which is provided by the Akaike Information Criterion (AIC) [2]. Let Npar denote the numberof model parameters (i.e., the dimension of ϑ), then AIC is given by

AIC � �2 log L� 2Npar . (7)

5

For FA, the number of model parameters is given by [4]

Npar � NpM � 1q � 1

2MpM � 1q , (8)

where the second term results from the constraint Σs � IM .

3.2 Independent Component Analysis based on likelihood

Independent Component Analysis (ICA) is sometimes introduced as ”non-Gaussian factoranalysis” [20], and indeed both methods, ICA and FA, start from similar assumptions. In ICA,the model for data generation is again given by Eq.(1), but the assumption of uncorrelatedsources, Σs � IM , is replaced by the stronger assumption of independence of the sources, i.e.the mutual information of the sources vanishing or at least being minimal; with regard to thesecond assumption of FA, in most ICA algorithms observational noise is entirely neglected,Σε � 0, or it is recommended to remove it by some preprocessing step prior to application ofICA algorithms [20].

Among the multitude of ICA algorithms which have been developed, in this subsection wefocus on a particular class which is defined by a maximum-likelihood criterion; this will enableus to perform quantitative comparisons with the results of other methods of analysis. It canbe shown that maximisation of likelihood corresponds to minimisation of mutual information,therefore some authors have employed the maximum-likelihood method for the task of ICA.In this class of algorithms a non-Gaussian probability density pps;ϑq needs to be chosen, andits parameters ϑ be estimated from the data by the maximum-likelihood method. Variousdensities have been proposed; particularly interesting is the density family given by [41, 11]

pps;α, σq � α

2`

2σΓp1{αq exp�� s`

2σ

��α , (9)

since it contains for α � 2 the Gaussian case, while 0 α 2 corresponds to super-Gaussian (leptokurtic) densities (with positive kurtosis), and α ¡ 2 corresponds to sub-Gaussian (platykurtic) densities (with negative kurtosis); for α Ñ 8 the uniform densityis approached. For α � 2 the parameter σ will correspond to the standard deviation, and alsofor other values it determines the moments.

In a full maximum-likelihood implementation, for M sources the NM elements of C andthe M density parameters αi (where sources are labeled by i) have to be estimated; dependingon the value for αi, different sources may be chosen to have super-Gaussian, (nearly) Gaussianor sub-Gaussian density. The log-likelihood for this model is given by

logLpC,αq � T

�log��detpC�1q��¸

i

npαi, σiq�¸t

¸i

�1

2σ2i

¸j

pC�1qijxjptq�αi

; (10)

where α :� pα1, . . . , αMq and npαi, σiq � αi � log σi � log Γ�

1αi

� � log 22

; the parameters σifollow by

σαi

i � αiT

¸t

��¸j

�C�1

�ijxjptq��αi

. (11)

The set of model parameters is given by ϑ � pC,αq, and the number of parameters is

Npar � pN � 1qM ; (12)

6

by using this number the corresponding AIC can be computed according to Eq.(7).It is quite common in applications of ICA methods to choose M � N , although M N is

sometimes also chosen; in this paper we will only consider cases where M � N .Note that fitting the model based on Eqs.(9) and (10) by direct numerical minimisation

of AIC may be time-consuming, especially for large N and T , therefore iterative learningalgorithms have been proposed for the purpose of parameter estimation [41]; they can beexpected to provide only sub-optimal models. In this paper we are not concerned with issuesof efficient implementation, but with comparison of performance, and therefore pay the priceof the computational expenses of numerical optimisation. The resulting ICA algorithm shallbe denoted by ”nG-ICA”.

3.3 Independent Component Analysis based on nonparametric es-timation

Most currently available ICA algorithms are not based on the maximum-likelihood principle, asdiscussed in the previous subsection, but employ some other approach to identifying the sourcecomponents (i.e., the ”independent components”). As a well-known example we mentionthe FastICA algorithm of Hyvarinen [19], which aims at identifying maximally non-Gaussianprojections of the data by maximising their negentropy; it is a major advantage of FastICAthat this maximisation can be implemented by a fast and robust approximation of Newton’smethod for optimisation.

Another example is given by the ”Mutual Information Least-dependent Component Anal-ysis” (MILCA) algorithm, recently introduced by Stogbauer et al. [34], which, according tosimulations reported by the originators, in many cases seems to outperform most of the previ-ously introduced ICA algorithms, albeit at the cost of considerably increased computationaltime consumption. The MILCA algorithm is based on explicit minimisation of a sophisticatednonparametric estimator of mutual information.

Within the ICA framework, minimisation of mutual information has been shown to beequivalent to maximisation of log-likelihood [20]; however, estimators of mutual informationwhich do not explicitly specify a model for the probability density of the variables in question,are unsuitable for providing estimates of likelihood that could be compared with the likeli-hood obtained from parametric models. Nevertheless, relative changes of estimates of mutualinformation, referring to the set of estimated sources as compared to the original data, can beexpected to be useful quantities.

The authors of ref. [34] also consider the case of convolutive mixing (see Sec. 2) and suggestto estimate the sources by an ansatz involving time delays:

sjptq � N

i�1

q

k�1

wjipkqxipt� kq , j � 1, . . . , pNqq , (13)

where q is an integer parameter of maximum delay. This equation might be interpreted asan approximate inversion of Eq. (2), but Stogbauer et al. provide some cautionary discussionon the issue of such interpretation. They also extend the MILCA algorithm to this situation,and the resulting algorithm shall be denoted as ”MILCA-delay”. It is an interesting featureof Eq. (13) that the number of retrieved sources sjptq is M � Nq, and therefore indeed largerthan the dimension of data space, M ¡ N , in contrast to the previously discussed FA andinstantaneous ICA methods. However, if convolutive mixing is assumed, this set of sourcesshould be expected to display a redundancy corresponding to the time-delay structure ofEq. (2); consequently Stogbauer et al. apply a ”heuristic” averaging procedure, in order toreduce the number of sources back to at most M � N .

7

With respect to the classification given in Sec. 2.2, both FastICA and MILCA are non-Gaussian instantaneous algorithms, while MILCA-delay is a non-Gaussian dynamic algorithm.

4 MAR and State Space Modelling

4.1 MAR modelling

When given a multivariate time series, one of the oldest, and well time-tested, approaches ofanalysis consists of fitting a linear multivariate autoregressive (MAR) model [16], as definedby

xptq � τm

τ�1

Φpτqxpt� τq � ηptq ô xjptq � τm

τ�1

N

k�1

φjkpτqxkpt� τq � ηjptq , (14)

where j � 1, . . . N . Φpτq denotes a set of N �N parameter matrices, specifying the influencewhich any of the N components of the observed series xptq exerts on any other component,with respect to a delay time τ , where τ � 1, . . . , τm. ηptq denotes a N -dimensional noiseterm, with N �N covariance matrix Ση; correlations between the components of xptq whichoccur without time delay (i.e., τ � 0), are represented by off-diagonal elements of Ση. Thecomplete set of parameters for this model is given by ϑ � �

Φp1q, . . . ,Φpτmq,Ση

�; since Ση is

a symmetric matrix, only NpN � 1q{2 of its elements need to be included.A fast and convenient method for recursively fitting linear MAR models to time series data

has been developed by Levinson [27] and later by Whittle [39]; another method was recentlyproposed by Neumaier & Schneider [31]. Levinson’s and Whittle’s method is based on the setof autocovariance matrices

Sxpτq � 1

T

¸t

�xptq � x

��xpt� τq � x

�:, τ � 1, . . . τm . (15)

After fitting the model also an estimate of Ση can be obtained by

Sη � Sxp0q � τm

τ�1

ΦpτqS:xpτq ; (16)

note that Sx � Sxp0q, as given in Eq.(6), is symmetric, while Sxpτq, τ ¡ 0 is usually non-symmetric. Based on this estimate, the corresponding log-likelihood follows as

logL�Φp1q, . . . ,Φpτmq,Ση

� � �1

2pT � τmq�log |Sη| �N

�1� logp2πq� , (17)

where |.| denotes the determinant of a matrix. The number of parameters for MAR models isgiven by

Npar � N2τm � 1

2NpN � 1q (18)

and again by using this number the AIC can be computed according to Eq.(7).Note that in Eq.(17) T has been replaced by pT�τmq since in the MAR model the likelihood

of the data is estimated by predictions, but for the first τm time points no predictions arepossible; therefore the likelihood of these time points is omitted. Since likelihood, being anestimate of an entropy, is an extensive quantity, this loss of data needs to be corrected whencomparing the likelihood or the AIC of different models; but for sufficiently long time seriesthe effect may be neglected.

8

4.2 Transition to state-space modelling

The linear MAR model discussed in the previous subsection provides a statistical model forgiven data, but now its relevance for the Blind Signal Separation problem and the identificationof source components needs to be demonstrated.

It is well known that any autoregression of order τm can be reformulated as an autoregres-sion of order 1 by employing an augmented state vector which contains time-delayed values[1]:��

xptqxpt� 1q

...xpt� τm � 1q

��

looooooooomooooooooonSptq

��

Φp1q Φp2q . . . Φpτm � 1q ΦpτmqIN 0N . . . 0N 0N...

.... . .

......

0N 0N . . . IN 0N

��

looooooooooooooooooooooooomooooooooooooooooooooooooonΦ

��

xpt� 1qxpt� 2q

...xpt� τmq

��

loooooomoooooonSpt�1q

��

ηptq0N...

0N

��

looomooonHptq

, (19)

where IN , 0N and 0N denote the N � N unity matrix, the N � N matrix of zeros and theN � 1 vector of zeros, respectively. We write this new MAR model as

Sptq � ΦSpt� 1q �Hptq . (20)

The covariance matrix of the augmented noise term Hptq is given by

ΣH ��

Ση 0N . . . 0N0N 0N . . . 0N...

.... . .

...0N 0N . . . 0N

�� ; (21)

the relationship between the augmented state Sptq and the original state is given by

xptq � �IN 0N � � � 0N�Sptq �: G Sptq , (22)

where there are pτm � 1q zero matrices 0N in G; this equation can formally be regarded as anobservation equation which does not contain an observational noise term (i.e., Σε � 0N), as itis assumed frequently in ICA modelling; but in a general linear state space model we shouldallow for a non-zero observational noise term:

xptq � G Sptq � εptq . (23)

With the state transition equation Eq. (20) and the observation equation Eq. (23) we haveformally reached a linear state space model (to be denoted by “linSS”), which (for Σε � 0N)is still equivalent to the original MAR model, Eq. (14). The set of model parameters of thismodel is given by ϑ � pΦ,G,ΣH,Σεq. But the class of linear state space models is much moregeneral than the class of MAR models, since in general the parameter matrices, as collectedin ϑ, can have any shape, not just the particular shape compatible with MAR models.

The log-likelihood of the general linSS model is given by:

logLpϑq � �1

2

�pT�τmq log |Σx|� T

t�τm�xptq�xpt|t�1q�:Σ�1

x

�xptq�xpt|t�1q��pT�τmqN logp2πq ,

(24)

9

where xpt|t� 1q denotes the prediction of the data vector at time t, based on all informationavailable at time pt�1q. The corresponding number of parameters in the model can be obtainedfrom the dimension of ϑ and used for computing the AIC of this model. Before the data canbe predicted (such that the log-likelihood can actually be computed), the unobserved statesneed to be estimated and predicted; a well-known iterative solution to this task is given by theKalman filter [15]. The intrinsic relationship between ICA and Kalman filtering was recentlyalso observed by Xu [42]. The state estimates, obtained at each time point conditional on theprevious data, provide the reconstructed source components; we call them filtered estimates.It is also possible to base the state estimates on the complete available data set, instead of juston the previous data; these estimates are called smoothed estimates. They can be obtained byan extension to the Kalman filter, such as the Rauch-Tung-Striebel two-pass smoother [15].

4.3 Transformation of the state of the state-space model

At first sight the MAR model and the corresponding linSS model may seem to have littlerelevance for the search for source components, but we can easily approach the task of BSSby considering a transformed state space in which the transition matrix Φ is diagonal; thisconcept has been employed also by West and coworkers in the context of estimating sourcecomponents (”latent” components) in time series ([36, 37, 38]). By such transformation everycomponent of the state vector will depend only on its own past, but not on the past of the othercomponents, and therefore such set of components can be expected to be less dependent. Infact, if the transition matrix Φ and the corresponding dynamical noise covariance matrix ΣHwere simultaneously diagonalised, the resulting components would become perfectly indepen-dent, but usually this can not be accomplished by transformations alone, and the off-diagonalelements of ΣH will remain a source of correlation between the components. We shall returnto this point later.

The theory of canonical models, as discussed by various authors in various contexts [32,33, 37, 31], forms the theoretical basis for the decomposition approach which will now bediscussed. Diagonalising the pNτmq � pNτmq transition matrix Φ will produce a set of dr realeigenvalues λi and another set of dc pairs of complex conjugated eigenvalues pψj, ψ�

j q, whereNτm � dr � 2dc; this diagonalisation corresponds to a particular linear transformation of statespace.

We reorder the state space dimensions such that in the new state vector first all dimensionscorresponding to real eigenvalues are accommodated, followed by the pairs of dimensionscorresponding to complex eigenvalues; furthermore real eigenvalue dimensions shall be orderedaccording the size of the eigenvalues and complex eigenvalue dimensions according to the size ofthe phase of ψj. Note that every real coefficient in a diagonalised state space transition matrixcorresponds to an univariate autoregressive process of first order, AR(1), for the correspondingdimension of the state vector.

In order to remove any complex numbers from the transition matrix, a further lineartransformation needs to be applied to each pair of complex conjugated eigenvalues pψj, ψ�

j q. Let�ψj 0

0 ψ�j

be the sub-matrix on the diagonal of the transformed transition matrix Φ containing

such pair; then a possible transformed shape of such sub-matrix would be the companion

form�ψ1pjq ψ2pjq

1 0

, where ψ1pjq and ψ2pjq are real parameters. However, both theoretical

considerations and practical experiences with state space modelling indicate that preferably

the transpose of the companion form,�ψ1pjq 1ψ2pjq 0

, should be employed [1]. The main difference

between these two representations lies in the fact that in the second form both state dimensionscan be modelled as being driven by dynamical noise terms, whereas in the first form no driving

10

noise term is possible for the second state dimension.Nevertheless, both forms can be easily interpreted as state space representations of a uni-

variate AR(2) process, with parameters ψ1 and ψ2. Each AR(2) process corresponds to oneparticular frequency and one particular damping coefficient; by sorting complex eigenvaluedimensions according to phase we have sorted them according to these frequencies. While thedr AR(1) components (resulting from real eigenvalues) represent stochastically driven relax-ators, the dc AR(2) components represent stochastically driven oscillators; the decompositionof given multivariate time series forms the core element of methods like Principal OscillationPattern (POP) analysis [32, 31] or modal analysis [33].

Let the linear transformation comprising the diagonalisation and the subsequent stepsof reordering and transformation to transposed companion form be denoted by D, then thetransformations for the state, the transition matrix, the dynamical noise covariance matrixand the observation matrix are given by:

Sptq � D Sptq (25)

Φ � D Φ D�1 (26)

ΣH � D ΣHD: (27)

G � G D�1 ; (28)

The dimension of the state vector Sptq is given by M � Nτm; however, since each AR(2)component contributes two dimensions to the state, which can be expected to be closelycorrelated, the effective state dimension should rather be M � Nτm � dc; at least this is thenumber of components, into which the data is decomposed by this approach. Still, we willtypically have M ¡ N , as in the above-mentioned case of MILCA-delay, or also the case ofSingular Spectrum Analysis [9, 35] (where N � 1); and in all of these cases the reason is thatby using time delays the number of components that can be reconstructed is larger than thedimension of data space. This point requires some further discussion which will be given inthe next section.

It is important to note that if these source components, corresponding to AR(1) and AR(2)processes, are estimated by a Kalman filter, they will, depending on the properties of the data,typically contain much richer dynamics, than pure AR(1) or AR(2) processes would be able toproduce. This convenient capacity results from the two-stage structure of the Kalman filter,consisting of predictor and corrector [15]. Second, in the transformed observation matrix Gthere will typically be nonzero elements for both state dimensions belonging to each AR(2)component; this situation may be interpreted by saying that these components would actuallycorrespond to ARMA(2,1) processes (where ARMA stands for autoregressive moving-average)instead of pure AR(2) processes [38].

The fitting of a linSS model to given data then proceeds by maximising the log-likelihood, asgiven by Eq. (24), or preferably by minimising the corresponding AIC, by a suitable numericaloptimisation routine [28]. This could be done even without the need to estimate and transforma MAR model before, but we have found that usually such model provides an excellent initialpoint for the optimisation. If we aim at independent components, the off-diagonal elements ofΣH should be kept zero during this optimisation (except for those pairs of elements referringto the two state dimensions within each AR(2) component).

11

5 Comparison of linSS with ICA

We have noted that in a linSS model which results from a MAR model by diagonalisationof Φ, the state dimension M will typically be larger than the data dimension N ; in the ICAliterature this case, known as the case of ”overcomplete bases”, is regarded as a difficult case[20].

On the contrary, in linSS models the caseM ¡ N does not by itself pose a difficult situation,and there is no need to artificially reduce the number of source components to M � N , as itis done in the case of MILCA-delay. It can be said that in linSS models the two dimensionsN and M are essentially independent of each other; therefore it is also possible to reconstructseveral components from even just one univariate observed time series pN � 1q. This hasbeen done in econometrics under the title of seasonal adjustment for a long time, and it is alsothe typical case for Singular Spectrum Analysis [9, 35]; in psychometrics the correspondinggeneralisation of classical factor analysis (FA) into dynamic factor analysis (DFA) has beenintroduced in the 1980’s, after the need for a generalisation of FA to include “memory” hadbeen expressed already in 1963 [6]. Of course, depending on the properties of the true sources,the correct reconstruction of all sources may still fail, and this remark applies to each of thethree possible cases, M ¡ N , M � N and M N .

The advantage of MAR/linSS over most ICA algorithms is the existence of an explicit para-metric dynamical model which describes the mutual dependencies of the state components overtime; through these dependencies information about not directly observed components is prop-agated into observed components, and from there into the observations, and this informationcan be exploited for reconstructing the non-observed components, regardless of whether thenumber of state components M is smaller, equal or larger than the data dimension N . Thisuseful property of multivariate dynamics remains valid for linear and nonlinear, deterministicand stochastic dynamics.

It is possible to devise dynamical systems for which this mechanism fails, e.g. cases inwhich the state space can be decomposed into two non-interacting subspaces only one ofwhich is observed; such systems are called non-observable [25]. Several numerical methods areavailable for the purpose of investigating whether a given system (i.e., a given pair of Φ andG) is observable. It turns out that also the case of M ¡ N observability is provided in mostsituations.

While ICA corresponds to the case of both Φ and ΣH being diagonal, this constraint canbe discarded in linSS modelling, which therefore represents a wider class of models than ICA.It remains possible to transform any linSS model into a model where either Φ or ΣH arediagonal, corresponding to the cases that all dependencies are instantaneous (described byΣH) or delayed (described by Φ). The first of these cases has been employed in this paper,while the second would offer advantages for the identification of causality patterns within adynamical system, but we will not discuss this case further in this paper.

6 Application to time series

In this section we will apply the decomposition algorithm, as proposed in this paper, toone simulated and several real-world time series, and we will compare the results with thoseobtained by the other algorithms which we have briefly reviewed above. The performance ofthe algorithms will be compared by the following measures:

• The mutual information of the components, using the estimator of Kraskov et al. [26]which is based on kth-nearest neighbours (using k � 6 in all cases).

12

• The value of AIC (only for algorithms producing a likelihood).

• The Amari performance index (API), a measure of closeness of the true mixing matrix

(i.e. observation matrix) C and its estimate C, taking into account the intrinsic am-biguities of source component estimation, as mentioned above. This measure can beevaluated only for simulations. API is defined as [13]

API � 1

2M

M

i,j�1

� |Kij|maxk |Kik| � |Kij|

maxk |Kkj|� 1 , (29)

where Kij � pC�1Cqij; API vanishes if C deviates from C only in scaling and reordering.

• Finally, by the computational time consumption. Since such measure depends on theefficiency of the implementation of the algorithms, as well as on the hardware, it servesonly for relative comparison. The values reported refer to a MATLAB implementationrunning on a 3.19 GHz Pentium PC; implementations of MILCA and MILCA-delay inthe C language were made available by the authors of Ref. [34].

6.1 Simulation case

6.1.1 Design of the simulation

There is a custom in the literature on ICA to employ mainly simulated data for the purposeof evaluating the performance of ICA algorithms; if such data is generated by the modelEq.(1), this approach has the advantage that the true sources are indeed known, such thatthe performance of the algorithms can be easily assessed. However, in many cases the datais simulated from quite simple and artifical processes or distributions, such that situationsare created which differ very much from the situation of analysing real-world data, whichhardly ever follows simple models. The relevance of results on performance, obtained fromsuch simulations, with respect to the application to real-world data is therefore uncertain.

As a compromise, here we will first assume that model Eq.(1) is valid, but in order tomake the situation more realistic, we will use real-world data as true sources. We choose togenerate artificial data from M � 4 different sources as given by:

• one trace of human EEG; healthy awake 10-year-old subject with eyes closed, channelO2 (occipital cortex) versus average reference of all 20 EEG channels.

• one trace of rat EEG; healthy adult rat, frontal cortex versus cerebellar reference elec-trode.

• part of an audio signal from the voice of a male speaker.

• velocity recording of a chaotic mode from a hydrodynamic experiment (Taylor-Couettesystem).

Actual physical sampling rates are different for each source data set, but this does notmatter for the purpose of a simulation. T � 1000 points are chosen from each data set andstandardised to zero mean and unit variance; then they are mixed by an arbitrarily chosennonsingular 4� 4 matrix. The true sources and the mixtures are shown in Fig. 1 (top panels).It can be seen that the sources themselves represent complicated signals with broad spectralcontent and a considerable degree of nonstationarity (especially in the case of the audio signal),therefore their reconstruction from mixtures poses a challenging problem.

13

6.1.2 Results of analysis

Results are shown in Fig. 1 and summarised in Table 1. The methods compared are FA,FastICA, nG-ICA, MILCA, MILCA-delay (with different embedding dimensions q), MARmodelling (with different model orders τm) and linSS models derived from MAR models (eitherwithout or with additional optimisation). The table also gives the mutual information fortrue sources and for the raw data, as well as, for parametric models, the number of modelparameters Npar. For the linSS models, the mutual information MI is given both for filteredand smoothed estimates, and in the figure smoothed estimates are shown.

Note that the number of source components M , resulting from MAR modelling, dependson the model order τm and will generally be larger than N � 4; for τm � 7 (which is the modelorder for which the AIC assumes a minimum) the algorithm finds M � 16 components, 4 ofwhich are AR(1) and 12 are AR(2), and the values for AIC and Npar, given in the table, refer tothis full set of components. For the purpose of comparison with other ICA methods, only 4 ofthese components should be retained; we decide to keep only those components which displaymost structure in their power spectrum, where ”structure” is quantified by the negentropy ofthe distribution of the power spectrum, as obtained from the Discrete Fourier Transform (i.e.the periodogram) [21]; but other criteria may be useful as well, or the components may beselected by subjective visual inspection. The values for MI und API given in the table referto this reduced set of components. It is also this reduced set which is used for the linSS model,either directly or as initial point for the numerical optimisation. Estimated source componentsfrom FastICA, MAR(τm � 7), linSS and optimised linSS are shown in Fig. 1 (middle and lowerpanels).

The MAR model with τm � 7 cannot produce predictions for the first 7 data vectors, andtherefore the corresponding contribution to the likelihood is missing; to compensate for thiseffect, all other reported values of likelihood were also calculated omitting the contributionsfrom the first 7 data vectors. The same correction will also be applied to the ral-world dataexamples presented below.

From Table 1, it can be seen that several methods achieve components which have aslow mutual information as the true sources, the estimates are even somewhat lower (i.e. thealgorithms try to remove even the coincidental correlations between the sources). This resultindicates that these methods have essentially identified the sources correctly (as can also beseen from the sources shown in Fig. 1), although in detail there may still be considerabledeviations, corresponding to non-zero values of API. With respect to API, the best resultseems to be provided by FastICA, which at the same time is also the fastest among this setof algorithms (keeping the promise of its name); however, it has to be noted that API is nota very precise measure of performance, i.e. its distribution has to be expected to be ratherbroad.

Within the subset of algorithms based on parametric models we can compare the likelihood,or preferably the AIC, as a measure of the performance. Factor analysis provides an upperbound at AIC � 9780.94 for the Gaussian instantaneous case; it can be seen from Table 1,that by dropping the Gaussian assumption in favour of a more general class of distributions(as employed by nG-ICA) only a small improvement of AIC can be achieved, which is furtherreduced by the penalty for the increased number of parameters (which increases from 14 to20). We remark that nevertheless also nG-ICA provides satisfying estimates of the true sources(result not shown), and even the components resulting from FA resemble approximately thetrue sources (result not shown).

MAR modelling yields AIC values more than 10000 units smaller (i.e. better) than theinstantaneous methods FA and nG-ICA (note that since AIC is a logarithmic quantity, avalue of zero has no special meaning, and, depending on the length and typical scaling of

14

the data, AIC may easily cross from positive to negative values; only differences of AIC arerelevant). For the minimum-AIC model order τm � 7 we find very small values for mutualinformation (in fact the smallest value obtained by any of the compared algorithms), andstill the analysis requires very little computational time; but from Fig. 1 it can be seen thatthe estimated sources are not as similar to the true sources as in the case of FastICA (or,MILCA and MILCA-delay, results not shown). This discrepancy can be partly explained bythe fact that only M � 4 components out of a total of 16 components provided by the MARdecomposition have been retained; nevertheless, it is remarkable that these 4 components, asselected by the Fourier negentropy criterion, are successful in identifying those componentswhich correspond approximately to the true sources.

Note that the time of computation given in the tables for the non-optimised linSS modelrefers to one application of the Kalman filter, assuming that the model is already given (bythe MAR model), while for the optimised linSS model it refers to the time consumption ofthe numerical optimisation procedure; after the optimised model is given, the Kalman filtercan always be applied with the much lower time consumption of the non-optimised model, inparticular it may be applied out-of-sample without refitting the model.

We could try to collect those sets of MAR components belonging to the same source byclustering methods, e.g. as employed by Stogbauer et al. [34]; alternatively we can refine ourmodel by fully entering the field of state space modelling, as described above. For this purposeonly the 4 components displayed in the middle right panel of Fig. 1 are retained and thoseoff-diagonal elements of ΣH linking different components are set to zero; then the componentsare re-estimated by Kalman filtering. It is not surprising that these modifications deterioratethe model performance: as shown in the table, the non-optimised linSS model has an AICvalue several thousand units larger than the original MAR model (even worse than the nG-ICA model); on the other hand, it represents a much more parsimonious description of thedata, requiring only 42 instead of the 118 parameters of the MAR model. The performance ofthis linSS model can be improved by numerical optimisation, which finally, as shown in Table1, yields an AIC much better than the original MAR model.

The source components obtained from the non-optimised and optimised linSS models areshown in the lower panels of Fig. 1; note that the order of the components has changed duringthe optimisation process and that certain components have been swapped between the groupsof AR(1) and AR(2) components; this is frequently observed for this modelling approach. Ascan be seen in the figure, the shapes of the components now correspond very closely to thetrue sources. The residual mutual information now has assumed a value much smaller thanin the non-optimised case.

6.2 Real-world data: ECG recording

6.2.1 Description of data set

We will now study the decomposition of a data set representing the electrocardiogram (ECG)of a pregnant woman. The data was sampled from N � 8 electrodes placed at thorax andabdomen, at a sampling rate of 500Hz; the length of the time series is T � 2500 points. Thedata is shown in Fig. 2 (top left panel); the sharp spikes corresponding to the mother’s heart-beat are clearly visible, and the heartbeats of the fetus, having much smaller amplitude, buthigher frequency, can also be seen. Furthermore, in channel D a very slow oscillation appears,probably due to breathing. This data set has been studied in the context of Independent Com-ponent Analysis also by other authors [29, 34]; it is suitable as a benchmark data set since itrepresents a case where we can expect the basic assumption underlying ICA, the assumptionof a linear instantaneous mixture of independent sources, to be approximately correct.

15


We apply the same decomposition algorithms as before and employ the same set of measuresfor evaluation and comparison (except API which is defined only for simulations); the resultsare shown in Fig. 2 and summarised in Table 2.

We find that for this case MILCA and nG-ICA provide sources with smallest residualmutual information; the sources estimated by MILCA are shown in Fig. 2 (middle left panel).It can be seen that sources 2 and 8 represent the mother, while 5 and 6 are dominated bythe fetus; this result reproduces Fig. 9 from Ref. [34]. The FastICA decomposition (top rightpanel) consists of very similar components, as can be seen from Fig. 2 (top right panel); againFastICA proves to be indeed very fast, while the time consumption of MILCA is considerable(using the implementation provided by the authors of Ref. [34]). For this data set also nG-ICAconsumes a long time for computation, because we have carried out a full optimisation of the72 model parameters; note that in [41] and [11] approximative optimisations based on iterativelearning rules were proposed.

In terms of AIC, again MAR achieves a value several 10000 units smaller than the non-dynamic methods FA and nG-ICA. At the minimum-AIC model order τm � 10 the algorithmfinds M � 44 components, 8 of which are AR(1) and 36 are AR(2); those M � 8 compo-nents having largest Fourier negentropy are displayed in Fig. 2 (middle right panel). Whilethe breathing signal is well separated among these components, there is no clear separationof mother and fetus heartbeat; only after these 8 components are taken into state space andfurther optimised, a much better decomposition can be obtained (Fig. 2, bottom right panel),within which there are two components for the mother and one for the fetus. By the transi-tion from MAR to linSS the model becomes much more parsimonious, the number of modelparameters being reduced from 668 to 132. The residual mutual information is reduced from2.653 to 1.187 (filter) or 1.301 (smoother), which is smaller than for FastICA, but not as smallas for MILCA; on the other hand, according to visual impression the decomposition is betterfor optimised linSS than for MILCA, with hardly any traces of the heartbeat spikes in thefirst 5 components.

The price for this improved decomposition is the very high time consumption of the nu-merical optimisation; for higher data and space dimension this problem renders this algorithmsoon impracticable. This remains true, even though our implementation was not optimisedfor speed and efficiency, and certain improvements could certainly be achieved. We note thatalso MILCA-delay suffers from very high time consumption.

From Table 2 it can be seen that the AIC of the optimised linSS model is still larger thanthat of the corresponding MAR model; various possible reasons could be considered for thiseffect. It could reflect an insufficient solution in the 132-dimensional parameter space, thusrequiring still more optimisation efforts; or it could reflect an insufficiency of the model itself.We have constrained the number of components to 8 and applied the additional constraint ofindependent components, and these two constraints together might be too restrictive for thisdata set; the MAR model has 44 components available, which furthermore are all correlated bythe off-diagonal terms in the noise covariance matrix. Keeping a larger number of componentsor introducing instantaneous correlations between certain components would certainly improvethe AIC further, but here we will not spend more time of this data set, since we intended merelyto demonstrate that very good decompositions can be obtained for this data set, even if themodel still leaves room for improvement.

16

6.3 Real-world data: EEG recording

6.3.1 Description of data set

As third and last example we will study the decomposition of a data set representing theelectroencephalogram (ECG) of a 10-year-old male patient suffering from absence epilepsy;the data set represents 4 electrodes (T4, T5, T6, A1 versus average of F3, F4) with a samplingrate of 256 Hz, the length of the time series is T � 1250 points. The data is shown in Fig. 3(top left panel); it can be seen that all 4 electrodes are dominated by a coherent oscillationwith slow waves and sharp spikes, a typical pattern for absence seizures. Furthermore someartefacts and contaminations can be seen, such as 50 Hz power hum noise (especially in T5)and a movement artefact in the last quarter of the data (which briefly brings A1 to a constantstop value at maximally negative amplitude). Closer examination reveals that the power humnoise contributes sharp lines not only at 50 Hz, but also at 74 Hz, 100 Hz and 106 Hz. Themixing of physiologically meaningful components and artefactual components which is foundin this data set, can be regarded as typical of real-world clinical data.


We apply the same decomposition algorithms as before and employ the same set of measuresfor evaluation and comparison (except API); the results are shown in Fig. 3 and summarisedin Table 3.

As with the ECG data set, we find again that MILCA and nG-ICA provide sources withsmallest residual mutual information; the sources estimated by MILCA are shown in Fig. 3(middle left panel). It can be seen that the second source represents the main seizure activity,the third contains most hum noise activity and the fourth contains a dipole-like structure cor-responding to the movement artefact. While this decomposition is potentially useful, it suffersfrom poor discrimination of spectral features: there remains strong low-frequency activity inthe third source, in addition to the hum noise signal, and spectral analysis of the other sourcesreveals the presence of residual hum noise activity. The FastICA decomposition (top rightpanel) suffers from the same problems.

Again MAR achieves an AIC value much smaller than the non-dynamic methods FA andnG-ICA. At the minimum-AIC model order τm � 43 the algorithm finds M � 89 components,6 of which are AR(1) and 83 are AR(2); those M � 8 components having largest Fouriernegentropy are displayed in Fig. 3 (middle right panel). It can be seen that the MAR modelautomatically identifies the four hum noise lines (at 50 Hz, 74 Hz, 100 Hz and 106 Hz); ifwe would choose only M � 4 components out of the 89 available components, we wouldchoose only these artefactual components, and for this reason the number of sources wasincreased to M � 8 (which corresponds to the ”overcomplete basis” case). Nevertheless, bythe transition from MAR to linSS the model becomes much more parsimonious, the numberof model parameters being reduced from 1926 to 94; the corresponding sources are shown inthe lower panels of Fig. 3. The residual mutual information is reduced by the transition, butits values cannot be directly compared to the corresponding values for FastICA and MILCA,since they refer to a larger number of sources; furthermore, the four hum noise sources arehighly correlated, therefore they tend to raise the residual mutual information of this set ofsources.

The remaining sources show components related mainly to the seizure or to the movementartefact; note that there is a tendency to split the seizure activity into an almost sinusoidalcomponent representing the slow waves, and a component representing the spikes. Furthermoreit is remarkable that through optimisation the hum noise components become almost perfect

17

sine waves, as it should be; this behaviour results from the corresponding complex eigenvaluesmoving onto the unit circle.

As can be seen from Table 3, in this case the optimisation also succeeds in improving theAIC of the linSS model beyond the lowest value possible with MAR models; but again thetotal time consumption of the optimisation is considerable.

7 Conclusion

In this paper we have proposed a method for decomposing multivariate time series into sourcecomponents by dynamical modelling, and we have compared this method with previously in-troduced methods, such as Factor Analysis (FA) and Independent Component Analysis (ICA).We have put particular emphasis on the concepts of dynamical modelling and of maximisationof likelihood. Dynamical modelling represents the attempt to reconstruct the causal structureof the system underlying the data, corresponding to a description by a set of differential equa-tions, whereby the information contained in the data can be exploited in much better way,compared to ”instantaneous” models (like FA or non-dynamical ICA), which ignore entirelythe temporal order of the time series data. By comparing the likelihood, or preferably theAkaike Information Criterion (AIC), the performance of different models for a given time seriescan be quantitatively evaluated and compared; the important point here is, that AIC is anestimator of the distance (in the sense of a Kullback-Leibler distance, but with an additionalinterpretation in terms of probability theory [3]) between the probability distribution providedby the model and the unknown, and in many cases inaccessible, true distribution.

Most algorithms for ICA do not provide a value for the likelihood or the AIC, an exceptionbeing the nG-ICA algorithm, for which we could explicitly show that the improvement of AIC,compared to FA, which can be obtained by introducing non-Gaussian distributions, is smallcompared to the improvement which can be obtained by introducing dynamical modelling.

Multivariate autoregressive (MAR) models and state space models represent a classicalapproach to dynamical modelling of time series; in this paper we have discussed the closerelation between these model classes with a particular view to their relation to ICA. Wehave demonstrated that the fundamental assumption of ICA, concerning the independenceof sources, has a natural interpretation within the framework of state space modelling, suchthat it becomes possible to perform ICA by these well-established methods from classical timeseries analysis. While the interpretation of MAR models in terms of ”principal oscillations”[32] is well known, it seems that surprisingly so far its relevance for the task of ICA has notbeen realised.

Many algorithms for ICA aim at directly minimising the residual mutual information of thesources; in some applications this approach may be useful, but as the main criterion for datamodelling the criterion of minimum mutual information is problematic, since it may easilyput overmuch effort on removing coincidental correlations. This behaviour could be seen inthe EEG data example presented in this paper: All tested ICA methods failed to isolate thesinusoidal power supply components, because there were other decompositions having lowermutual information. Given that sinusoidal components are extremely well defined and theneed for their removal from data is a common situation in biomedical time series analysis,this result represents a serious deficiency of the tested ICA algorithms. We have observedthat many decompositions produced by ICA algorithms suffered from such neglect of spectral-domain features. In contrast to this, within linear state space models spectral features can beeasily represented, but at the same time the reconstruction of source components with broadspectral content remains possible, as has been shown by the simulation example presented inthis paper.

18

In the simulation example standard ICA algorithms and the MAR and linSS decomposi-tions have demonstrated comparable performance, indicating that standard ICA will provideacceptable results if its assumptions are known to be correct. In such situations the muchlarger time consumption of linSS model fitting may indeed be a reason to favour ICA overlinSS modelling. This remark applies in particular to high-dimensional data; sometimes ICAis applied to time series with N ¡ 100, while so far full linSS model fitting is practicable onlyfor small dimension, say N 8. On the other hand, pure MAR fitting is very fast and mayin many cases also provide useful decompositions; as an example, from Fig. 3, middle rightpanel, it can be seen that also the MAR model is able to isolate the sinusoidal components.

If the task was simply the removal of sinusoidal components, then, depending on theintended further analysis of the data, there may be no need for the time-consuming full linSSmodel fit, and easier methods may be chosen. We have discussed this example in order topresent in a number of situations a very general modelling framework which offers very goodperformance and is based on solid mathematical foundations. It is clear that for practicalwork compromises may be needed which do not always follow the theoretical preferences; inmany applications it will be more convenient to remove contaminations from the data bypreprocessing steps, before proceeding to the main analysis. Nevertheless, the concept of aunified approach to time series analysis, which treats all phenomena in the data at an equallevel and does not distinguish between preprocessing and main analysis, bears considerableattractiveness both from theoretical and applied viewpoints.

We see the main advantage of the framework based on linSS modelling in its much higherflexibility, as compared to standard ICA algorithms. Prior knowledge about certain modelparameters or about the dynamics of certain components can be incorporated easily, possiblyalso including nonlinear elements of the dynamics or the observation process. As an examplefor prior knowledge of parameters, the frequencies of sinusoidal components may be wellknown a priori, such that they need not be estimated numerically, and the same is truefor the corresponding moduli, which can be fixed to unity. Known or suspected deviationsfrom independence of components can be incorporated, and, unlike with ICA, the presenceof observation noise can be easily treated; also there is no need for the sources to have non-Gaussian distributions.

Furthermore, the ”overcomplete bases” case does not pose in itself a particular problem;MAR modelling typically provides more components than data dimensions, and also linSSmodels with M ¡ N can be estimated, provided the model is observable. For example, in thesimulation example presented in this paper, it should in principle be possible to reconstructall M � 4 sources from a subset of the N � 4 mixtures; when using only 3 of the mixtures, sofar we have been able to reconstruct only one of the sources correctly, but this may be due tothe time series being rather short and the sources being quite complicated and nonstationary;with longer time series we expect that it would be possible to achieve this task.

A number of extensions are possible, such as introducing separate dynamics of the co-variance of the driving noise (corresponding to ”stochastic volatility” models) by GARCHmodelling of the noise; for the univariate case, this possibility has recently been explored inRef. [40], while a different approach to employing GARCH modelling of noise in the contextof ICA was proposed by Cheung & Xu [10]. Another extension with considerable potentialfor the neurosciences is given by including an external input into the state space dynamics,such as an additional time series (or a set of time series) describing external influences onthe subject or patient; this could be information on sensory input, such as light or sound, oron presence or absence of stimuli or tasks, as it is the common situation in the research onevoked potentials. The evoked potential would then be retrieved as the impulse response ofthe dynamics in state space; components known to be unrelated to the stimulus can be keptisolated from the stimulus. We intend to pursue these ideas in future work.

19

Acknowledgements

This work was supported by the by the German Research Foundation (Deutsche Forschungsge-meinschaft, DFG) through project GA 673/1-1 and by the Japanese Society for the Promotionof Science (JSPS) through fellowship ID No. P 03059 and grant KIBAN B No. 173000922301.The authors are grateful to the authors of Ref. [34] for making their MILCA software availableon the Internet. Data for the simulation study was kindly provided G. van Luijetelaar andJ. Welting (rat EEG) and G. Pfister and J. Abshagen (Taylor-Couette); the human voice au-dio data was taken from http://www.jokes.thefunnybone.com/waves (”parental guidanceis suggested”), as also in Ref. [34]. The ECG data set was made available to the public aspart of the DAISY database (http://www.esat.kuleuven.ac.be/sista/daisy).

References

[1] H. Akaike. Markovian representation of stochastic processes and its application to theanalysis of autoregressive moving average processes. Ann. Inst. Stat. Math, 26:363–387,1974.

[2] H. Akaike. A new look at the statistical model identification. IEEE Trans. Autom. Contr.,19:716–723, 1974.

[3] H. Akaike. Prediction and entropy. In A. C. Atkinson and S. E. Fienberg, editors, Acelebration of statistics, pages 1–24. Springer, Berlin, Heidelberg, New York, 1985.

[4] H. Akaike. Factor analysis and AIC. Psychometrika, 52:317–332, 1987.

[5] H. Akaike and G. Kitagawa, editors. The practice of time series analysis. Springer, Berlin,Heidelberg, New York, 1999.

[6] T. W. Anderson. The use of Factor Analysis in the statistical analysis of multiple timeseries. Psychometrika, 28:1–25, 1963.

[7] A. K. Barros and A. Cichocki. Extraction of specific signals with temporal structure.Neural Computation, 13:1995–2000, 2001.

[8] G. E. P. Box and G. M. Jenkins. Time series analysis, forecasting and control. Holden-Day, San Francisco, 2. edition, 1976.

[9] D. S. Broomhead and G. P. King. Extracting qualitative dynamics from experimentaldata. Physica D, 20:217–236, 1986.

[10] Y. M. Cheung and L. Xu. Dual multivariate auto-regressive modeling in state space fortemporal signal separation. IEEE Trans. Syst. Man Cyb., 33:386–398, 2003.

[11] S. Choi, A. Cichocki, and S. Amari. Flexible independent component analysis. J. VLSISignal Processing, 26:25–38, 2000.

[12] S. Choi, A. Cichocki, H. Park, and S. Lee. Blind source separation and independentcomponent analysis: A review. Neural Inf. Proc. Lett. Rev., 6:1–57, 2005.

[13] A. Cichocki and S. Amari. Adaptive Blind Signal and Image Processing. Springer Seriesin Information Sciences, vol. 17. Wiley, New York, 2002.

20

[14] P. Comon. Independent component analysis, a new concept? Signal Processing, 36:287–314, 1994.

[15] M. S. Grewal and A. P. Andrews. Kalman Filtering: Theory and Practice Using MAT-LAB. Wiley-Interscience, New York, 2001.

[16] E. J. Hannan. Multiple Time Series. Wiley, New York, 1970.

[17] Harry H. Harman. Modern Factor Analysis. University of Chicago Press, Chicago, 3.edition, 1976.

[18] A. Harvey, S. J. Koopman, and N. Shephard, editors. State space and unobserved com-ponent models. Cambridge University Press, Cambridge, 2004.

[19] A. Hyvarinen. Fast and robust fixed-point algorithms for independent component anal-ysis. IEEE Trans. Neural Networks, 10:626–634, 1999.

[20] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley, NewYork, 2001.

[21] T. Inouye, K. Shinosaki, H. Sakamoto, S. Toi, S. Ukai, A. Iyama, Y. Katsuda, and M. Hi-rano. Quantification of EEG irregularity by use of the entropy of the power spectrum.Electroenc. Clin. Neurophys., 79:204–210, 1991.

[22] A. H. Jazwinski. Stochastic Processes and Filtering Theory. Academic Press, San Diego,1970.

[23] K. G. Joreskog. Some contributions to maximum likelihood Factor Analysis. Psychome-trika, 32:443–482, 1967.

[24] A. Jung and A. Kaiser. Considering temporal structures in Independent ComponentAnalysis. In Proc. 4th Int. Symp. ICA BSS (ICA 2003), Apr. 2003, Nara (Japan), pages95–100, 2003.

[25] Th. Kailath. Linear systems. Information and system sciences series. Prentice-Hall,Englewood Cliffs, 1980.

[26] A. Kraskov, Harald Stogbauer, and P. Grassberger. Estimating mutual information. Phys.Rev. E, 69:066138 1–16, 2004.

[27] N. Levinson. The Wiener RMS error criterion in filter design and prediction. J. Math.Phys., 25:261–278, 1947.

[28] R. K. Mehra. Identification of stochastic linear systems using Kalman filter representation.AIAA Journal, 9:28–31, 1971.

[29] F. Meinecke, A. Ziehe, M. Kawanabe, and K.-R. Muller. A resampling approach toestimate the stability of one- or multidimensional independent components. IEEE Trans.Biomed. Eng., 49:1514–1525, 2002.

[30] L. Molgedey and H. G. Schuster. Separation of a mixture of independent signals usingtime delayed correlations. Phys. Rev. Lett., 72:3634–3637, 1994.

[31] A. Neumaier and T. Schneider. Estimation of parameters and eigenmodes of multivariateautoregressive models. ACM Trans. Math. Softw., 27:27–57, 2001.

21

[32] C. Penland. Random forcing and forecasting using principal oscillation pattern analysis.Monthly Weather Review, 117:2165–2185, 1989.

[33] Y. L. Pi and N. C. Mickleborough. Modal identification of vibrating structures usingARMA model. J. Engineer. Mechanics., 115:2232–2250, 1989.

[34] H. Stogbauer, A. Kraskov, S. A. Astakhov, and P. Grassberger. Least-dependent-component analysis based on mutual information. Phys. Rev. E, 70:066123, 2004.

[35] R. Vautard and M. Ghil. Singular spectrum analysis in nonlinear dynamics, with appli-cations to paleoclimatic time series. Physica D, 35:395–424, 1989.

[36] M. West. Time series decomposition. Biometrika, 84:489–494, 1997.

[37] M. West and J. Harrison. Bayesian Forecasting and Dynamic Models. Springer, Berlin,Heidelberg, New York, 2. edition, 1997.

[38] M. West, R. Prado, and A. D. Krystal. Evaluation and comparison of EEG traces: Latentstructure in nonstationary time series. J. Amer. Stat. Assoc., 94:1083–1095, 1999.

[39] P. Whittle. On the fitting of multivariate autoregressions, and the approximate canonicalfactorization of a spectral density matrix. Biometrika, 50:129–134, 1963.

[40] K. F. K. Wong, Andreas Galka, Okito Yamashita, and Tohru Ozaki. Modelling non-stationary variance in EEG time series by state space GARCH model. Computers Biol.Med., (in press), 2006.

[41] H. Wu and J. Principe. Generalized anti-Hebbian learning for source separation. InProceedings of ICASSP’99, pages 1073–1076, Piscataway, NJ, 1999. IEEE Press.

[42] L. Xu. Temporal byy learning for state space approach, hidden Markov model and blindsource separation. IEEE Trans. Signal Process., 48:2132–2144, 2000.

[43] A. Ziehe and K.-R. Muller. TDSEP – an efficient algorithm for blind separation usingtime structure. In L. Niklasson, M. Boden, and T. Ziemke, editors, Proceedings 8th Int.Conf. Artificial Neural Networks (ICANN’98), pages 675–680. Springer, Berlin, 1998.

22

Table Captions

Table 1: Results of simulation study, using different decomposition methods. M : number ofcomponents, MI: mutual information, Npar: number of model parameters, AIC: Akaike In-formation Criterion, API: Amari Performance Index, time/s: time required for computation.FA: Factor Analysis, FastICA: Hyvarinen’s ICA algorithm, nG-ICA: non-Gaussian ICA algo-rithm, MILCA: mutual information least-dependent component analysis, MAR: multivariateautoregressive modelling, linSS: linear state space modelling. τm denotes the model order ofMAR models; in case of linSS models it denotes the model order of the MAR model whichwas used for constructing the state space. For details see text.

Table 2: Results of ECG data analysis, using different decomposition methods. For abbrevi-ations, see caption of Table 1. Note that the FA decomposition refers to only 7 sources, sincethe algorithm failed to produce more sources.

Table 3: Results of EEG data analysis, using different decomposition methods. For abbrevia-tions, see caption of Table 1. Note that the FA decomposition refers to only 7 sources, sincethe algorithm failed to produce more sources.

23

Tables

Method M MI Npar AIC API time/s

(true sources) 0.576(raw data) 1.831FA 4 0.645 14 9780.94 0.580 0.82nG-ICA 4 0.533 20 9446.62 0.475 94.44FastICA 4 0.548 - - 0.267 0.44MILCA 4 0.567 - - 0.447 26.02MILCA-delay, q � 2 4 0.541 - - 0.623 146.19MILCA-delay, q � 3 4 0.544 - - 0.624 257.43MILCA-delay, q � 7 4 0.547 - - 0.592 709.03MAR, τm � 2 4 (of 7) 1.075 38 -1221.60 1.006 0.88MAR, τm � 3 4 (of 8) 0.720 54 -1910.23 0.493 1.32MAR, τm � 7 4 (of 16) 0.507 118 -2082.21 0.616 1.16

linSS, filter, τm � 74

0.87542 9477.15 0.616 0.21

linSS, smoother, τm � 7 0.872linSS, filter, τm � 7, optimised

40.558

42 -2212.61 0.446 4767.04linSS, smoother, τm � 7, opt. 0.581

Table 1:

Method M MI Npar AIC time/s

(raw data) 6.437FA 7 1.503 44 14624.41 1.30nG-ICA 8 1.076 72 4377.89 23743.52FastICA 8 1.442 - - 0.88MILCA 8 1.037 - - 2152.76MILCA-delay, q � 2 8 1.190 - - 6853.45MILCA-delay, q � 3 8 1.224 - - 18244.60MILCA-delay, q � 10 8 1.172 - - 75097.78MAR, τm � 2 8 (of 12) 2.227 156 -27333.05 5.37MAR, τm � 3 8 (of 16) 2.236 220 -28445.05 5.62MAR, τm � 10 8 (of 44) 2.653 668 -29928.38 6.53


2.570132 469374.02 0.76


81.187

132 -24951.43 94677.52linSS, smoother, τm � 10, opt. 1.301

Table 2:

24

Method M MI Npar AIC time/s

(raw data) 3.269FA 4 0.970 14 8612.88 6.52nG-ICA 4 0.895 20 7945.75 2685.83FastICA 4 0.977 - - 0.55MILCA 4 0.679 - - 87.25MILCA-delay, q � 2 4 0.677 - - 164.65MILCA-delay, q � 3 4 0.674 - - 286.18MILCA-delay, q � 10 4 0.704 - - 1069.32MAR, τm � 2 6 (of 6) 1.720 38 -8891.13 3.14MAR, τm � 3 6 (of 6) 0.989 54 -8905.76 2.77MAR, τm � 43 8 (of 89) 2.037 1926 -10982.66 7.60


1.09494 72844.49 0.37


81.592

94 -11294.52 64192.50linSS, smoother, τm � 43, opt. 1.678

Table 3:

Figure Captions

Fig. 1: True sources: human EEG, rat EEG, audio signal and fluid velocity signal (topleft); simulated data, resulting from mixing the sources (top right); sources reconstructedby FastICA (middle left), by MAR pτm � 7q (middle right), by linSS (bottom left) and byoptimised linSS (bottom right). True physical units for amplitude and time are different foreach source. For details see text.

Fig. 2: ECG data (top left); sources reconstructed by FastICA (top right), by MILCA (middleleft), by MAR pτm � 10q (middle right), by linSS (bottom left) and by optimised linSS (bottomright).

Fig. 3: EEG data (top left); sources reconstructed by FastICA (top right), by MILCA (middleleft), by MAR pτm � 43q (middle right), by linSS (bottom left) and by optimised linSS (bottomright).

25

Figures

Figure 1:

26

Figure 2:

27

Figure 3:

28

identiﬁcation of source components in multivariate time...

Documents