comparing metrics to evaluate performance of regression ... · comparing metrics to evaluate...

4
Comparing Metrics to Evaluate Performance of Regression Methods for Decoding of Neural Signals Martin Sp¨ uler 1 , Andrea Sarasola-Sanz 2 , Niels Birbaumer 2,3 , Wolfgang Rosenstiel 1 , Ander Ramos-Murguialday 2,4 Abstract— The use of regression methods for decoding of neural signals has become popular, with its main applications in the field of Brain-Machine Interfaces (BMIs) for control of prosthetic devices or in the area of Brain-Computer Interfaces (BCIs) for cursor control. When new methods for decoding are being developed or the parameters for existing methods should be optimized to increase performance, a metric is needed that gives an accurate estimate of the prediction error. In this paper, we evaluate different performance metrics regarding their robustness for assessing prediction errors. Using simulated data, we show that different kinds of prediction error (noise, scaling error, bias) have different effects on the different metrics and evaluate which methods are best to assess the overall prediction error, as well as the individual types of error. Based on the obtained results we can conclude that the most commonly used metrics correlation coefficient (CC) and normalized root- mean-squared error (NRMSE) are well suited for evaluation of cross-validated results, but should not be used as sole criterion for cross-subject or cross-session evaluations. I. I NTRODUCTION A Brain-Machine Interface (BMI) or Brain-Computer In- terface (BCI) is a device that translates neural signals into control signals to drive an external device or a computer. For the decoding of neural signals, machine learning methods are used which can be grouped into two areas: classification methods and regression methods. Classification methods deliver a discrete output (like a yes/no response) and are often used in BCIs for communication purposes. On the other hand, regression methods deliver a continuous output (like movement velocity), which will be the focus in this paper. While the control of a robotic arm [1] is the most prominent use, there are other applications that include control of prosthetic devices [2] or a computer cursor [3]. There are also other examples from related fields that use regression methods for decoding human movement trajectory from neural signals [4], or use brain signals to predict electrical stimulation parameters [5] or estimate a user’s mental workload [6]. When it comes to developing and improving methods for decoding of neural signals, estimating the performance of the prediction model is crucial to the whole process. For classification methods, there are established performance 1 MS and WR are with Computer Science Department, University of ubingen, T¨ ubingen, Germany 2 ASS, NB and ARM are with the Institute of Medical Psychology and Behavioral Neurobiology, University of T¨ ubingen, T¨ ubingen, Germany 3 NB is also of Ospedale San Camillo, IRCCS, Venice, Italy 4 ARM is also at TECNALIA, Health Technologies Department, San Sebastian, Spain metrics [7] and studies that compare those metrics for the use in Brain-Computer Interfaces [8]. When using regression methods for decoding neural sig- nals, there are different performance metrics being used, with the correlation coefficient (CC) and the root mean squared error (RMSE) or its normalized version (NRMSE) being the most frequently used ones. While it is good scientific practice to state multiple performance metrics in a publication, there is the need to decide on one metric when it comes to automatic parameter optimization (e.g. in a grid- search). While those metrics capture different properties of the prediction performance, it is unclear which method is overall best suited. A. Desired properties of a good performance metric When trying to find a performance metric that is overall best suited, we first have to define what properties are desirable. Therefore, we have to look at the most common factors that lead to bad prediction performance. The most important factor is noise. There is noise in the recorded neural signals and other possible reasons like ambiguous data, which lead to a noisy prediction. When the prediction is evaluated using a cross-validation, noise is arguably the biggest reason for prediction error. However, when compar- ing a prediction model across sessions or across subjects, the so-called non-stationarity of the data becomes a big issue. Non-stationarity describes the fact the probability distribution of the data changes over time (e.g., fatigue during the experiment changes neural signals) and is different between subjects. In a cross-validation, training and testing data are chosen both from the whole dataset, which means that they both have roughly the same probability distribution. When using a cross-session or cross-subject evaluation, probability distributions of the training and testing data are different. Although there are methods to alleviate the problem of non- stationarity [9], it can be a large issue in cross-subject and cross-session evaluation that can lead to prediction bias and scaling errors. Therefore, if we want to evaluate a prediction model across subjects or sessions, the performance metric should not only be able to capture noise, but should also work reliably when the results are affected by a prediction bias or a scaling error. Further, the metric should be invariant to the total scaling of the data (but not the scaling error), which makes it easier to compare results between different datasets. To be able to compare different regression results (e.g. to decide on 978-1-4244-9270-1/15/$31.00 ©2015 IEEE 1083

Upload: others

Post on 18-Oct-2019

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparing Metrics to Evaluate Performance of Regression ... · Comparing Metrics to Evaluate Performance of Regression Methods for Decoding of Neural Signals Martin Sp uler¨ 1, Andrea

Comparing Metrics to Evaluate Performance of Regression Methods forDecoding of Neural Signals

Martin Spuler1, Andrea Sarasola-Sanz2, Niels Birbaumer2,3, Wolfgang Rosenstiel1, Ander Ramos-Murguialday2,4

Abstract— The use of regression methods for decoding ofneural signals has become popular, with its main applicationsin the field of Brain-Machine Interfaces (BMIs) for control ofprosthetic devices or in the area of Brain-Computer Interfaces(BCIs) for cursor control. When new methods for decodingare being developed or the parameters for existing methodsshould be optimized to increase performance, a metric is neededthat gives an accurate estimate of the prediction error. In thispaper, we evaluate different performance metrics regardingtheir robustness for assessing prediction errors. Using simulateddata, we show that different kinds of prediction error (noise,scaling error, bias) have different effects on the different metricsand evaluate which methods are best to assess the overallprediction error, as well as the individual types of error. Basedon the obtained results we can conclude that the most commonlyused metrics correlation coefficient (CC) and normalized root-mean-squared error (NRMSE) are well suited for evaluation ofcross-validated results, but should not be used as sole criterionfor cross-subject or cross-session evaluations.

I. INTRODUCTION

A Brain-Machine Interface (BMI) or Brain-Computer In-terface (BCI) is a device that translates neural signals intocontrol signals to drive an external device or a computer. Forthe decoding of neural signals, machine learning methodsare used which can be grouped into two areas: classificationmethods and regression methods. Classification methodsdeliver a discrete output (like a yes/no response) and areoften used in BCIs for communication purposes. On theother hand, regression methods deliver a continuous output(like movement velocity), which will be the focus in thispaper. While the control of a robotic arm [1] is the mostprominent use, there are other applications that includecontrol of prosthetic devices [2] or a computer cursor [3].There are also other examples from related fields that useregression methods for decoding human movement trajectoryfrom neural signals [4], or use brain signals to predictelectrical stimulation parameters [5] or estimate a user’smental workload [6].

When it comes to developing and improving methodsfor decoding of neural signals, estimating the performanceof the prediction model is crucial to the whole process.For classification methods, there are established performance

1MS and WR are with Computer Science Department, University ofTubingen, Tubingen, Germany

2ASS, NB and ARM are with the Institute of Medical Psychology andBehavioral Neurobiology, University of Tubingen, Tubingen, Germany

3NB is also of Ospedale San Camillo, IRCCS, Venice, Italy4ARM is also at TECNALIA, Health Technologies Department, San

Sebastian, Spain

metrics [7] and studies that compare those metrics for the usein Brain-Computer Interfaces [8].

When using regression methods for decoding neural sig-nals, there are different performance metrics being used,with the correlation coefficient (CC) and the root meansquared error (RMSE) or its normalized version (NRMSE)being the most frequently used ones. While it is goodscientific practice to state multiple performance metrics in apublication, there is the need to decide on one metric whenit comes to automatic parameter optimization (e.g. in a grid-search). While those metrics capture different properties ofthe prediction performance, it is unclear which method isoverall best suited.

A. Desired properties of a good performance metric

When trying to find a performance metric that is overallbest suited, we first have to define what properties aredesirable. Therefore, we have to look at the most commonfactors that lead to bad prediction performance. The mostimportant factor is noise. There is noise in the recordedneural signals and other possible reasons like ambiguousdata, which lead to a noisy prediction. When the predictionis evaluated using a cross-validation, noise is arguably thebiggest reason for prediction error. However, when compar-ing a prediction model across sessions or across subjects,the so-called non-stationarity of the data becomes a big issue.Non-stationarity describes the fact the probability distributionof the data changes over time (e.g., fatigue during theexperiment changes neural signals) and is different betweensubjects. In a cross-validation, training and testing data arechosen both from the whole dataset, which means that theyboth have roughly the same probability distribution. Whenusing a cross-session or cross-subject evaluation, probabilitydistributions of the training and testing data are different.Although there are methods to alleviate the problem of non-stationarity [9], it can be a large issue in cross-subject andcross-session evaluation that can lead to prediction bias andscaling errors.

Therefore, if we want to evaluate a prediction model acrosssubjects or sessions, the performance metric should not onlybe able to capture noise, but should also work reliably whenthe results are affected by a prediction bias or a scaling error.Further, the metric should be invariant to the total scaling ofthe data (but not the scaling error), which makes it easierto compare results between different datasets. To be ableto compare different regression results (e.g. to decide on

978-1-4244-9270-1/15/$31.00 ©2015 IEEE 1083

Page 2: Comparing Metrics to Evaluate Performance of Regression ... · Comparing Metrics to Evaluate Performance of Regression Methods for Decoding of Neural Signals Martin Sp uler¨ 1, Andrea

the optimal parameters or the best regression model), theperformance metric should relate to the prediction error in amonotonic fashion, with an ideally linear behavior.

While there are also application-centered metrics (liketime-to-reach target), we ignore those metrics in the course ofthe paper, since they do not allow a comparison of methodsacross different applications.

In this work, we tested different performance metrics, howwell they capture different error properties (noise, bias, andscaling) and which method delivers overall the most robustresults.

II. METHODS

In the following, we describe what performance metricswe used, how we generated simulated data, and how weevaluated the robustness of the metrics.

A. Performance metrics

Leaning on the example of predicting movement trajectoryfrom brain signals, y = (y1, ..., yn) denotes the actual move-ment trajectory for n time points, and y = (y1, ..., yn) is thecorresponding predicted trajectory. Based on this example,we tested the following performance metrics:

• Correlation coefficient (CC): Pearson’s correlation co-efficient computed by

CC(y, y) =

∑ni=1(yi −m)(yi − m)√∑n

i=1(yi −m)2 ·∑ni=1(yi − m)2

(1)

with m and m being the mean of y and y, respectively.In some publications the squared correlation coefficientis used. But due to the monotonic relationship betweenCC and its squared version, there is no difference in therobustness between those two.

• Normalized root mean squared error (NRMSE):

NRMSE(y, y) =

√∑n

t=1(yt−yt)2

n

(ymax − ymin)(2)

As the RMSE depends on the total scaling of the dataset,the normalized version (NRMSE) should be used toallow a comparison of results between datasets.

• Signal-Noise Ratio (SNR): as there are different defini-tions of SNR, we used the following method:

SNR(y, y) =var(y − y)var(y)

(3)

• Coefficient of determination (COD): There exist alsodifferent definitions of the coefficient of determination,with one of the definitions being the squared correlation.In this paper we used the COD defined by

COD(y, y) =

∑nt=1(yt − yt)2∑n

t=1(yt −mean(y))2(4)

• Global deviation (GD): defined by the average squareddifference

GD(y, y) =

(∑nt=1(yt − yt)

n

)2

(5)

Fig. 1. Each of the three subplots shows an example of a simulatedtrajectory prediction with the trajectory (red) and the predicted trajectory(blue). The first run (top) shows a larger amount of noise with no scalingerror and no bias. The second run (middle) shows a run with predictionbias. The third run (bottom) shows a run with scaling error.

As will be shown later, the performance metrics capturedifferent properties of the prediction error, which is whywe also tested different combinations of the metrics to findone metric (or combination of metrics) that is best suited.Due to the limited space, we will only present results fromcombination of metrics that are meaningful in terms of goodresults.

B. Simulation and evaluation procedure

To test the different performance metrics, we used theexample of predicting a movement trajectory and generatedan artificial one-dimensional trajectory y which consists of arepeated sinusoidal movement with a length of 105 samples.The maximum amplitude of y was set in a way that thevariance of y is 1. Based on y we generated y, which isthe predicted movement trajectory. To vary the effects ofnoise, bias and scaling errors, we introduced the factors ento specify the amount of noise in the prediction, eb to varythe prediction bias, and es which controls the scaling error.Nσ(0,1) denotes a noise vector having the same length as thetrajectory. As we assume that noise in neural recordings isGaussian, each value is drawn from a normal distributionσ(0, 1) with mean 0 and a variance of 1. The predictedtrajectory is then generated by

y = es · (en ·Nσ(0,1) + eb + y) (6)

For evaluation of the different metrics, we performedmultiple runs. In each run, the error factors were chosenrandomly from a predefined interval. The interval of en ∈[1, 4] was chosen, since these values resulted in a CC between0.2 and 0.8, which approximately are the highest and lowestvalues that are published for trajectory prediction based on

1084

Page 3: Comparing Metrics to Evaluate Performance of Regression ... · Comparing Metrics to Evaluate Performance of Regression Methods for Decoding of Neural Signals Martin Sp uler¨ 1, Andrea

various brain signals. The interval of eb ∈ [0, 0.5] was chosensince the minimum value of 0 is expected for cross-validatedresults, while the maximum value stems from personalexperience in cross-subject or cross-session validated data.For es ∈ [0.5, 1.5], the interval was chosen since scalingerrors of up to 50 % were observed in EEG-based cross-subject workload prediction [6]).

Exemplary trajectories as well as the result of a simulatedtrajectory prediction with different amount of noise, bias andscaling error are shown in figure 1.

To evaluate how well the different metrics are able tocapture the three error factors individually, we performed1000 runs for each of the error factors, in which onlyone error factor was randomly chosen from the predefinedintervals, while the other factors where set to a default value(en = 0.1,eb = 0,es = 1,), so that respective type of errorhas no (or only an insignificant) effect. Since a robust metricshould have a monotonic relation to the amount of error andthis relation should ideally be linear, we calculated Pearson’scorrelation coefficient to assess how good a metric is able toreflect the individual types of error.

To assess how well the metrics are able to capturethe individual types of errors, if all three types of errorshappen simultaneously (e.g. in a cross-subject or cross-session prediction), we also performed 1000 runs with all theerrors factors being randomly chosen from their predefinedinterval and used Pearson’s correlation coefficient to assesshow robust the metrics are for assessing the amount of theindividual error, as well as the overall error.

III. RESULTS

The results for the simulation, in which each error factorwas varied individually, are shown in table I. As is expectedby definition, CC is invariant to bias and scaling errors, butcaptures noise well. NRMSE, as well as the combination CC-NRMSE, have a Pearson correlation near 1 meaning that bothcapture all kinds of errors very well, if only one error factoris present in the data. Worth mentioning is metric COD,which captures noise and bias errors well, although it doesnot work that well (r = 0.65) for scaling errors.

TABLE IPEARSON’S CORRELATION BETWEEN THE PERFORMANCE METRICS AND

THE AMOUNT OF ERROR, WHEN THE PREDICTION ERROR IS ONLY

INFLUENCED BY ONE FACTOR (EITHER NOISE, BIAS, OR SCALING).RESULTS WITH VALUES NEAR ONE (≥ 0.95), ARE HIGHLIGHTED BOLD.

Noise Bias ScalingCC 0.98 0.03 0.07

NRMSE 1.00 1.00 1.00SNR 0.96 0.03 0.65COD 0.96 0.97 0.65GD 0.26 0.97 0.04

CC-NRMSE 1.00 0.99 1.00CC/NRMSE 0.91 0.69 0.52

CC+SNR 0.48 0.03 0.65

Since it is unrealistic that only a bias or only a scalingerror would occur during neural signal decoding and rather

all error factors will be present with a varying degree, wealso performed simulations with all factors being variedsimultaneously. The results of this simulation can be foundin table II. It can be seen that CC still captures noise robustly,remaining invariant to bias or scaling errors. However, theresults for NRMSE and CC-NRMSE drastically change whenall error factors are present simultaneously in the data, so thatboth methods still can be used as an indication of the amountof noise in the prediction, but fail to assess a bias or scalingerror.

TABLE IICORRELATION BETWEEN THE PERFORMANCE METRICS AND THE

AMOUNT OF ERROR SEPARATED BY EACH FACTOR, AS WELL AS THE

AVERAGE OVER ALL FACTORS. FOR THESE RESULTS THE PREDICTION

ERROR WAS AFFECTED BY ALL FACTORS SIMULTANEOUSLY (NOISE,BIAS, AND SCALING). BEST RESULTS ARE MARKED BOLD.

Noise Bias Scaling MeanCC 0.98 0.02 0.06 0.35

NRMSE 0.74 0.02 0.09 0.28SNR 0.85 0.01 0.32 0.39COD 0.84 0.07 0.33 0.41GD 0.00 0.76 0.07 0.28

CC-NRMSE 0.87 0.02 0.09 0.33CC/NRMSE 0.83 0.03 0.00 0.29

CC+SNR 0.09 0.02 0.55 0.22

Averaged over all three types of errors, COD performsbest in terms of estimating the overall prediction error. Butit still fails to capture bias and has problems capturing scalingerrors. Due to the properties of CC (invariance to bias andscaling), CC is best suited if only the amount of noise in aprediction should be estimated. GD performs best to capturea prediction bias, while CC+SNR is the best method toestimate a scaling error.

To illustrate how the metrics relate to the amount of noise,Figure 2 shows the results of the simulation run with allthree error factors being varied simultaneously. Due to spaceissues, only the scatter plot for the amount of noise is shownand not for the amount of bias and scaling error.

IV. DISCUSSION

In this study, we used the example of trajectory predictionto investigate the reliability of several performance metricsfor assessing the prediction error. Therefore, we performedsimulations in which a trajectory is predicted with theprediction being affected by three different kinds of errorin a varying degree.

When looking at predictions using real neural signals, wedo not know how noisy the prediction is and how muchit is affected by bias or scaling errors. Therefore, we needperformance metrics to estimate the amount of predictionerror. On the other side, the simulation allows us to generatepredictions for which the amount of noise, bias and scalingerror is exactly known and performance metrics can beevaluated in order to know how well they assess these errors.

By using simulations we could show which performancemetrics are sensitive to which kind of error. To understand,

1085

Page 4: Comparing Metrics to Evaluate Performance of Regression ... · Comparing Metrics to Evaluate Performance of Regression Methods for Decoding of Neural Signals Martin Sp uler¨ 1, Andrea

Fig. 2. Result of the simulation, in which the prediction is affected byall error types. Each scatter plot shows for a different metric how well themetric captures the amount of prediction noise, when simultaneously also arandom amount of prediction bias and scaling error is present. Each circlerepresents one run with a random amount of noise, bias and scaling error.

which performance metric is best suited to assess the overallprediction, we have to consider two different scenarios:

For cross-validated results, in which the data is similarlydistributed in the training and test sets, bias and scaling errorsare not to be expected and a noisy prediction will be the maincause of error. Based on the results, the most popular metricsCC and NRMSE give a reliable estimate on the amount ofnoise in the prediction and therefore can both be used asreliable performance metrics. The same holds for SNR, CODand CC-NRMSE.

However, if results of a regression shall be evaluated in across-subject or a cross-session manner, the data distributionmay differ between training and test sets due to the databeing obtained on different subjects or sessions, which canlead to a prediction bias or a scaling error. When all threefactors (noise, bias, scaling) are affecting the prediction errorsimultaneously, the results are different. Due to CC beinginvariant to bias and scaling, it is the best method to assessnoise effects in such a scenario, but due to its invarianceit completely fails to capture possible errors arising from aprediction bias or scaling error. To assess the bias in theprediction, GD is the method of choice since it is the onlymethod allowing a reasonable estimate of the prediction bias.When it comes to assessing the scaling error, CC+SNR is themethod that works best. While the three above mentionedmethods (CC, GD, CC+SNR) can assess one type of errorwell, there is no method that performs good on all types oferrors. On average the coefficient of determination (COD) is

the method with the overall best results, but although betterthan most methods, it does not work satisfactorily to capturebias or scaling errors.

A. Conclusion

In conclusion, it seems that the most popular metrics CCand NRMSE can reliably be used for evaluation of cross-validated results, but are not recommended as sole criteriain a cross-subject or cross-session evaluation. As CC doesnot take into account prediction bias and scaling errors,using only this metric could lead to wrong decisions whencomparing the prediction performance in those scenarios.

For cross-subject or cross-session evaluations, we recom-mend that multiple metrics (CC, GD, CC+SNR) should belooked at, to get a reliable assessment of the predictionperformance. If only one performance metric can be used(i.e. for parameter optimization in a grid-search), none ofthe tested methods works satisfactorily for all types of error,but the coefficient of determination (COD) is the best choice,because it delivers the best results on average.

ACKNOWLEDGMENT

This study was funded by the Baden-Wurttemberg Stiftung(GRUENS), Volkswagen Stiftung, the WissenschaftsCam-pus Tubingen, the Deutsche Forschungsgemeinschaft (DFG,Grant RO 1030/15-1, KOMEG), the Indian-European col-laborative research and technological development projects(INDIGO-DTB2-051) and the Natural Science Fundationof China (NSFC 31450110072). Andrea Sarasola-Sanz issupported by the La Caixa-DAAD scholarship.

REFERENCES

[1] L. R. Hochberg, D. Bacher, B. Jarosiewicz, N. Y. Masse, J. D. Simeral,J. Vogel, S. Haddadin, J. Liu, S. S. Cash, P. van der Smagt et al., “Reachand grasp by people with tetraplegia using a neurally controlled roboticarm,” Nature, vol. 485, no. 7398, pp. 372–375, 2012.

[2] K. Ganguly and J. M. Carmena, “Emergence of a stable cortical mapfor neuroprosthetic control,” PLoS biology, vol. 7, no. 7, p. e1000153,2009.

[3] S.-P. Kim, J. D. Simeral, L. R. Hochberg, J. P. Donoghue, and M. J.Black, “Neural control of computer cursor velocity by decoding motorcortical spiking activity in humans with tetraplegia,” Journal of neuralengineering, vol. 5, no. 4, p. 455, 2008.

[4] M. Spuler, W. Rosenstiel, and M. Bogdan, “Predicting wrist movementtrajectory from ipsilesional ecog in chronic stroke patients,” in Proceed-ings of 2nd International Congress on Neurotechnology, Electronics andInformatics (NEUROTECHNIX 2014), 10 2014, pp. 38–45.

[5] A. Walter, G. Naros, M. Spuler, A. Gharabaghi, W. Rosenstiel, andM. Bogdan, “Decoding stimulation intensity from evoked ecog activity,”Neurocomputing, vol. 141, pp. 46–53, 2014.

[6] C. Walter, P. Wolter, W. Rosenstiel, M. Bogdan, and M. Spuler,“Towards cross-subject workload prediction.” in Proceedings of the 6thInternational Brain-Computer Interface Conference, Graz, Austria, 092014.

[7] M. Sokolova and G. Lapalme, “A systematic analysis of performancemeasures for classification tasks,” Information Processing & Manage-ment, vol. 45, no. 4, pp. 427–437, 2009.

[8] M. Billinger, I. Daly, V. Kaiser, J. Jin, B. Z. Allison, G. R. Muller-Putz,and C. Brunner, “Is it significant? guidelines for reporting bci perfor-mance,” in Towards Practical Brain-Computer Interfaces. Springer,2013, pp. 333–354.

[9] M. Spuler, W. Rosenstiel, and M. Bogdan, “Principal component basedcovariate shift adaption to reduce non-stationarity in a meg-basedbrain-computer interface,” EURASIP Journal on Advances in SignalProcessing, vol. 2012, no. 1, pp. 1–7, 2012.

1086