universitat politècnica de catalunyaayamaui/documents/report_busquet_yamaui.pdf · total...

Universitat Politècnica de CatalunyaMaster in Innovation and Research in Informatics

Data Mining and Business IntelligenceReport

Machine LearningDetecting early stage Parkinson Disease according to Patient’s voice features

Submitted byFrancesc BusquetAlexandra Yamaui

June 24, 2017

Contents

Introduction and Problem Understanding 1

Data 2

Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Data visualization and transformation . . . . . . . . . . . . . . . . . . . . . . 4

Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Clustering and Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Modelling 9

Model Tuning and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Conclusions and Future Work 12

Introduction and Problem Understanding

Parkinson’s disease (PD) is a long term degenerative disorder of the central nervous system,which belongs to a group of conditions called motor system disorders, result of the loss ofdopamine-producing cells. The causes of PD are generally unknown, although it’s believedto involve genetic and environmental factors. [1]

The main symptoms of PD are tremor (involuntary shakiness), rigidity (stiffness), bradyki-nesia or hypokinesia or akinesia (i.e. movement difficulty) and postural instability (problemswith balance). Other symptoms are also common in people who suffers of PD, such as de-pression, dementia, sleep disturbances and smelling difficulties [2].

PD is a chronic and progressive disease, i.e. it persists over a long period and its symptomsgrow worse over time. There is currently no cure for PD [1], although, a variety of medica-tions can significantly smooth the symptoms of this disease. As we mentioned before, PD isrelated to a loss of dopamine-producing cells, hence the treatment strategy is to increase theDopamine signaling.

Additionally, the current diagnosis modalities in PD are limited to identify PD by thepresence of motor symptoms [3], when the disease has already progressed to an advancedstage in which those symptoms are clearly evident. However, a growing amount of datafrom the medical literature indicates the existence of several cases of improvements, whichmay be associated with early therapeutic intervention. The most palpable benefit of earlyintervention is a reduction in symptoms, principally dyskinesia, and the delay of levodopainitiation. [4]

Moreover, recent studies present strong evidence of a relation between degrading per-formance in voice and PD progression [5]; linking hypophonia and dysphonia with PD. [6]Furthermore, vocal impairment could be one of the earliest indicators of PD according torecent studies. [7]. Additionally, several studies have proven that age and gender are strongfactors contributing to the disease. [8]

There are around 300,000 patients with PD in Spain, having an important impact on thepatient’s quality life and the patient’s family. Generating costs of more than 17,000$ perpatient every year, being substantially dependent on clinical intervention. [9]

Typically, PD symptoms progression is tracked using the Unified Parkinson’s DiseaseRating Scale (UPDRS). UPDRS requires the patient’s presence in clinic and a time-consumingexamination carried by trained medical staff. Hence this is costly and usually done when thePD has progressed to an advanced stage. [10] For this reason, our goal in this project is topredict the value of UPDRS from voice performance and other attributes such as age andgender.

This document will be divided in 3 main sections, in the first one, we will describe ourdata while performing a data exploration process. We will pre-process the data and performfeature extraction, as well as visualizations and clustering for a clearer understanding.Insection two, we will use Artificial Neural Networks (ANN) and Support Vector Machine(SVM) to predict UPDRS. Finally, in the third and final section we will summarize the mainresults and considerations of our analysis, pointing any limitation found in our work.

1

Data

In this project we have used the Parkinsons Telemonitoring Data Set from the UCIMachine Learning Repository. This dataset is composed of a range of biomedical measure-ments from 42 people with early-stage Parkinson’s disease recruited to a six-month trial oftelemonitoring device, developed by the Intel Corporation, to record the speech signals forremote symptom progression monitoring done in the patient’s homes [10]. There’s a total of5,875 voice recordings from these patients.

The dataset contains 22 variables, 16 of those refer to voice features:

1. Subject#: Integer that uniquely identifies each patient.2. Age: Integer that identifies the age of each patient.3. Sex: Binary variable that identifies the gender of the patient being ’0’ for male and ’1’

for female.4. Test time: Integer referring to the number of days since recuitment into the trial.5. Motor UPDRS: Clinician’s motor UPDRS score, linearly interpolated.6. Total UPDRS: Clinician’s total UPDRS score, linearly interpolated7. Jitter Percentage: measure of variation in fundamental frequency.8. itter (Absolute): measure of variation in fundamental frequency.9. Jitter (RAP): measure of variation in fundamental frequency.10. Jitter (PPQ5): measure of variation in fundamental frequency.11. Jitter (DDP): measure of variation in fundamental frequency.12. Shimmer: measures of variation in amplitude.13. Shimmer(dB): measure of variation in amplitude.14. Shimmer:APQ3: measure of variation in amplitude.15. Shimmer:APQ5: measure of variation in amplitude.16. Shimmer:APQ11: measure of variation in amplitude.17. Shimmer:DDA: measure of variation in amplitude.18. NHR: Noise-to-Harmonies Ratio, measures of ratio of noise to tonal components in the

voice.19. HNR: Harmonics-to-Noise Ratio, measures of ratio of noise to tonal components in the

voice.20. RPDE: Recurrence Period Density Entropy, a nonlinear dynamical complexity measure

to detect general voice disorders.21. DFA: Detrended Fluctuation Analysis (DFA) is a scaling analysis method used to

estimate long-range power-law correlation exponents in noisy signals.22. PPE: Pitch Period Entropy. A nonlinear measure of fundamental frequency variation

(dysphonia).

Since UPDRS is designed to monitor PD, this clearly relates to the patient’s Parkinsonlevel. Therefore, the UPDRS variable is a perfect candidate to be the response variable.

The UPDRS score is composed of 4 different parts referring to mentation, behavior andmood; activities of daily living; motor examination; and complications of therapy. The totalUPDRS score spans from 0 to 176, where 0 denotes perfectly healthy individuals and 176

2

total disability. In contrast, motor UPDRS is a subset of total UPDRS and ranges from 0to 108, where 0 denotes symptom free and 108 severe motor impairment. Speech appearsexplicitly in the UPDRS in Part 2 (activities of Daily Living) and 3 (Motor Exam).

Data pre-processing

This section involves what we can call the first step in the data analysis, where we willsummarize the main characteristics of our dataset and perform some transformations of ourdata to improve its quality. For this reason, in this subsection, our aim is to remove irrelevant,redundant, noisy and unreliable data to improve the representability and the quality. In thissection we will focus with missing values and outliers.

Analyzing the missingness pattern in our data, we discover that there are no explicitmissing values. In the same way, analyzing the minimum and maximum for each variable wedid not observe any strange values that could be interpreted as missing.

Now, we will proceed to detect the outliers of our data. To do so, we will use an anomalydetection method called Local Outlier Factor (LOF) which measures the local deviation of agiven data point with respect to its neighbours (in this case, we will consider 5 neighbours,i.e. k = 5). For graphical visualization, we will plot the local outlier scores, highlighting thepoints that have scores far higher than the others, with a different color and different size(proportional to its score).

Figure 1: Local Outlier Factor Plot

Observing the plot above we can notice that there are several points which stand out fromthe rest. The extreme outliers belong to patients with ID: 7, 8, 14, 13, 18, 22, 35, 36 and 39.Although, by looking those points individually they seem to add important information toour analysis, not presenting an anomaly. Hence, we will keep them in our analysis.

3

Data visualization and transformation

Now we will proceed to represent graphically the distribution of the different variables in ourdataset performing histograms of each of them:

Figure 2: Variables distribution

We can see from the histograms that our dataset contains a high proportion of peoplebetween 60 to 75 years and a high proportion of males individuals. Furthermore, we can seethat Total_UPDRS follows a normal distribution, being the cumulus of the patients on therange of 25 - 35 of UPDRS score. In the same way, HNR, RPDE ad PPE are also normallydistributed; however, the rest of the variables are very unnormalized. We can see that thevariables related to Jitter and Shimmer and NHR are distributed close to zero. Hence, atransformation for those variables should be done; for that, we will apply a logarithmictransformation on them. Below we can see again the representation of the different variables,seeing how the transformations change the shape of some variables distribution:

4

Figure 3: Variables distribution with variables related to Jitter, Shimmer and NHR trans-formed

Now, we can see that the variables to which we applied the logarithmic transformationlook better, now they are "normalized". We can now proceed with the analysis of our data.

Feature Selection

After the data pre-processing, we proceed to reduce the dimensionality of our data combiningredundant and correlated variables.

Just seeing the description of the variables of our dataset we can infer that many of themgive similar information about the same concept, despite that each of them is aimed to extractdifferent characteristics of the speech signal. Since they seem to explain more or less the sameinformation, there is a good chance for them of being highly correlated. Hence, to test ourhypothesis we will proceed to compute the correlations between the different variables [11].We will use the Pearson correlation matrix (left) to find linear relationships and the Spearmancorrelation matrix (right) to see monotonic relationships:

5

Figure 4: Correlation Matrix Plot of the different variables

As we can see on both plots, our intuition was accurate, there is a correlation coefficientclose to 1 among the variables related to Jitter and among the variables involving Shimmer.Additionally, we can see a strong negative correlation between HNR and NHR, since theyare opposites. In the same way, we can see that Total and motor UPDRS are also highlycorrelated (with a correlation coefficient close to 1). This is because, as we previously said,Motor UPDRS is a subset of the tests stated in Total UPDRS. Hence, they explain the sameinformation.

Performing a Principal Component Analysis (PCA) (shown below) we can see that TotalUPDRS explains a slightly higher amount of inertia than motor UPDRS, thus, we select TotalUPDRS as our dependent variable.

Figure 5: Variables Factor Map (Principal Component Analysis)

6

From what we have said before, we can state that we are in presence of multicollinearity,there is an excessive correlation among explanatory variables. This phenomenon usuallycomplicates the identification of an optimal set of explanatory variables in a statistical model.Because many of the variables measure similar concepts, we will build a linear combinationof them, avoiding redundancy, without loosing information.

In order to do that, we will perform a PCA for each group of correlated variables (Jitterand Shimmer variables groups). Then, we will build a linear combination of the variablesmore correlated with the first significant dimension (which explains more than 90% of thetotal variability). In this way, we take into account the effect of all Shimmer and Jittervariables, without having repeated information.

Clustering and Profiling

For a better understanding of our data we will perform a clustering analysis to see howthe individuals group among each other. For that, we will generate a new factorial spacewith the new combined variables. For the sake of interpretation of the extracted factors, wewill perform a varimax-rotated PCA [12] over the new combinations of variables Jitter andShimmer, and the remaining ones related to voice features (HNR, RPDE, DFA, PPE). Wewill set as supplementary the variables age, gender, test_time, subject and total UPDRS,which we will categorize, trying to get homogeneous splits. The Total UPDRS variable iscategorized only for the PCA analysis.

The categorization of the total UPDRS variable results in two categories: (0,30] and(30,60], to which we will refer as low and high levels, respectively. This categorization wasdone based in the article of the Food and Drug Administration committee (FDA) in theirpublication about “Peripheral and Central Nervous System Drugs”, where they establish thatthere is no dose-response at 36 weeks in patients with total UPDRS greater than 30.

From doing the PCA we can identify that the first 4 dimensions are the significant ones(they explain 92% of the variability). Dimension 1 refers to features related with Jitter, PPEand NHR, meanwhile Dimension 2 refers to features related with DFA; Dimension 3 refers toRPDE and Dimension 4 refers to Shimmer variables.

To generate the clusters, we will use the k-means algorithm, taking the dimensions andthe values of the projection over the individuals obtained from the varimax rotated PCAanalysis of the voice dimensions.

However, to use k-means, we need to select the initial numbers of classes. To do that, wewill use an empirical criterion proposed by Husson et al. (2010) based on the between-clusterinertia gain of the hierarchical clustering, which suggests 5 initial classes as optimum for ourdata.

Once we have defined the number of classes we proceed to plot them in the Principal Com-ponent space, obtaining the following three-dimensional plot, where the different individualsare colored according to their class (We plot 3 of the 4 dimensions for visualization purposes):

7

Figure 6: Clusters shown in 3 Dimensions

As we can see, the plot shows a very closed but different clusters. Now we will proceedto profile the different clusters, by examining the characteristics of each one and pointing totheir significant differences. Our clusters can be described as follows:

1. Senior patients with high UPDRS (C1): corresponds to 60 and 70 years oldpeople, with high total UPDRS. This class contains groups of women and men, with41% and 59%, respectively. These patients have high values of dysphonia (PPE), Jitterand NHR, and low values for HNR.

2. Young patients with high HNR (C2): corresponds to young people (36-60 yearsold), with both, low and high, total UPDRS, high HNR ratio and low RPDE and NHR.This class contains 46% women and 54% male patients.

3. Male Senior patients with high RPDE and HNR (C3): patients from 60 to 85years old, mostly men (74.5%), with high general voice disorders (RPDE) and HNR,and low values for disphonia (PPE) and Jitter.

4. High UPDRS patients (C4): patients between 60 and 85 years old, with high totalUPDRS and low DFA.

5. Male young and senior patients with low UPDRS (C5): mostly male patients(81.6%), between 30-60 and 70-85 years old with low values for total UPDRS. Thisgroup has high values for DFA.

8

Modelling

Model Tuning and Validation

For the modelling we are going to use three methods, the first one a linear regression model,the second, an Artificial Neural Networks (ANN), specifically, a single layer neural network,and for the third method, we are going to use Support Vector Machine (SVM) for regression.These will allow us to capture both linear or more complex relationships in the data.

For ANN and SVM we need to choose the right parameters to use. For ANN, we need todetermine the optimal number of neurons in the hidden layer and the weight decay. On theother hand, for SVM we need to choose the type of kernel to use, the regularization parameterC (cost) and the tolerance for errors (epsilon). To determine the optimal parameters we willuse cross-validation. However, because we are dealing with patients (every observation belongsto one patient) and because the built-in cross-validations method do not differentiate betweenthem, we need to implement it by ourselves. We will split the data in 10 homogeneous foldsand for every iteration, we will take 9 folds to train the model and 1 to validate it. Then,we will compute the average Root Mean Squared Error (RMSE), varying the parameters eneach iteration.

For the neural network, we are going to use 8 different number of neurons in the hiddenlayer: 1, 5, 10, 15, 25, 35, 40, 50, and 12 different weight decays: 0, 0.01, 0.1, 0.25, 0.3, 0.4,0.5, 0.6, 0.7, 0.75, 0.85, 1. We will choose the combination of parameters that leads to a lowererror.

For the SVM we will use 8 different values for the regularization parameter C: 1, 2, 4, 6,10, 15, 25, 30; we will use three types of kernel functions: linear, radial and polynomial and6 different values for epsilon: 0.00001, 0.0001, 0.001, 0.01, 0.01, 0.1. As same as with ANN,we will choose the best combination of parameters that leads to the lowest error.

The pseudo-code of the cross-validation algorithm is presented below:

RMSEmatrix <− i n i t i a l i z e empty matrix

f o r f o l d in f o l d s {f o r parameters in l i s t_of_parameters {

Val idat ionData <− Data be long ing to the cur r ent f o l dTrainingData <− Data be long ing to d i f f e r e n t f o l d sf i t <− Generate the s p e c i f i e d model with the s e l e c t e d

parametersVa l idate the model ( f i t ) on the va l i d a t i o n dataRMSEMatrix [ parameter , f o l d ] <− RMSE on Val idat i on Data

}}

Return (RMSEMatrix)

After running the cross-validation for ANN we obtain that the optimal parameters are 40neurons in the hidden layer and a decay of 0.01. The graphical representation is presented

9

below:

Figure 7: RMSE vs decay with different number of neurons

We can see that after 40 neurons the decrease of the RMSE is not really significant.Therefore, choosing a more complex model would not bring any benefit. Regarding to theweight decay, we did a sensitivity analysis around the value obtained (0.01), from which wefinally found that the optimal one is 0.005. Even so, the differences on RMSE among thedifferent parameters are really small.

Additionally, we also ran CV for a neural network with size 0, i.e. a linear neural network,even so, the results from this neural network were worse than with higher sizes. For thisreason we did not display it in the plot. For the SVM, we saw that the lowest error wasobtained using a linear kernel and a regularization parameter (C) equal to 15. Additionally,we tried with different values of epsilon, but variations on this parameter did not affect theRMSE. For this reason, we let the default value (epsilon = 0.1). On the other hand, afterdoing a sensitivity analysis around the C parameter, we determined that the optimal one was14. Below is presented the plot of the RMSE vs the C.

10

Figure 8: RMSE vs decay with different number of neurons

Lastly, we developed a linear regression model with the variables obtained from the pre-processing, using the total UPDRS as response variable and the rest as explanatory.

Finally, we ran the optimal models in the test set, where we obtained the following results:

Table 1: RSME for ANN and SVM

Model RMSESVM 14.82364ANN 37.52655

Linear Regression 14.56033

We can see that Linear Regression and SVM performed a lot better than ANN. Thisseem to say that our model actually follows a linear pattern, therefore, SVM, using a linearkernel, and Linear Regression were able to fit better. Even so, for interpretability reasons(and because the RMSE is a bit better) we will proceed to pick the linear regression modelas our optimal one. Our linear model is the following:

total_UPDRS = 40.139 + 0.3663 ×age− 2.6302× sex+ 0.0128× test_time− 4.2815×log10(NHR) + 16.9934 × RPDE − 41.9428 × DFA − 0.6267 × HNR − 1.9648 × PPE −1.2121× ShimmerComb+ 1.6558× JitterComb

As we can see, age increases the total UPDRS score by an amount of 0.3663 for eachadditional unit, meanwhile females have on average a total UPDRS score of 2.63 units lower

11

than males with the same values for the rest of variables. By looking the effect of the voice’sfeatures on the total UPDRS score we can see that there’s a positive relation among TotalUPDRS score and RPDE (each additional unit increases total UPDRS by 17 units) andJitterComb (each additional unit increases total UPDRS by 1.66 units).

In contrast, features like log10(NHR) (one percent increase in NHR is associated with a0.0428 unit decrease in Total UPDRS), DFA (each additional unit decreases total UPDRS by41.94 units), PPE (each additional unit decreases total UPDRS by 1.95 units), HNR (eachadditional unit decreases total UPDRS by 0.627 units) and ShimmerComb (each additionalunit decreases total UPDRS by 1.21 units). Remember that all those variations are expressedceteris paribus, in other words supposing that we only change the value of 1 variable, whileall the others remain constant.

Conclusions and Future Work

As we saw in this project, the detection of early stages of Parkinson’s disease is crucial, sincethe early treatment of this disease can improve the individual’s quality of life and significantlydecrease the evolution of this disease. We saw that different voice features are related to theParkinson Disease’s score. Hence, those features can be used to classify the Parkinson Diseasedegree of different patients, being able to detect the ones that are in early stages. As we sawin the data exploration, the patients were clustered in a coherent way using voice features.

From the results obtained using the three kind of algorithms, we saw that our data can befit with simpler models; specifically, we saw that the linear model had the best results (14.56)from all three, followed by the SVM using a linear kernel (14.82). From this, we can concludethat there are no complex (non-linear) relationships among our data, and that is why linearmodels outperformed more complex ones. Additionally, Artificial Neural Network seem tohave a bad performance due to the inability to reach a good local optimum or because of abad weight initialization, since using a neural network with size equal 0 the result should becomparable with the linear regression. This hypothesis is reinforced due to the fact that wetried with a different package for neural networks with a different optimization function andit showed improvements in some cases.

It is worth mentioning that, even that linear regression performed significantly betterthan Neural Networks, the RMSE obtained shows that our model is unacceptable, and evenmore in the context in which we are working, where we are dealing with medical cases of aserious disease. In this kind of scenarios, the good prediction could avoid the deterioration ofsomebody’s life. However, more data is needed. The data set analyzed contained informationof only 42 different patients, that means, the low amount of independent data constrainedthe power of prediction of the models. Therefore, it is highly likely that a greater amount ofindependent data would allow to reach a higher accuracy.

Additionally, as we saw in the data pre-processing, the collection of many variables relatedto Jitter and Shimmer is unnecessary in terms of prediction, since all of them are highlycorrelated. A single indicator of each of them would be enough to perform the study. For thisreason, we suggest that for future works the data set analyzed should be extended to containa higher number of patients, in order to consider multiple profiles and increase the quality ofdata, increasing the ability to predict PD level.

12

Bibliography

[1] Ninds: Parkinson’s disease information page.

[2] Joseph Jankovic. Parkinson’s disease: clinical features and diagnosis. Journal of Neu-rology, Neurosurgery & Psychiatry, 79(4):368–376, 2008.

[3] Fernando L Pagan. Improving outcomes through early diagnosis of parkinson’s disease.The American journal of managed care, 18(7 Suppl):S176–82, 2012.

[4] Daniel L Murman. Early treatment of parkinson’s disease: opportunities for managedcare. The American journal of managed care, 18(7 Suppl):S183–8, 2012.

[5] Max A Little, Patrick E McSharry, Eric J Hunter, Jennifer Spielman, Lorraine O Ramig,et al. Suitability of dysphonia measurements for telemonitoring of parkinson’s disease.IEEE transactions on biomedical engineering, 56(4):1015–1022, 2009.

[6] Rhonda J Holmes, Jennifer M Oates, Debbie J Phyland, and Andrew J Hughes. Voicecharacteristics in the progression of parkinson’s disease. International Journal of Lan-guage & Communication Disorders, 35(3):407–418, 2000.

[7] Christopher G Goetz, Glenn T Stebbins, David Wolff, William DeLeeuw, Helen Bronte-Stewart, Rodger Elble, Mark Hallett, John Nutt, Lorraine Ramig, Terence Sanger, et al.Testing objective measures of motor impairment in early parkinson’s disease: Feasibilitystudy of an at-home testing device. Movement Disorders, 24(4):551–556, 2009.

[8] Konrad Szewczyk-Krolikowski, Paul Tomlinson, Kannan Nithi, Richard Wade-Martins,Kevin Talbot, Yoav Ben-Shlomo, and Michele TM Hu. The influence of age and gender onmotor and non-motor features of early parkinson’s disease: initial findings from the oxfordparkinson disease center (opdc) discovery cohort. Parkinsonism & related disorders,20(1):99–105, 2014.

[9] R García-Ramos, E López Valdés, L Ballesteros, S Jesús, and P Mir. The social impact ofparkinson’s disease in spain: Report by the spanish foundation for the brain. Neurología(English Edition), 31(6):401–413, 2016.

[10] Athanasios Tsanas, Max A Little, Patrick E McSharry, and Lorraine O Ramig. Accuratetelemonitoring of parkinson’s disease progression by noninvasive speech tests. IEEEtransactions on Biomedical Engineering, 57(4):884–893, 2010.

[11] Mark A Hall. Correlation-based feature selection for machine learning. PhD thesis, TheUniversity of Waikato, 1999.

13

[12] Yijuan Lu, Ira Cohen, Xiang Sean Zhou, and Qi Tian. Feature selection using principalfeature analysis. In Proceedings of the 15th ACM international conference on Multimedia,pages 301–304. ACM, 2007.

[13] Jeshua Sargetis. Evaluation of particulate metal and noise exposures at a foundry andrecommended control strategies. 2016.

[14] "parkinsons telemonitoring data set". uci machine learning repository.

14

universitat politècnica de catalunyaayamaui/documents/report_busquet_yamaui.pdf · total...

Documents