incorporating domain knowledge of generalized …

54
INCORPORATING DOMAIN KNOWLEDGE OF GENERALIZED TONIC-CLONIC SEIZURES INTO SEIZURE ONSET DETECTOR ON AN APPLE WATCH USING NEURAL NETWORK by Michael Chan A thesis submitted to Johns Hopkins University in conformity with the requirements for the degree of Master of Science in Engineering Baltimore, Maryland May, 2019 © Michael Chan 2019 All rights reserved

Upload: others

Post on 19-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

INCORPORATING DOMAIN KNOWLEDGE OF GENERALIZED TONIC-CLONIC

SEIZURES INTO SEIZURE ONSET DETECTOR ON AN APPLE WATCH USING

NEURAL NETWORK

by

Michael Chan

A thesis submitted to Johns Hopkins University in conformity with the requirements for the

degree of Master of Science in Engineering

Baltimore, Maryland

May, 2019

© Michael Chan 2019

All rights reserved

ii

ABSTRACT

Generalized tonic-clonic seizures (GTCS) is the most severe type of seizure that could lead to sudden

unexpected death in epilepsy (SUDEP). To solve this problem, we’ve previously built a seizure onset

detector on Apple Watch to serve as an alerting system for timely intervention. The detector was able to

achieve a sensitivity of 95%, 2 false positives per day, and a 60 seconds latency in detection. To further

optimize the development process of the detector and its performance, we constructed a machine learning

model pipeline, explored different feature sets, and incorporated domain knowledge to neural network

models. We’ve determined that spectral features outperformed handcrafted features. We’ve also designed

neural networks based on the domain knowledge by modeling time and/or frequency dynamics. In

particular, a F-LSTM/T-LSTM NN was able to reach a targeted performance of 100% sensitivity, 0.41

false positives per day, and a mean of 6.87 seconds latency in detection. In the future, we will package

our software, explore the use of other modules on Apple Watch to maximize its use, and design a

personalized detector when we have a larger dataset.

Thesis Committee Chair: Dr. Nathan E. Crone

Thesis Project Adviser and Thesis Committee Member: Dr. Shinji Watanabe

Thesis Committee Member: Dr. Sridevi V. Sarma

iii

PREFACE

I would first like to thank my thesis advisor Dr. Nathan Crone of the Neurology department at Johns

Hopkins Medicine for his mentorship and support. Dr. Crone has always been eager to discuss research

related problems with me. I am also grateful for the opportunity to explore potential research direction for

the EpiWatch project.

I would like to also my thesis committee members: Dr. Shinji Watanabe and Dr. Sridevi V. Sarma. I have

learned a lot of technical skills and domain know-how from Dr. Watanabe. I’ve enjoyed every

conversation we had because they helped incubate new ideas for this thesis. Dr. Sarma has been helpful in

providing me necessary guidance to complete this thesis.

Finally, I’d like thank my colleagues in Dr. Crone’s lab, Maxwell Collard, Samyak Shah, Daniel Candrea,

Tessy Thomas, Qinwan Rabbani, Anna Korzeniewska, Christopher Coogan, Alexander Weiss, Yujing

Wang, and Griffin Milsap for the stimulating conversations we had during lunchtime.

iv

DEDICATION

I would like to dedicate this thesis to my family and friends. Thank you for having confidence in me and

encouraging me to pursue my passion. I also dedicate this to all my teachers and professors who have

been the reason for my passion and love for science.

Finally, I would like to express my profound gratitude to my parents and friends. I’ve adapted dedication,

motivation, curiosity, critical thinking, etc. from them, which made me the mature and independent

researcher I am today. For the amazing people I will meet, for the cool stuff I will build, for the dreams

that will come true, and for the moments that I look back, I never stop.

v

TABLE OF CONTENTS

Chapter 1: Introduction .………………………………………………………………...........................1

1.1. Epilepsy, seizure, SUDEP .…………………………………….….…...…………………...…...........1

1.2. Intro to seizure onset detector .………………………………...….….…...……………………….....2

1.3. Current achievements .……….……………………………….….…...………………………........... 3

1.4. Thesis Overview .………………………..…………………….….…...………………………...........5

Chapter 2: EpiWatch Development Environment 6

2.1. Previous work .…………………………..…………………….….…...………………………...........6

2.2. Data preprocessing pipeline .…………….…………………….….…...………………………...........8

2.2.1. Stage 1 - data ingestion .………………………..………………………...….…...…………………9

2.2.2. Stage 2 - data preparation .………………………..…………………….….….……………...........10

2.2.3. Stage 3 - feature extraction.………………………..…………………….…....……………........... 11

2.2.4. Stage 4 - model training and optimization .………………………....………………………...........11

2.2.5. Stage 5 - model conversion .………………………..…………….....………………………...........12

2.3. Summary .………………………..…………………………....….…...………………………...........12

Chapter 3: Feature exploration .………………………..……….….…...………………………...........15

3.1. Introduction to different feature extraction methods .………….…......………………………...........15

3.2. Handcrafted features .………………………..…………………….….…...……….…………...........16

3.3. Spectral features .………………………..…………………….…...….…...…………………............20

vi

3.4. LSTM neural network architecture .………………………..…………..……………………...........24

3.5. Performance: sensitivity, specificity, latency, and computation time .………………………...........26

3.6. Summary .………………………..…………………….….….........................………………...........29

Chapter 4: Incorporate domain knowledge into neural network architecture .…...……………….30

4.1. Introduction .………………………..…………………….….…...……………….…………...........30

4.2. How to Domain knowledge: chirp in motion signal in GTC seizure .…………….…………...........30

4.3. Advanced seizure onset detector using neural network: LSTM that capture time and frequency

dynamics .……………………….…………………………………….……..........………………...........32

4.4. Training recipe and K-fold cross validation .……… ………..….…...………………………...........34

4.5. Performance: sensitivity, specificity, latency, computation cost .……....……………………...........35

4.6. Summary .………………………..……..…………………….….…...………………………...........37

Chapter 5: Conclusion and future directions .……… ……………………...…………………...........38

5.1. Future direction .………………………..…………………….….…...………………………...........38

5.1.1. Build Python pip packages and unit tests for maintainability .…….….……………………...........38

5.1.2. The use of microphone: detector targeted for ictal cry . ……….….……..…………...……...........38

5.1.3. The use of speaker: vocal instruction for timely intervention ………..……………………...........39

5.1.4. Evaluation of the duration of postictal generalized EEG suppression for better SUDEP risk

assessment.……………………………….....…………………….….…...………………………........... 39

5.1.5. Personalized detector .…………………………………...…….…...………………………...........40

5.2. Conclusion .………………………..…………………….……....…...………………………...........40

vii

LIST OF TABLES

Chapter 1:

1.1. Performance metrics for seizure onset detector .………………………..…………… ………...........2

Chapter 2:

2.1. Description of the notebook pipeline that trains and optimizes an iForest detector .………………...7

Chapter 3:

3.1. Definition of TP, FP, FN, and TN .………………………..…………………………………………26

3.2. Classification performances of handcrafted features and spectral features are compared …………..27

Chapter 4:

4.1. Observations on the characteristics of the chirp .………………………..…………………………...31

4.2. Feature extraction constraints based on prior knowledge about chirp .……………………………....31

4.3 Performance for different LSTM approaches .………………………..……………………………....36

viii

LIST OF FIGURES

Chapter 2:

2.1. The proposed machine learning pipeline .………………………..…………………………………..9

2.2. Data preprocessing visualization .………………………..………………………………………….10

2.3. Data is segmented using a user defined epoch size and overlap .………………………..………….11

2.4. Flow chart of CoreML model conversion .………………………..………………………………...12

2.5 Cyclical machine learning pipeline .………………………..………………………………………..13

2.6. A sample snippet of pipeline shell script with the 5 stages scheme .………………………..………14

Chapter 3:

3.1. Zero-crossing points .………………………..………………………………………………………17

3.2. Raw data and handcrafted features .………………….. ….…………………………………………19

3.3. Spectral features distribution .………………..………………………...……………………………20

3.4. Sample seizure and non-seizure data in time and frequency domain .…………………...…………21

3.5. Distributions of the spectral features before standardization of seizure and non-seizure data…….. 22

3.6. Spectrogram contracted from spectral feature vectors (1~20Hz) .…………………………... …… 23

3.7. Frequency features have a lot of information .………………………..…………………………… 24

3.8. The LSTM NN architectures for both handcrafted features and spectral features …………………25

3.9. Latency definition .………………………..………………………………………………………...26

3.10. The ROC curves and latency curves for both handcrafted and spectral feature set are compared ..28

3.11. Outputs of both detectors on all 15 seizure data stacked together.…………………………..……..28

ix

Chapter 4:

4.1. TF-LSTM NN architecture .………………………..………………………………………………..33

4.2. F-LSTM/T-LSTM NN architecture .………………………..……………………………………….34

4.3. The ROC curves and latency curves for both handcrafted and spectral feature set ………………...37

4.4. LSTM NN output for all four LSTM models ……………………………………………………….37

1

Chapter 1: Introduction

1.1. Epilepsy, SUDEP, GTC seizure

Of the 3 million Americans with active epilepsy, approximately 1 million continue to have uncontrolled

seizures even with ongoing medical therapy (Tian, 2018). While epilepsies, the seizure types experienced

by patients, and the impacts on quality of life associated with individuals’ seizures are incredibly diverse,

generalized tonic-clonic convulsions are a particular focus of study. These seizures—especially those that

are prolonged or violent—are at-risk for major complications including sudden death in epilepsy

(SUDEP), which occurs in approximately 1 in 100 patients with uncontrolled GTCS per year (Ficker et

al., 1998). While uncontrolled GTCS are particularly correlated with SUDEP (Ficker & Ficker, 2000),

recent evidence indicates that timely intervention during the course of, and immediately following, a

convulsion can substantially reduce the risk of SUDEP and other complications (Ryvlin et al., 2013).

Therefore, a great deal of recent interest has been focused on developing means of detecting GTCS in real

time, and of alerting caregivers or emergency services to these detected events (Tzallas et al., 2012).

There are a series of stages in a GTCS including aura, tonic, clonic, and postictal stages. In aura stage,

patients could experience hallucination, confusion, numbness, and distorted emotions. In tonic stage,

patients could experience epileptic cry (ictal cry), stiff body, incontinence, back arched. In clonic stage,

there will be a distinctive rhythmic motion, frothy saliva, and blinking eyes. In postictal stage, patients

will have weak limbs, exhaustion, sleepy. SUDEP normally occurs after the clonic stage, in which

patients are in or after the postictal stage. There was often breathing difficulty before death. A study in

1996 has shown that 59% of patients developed apnea and 35% developed oxygen saturation in the range

of 55% to 83% during seizures, suggesting that seizures can lead to life-threatening event such as cardiac

and respiratory dysfunction. (Nashef et al., 1996).

2

We believe that a warning device that is triggered by the characteristic movement occurring at the tonic-

clonic stage (especially clonic stage) is most helpful to prevent SUDEP, so intervention can be applied

timely by care givers. A detector needs to reach near perfect sensitivity to fully prevent SUDEP. While

100% sensitivity is relatively easy to attain, it usually comes at the cost of higher false positive rate (FPR)

and unacceptably long latency. The detector needs to have low FPR to be useful to the patients. Frequent

false alarms can greatly reduce device adherence and therefore lose its purpose. We envisioned a detector

for nocturnal monitoring at home so family members can reach the patients in time at nighttime. We also

expected usage under ambulatory setting, where patients are active and sometimes outdoor. Patients can

have a peace of mind when being alone. In both cases, latency of the detector needs to be less than 60

seconds.

Sensitivity FPR Latency

Minimum requirement > 95% < 0.2 per day < 60 seconds

Table 1.1. Performance metrics for seizure onset detector.

1.2. Intro to seizure onset detector

To reduce SUDEP, many teams have invested in designing the hardware component of seizure onset

detector for research and commercialization. Different sensors have been investigated and each shows

their strengths and weaknesses. Electroencephalogram (EEG), usually paired with video recording

(vEEG), is the most accurate method for seizure labeling since it targets the source of seizure: brain. It is

often used as a gold standard device for epileptologists to label seizure stages. However, it does limit

mobility and portable EEG is uncomfortable to wear (Bell, Walczak, Shin, & Radtke, 1997). Patients

usually feel uneasy to wear them in public. On the other hand, in-bed sensor strips are the most

comfortable alternatives with the lowest sense of presence. It can extract heart rate and respiratory rate

with its piezoelectric sensor strips. They do not limit patients’ mobility as they are not placed on the

3

patients but on top of mattress pad. Nevertheless, its inability to realize ambulatory monitoring prevents it

from becoming a universal solution for seizure onset detection. There are other electrophysiological

signals that can provide important insights such as disruptions in nervous system. Since 70% of the

patients have cardiovascular dysfunction during seizures, electrocardiogram (ECG) can be used to extract

heart rate and advanced cardiovascular information (Vandecasteele et al., 2017). Electromyography

(EMG) has good performance since limb movements caused by GTCS are distinctive from other

movement while consistent among itself (Hamer, 2003). Nevertheless, common ECG and EMG

electrodes, Ag/Cl electrodes, can cause skin irritation, so they are not ideal solution for long-term

monitoring (longer than 12 hours). Ideal sensor choice should be portable, noninvasive, and comfortable.

Wrist-worn devices have emerged in particular force, owing to their unobtrusiveness, ease of use, and the

ability to house several noninvasive sensors. While the majority of these devices utilize accelerometry as

their primary modality, several other classes of sensors have also been successful at informing seizure

detection, including heart rate data derived from photoplethysmography (PPG) data (Vandecasteele et al.,

2017), as well as sympathetic response derived from electrodermal activity (EDA) (Poh et al., 2012) and

motion features extracted from accelerometer (ACC) data (Beniczky, Polster, Kjaer, & Hjalgrim, 2013).

1.3. Current achievements

Here we introduce and test the performance of a seizure detection and alerting system, EpiWatch,

developed on a commercially available consumer wearable device, Apple Watch. Motivated by recent

evidence that suggests that the biosensor readings measured by Apple Watch are highly accurate (El-

Amrawy & Nounou, 2015), we developed a detection algorithm that fuses the movement data provided

by accelerometer with heart rate measurements computed from the built-in PPG sensor in Apple Watch.

As GTCS are relatively rare events, we utilized a novelty detection strategy called an Isolation Forest

(iForest), well-tailored to imbalanced classification problems. We built a candidate detector using a

4

dataset collected from a combination of inpatients in an epilepsy monitoring unit (EMU) and ambulatory

controls without epilepsy. In a prospective “evaluation” dataset, we measured sufficient performance to

potentially provide real-world utility as a seizure detector. Currently, we have 9 seizures labeled by

vEEG, about 2 hours of recording. Control data (non-seizure data) last about 4000 hours.

Seizure detection is done by evaluating continuous “epoch,” a 1 second worth of data. Handcrafted

features extracted from data in each epoch will be used as the input of the detector to better represent the

data. The detector, iForest, will act as an anomaly detector that determines whether an epoch represents a

normal or abnormal behavior. It is trained by feeding nearly 4000 hours of non-epileptic data. Intuitively,

any anomaly data (such as epileptic convulsion) can trigger iForest and lead to the firing of the detector.

Since the epoch is 1 second long, the output of iForest will have a detection rate of 1 detection / second.

Although this method yields high sensitivity in our preliminary testing, it can also trigger many false

positives for usual activities not related to epilepsy, such as tooth brushing or exercise. Therefore, we

introduce a subsequent mechanism to reduce the false positives: evidence accumulation. Evidence

accumulation acts as a smoothing filter. Only when the output of the evidence accumulation stage reaches

beyond threshold, the detection system will respond and follow with intervention logics (confirm user’s

conscience and alert caregivers).

Despite the success made by EpiWatch has been promising, we believe that we haven’t maximized the

use of the data collected and the superior resources Apple Watch has offered for developers. Since the

development of EpiWatch seizure detector has been in the feasibility stage, it was important to assure

whether we can build a functioning seizure detector on Apple Watch. The underlying difficulties include

sensor data intact retrieval, battery consumption, CPU usage limitation, etc. The strategy taken was to

build a minimum viable product (MVP). Through the process, we learned that data retrieval can

occasionally fail in earlier versions (Watch Series 2 and 3) while haven’t failed in the latest version

(Watch Series 4). Battery consumption hasn’t been a problem since Series 3, largely due to the more

5

efficient CPU (Apple S3 Silicon in Package) Apple developed. We also learned that we have CPU usage

to spare (we only used 8% for the current build), which opens opportunities to research in advanced

algorithms such as neural network. Note that the choices of feature extraction and detection methods are

constrained by computation resources and the size of our dataset. Although this method has been

performing well in sensitivity and slightly improve the specificity, the specificity is still not acceptable

(2/day). Evidence accumulation can certainly improve specificity, it also hinders latency significantly (it

takes at least 10 “anomaly” epochs, or 10 seconds for the detector to trigger). It is merely a crude way to

capture time dynamics: It relies on the fact that GTCS usually last more than 10 seconds. It does not learn

the characteristics of GTCS as time progresses. In addition, iForest can only output binary results, which

makes its prediction imprecise. Now that our current build has answered the feasibility question, we

should ask the next question to ourselves: can we build a high-performing seizure detector on Apple

Watch?

1.4. Thesis Overview

To move from MVP to Minimum Marketable Product (MMP), it is important to first construct a machine

learning model pipeline for efficient collaboration. The hardware and the supported API have also

improved significantly (from Watch Series 1 to Watch Series 4), we are ready to make some changes. It is

evident from the above that the current detector should be replaced with a more advanced one. To build a

more advanced algorithm, the model needs to learn the structure in the data (e.g. dynamics in frequency

and time). It should abandon evidence accumulation as there should exist better methods to reduce FPR.

As a result, there is a need to critically evaluate the feasibility and quality of different feature sets, assess

the performance mathematical models corresponding to each feature sets, and select the best model

accordingly.

6

This thesis is organized in the following order: In Chapter 2, a well-structured pipeline to perform

machine learning model production will be discussed. To advance the algorithm, our MMP will focus on

reaching the high-performing metrics found in literature (sensitivity of 90%, FPR of <0.5 per day, and

latency < 60 seconds) by optimizing feature extraction and classification separately. In Chapter 3, two

feature extraction methods, handcrafted features and spectral features, are tested and compared. In

Chapter 4, neural network models that utilize the spectral feature set and incorporate domain knowledge

in GTCS are discussed. Four variants of the long short-term memory (LSTM) neural networks come from

literature, one of the variants is proposed by the author. The four models will be tested and compared

based on their performance and computation cost. Finally, future research direction will be proposed and

explained in detail in Chapter 5.

7

Chapter 2: EpiWatch Development Environment

2.1. Previous work

EpiWatch algorithm development was started by a student researcher who developed a software pipeline

that produces a GTCS onset detector. Specifically, this pipeline performs preprocessing, feature

extraction, model training and optimization, and model serialization for mobile device deployment on

Jupyter notebooks, an open source web application that can be used to create documents that contain

code, visualizations, and text. A table that describes and evaluates this pipeline is provided below:

Purpose Export what What I learned

notebook 1: setup data retrieval from

backend, filtering,

plotting

filter parameters plots are well labeled

and organized, can

reuse the code

notebook 2: seizures inspect seizures data export seizure

descriptors back into

local database

labels (fineStart and

fineEnd) are logged by

our epileptologists

notebook 3: pull show statistics of the

data, show number of

records by user and

total data duration

(~4000 hrs)

save intermediate data

at local machine pulling data is time

intensive, this pipeline

has an efficient way to

mitigate this

notebook 4: preprocess chunk, uniformize, and

filter data export preprocessed

seizure data export seizure and non-

seizure data separately

notebook 5: extract extract all features export feature vectors uses parallelization a lot

(faster)

notebook 6: select select features (9 of

them) export selected feature

names select features based on

ease of implementation

and interpretability

notebook 7: train z score plot, train and

evaluate an iForest

detector (use CV to

further optimize), ROC

plot

export detector

parameters as JSON detector parameters are

optimized in this

notebook, many

reusable codes

Table 2.1. Description of the notebook pipeline that trains and optimizes an iForest detector.

These notebooks are great in that the code is clean and well-documented. One can easily learn this

pipeline by executing all notebooks. The plots generated are well labeled and easy to interpret, while

8

providing insights to seizure data for new developers. Nevertheless, there are a few shortcomings. First of

all, the data are exported to folder that doesn’t have traceable name. For example, data exported in

notebook 3 are called “intermediates,” which is not self-explanatory. Secondly, the notebooks can’t be

run in background nor in terminal in a single software command, which makes certain time-consuming

tasks infeasible. This is important because new seizure data is generated constantly and our dataset is still

expanding. We’d like to have a system that automatically retrain the detector frequently. The notebooks

are also hard to be maintained by nature because one needs to open the notebook and run all cells to

debug. Further, current version control tool for changes tracking such as git does not provide support for

Jupyter notebooks, which hinders collaboration. The current pipeline is sufficient for research purposes,

but it is necessary to reform it to production level.

2.2. Data preprocessing pipeline

To reform the pipeline, we tried to first learn how other fields did it. For researchers in speech

recognition, Kaldi is a highly adopted toolkit. The main purpose of this toolkit is to encourage open

source collaboration by organizing speech recognition processing steps in “stages” so researchers don’t

have to reinvent the wheel. Stage is an important feature of Kaldi because it defines clear boundary of

each processing step. For example, a researcher might be interested in how a new feature extraction

method, a processing step of speech recognition, can help a speech recognition task. This researcher can

easily replace Kaldi’s feature extraction method with his/hers own while reuse the rest of the stages in

Kaldi to produce a speech recognition system with minimal development time. Similarly, our pipeline

also adapts the stage scheme. In the script, developers can replace any of the stage with his/her own recipe

as long as the input and output data have format equivalent to that in the same stage.

In a pipeline we designed, there are 5 stages, each can be run using a single shell script in the operating

system background and unit tests were provided for maintainability. In this pipeline, each stage can have

9

its own variants. For example, stage 3 (feature extraction) can have the following variants: handcrafted

features, spectral features, convolutional autoencoder features, etc.

Figure 2.1. The proposed machine learning pipeline.

2.2.1. Stage 1 - data ingestion

In this stage, raw data is first extracted from backend (MongoDB and AWS) using metadata in a JSON

file. Raw data has problems such as uneven sampling of both accelerometer and heart rate data, data gaps,

and data overlap. These problems can be mitigated by interpolation and data slicing. The processed data is

then filtered with a bandpass filter (0.2~20 Hz) to remove DC drifting and high frequency noise.

10

Figure 2.2. Data preprocessing visualization. The first plot shows the raw data that was retrieved from the

backend with no preprocessing. The second plot is interpolated and filtered. Purple color is used to

specify the segment of a seizure occurrence.

2.2.2. Stage 2 - data preparation

In this stage, data is segmented into epochs of equal length with user defined overlaps. Each epoch will be

assigned a seizure number (i_seizure), specifying the unique recording it belongs to. This is essential to

enable k-fold cross validation testing to ensure generalizability. The epochs will be stored in a matrix with

a dimension of (Nepochs, epoch length, # of channels), where the number of channels in this matrix is 4,

including accelerometer x-axis, y-axis, z-axis and heart rate channel. In addition, a vector of classes label

(binary, sz or non-sz) for each epoch is stored, with a dimension of (Nepochs, 1), and a vector of i_seizure

label (discrete integers) for each epoch is also stored, with a dimension of (Nepochs, 1).

11

Figure 2.3. Data is segmented using a user defined epoch size and overlap.

2.2.3. Stage 3 - feature extraction

This stage performs the feature extraction. Three feature extraction methods implemented include

handcrafted features, convolutional autoencoder features, and spectral features. The specifics of these

feature sets will not be discussed here. Since we plan to employ machine learning algorithms that capture

time dynamics, we will expand the extracted features to sequence of features. As a result, we have three

more feature set: handcrafted features sequence, convolutional autoencoder features sequence, and

spectral features sequence. A process called train-validation split is performed here for K-fold cross

validation. This stage is also similar to a dimensionality reduction step, so the data exported is very

similar to that in stage 2 (with lower data dimension).

2.2.4. Stage 4 - model training and optimization

In stage 4, different models are designed, trained, and tested. These models include iForest, SVM, and

LSTM models for handcrafted features, an LSTM model for convolutional autoencoder features, and

different variants of LSTM models for spectral features. It is important to train a model without

overfitting or underfitting, but such conditions cannot be predicted prior to training. Therefore, model

performance metrics (sensitivity, false positive rate, and latency) are recorded during training to estimate

12

bias and variance of the model trained. Our goal is to train a model that yields a low bias and a low

variance.

2.2.5. Stage 5 - model conversion

This stage is the last stage of the pipeline. It is relatively simple since the CoreML API provides most of

the software abstractions necessary for machine learning model deployment. CoreML is Apple’s machine

learning framework that is optimized for on-device performance for both iOS and Watch OS systems,

which minimizes memory footprint and lower power consumption. It allows app to provide machine

learning prediction (whether a recording is a seizure or not) even when there is no network connection. In

this stage, CoreML models are generated using the models trained in Python framework, and CoreML

models can be directly imported into the EpiWatch app.

Figure 2.4. Flow chart of CoreML model conversion

2.3. Summary

We’ve identified the problems of existing pipeline and proposed a reformed pipeline. The new pipeline

allows us to expand collections of feature set and models quickly and efficiently. Exploration of feature

extraction methods, model implementation, validation, and fine-tuning are all made automatic and

therefore, less time-consuming. For these reasons, this pipeline strengthens many aspects of EpiWatch

including work described in the following section. Future direction for this pipeline is to make it data

flow cyclical rather than one-way (see Figure 2.5) This change will iterate every step to continuously

improve the accuracy and generalizability of the model. A sample shell script of the pipeline is shown in

Figure 2.6.

13

Figure 2.5 Cyclical machine learning pipeline.

14

Figure 2.6. A sample snippet of pipeline shell script with the 5 stages scheme.

15

Chapter 3: Feature exploration

3.1. Introduction to different feature extraction methods

To build a more advanced algorithm, it is important to first re-examine our feature choice. Both

accelerometer and heart rate signals contain useful information. Tachycardia is a common symptom of

many types of seizures, including GTC seizures (Zijlmans, Flanagan, & Gotman, 2002). Information

hidden in heart rate data is relatively straightforward. The features used in our current detector are heart

rate amplitude and its derivative. Although advanced heart rate features such as heart rate variability

(HRV) can potentially contribute to discriminating power of the detector, the sampling rate of our heart

rate sensor is insufficient to support such analysis. As a result, we will only explore feature extraction

methods for accelerometer data.

On the other hand, accelerometer data contains more information. Many feature extraction methods for

accelerometer data have been explored in the past. They can be divided into six groups: time domain

features, frequency domain features, continuous wavelet transform features, packet wavelet transform

features, recurrence plot analysis and entropy features (Van de Vel et al., 2016). Researchers usually

extracted as many features as possible if there is no computation limitation, and selected features using

methods like filter and wrapper to optimize performance for their task (Chandrashekar & Sahin, 2014).

Since there is computation limitation (restriction of CPU usage on Apple Watch apps imposed by Apple),

we selected features that are representative enough while computationally cheap for our current detector.

These features will be the first feature set and the baseline for our analysis. We called them the

“handcrafted features.” In our preliminary testing with the first feature set, we found that there was still

computation margin to spare. Therefore, we decided to explore a different feature set that is

computationally more expensive but offers better insights into the structure of the data, in the hope that it

will improve detector performance. In GTs, the rhythmic movement has characteristics in time,

16

frequency, and amplitude: it gets slower but stronger over time. An ideal feature set to capture this

characteristic is therefore the frequency components of the accelerometer signals, extracted through

Discrete Fourier Transform (DFT). We call these features the “spectral features.”

To validate which feature set offers better detector performance, we did not use popular feature selection

methods (e.g. univariate selection, feature importance, correlation with target, etc.) that are independent

of classification algorithms. This is because we believed the structure in data can only be extracted while

using a model that takes time dependencies into account for feature sets that do not capture time

dependencies. Therefore, we directly compared the performance of the two feature sets using the same

classification algorithm, and the winning feature set is the optimal one. In this chapter, we tested both

feature sets with bidirectional LSTM neural networks with 2 layers of 64 hidden units, 1 dropout layer,

and 1 relu activation layer, followed by linear layers and a log softmax layer to produce probability for

either class (seizure or non-seizure). The only difference in the networks is the input dimension. We

evaluated the performance according to sensitivity, specificity, latency, and computational time.

3.2. Handcrafted features

We investigated a large number of potential features derived from the 3-axis ACC and HR measurements,

including time- and frequency-domain features, as well as information-theoretic measures. Each feature

was evaluated using a custom-built visualization suite, which allowed manual inspection of each features

during observed seizures and non-seizure epochs. Given the computational constraints imposed on apps

running natively within the Apple Watch operating system, as well as the risk of overfitting from using

more numerous features, we opted for a small, easy-to-compute while interpretable set of features that

still appeared to provide meaningful information about seizures. There are 9 of them:

17

Feature 1. ACC norm, the L2-norm of the acceleration vector at each epoch, giving a picture of general

movement activity.

𝑓𝑒𝑎𝑡𝑢𝑟𝑒1 = ∑ ∑ 𝑥𝑎𝑥𝑖𝑠,𝑛2𝑁𝑛=1

2𝑎𝑥𝑖𝑠=0

Feature 2, 3, 4. x-, y-, and z-axis zero-crossings rates (ZCR), the rate of signal crossing zeros in the

accelerometer signal, in each axis. The hypothesis underlying these features’ inclusion was that

convulsive seizures entail a particular pattern of evolution in the rhythmicity of movements; hence, the

dynamics of the instantaneous frequency of the accelerometry might lend specificity to the detector. The

zero crossings rate was determined in an epoch, with a threshold to exclude spurious values generated by

noise.

𝑓𝑒𝑎𝑡𝑢𝑟𝑒2,3,4 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑖𝑔𝑛 𝑐ℎ𝑎𝑛𝑔𝑒𝑠 𝑖𝑛 𝑎𝑛 𝑒𝑝𝑜𝑐ℎ

Figure 3.1. There are 6 zero-crossing points.

Feature 5, 6, 7. x-, y-, and z-axis ZCR derivatives, the derivative of the ZCR in the accelerometer signal,

in each axis. These features were estimated using moving average convergence/divergence (MACD), a

18

technical indicator from quantitative finance (Yazdi & Lashkari, n.d.), by taking the difference between

two exponential smoothings of the zero crossings rate data with different time constants.

𝐸𝑀𝐴𝛼[𝑡] = 𝛼 ∗ 𝐿𝐶𝑅[𝑡] + (1 − 𝛼) ∗ 𝐸𝑀𝐴[𝑡 − 1] 𝑓𝑒𝑎𝑡𝑢𝑟𝑒5,6,7 = 𝑑𝑒𝑟𝑖𝑣𝑎𝑡𝑖𝑣𝑒 𝑜𝑓 𝑍𝐶𝑅 = 𝑀𝐴𝐶𝐷[𝑡] = 𝐸𝑀𝐴𝛼=0.05 (𝑓𝑎𝑠𝑡)[𝑡]− 𝐸𝑀𝐴𝛼=0.005 (𝑠𝑙𝑜𝑤)[𝑡]

Where EMA is exponential moving average,

α is smoothing factor,

LCR is line crossing rate,

Higher α yields short-term trends, whereas lower α yields long-term trends

Feature 8. Current HR, we used nearest neighbor interpolation because the heart rate sampling frequency

was lower than the frequency of the detector’s estimate of seizure likelihood.

𝑓𝑒𝑎𝑡𝑢𝑟𝑒8 = 𝐻𝑅 [𝑡] Feature 9. HR derivative, estimated using a weighted average of differences between consecutive HR

estimates from Apple’s proprietary API within a 150 second window.

𝑓𝑒𝑎𝑡𝑢𝑟𝑒9 = 𝐻𝑅 𝑑𝑒𝑟𝑖𝑣𝑎𝑡𝑖𝑣𝑒 (𝑡) = 𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑅[(𝑡 − 30)𝑡𝑜 𝑡 𝑠𝑒𝑐]) − 𝑚𝑒𝑑𝑖𝑎𝑛(𝐻𝑅[(𝑡 − 150)𝑡𝑜 (𝑡 − 30) 𝑠𝑒𝑐])

As a result of the feature extraction step, a feature matrix of dimension (Nepochs, 9) and a label vector of

dimension (Nepochs, 1) are generated. In the Figure 3.2, data and features of a GTC seizure recording is

shown. Features deviate from baseline during the occurrence of the seizure, suggesting that they are good

predictor for seizure labeling.

19

Figure 3.2. The top nine signals are the features extracted. The bottom four signals are the raw signal

(accel and HR). Purple region indicates the occurrence of GTC seizure labeled using vEEG.

We further look at the mean and variance of the features for seizure and non-seizure movement. This will

allow us to see how data distribution is different between the two classes, which is a good indicator for

separability of the data using these 9 features. The means and variances of features extracted from seizure

data are usually higher than that extracted from non-seizure data (shown in Figure 3.3). Although

significant difference in the means is a good thing, the difference in variance is generally not preferred.

The variances of seizure data are higher, which means that the range seizure data distribution spreads

more. In other words, the seizure features can behave very differently, and sometimes they may even look

like non-seizure data (near the end of a GTC seizure, where rhythmic movement was wearing off). Note

that the data has been standardized (zero mean and unit variance) prior to this analysis.

features

raw data

20

Figure 3.3. The means (top) and variances (bottom) of each feature are shown (no standardization).

3.3. Spectral features

Spectral features are extracted by applying Discrete Fourier Transform (DFT) on a 5 second epoch. The

DFT of a signal x may be defined by:

𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑘 = ∑ 𝑥[𝑛]𝑒𝑥𝑝 {−2𝜋𝑖 𝑘𝑛𝑁 } 𝑘 = 0, … , 𝑁 − 1𝑁−1𝑛=0

In which, featurek is the energy of kth frequency component in the signal. N is the number of samples in

the epoch. n is the signal at a certain time step. k is the index of the frequency component extracted. Since

the frequency resolution is 0.2 Hz and the Nyquist frequency is 25 Hz (sampling rate is 50 Hz), the

spectral features have the dimension of 125. From Figure 3.4, we can see strong responses in the range of

21

2-4 Hz. On the other hand, non-seizure spectral features have the highest response near DC, mainly due to

baseline drifting of accel.

Figure 3.4. Left: raw accel and HR from seizure and non-seizure recordings (5 second epoch). Right: FFT

(spectral features) of the two epochs from 0 to 25 Hz.

Human movements have a dominant frequency component ranges between 0.48 and 2.47 Hz with a mean

of 1 Hz (Mann, Wernere, & Palmer, 1989). Seizure rhythmic movements usually have a mean at 3 Hz

based on our preliminary analysis. The means of spectral features extracted from seizure data are usually

higher than that extracted from non-seizure data while their variances are similar. From Figure 3.5, the

two classes of data have good separability even without considering time dependencies. For this reason,

spectral features appear to be a better feature set than handcrafted features to capture the chirp.

22

Figure 3.5. Distributions of the spectral features before standardization of seizure and non-seizure data.

In nature, birds can make chirp sound with its tune shifting from higher frequency to lower frequency. On

spectrogram, it looks like a backward slash (Figure 3.6). We observed similar characteristic in GTCS.

GTCS start with the tonic phase, in which patient’s limbs are extended and stiffened (Lugaresi &

Cirignotta, 1981). In clonic phase, patient’s limbs start to have rhythmic movements that decrease in

frequency and increase in amplitude. Since the rhythmic movements are not pure sinusoidal, we can

observe harmonics. This unique localization of the characteristics in time and frequency make spectral

features ideal candidate for a special neural network architecture: time and frequency LSTM. In Chapter

4, we will further examine its ability to extract time and frequency dependencies.

23

Figure 3.6. Spectrogram contracted from spectral feature vectors (1~20Hz). The white lines in each

recording label the start and end of the seizure respectively. In both recordings, chirp can be observed

during the course of the seizure. In i_seizure 4, a harmonic response can be observed.

We can look at the spectrogram in a different way by treating each frequency channel as a continuous

signal as shown in Figure 3.7. The y-axis is the amplitude, x-axis is the time. The top-most channel

represents the spectral feature vector at the first time step. The bottom-most channel represents the

spectral feature vector at the last time step. We can observe the activation of seizure onset happens at near

4Hz frequency, and it gradually evolve to lower frequency near 2Hz. It is interesting to note that spectral

feature vector higher than the onset frequency does not show shifting in frequency. This is reasonable as

they could come from data unrelated to the characteristic convulsion.

24

Figure 3.7. Frequency features have a lot of information. The y axis is the amplitude, x axis is the

frequency. The top-most channel represents the earliest spectral feature vector. The bottom-most channel

represents the most recent spectral feature vector. As time progress (from top to bottom), frequency

gradually decrease while amplitude gradually increase. This is consistent with what we observed through

the video recordings.

3.4. LSTM neural network architecture

To extract time dependencies of handcrafted features and spectral features, we designed bidirectional

LSTM neural networks (NN) respectively (Figure 3.8). LSTM NN for handcrafted features has an input

dimension of 9. LSTM NN for spectral features has an input dimension of 100. Note that we had 125

spectral features with the predetermined epoch size and sampling rate, but a bandpass filter that remove

signals in irrelevant frequency bands allowed us to discard redundant spectral features (0-0.2Hz and 20-

25Hz). Both the handcrafted features and the spectral features are standardized (zero mean and unit

variance) before fed into their LSTM NNs. The LSTM NN for handcrafted features has 2 hidden layers

with 32 hidden nodes each, a dropout layer to prevent overfitting, and 2 linear layers followed by a

softmax layer to extract probability output for each time step. As a result, this LSTM NN will output a

sequence of probabilities for the two classes. Cross entropy is used to compute the loss. During training,

the network is unrolled 10 time steps for backpropagation.

25

The LSTM NN for spectral features is slightly different. It also has 2 hidden LSTM layers with 32 hidden

nodes each, a dropout layer to prevent overfitting, and 2 linear layers followed by a softmax layer to

extract probability output for each time step. However, since there are three channels for accelerometer

data, they are fed into the same LSTM layers. The output of the LSTM layers is stacked with heart rate

features (2-dimensional, current heart rate and heart rate derivative). The stacked information is then fed

into the linear layers and the softmax layer to generate a sequence of probability output. Again, cross

entropy is used to compute the loss and the network is unrolled 10 time steps for backpropagation.

Performance for both models will be shown and compared in the next section.

Figure 3.8. The LSTM NN architectures for both handcrafted features and spectral features. The number

of hidden nodes and number of layers are the same for their LSTM layers.

26

3.5. Performance: sensitivity, specificity, latency, and computation time

In this section, we will compare the classification performance using these two different feature sets. For

our application, we are most concerned with true positive rate (TPR), false positive rate (FPR), and

latency. Although these metrics on the basis of recording are better for real-world performance

assessment, evaluating these metrics on the basis of epoch could provide us better insights to model

performance. TPR is also known as sensitivity. It is the proportion of true positives (see Table 3.1) for

what true positives are false positives are) that are correctly classified. For our application, higher TPR is

preferred because it is important to capture all GTCS to prevent SUDEP. FPR is a measure that specifies

how frequently false alarms will be triggered. Lowering FPR while maintaining the same TPR gives

patient a piece of mind. Latency is measured as the difference in time between convulsion onset and

detection time (Figure 3.9). Minimizing latency can ensure timely notification to the care givers. TPR and

FPR is defined as followed:

𝑇𝑃𝑅 (𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦) = 𝑇𝑃𝑇𝑃+𝐹𝑁 𝐹𝑃𝑅 (1 − 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦) = 𝐹𝑃𝑇𝑁+𝐹𝑃

predicted positive predicted negative

real positive true positive (TP) false negative (FN)

real negative false positive (FP) true negative (TN)

Table 3.1. Definition of TP, FP, FN, and TN

𝑙𝑎𝑡𝑒𝑛𝑐𝑦 = 𝑑𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 − 𝑜𝑛𝑠𝑒𝑡 𝑡𝑖𝑚𝑒

Figure 3.9. Latency is defined as the time difference between detection time and seizure onset time.

27

From Table 3.2 and Figure 3.10, we can see that spectral features outperform handcrafted features in

TPR, FPR, and latency. AUROC also suggests that spectral features provide better separability for LSTM

NN since it’s closer to 1. Although the LSTM NN for spectral features have almost 3 time more trainable

parameters and almost double the CPU computation time, the computation complexity is considered

insignificant when deployed in Apple Watch.

Handcrafted features Spectral features

validation TPR 0.77688 0.83653

validation FPR 0.08450 0.00112

AUROC 0.9221 0.94980

validation mean

detection latency when

threshold = 0.5

3.2 sec 7.0 sec

trainable parameters 102 4432

prediction rate 1 prediction/sec 1 prediction/sec

Average CPU time

consumption for an

epoch

13.796 µs

78.302 µs

Table 3.2. Classification performances of handcrafted features and spectral features are compared.

28

Figure 3.10. The ROC curves and latency curves for both handcrafted and spectral feature set are

compared.

From Figure 3.11, we can see that the detector output for handcrafted features is very noisy. Although all

seizures can be captured timely, many false alarms will be triggered thus making it less useful. On the

other hand, detector output for spectral features is less noisy. If an optimal threshold is set correctly, all

seizures can be captured while no false alarm will be triggered.

Figure 3.11. Outputs of both detectors on all 15 seizure data stacked together. Top graph: LSTM NN

output for handcrafted features. Bottom graph: LSTM NN output for spectral features.

29

Note that network hyperparameters are not explicitly chosen and each neural network could have better

performance. We treat to this analysis as a feasibility study to quickly assess the representational power of

the feature sets.

3.5. Summary

In this chapter, we have introduced two feature sets, described methods to extract them from raw data,

and evaluated performance when the same classification algorithm is used. The performance of spectral

feature set is superior than that of handcrafted feature set. It also represents chirp better visually than

handcrafted feature set. In Chapter 4, we will use spectral feature and aim to design architecture to extract

time and frequency information in them.

30

Chapter 4: Incorporate domain knowledge into neural network

architecture

4.1. Introduction

In Chapter 3, we have determined that spectral features can better capture the characteristics of an

epileptic convulsion based on experimental results. The characteristics of seizure convulsion, rhythmic

movement that decrease in frequency and amplitude with time, are analogous to chirp made by birds. We

further explored the use of spectral features by designing a neural network architecture that can learn the

time and frequency dynamics of the seizure convulsion. The best approach is to use NN network to model

both time and frequency dynamics.

In this chapter, we will introduce three new LSTM architectures that capture time and/or frequency

dependencies, and T-LSTM architecture served as a baseline model for comparison. The first is called the

frequency LSTM (F-LSTM), proposed by the author, to validate that it augmented the information

provided by T-LSTM. The second is called the time-frequency LSTM (TF-LSTM), proposed for a large

vocabulary continuous speech recognition (LVCSR) task (Li, Mohamed, Zweig, & Gong, 2015) and

(Sainath & Li, 2016). The third is called frequency LSTM/time LSTM (F-LSTM/T-LSTM). It is

consisting of two parallel architectures: T-LSTM and F-LSTM. Intuition for the latter two architectures

will be discussed. Finally, the performances of these detector were evaluated. The best model will be

determined based on this performance metric.

4.2. How to incorporate domain knowledge: chirp in motion signal in GTCS

GTCS is known to have a characteristic rhythmic movement. The movement starts at the tonic and

progress at clonic stage of a GTCS and gradually dissipate as time progress. Seizure initiation starts with

high frequency bursts of action potential of a hypersynchronous neuronal population in the brain. This

31

discharging activity can spread within the brain when there is sufficient activation to recruit surrounding

neurons (Bromfield, Cavazos, & Sirven, 2006). Activity in GTCS has propagated and affected

sensorimotor cortex, starting (usually proximal arms) with sudden sustained tonic posturing, stiffening,

that could follow by vigorous involuntary limb movement (Bass et al., 1995). This movement pattern

begins at high frequency with low amplitude, and gradually evolves to pattern with lower frequency with

higher amplitude. Although the definite amplitude and frequency range can vary among patients, this type

of dynamics is true for most of the GTCS.

Before designing the architecture, we needed to first determine whether we can model such characteristics

based on the capability of our hardware. A crude preliminary testing showed that the rate of frequency

change is about 0.1~0.3 Hz/sec. Since we aimed to make a prediction every second, we picked 0.2 Hz/bin

as the frequency resolution from our spectral features in order to capture this pattern. The constraint for

epoch size (number of samples) based on this frequency resolution can be determined as follow:

𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑟𝑒𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛 = sampling frequencynumber of samples

0.2 𝐻𝑧 = 50 Hznumber of samples , number of samples = 250

Chirp frequency range in

a GTC seizure

Duration of

chirp

Rate of frequency

change

2-4 Hz 10-20 sec 0.13 Hz / sec

Table 4.1. Observations on the characteristics of the chirp.

Prediction rate Frequency resolution Epoch size Epoch overlap

1 pred / sec 0.2 Hz / bin 250 samples (5 sec) 80%

Table 4.2. Feature extraction constraints based on prior knowledge about chirp.

This means that the time interval of each column in the spectrogram needs to be at least 250 samples (5

seconds). Prediction rate is set at once per second because timely detection is our goal. Frequency

32

resolution, epoch size, and overlap are determined based on the observations and the predetermined

prediction rate. Since most chirps lasted 10~20 seconds, we have determined that 10 time steps (14

seconds) are sufficient to capture the dynamics since only part of chirp is needed to make the data

discriminative enough. The spectral features extracted based on these criteria were the input for the

various LSTM to be evaluated.

4.3. Advanced seizure onset detector using neural network: LSTM that capture time and frequency

dynamics

LSTM is known to be capable of learning long term relationship. Epileptic convulsion has a characteristic

pattern in time domain as well as in frequency domain as shown in the last chapter. These observations

motivated me to use LSTM to capture both time and frequency dynamics. Two papers (Li et al., 2015;

Sainath & Li, 2016) have shown Time-Frequency LSTM to perform better than convolutional LSTM

(CLDNN). CLDNN can achieve local translation invariance through local filters and pooling, which is

ideal for many image recognition tasks because image patterns are usually repetitive and do not appear at

a specific location. However, in our task, we need to determine the onset of an epileptic convulsions. The

features that represent the convulsions have very specific location (around 3 Hz). TF-LSTM allows us to

learn the characteristic time-frequency pattern instead of other similar patterns shifted in frequency

domain (e.g. 10 to 8 Hz chirp).

We will not describe the implementation of T-LSTM and F-LSTM because they will be covered in the

implementation of F-LSTM/T-LSTM. In TF-LSTM NN (Figure 4.1), we first divide the 100-dimensional

spectral feature vectors in each time step into 5 segments without overlap, each contains 20 successive

dimensional spectral features (0~4, 4~8, 8~12, 12~16, and 16~20 Hz). Each feature segment becomes the

input for each frequency step for the F-LSTM. The bidirectional F-LSTM contains two hidden layers,

each with 64 hidden units per direction. It is followed by a dropout layer and the outputs for all three axes

33

are summed, where the T-LSTM takes the summation as the input for each time step. The bidirectional T-

LSTM also contains two hidden layers, each with 64 hidden units per direction, and follows by a dropout

layer. The output for each time step is passed onto a perceptron layer for final classification.

Figure 4.1. TF-LSTM NN is consisting of two successive LSTM NN. An F-LSTM first models frequency

dynamics and a successive T-LSTM models time dynamics.

In F-LSTM/T-LSTM NN (Figure 4.2), we performed F-LSTM and T-LSTM in parallel. The F-LSTM

takes the input matrix with a dimension of 100 (frequency channel) by 10 (time step). Input for each

frequency step is a time series signal (with a size of 10). The F-LSTM contains two hidden layers, each

with 5 hidden units per direction. It is followed by a dropout layer and the output is summed for all three

axes. The T-LSTM takes same input matrix but treated data at each time step as an input (with a size of

34

100). The T-LSTM contains two hidden layers, each with 64 hidden units per direction. It is followed by

a dropout layer and the output is summed for all three axes. At each time step, heart rate features (heart

rate amplitude and its derivative) are stacked to the output. The output of the F-LSTM and the stacked

output of the T-LSTM are stacked and passed onto a perceptron layer for final classification.

Figure 4.2. F-LSTM/T-LSTM NN is consisting of two parallel architectures. F-LSTM models frequency

dynamics and T-LSTM models time dynamics. Their outputs are stacked and passed onto perceptron

layers for final classification.

4.4. Training recipe and K-fold cross validation

Training of the LSTM NN has suffered from inadequate dataset (we only have 15 seizures). When we

divide the data into training and validation sets, training data and validation data can have very different

distribution, leading to overfitting easily. To mitigate this issue, we employ a few regularization

techniques such as dropout, batch normalization, and minibatch (batch size = 64). All four models

generally have the following architecture: 1 LSTM layer with dropout. The stacked information for all

steps (in either time or frequency) is then fed into the linear layers and the softmax layer to generate a

35

sequence of probability output. Again, cross entropy is used to compute the loss and the network is

unrolled 10 time steps or 5 frequency steps for backpropagation for 5 iterations. Note that dropout

regularization is only activated during training and deactivated when evaluating on the validation.

Therefore, validation error is lower than training error. To help better diagnose generalizability of the

model, we evaluated training error again in evaluating mode (no dropout operation). We found that

training error is similar to validation error.

We also employed 5-fold validation to evaluate the models on a limited dataset. We split the dataset into

5 groups of data with similar length. For each unique group, it was treated as the validation dataset and

the remaining groups were treated as the training dataset. The performance metrics were recorded and

shown in the next section.

4.5. Performance: sensitivity, specificity, latency, computation cost

The performance of all four architectures is listed in Table 4.3. The validation TPR, FPR, AUROC, and

detection latency were all similar among the four variants of LSTM. In particular, T-LSTM NN, the only

architecture that did not model frequency dependencies, yielded the worst performance. We therefore

hypothesized that by capturing frequency dynamics, we were able to boost the performance of the

detector. Although T-LSTM can achieve good performance with low model capacity (lowest number of

trainable parameters) and the fastest CPU processing time, we should aim for a model without

consideration of computation complexity as long as it does not hinder detection latency when deployed in

Apple Watch. For F-LSTM/T-LSTM NN, all seizures were detected (TPR = 1 on patient recording level).

Its mean FPR is equivalent to 0.41 false alarms per day when the ratio of seizure and non-seizure activity

is considered. The mean latency is 6.87 seconds.

36

T-LSTM F-LSTM TF-LSTM F-LSTM/T-LSTM

validation TPR 0.7493±0.0892

0.7872±0.0911

0.7400±0.0542 0.7756±0.0775

validation FPR 0.0042±0.0042

0.0048±0.0038

0.0031±0.0011 0.0057±0.0046

AUROC 0.9688±0.0220 0.9564±0.0297 0.9777±0.0102 0.9569±0.0267

validation detection

latency when threshold

= 0.5

9.5333±1.7333

6.9333±4.4791

9.9333±1.8785

6.8667±4.6072

trainable parameters 4432 325886 5418 85702

prediction rate 1 pred / sec 1 pred / sec 1 pred / sec 1 pred / sec

Average CPU time

consumption for an

epoch

70.81 µsec 283.19 µsec 207.57 µsec 296.08 µsec

Table 4.3 Performance for different LSTM approaches. The bolded cells indicate the best performance

among the four models.

The ROC curves and latency curves show similar performance for all models (Figure 4.3). To further

examine on how each model performs, a fraction (5 seizures) of the model outputs are shown in Figure

4.4. T-LSTM NN is most susceptible to false alarms while F-LSTM NN looks closest to ground truth

label.

37

Figure 4.3. The ROC curves and latency curves for both handcrafted and spectral feature set are

compared.

Figure 4.4. LSTM NN output for all four LSTM models. Five seizure data were stacked together. The

purple regions indicate the occurrence of seizures. All model outputs range from 0 to 1.

4.6. Summary

In this chapter, the domain knowledge in GTC seizures is discussed. We presented different LSTM

approaches (T-LSTM, F-LSTM, TF-LSTM, F-LSTM/T-LSTM) to capture dynamics in time and/or

frequency of seizure data. We found that F-LSTM/T-LSTM NN offered the best performance in terms of

TPR, FPR, AUROC, and latency. Computationally, F-LSTM/T-LSTM NN is not the most efficient, but

the device should be able to handle its CPU usage for the same reason discussed in Chapter 3.

38

Chapter 5: Conclusion and future directions

5.1. Future direction

The future direction for this research project can span from open source collaboration, adaptation of new

sensors, to development of algorithm. Open source collaboration is highly encouraged since the scale of

EpiWatch research is growing as well as the team. A multi-center study for this project has just started,

which will invite more participants to contribute to the seizure data collection. As seizure data

accumulates, advanced algorithms that are only ideal for large dataset can now be developed. If the

preprocessing steps can be standardized, a lot of overhead work can be avoided. Secondly, we are also

interested in exploring the use of any available sensor on Apple Watch. Microphone is a particular

interest as patients experiencing GTCS can make a very characteristic sound: ictal cry. The snoring sound

they make once they enter the post-ictal suppression stage is a very characteristic sound pattern, which

can help extract advanced information that will be discussed in later section. On the other hand, the

speaker can be an effective tool to improve the efficacy of intervention. We envisioned EpiWatch

becoming the most recommended seizure detector. Not only we want to assist early intervention, we also

want to assess the severity of an epileptic patient. This can be done by evaluating the duration post-ictal

suppression of GTCS as it is shown to be an important risk factor for SUDEP. Finally, we believe the

seizure detector performance can be further improved by personalize the algorithm for each user type.

5.1.1. Build Python pip packages and unit tests for maintainability

In the EpiWatch team, we tried to standardized codes that are frequently used. However, this is done by

sharing the python files that contain these useful functions. These functions are still fluidic and subject to

changes, so there is a need for better methodology to develop and maintain these functions. Therefore, we

are interested in developing a python package for EpiWatch algorithm. Packaging of the python code can

be done easily through pip. On the other hand, unit tests are important as it help developers to track the

39

usability of their code without diving into the code. It makes the coding process more robust. It facilitates

changes and simplifies integration. It also provides documentation, which is missing a lot during the

development of isolation forest detector. It would be ideal to start this process before launching the

EpiWatch app with the seizure onset detector.

5.1.2. The use of microphone: detector targeted for ictal cry

The ictal cry was defined as a prolonged tonic expiratory laryngeal vocalization, or a deep guttural clonic

vocalization. It is strongly associated with epileptic GTCS and warrants inquiry when taking the history

from witnesses of a patient’s seizure (Elzawahry, Do, Lin, & Benbadis, 2010). Audio component of the

ictal cry is also very characteristic and similar across patients, making it a potential marker for seizure

onset detection. The utilization of microphone in EpiWatch app is expected to increase sensitivity and

lower FPR and latency for the detector.

5.1.3. The use of speaker: vocal instruction for timely intervention

When GTCS occur to patients, they could experience complications such as falling, suffocation, and loss

of conscience, which could danger patients’ lives in certain cases (e.g. GTCS while driving, swimming,

etc.). In these cases, care givers cannot reach the patients in time since the location of the incidents don’t

necessarily have to be close to the hospitals. It is important for people who are adjacent to the patients to

provide appropriate intervention. Simple interventions that can be instructed through vocal sound can be

very effective to prevent further serious complications such as drowning in swimming pool. Therefore,

the use of speaker for EpiWatch is of great interest to realize such need.

5.1.4. Evaluation of the duration of postictal generalized EEG suppression for better SUDEP risk

assessment

Postictal generalized EEG suppression (PGES) is a condition followed by the tonic-clone phase of a

convulsive seizure. PGES occurs more frequently after convulsive seizure arising from sleep. PGES thus

40

become an important biomarker for SUDEP since nocturnal convulsive seizure increase SUDEP risk

(Lamberts et al., 2013). A study has shown that PGES duration of greater than 20s following GTCS can

be a risk factor of SUDEP (Lhatoo et al., 2010). If EpiWatch can quantify PGES duration, it can be a risk

assessment tool in addition to its alerting capability, which helps improve epilepsy treatment strategy.

5.1.5. Personalized detector

The current seizure onset detector is trained on different patients. It is designed to generalize for GTCS

modeling capability for new patients. However, GTCS can have slightly different characteristics among

patients. Amplitude of a GTCS for an adult can be larger than that for a child. Although frequency and

time dynamics remain the same, the activation of the neurons in the LSTM NN can be different and

therefore, making the detector less sensitive, and less immediate. Therefore, we believe a personalized

detector can further improve detector performance on individuals. This task is very similar to speaker

adaptation using neural network. Specifically, this is essentially multi-task learning (a type of transfer

learning), where the source and target domains are different, but the source and target tasks are the same

(Pan & Yang, 2010). Most machine learning methods work well under the assumption that training and

testing data are drawn from the same feature space with the same distribution. Most models need to be

retrained when distribution changes. To prevent such overhead, transfer learning can be employed to

realize efficient domain adaptation. The training is done by first training the detector on GTCS from

different patients. It is then adapted to a specific patient by continuing the training process on GTCS of

that specific patient. Such NN converges much faster and does not overfit when compared to training on

only GTCS from that specific patient.

5.2. Conclusion

This paper has summarized work done by the author for EpiWatch project. It began with codebase

refactoring, in the form of a pipeline similar to Kaldi. The pipeline encourages collaboration and

41

improves maintainability. Within the framework of this pipeline, several research studies such as feature

set exploration and machine learning model architecture design can be performed efficiently. The two

feature sets explored, handcrafted features and spectral features, are fed into LSTM NN of similar

architectures for comparison. LSTM NN for spectral features outperformed handcrafted features in all

three metrics employed by the author. A more complicated LSTM NN has been designed and validated

based on the domain knowledge of GTCS. Different LSTM variants were implemented and the results

were compared using the same metrics. The F-LSTM/T-LSTM architecture author proposed helps

modeling both time and frequency dependencies, and therefore yields better performance. In the future,

we will continue to further solidify our codebase, explore the use of different sensors on Apple Watch,

assess duration of postictal suppression, and personalize the detector for better performance.

42

REFERENCES

Bass, N., Wyllie, E., Comair, Y., Kotagal, P., Ruggieri, P., & Holthausen, H. (1995). Supplementary

sensorimotor area seizures in children and adolescents. The Journal of Pediatrics, 126(4), 537–544.

https://doi.org/10.1016/S0022-3476(95)70346-2

Beniczky, S., Polster, T., Kjaer, T. W., & Hjalgrim, H. (2013). Detection of generalized tonic-clonic

seizures by a wireless wrist accelerometer: A prospective, multicenter study. Epilepsia, 54(4), e58–e61.

https://doi.org/10.1111/epi.12120

Bromfield, E. B., Cavazos, J. E., & Sirven, J. I. (2006). Basic Mechanisms Underlying Seizures and

Epilepsy. Retrieved from https://www.ncbi.nlm.nih.gov/books/NBK2510/

Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical

Engineering, 40(1), 16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024

El-Amrawy, F., & Nounou, M. I. (2015). Are Currently Available Wearable Devices for Activity

Tracking and Heart Rate Monitoring Accurate, Precise, and Medically Beneficial? Healthcare

Informatics Research, 21(4), 315–320. https://doi.org/10.4258/hir.2015.21.4.315

Elzawahry, H., Do, C. S., Lin, K., & Benbadis, S. R. (2010). The diagnostic utility of the ictal cry.

Epilepsy & Behavior: E&B, 18(3), 306–307. https://doi.org/10.1016/j.yebeh.2010.04.041

Ficker, D. M., So, E. L., Shen, W. K., Annegers, J. F., O’Brien, P. C., Cascino, G. D., & Belau, P. G.

(1998). Population-based study of the incidence of sudden unexplained death in epilepsy. Neurology,

51(5), 1270–1274. https://doi.org/10.1212/WNL.51.5.1270

Ficker, & Ficker. (2000). Sudden unexplained death and injury in epilepsy. Epilepsia, 41(s2), S7–S12.

https://doi.org/10.1111/j.1528-1157.2000.tb01519.x

43

Lamberts, R. J., Gaitatzis, A., Sander, J. W., Elger, C. E., Surges, R., & Thijs, R. D. (2013). Postictal

generalized EEG suppression. Neurology, 81(14), 1252–1256.

https://doi.org/10.1212/WNL.0b013e3182a6cbeb

Lhatoo, S. D., Faulkner, H. J., Dembny, K., Trippick, K., Johnson, C., & Bird, J. M. (2010). An

electroclinical case-control study of sudden unexpected death in epilepsy. Annals of Neurology, 68(6),

787–796. https://doi.org/10.1002/ana.22101

Li, J., Mohamed, A., Zweig, G., & Gong, Y. (2015). LSTM time and frequency recurrence for automatic

speech recognition. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU),

187–191. https://doi.org/10.1109/ASRU.2015.7404793

Lugaresi, E., & Cirignotta, F. (1981). Hypnogenic Paroxysmal Dystonia: Epileptic Seizure or a New

Syndrome? Sleep, 4(2), 129–138. https://doi.org/10.1093/sleep/4.2.129

Mann, K. A., Wernere, F. W., & Palmer, A. K. (1989). Frequency spectrum analysis of wrist motion for

activities of daily living. Journal of Orthopaedic Research, 7(2), 304–306.

https://doi.org/10.1002/jor.1100070219

Nashef, L., Walker, F., Allen, P., Sander, J. W., Shorvon, S. D., & Fish, D. R. (1996). Apnoea and

bradycardia during epileptic seizures: relation to sudden death in epilepsy. Journal of Neurology,

Neurosurgery, and Psychiatry, 60(3), 297–300.

Pan, S. J., & Yang, Q. (2010). A Survey on Transfer Learning. IEEE Transactions on Knowledge and

Data Engineering, 22(10), 1345–1359. https://doi.org/10.1109/TKDE.2009.191

Poh, M.-Z., Loddenkemper, T., Reinsberger, C., Swenson, N. C., Goyal, S., Madsen, J. R., & Picard, R.

W. (2012). Autonomic changes with seizures correlate with postictal EEG suppression. Neurology,

78(23), 1868–1876. https://doi.org/10.1212/WNL.0b013e318258f7f1

44

Ryvlin, P., Nashef, L., Lhatoo, S. D., Bateman, L. M., Bird, J., Bleasel, A., … Tomson, T. (2013).

Incidence and mechanisms of cardiorespiratory arrests in epilepsy monitoring units (MORTEMUS): a

retrospective study. Lancet Neurol, 12(10), 966–977. https://doi.org/10.1016/S1474-4422(13)70214-X

Sainath, T. N., & Li, B. (2016, September 8). Modeling Time-Frequency Patterns with LSTM vs.

Convolutional Architectures for LVCSR Tasks. 813–817. https://doi.org/10.21437/Interspeech.2016-84

Tian, N. (2018). Active Epilepsy and Seizure Control in Adults — United States, 2013 and 2015. MMWR.

Morbidity and Mortality Weekly Report, 67. https://doi.org/10.15585/mmwr.mm6715a1

Tzallas, A. T., Tsipouras, M. G., Tsalikakis, D. G., Karvounis, E. C., Astrakas, L., Konitsiotis, S., &

Tzaphlidou, M. (2012). Automated Epileptic Seizure Detection Methods: A Review Study. Epilepsy -

Histological, Electroencephalographic and Psychological Aspects. https://doi.org/10.5772/31597

Van de Vel, A., Milosevic, M., Bonroy, B., Cuppens, K., Lagae, L., Vanrumste, B., … Ceulemans, B.

(2016). Long-term accelerometry-triggered video monitoring and detection of tonic–clonic and clonic

seizures in a home environment: Pilot study. Epilepsy & Behavior Case Reports, 5(Supplement C), 66–

71. https://doi.org/10.1016/j.ebcr.2016.03.005

Vandecasteele, K., De Cooman, T., Gu, Y., Cleeren, E., Claes, K., Paesschen, W. V., … Hunyadi, B.

(2017). Automated Epileptic Seizure Detection Based on Wearable ECG and PPG in a Hospital

Environment. Sensors, 17(10), 2338. https://doi.org/10.3390/s17102338

Yazdi, S., & Lashkari, Z. H. (n.d.). Technical analysis of Forex by MACD Indicator.

Zijlmans, M., Flanagan, D., & Gotman, J. (2002). Heart Rate Changes and ECG Abnormalities During

Epileptic Seizures: Prevalence and Definition of an Objective Clinical Sign. Epilepsia, 43(8), 847–854.

https://doi.org/10.1046/j.1528-1157.2002.37801.x

45

BIOGRAPHY

Michael Chan graduated from University of California, Los Angeles, with a Bachelor of Science degree

in Bioengineering. He conducted research under Dr. Wentai Liu’s guidance in his Biomimetic Research

Lab, where he built his favorite medical device: a noninvasive medical patch that captures both PPG and

ECG from the chest and streams to the mobile devices. Upon graduation, he joined Apple Inc. Health

Technologies Team as a biomedical R&D intern for 10 months, where he helped assess feasibility and

conduct studies for health hardware and applications of Apple Watch. He then entered Johns Hopkins

University as a Biomedical Engineering MSE student and has been a graduate researcher under Dr.

Nathan Crone’s guidance. His primary role for this research is to help progress the EpiWatch project by

utilizing various machine learning technique including neural network on the seizure onset detector. He is

still not satisfied with his accomplishments and knowledge in the field. To reach his goal, becoming a

bioengineer who can solve biomedical problems in the best possible way, he is resolved to continue his

academic journey through PhD.