identifying dyslectic gaze pattern955646/fulltext01.pdf · 2016. 8. 25. · referat identiﬁkation...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2016

Identifying dyslectic gaze patternComparison of methods for identifying dyslectic readers based on eye movement patterns

JOAKIM LUSTIG

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Identifying dyslexic gaze pattern

Comparison of methods for identifying dyslectic readers based on eye movementpatterns

JOAKIM LUSTIG

Master’s Thesis at CSCSupervisors: Jens Lagergren & Hung-Son Lee

Examiner: Anders Lansner

Abstract

Dyslexia a�ects between 5-17% of all school children, mak-ing it the most common learning disability. It has beenfound to severely a�ect learning ability in school subjectsas well as limit the choice of further education and occupa-tion. Since research has shown that early intervention andsupport can mitigate the negative e�ects of dyslexia, it iscrucial that the diagnosis of dyslexia is easily available andaimed at the right children. To make sure children whoare experiencing problems reading and potentially could bedyslectic are investigated for dyslexia an easy access, sys-tematic, and unbiased screening method would be helpful.This thesis therefore investigates the use of machine learn-ing methods to analyze eye movement patterns for dyslexiaclassification.

The results showed that it was possible to separatedyslectic from non-dyslectic readers to 83% accuracy, us-ing non-sequential feature based machine learning methods.Equally good results for lower sample frequencies indicatedthat consumer grade eye trackers can be used for the pur-pose. Furthermore a sequential approach using RecurrentNeural Networks was also investigated, reaching an accu-racy of 78%.

The thesis is intended to be an introduction to whatmethods could be viable for identifying dyslexia and as aninspiration for researchers aiming to do larger studies in thearea.

Referat

Identifikation av dyslektiskt blickmönster

Dyslexi påverkar 5-17% av alla skolbarn, vilket gör det tillden mest utbredda inlärningssvårigheten. Det har visats attdyslexi har negativ inverkan på prestationen i grundsko-leämnen samt vidare utbildning och arbete. Eftersom forsk-ning pekar på att tidigt insatta åtgärder och stöd kan lindrae�ekterna av dyslexi är det av stor vikt att diagnostiken äråtkomlig och riktas åt rätt barn. För att säkerställa att di-agnostiken uppfyller detta skulle det vara hjälpsamt medtillgänglig, systematisk och objektiv metod för screening.Den här rapporten undersöker därför analys av ögonrörel-semönster med maskininlärningsmetoder för att identifieradyslexi.

Resultaten visar att det är möjligt att separera dyslek-tiker och icke-dyslektiker med 83% ackuratess genom attapplicera icke-sekventiella särdragsbaserade metoder. Sam-ma resultat uppnåddes även med lägre sampelfrekvens förögonrörelserna, vilket indikerar att ögonspårare av konsu-mentgrad kan användas. Vidare undersöktes även en se-kventiell ansats som använder återkommande neurala nät-verk, vilken uppnådde 78% ackuratess.

Det här examensarbetet är menat att fungera som enintroduktion till metoder för att identifiera dyslexi och somen inspiration för forskare som ämnar göra större studier iområdet.

Acknowledgements

First and foremost I want to thank all the participants in the study, without theircontribution I would probably still be spamming emails to try to get a hold on adata set. So thank you, and I hope you enjoyed your free ticket to the movies!

I would also like to thank Björn Thuresson, manager of the VIC studio, whodespite my study having nothing to do with him was kind enough to accomodateit in his studio. I am grateful to Tobii for accepting me as a thesis worker andproviding me with eye tracking equipment. Furthermore I would like to thank mysupervisors Jens Lagergren and Hung-Son Lee for their valuable feedback as wellas my examiner Anders Lansner. Lastly a special thanks goes to Mattias Benfatto,who is a researcher at Optolexia, for sharing some of his knowledge on how toidentify dyslexia based on eye movements.

Contents

1 Introduction 1

1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3

2.1 Eye movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Eye movements during reading . . . . . . . . . . . . . . . . . . . . . 42.3 Eye movements and dyslexia . . . . . . . . . . . . . . . . . . . . . . 5

3 Data 7

3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Sequence Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Method 11

4.1 Feature Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.1.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . 114.1.2 Feed-Forward Neural Network . . . . . . . . . . . . . . . . . . 13

4.2 Sequence Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.1 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 16

4.3 Naive Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.5 Lowering the sample frequency . . . . . . . . . . . . . . . . . . . . . 18

5 Results 19

5.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 205.2 Feed-Forward Neural Network . . . . . . . . . . . . . . . . . . . . . . 215.3 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 225.4 Naive approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.5 Predictions by test subject . . . . . . . . . . . . . . . . . . . . . . . . 245.6 Lowering the sample frequency . . . . . . . . . . . . . . . . . . . . . 25

6 Discussion 27

CONTENTS

6.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286.4 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Bibliography 31

Chapter 1

Introduction

This chapter gives a brief context to dyslexia, explains why it is an important topicand defines the aim and limitations of the thesis.

1.1 Context

The prevalence of dyslexia is estimated to range between 5-17% among school chil-dren in the United States, a�ecting 80% of all the children identified as havinglearning disabilites. This makes it the most common learning disability [29]. Ithas been shown that dyslexia severely a�ects the learning ability in school subjects,limiting the choice of further education and occupation [12]. People with dyslexiaalso runs a higher risk of being a�icted with psychological illness [32]. Although itis believied that dyslexia is not something that can be completely subdued, researchhas shown that early intervention and support can mitigate the negative e�ects [30].This means that if the diagnosis of dyslexia could be made more available, enablingtreatment at an earlier stage and for a larger part of the dyslectics, the negativee�ects of dyslexia would be reduced. For this purpose a pre-screening tool to helpdetermine which children needs to be examined would be helpful. If there was areliable way to identify dyslexia based on analysis of eye movements, this could bedone with eye tracking equipment.

1.2 Problem statement

This thesis investigates the use of machine learning methods to classify dyslexiabased on eye movement patterns during reading. It compares two approaches, oneusing features extracted from eye movement data and the other using sequences ofeye movement data.

1

2 CHAPTER 1. INTRODUCTION

1.3 Limitations

This thesis is not investigating the underlying cause of dyslexia and does not makeany claims to whether dyslexia is either an oculumotor or a cognitive disability.The eye-movements can be relevant as a discriminatory factor regardless of if it isa symptom or a cause of the disability.

The only kind of dyslexia studied in this report is developmental dyslexia, we donot consider dyslexia that has been contracted by physical or psychological traumas.

The texts read by the test subjects were all in swedish, results may di�er forother languages.

There is a di�erence in eye movement pattern between reading aloud and silentreading [21]. All mentions of reading in the thesis refers to silent reading.

The results may also di�er for younger age groups since they have not learnt tocompensate as much for their dyslexia.

Chapter 2

Background

This chapter gives an overview of the research done within the area of eye movementsas well as the connection between eye movements and dyslexia.

2.1 Eye movements

Eye movements are typically divided into five main types; fixations, saccades, pur-suit, vestibular and vergence.

Fixations occur when the gaze is kept in the same place. However, the eyesnever stop moving entirely, since that would lead to a loss of output from the conesand rods. The duration of a fixation vary, but generally it is said to last between200 and 300 ms. The small movements that occur during fixations are dividedinto three subtypes; nystagmus, drift and microsaccades. Nystagmus are small andconstant movements around the fixation point. They are believed to mitigate theloss of output from the cones and rods in the eye that occur when they are subjectedto the same input for too long. Drift is caused by the inability of the oculumotorsystem to keep the eyes completely still. The microsaccade compensate for themovements caused by nystagmus and drift in order to keep the fixation point.

Saccades are rapid eye movements, with velocities up to 500 degrees per second,that are used to move from one fixation point to another. During a saccade theperception of visual input is reduced, but the reduction is not noticeable since it iscompensated by the brain. It is however important to note that it is only duringfixations that objects are clearly visible.

Pursuit is an eye movement used to keep the gaze at a moving object. Vestibulareye movements compensate change of perspective due to for example head move-ments. Vergence is a movement of the eyes in horizontally opposite directions, whichis used to focus on very near objects.

Since pursuit, vestibular and vergence are normally not used during reading,they will not be further discussed in this thesis [25].

3

4 CHAPTER 2. BACKGROUND

2.2 Eye movements during reading

The two types of eye movements that are used during reading are as stated above;fixations and saccades. Saccades can be further divided into backward saccades,when a reader is going back to read earlier traversed text, and forward saccades,which are used to progress in reading a text. Backward saccades are usually referredto as regressions.

Always remember, your focus determines your reality.fixation

saccade

regression

Figure 2.1. Visualization of fixation, saccade and regression during reading. Inreality there is also a lot of smaller eye movements occuring.

To understand why the eye movement pattern during reading is structured inthis way it is important to understand how the eyesight functions. As mentioned inthe previous section, the perception of visual input is drastically reduced when theeye is moving between positions, and it is only during a fixation that detailed visualinformation can be collected. Furthermore the area of high acuity during a fixationis smaller than it may seem. This is because there is only a small area in our eyethat has a high enough density of cones to process the visual input in a manner thatprovides high acuity. This area is called the fovea and only takes input from thecentral 2° of the vision. The area around the foveal area and up to 5° is called theparafoveal area of vision, where the acuity is much worse. Beyond the parafovealis the peripheral vision, where the distinguishing of detail, color and shape is evenlower. Most of the information that can be extracted from the peripheral visionconsist of movements [17].

Figure 2.2. Visualization of foveal, parafoveal and peripheral vision duringreading.[17]

2.3. EYE MOVEMENTS AND DYSLEXIA 5

2.3 Eye movements and dyslexia

Starting as early as 1958 studies has indicated that there is a di�erence in eyemovement pattern during reading between dyslectic and non-dyslectic readers [31].While there are some arguing that there is no correlation between eye-movementsand dyslexia [3], most of the researchers in the area seem to believe that thereis a di�erence. Some researchers have even argued that analysis of eye-movementscurrently is the best way to measure word recognition, which is a major part of whatdyslectics are struggling with [28]. Studies have shown that to a 80 % degree it ispossible to make the correct classification of dyslexia based on features extractedfrom eye tracking data. The study used Support Vector Machines to separate theclasses based on the features reading time, mean of fixation lengths and age of theparticipants [26].

To separate dyslectics from non-dyslectics based on eye movement data two ap-proaches have been considered. One is the earlier tested feature based approach,where reliable methods for binary classification e�ectively can be used. The otherapproach is to analyze the sequences with methods that are suited to analyze se-quential data, such as Recurrent Neural Networks or Hidden Markov Models. Inthis thesis, we will evaluate methods using both approaches and their potential tocorrectly identify dyslexia based on eye movement patterns during reading.

To choose the best possible features for the feature based approach a pre-studyof the research within eye movements and dyslexia was conducted. The aim ofwhich was to find out how dyslectic eye movements di�er from non-dyslectic duringreading.

The features that separate dyslectics from non-dyslectics found in earlier re-search examined during the pre-study are listed below.

• Amount and length of forward saccades [1, 7, 11]

• Fixation duration [1, 7, 11]

• Amount of backward saccades[24]

• Fixation stability[10]

• Saccadic amplitude/velocity[11]

In the next chapter these features will be compared and explained in detail.

Chapter 3

Data

This chapter presents how the data was collected, as well as how it was pre-processedto suit the methods used in the thesis.

3.1 Data Collection

The collected data consisted of eye-movement recordings during reading from 21test subjects, all of whom were students at KTH.

The recordings were done using the eye tracker Tobii TX300, which is a top-of-the-line device well suited for research purposes. The eye tracker has a samplerate of 300Hz, which means that it records 300 data points per second. Each datapoint is x- and y-values in a Cartesian coordinate system representing the screenon which the text was presented to the subjects, and each sequence of data pointsis a reading of one entire text by one test subject.

The material that was read consisted of eight texts taken from the Swedishversion of the PISA reading tests for high school students between 2000 and 2009.Between each text the test subjects were asked to answer comprehension questionstaken from the same PISA test. The comprehension questions were included tomake sure that the test subjects really read the text and not only skimmed it. Thetexts were of di�erent length; the longest was about 11 times as long as the shortest.Each reading session took about 45 minutes in total.

Some of the readings resulted in poor quality data and a low amount of gazesamples collected. Therefore each reading of a text with more than 50 % gaze losswas discarded from the data set used in the study. The high gaze loss for somesubjects was most likely due to insu�cient instructions about how to use the eyetracker properly, since in those cases there were huge chunks missing and not onlyrandom data points here and there. The specific source of loss is hard to trace sincethe tests was non-supervised during the reading time to limit the impact on readingbehavior.

The final data set consisted of readings from 18 test subjects, 126 reading sam-ples and 4.6 million data points. Among the 18 test subjects, 9 were dyslectic and

7

8 CHAPTER 3. DATA

9 were non-dyslectic.

3.2 Feature Selection

The aim of the feature selection is to choose the features that best separates dyslec-tics from non-dyslectics regarding their eye movement pattern when reading. Todetermine which features are significant, a survey of earlier research within thesubject of eye movements and dyslexia was done, which was presented in Chapter2.

To calculate these features for each text, a method to classify each data pointeither as saccade, fixation or neither had to be used. This is commonly known as afixation filter. In this study the method used to filter out the fixations and saccadeswas an I-VT filter, which is a velocity based way of separating saccades and fixations[22]. Backward and forward saccades were separated by defining a forward saccadeas a saccade where the x-value was larger than the preceding fixation, and backwardsaccade as a saccade where the x-value was smaller than the preceding fixation.

After classifying each data point as belonging to either a fixation, a saccade orneither, the calculations to retrieve each feature were as follows.

• (MFL) Mean of fixation lengths: Summation of the lenght in time of fixationsdivided by the number of fixations.

• (FF) Fixation frequency: Total number of fixations divided by the total num-ber of data points for the reading.

• (MFS) Mean of forward saccade lengths: Summation of the distance of allforward saccades divided by the number of forward saccades.

• (FSF) Forward saccade frequency: Total number of forward saccades dividedby the total number of data points for the reading.

• (FS) Fixation stability: Mean of distance of movement within a fixation.

• (SA) Saccadic amplitude: Mean of the speed (distance divided by time taken)of saccades.

• (RF) Regression/Backward saccade frequency: Total number of backwardsaccades divided by the total number of data points for the reading.

Each data point Xi

was scaled to be in the range 0 to 1, as follows,

Zi

= Xi

≠ min(X)max(X) ≠ min(X)

where Xi

is the original value and Zi

is the scaled value.

3.3. SEQUENCE PREPROCESSING 9

Feature correlationFeature MFL FF MFS FSF FS SA RFMFL 1.00 -0.35 -0.06 -0.39 0.16 -0.11 -0.15FF -0.35 1.00 0.50 0.71 0.11 0.64 0.76MFS -0.06 0.5 1.00 -0.08 0.31 0.84 0.78FSF -0.39 0.72 -0.08 1.00 -0.02 0.33 0.10FS 0.16 0.11 0.31 -0.02 1.00 0.34 0.16SA -0.11 0.64 0.84 0.33 0.34 1.00 0.60RF -0.15 0.76 0.78 0.10 0.16 0.60 1.00Class 0.31 -0.42 -0.51 -0.35 -0.22 -0.63 -0.28

Figure 3.1. Feature to feature and feature to class correlation.

One way to reason about which features to select is to choose the features withthe highest correlation to the classes, but with the lowest correlation to each other.For example as shown in figure 3.1 the feature most correlated to the class is SA,which is also highly correlated to MFS, FF and RF. This means that we could getgood results by choosing SA but not MFS, FF and RF as feature. Instead we shouldpair SA with the features that is least correlated with it and at the same time givesthe highest class correlation, for example MFL. Another option would be to chooseMFS, which also has a high correlation to the class, as the main feature and pairit with MFL and FSF. This gives us an understanding of why some features mightwork good together and others not.

3.3 Sequence Preprocessing

Another way to represent the data is the sequence of gaze points collected by the eyetracker. To make sure the methods focus on the right kind of patterns preprocessingof the data is helpful.

The main problem with the raw sequential data that was addressed by thepreprocessing was that the data set consisted of readings of several texts and pagesof texts, making separation on the absolute positions of the gaze points incomparablebetween di�erent texts. In mathematical terms, the data can be seen as createdby a non-stationary stochastic process, whose joint probability distribution changedover time. Therefore the delta values or positional di�erence between consecutivegaze points was used instead of the absolute positions, since the delta values can beconsidered to be generated by a stationary stochastic process.

Furthermore, the y-values of the coordinates were of little importance to thereading pattern, since they mostly indicate when the reader is switching row. Itmay be the case that the y-values still have some significance, but including themgives a high chance of overfitting the model, since the more data the model getsthe easier it will be to fit this particular data set and not produce a generalizablemodel.

10 CHAPTER 3. DATA

The values were then normalized to have a zero mean across the whole data setas follows,

Zi

= Xi

≠ X̄

where Xi

is the original, X̄ is the mean of the dataset and Zi

is the normalizedvalue.

Lastly the sequences were divided into batches (smaller sequences) of 2000 datapoints each. With the sampling rate of 300 Hz the batches contained about 6.7seconds of reading. This was done partly because it improves the training of themodel, both by providing a higher amount of samples and limiting the length of thesequential pattern modeled on, and partly because larger sequences did not fit inthe memory of the graphics card. The choice of length of the sequences is supportedby the claim in the original paper on Long Short-Term Memory that the methodcan learn patterns with time lags in excess of 1000 steps [15]. However, this assumesthat the pattern on which to separate the classes does not occur with a time laglonger than 6.7 seconds.

The preprocessed sequences therefore consisted of normalized delta values ofx-coordinates of the gaze points in batches of 2000 data points.

Figure 3.2. Sequence of x-coordinates (left) and delta values of x-coordinates (right)from one reading.

Chapter 4

Method

The methods to classify sequential data can roughly be divided into two approaches,one is to separate the classes based on features extracted from the sequences andthe other is to let the method analyze the sequences directly to find which patternsto separate the classes on. Methods following both approaches are evaluated in thisthesis. They are also compared to a naive method of classificaton. In this chapterthe methods are described in more detail.

4.1 Feature Approach

By using features extracted from the sequences, standard binary classification meth-ods can be used to separate the classes. In this thesis two such methods are con-sidered: Support Vector Machines and Feed-Forward Neural Networks.

4.1.1 Support Vector Machines

Support Vector Machines (SVM) is a method that is widely used in classification andhas been proven to be an e�ective method for solving a wide range of classificationproblems, for example text classification [18] and face detection [23].

Classification with SVMs is done by non-linearly mapping input vectors into ahigh dimensional space and then separating the classes with a linear decision surfacein that space. The linear separation is based on the vectors that best represent thedi�erence between the classes, called Support Vectors, which make up the decisionmargin. The predicted class of a new data point is then given by weighting theinfluence of the Support Vectors by their similarity to the new data point, to makean aggregated judgment of the class of the data point [6].

Suppose we are given a set of features X, where each Xi

is associated with aclass through a binary variable Y

i

œ (≠1, 1). In the linearly separable case, ourgoal is then to find weights W and bias b which maximizes the margin between aseparation plane and any data point, minimizing,

11

12 CHAPTER 4. METHOD

12WW

under the constraints,

Yi

(W ú ◊(Xi

) + b) Ø 1, ’i

where ◊(.) is the (optional) mapping of the data. Solving this optimization problemgives us the values for W and b that makes it possible to classify a new data pointXú using an indicator function,

ind(Xú) = W ú ◊(Xú) + b

In the linearly non-separable case, when the data is noisy, it is common to add aslack variable C. This slack variabel accepts some noisy observations to be wronglyclassified in order to avoid the overfitting that would be the result of perfectly fittingthe data using a more complex kernel. We then minimize,

12W ú W + C

ÿÁ

i

under the constrains,

Yi

(W ú ◊(Xi

) + b) Ø 1 ≠ Ái

, ’i

In order to make the computations easier, we rewrite the optimization problem in adual formulation, allowing for a technique called the kernel trick. This allows us touse mappings to a higher dimensional space without increasing the computationalcost excessively. The dual formulation instead minimizes,

12

ÿ

i

ÿ

j

–i

–j

ti

tj

K(Xi

, Xj

) ≠ÿ

i

–i

under the constraints,

–i

Ø 0, ’i

where K(., .) is called the kernel function and computes a similarity measure ◊(Xi

)ú◊(X

j

). The indicator function is then,

ind(Xú) =ÿ

i

–i

ti

K(Xú, Xi

)

We can then see that the indicator function assigns the class based on the weightedsimilarity of the new data point and all previous data points. The similarity measureis central to this calculation and is based on the kernel choice. Common kernelchoices are polynomial, radial basis and sigmoidal.[4]

The SVM implementation used in this thesis was from the Python library scikit-learn, which in this context works as a wrapper for the popular C++ library LIB-SVM. [5]

4.1. FEATURE APPROACH 13

Figure 4.1. Example of an SVM with 2 features and linear kernel. The filled lineis the separating hyperplane and the dashed line is the decision margin. The circlesaround the dots mark the support vectors.

4.1.2 Feed-Forward Neural Network

Artificial Neural Networks has been successfully applied in a wide range of areas,such as image compression [9] and speech recognition [20].

The method involves small computational units, referred to as (artificial) neu-rons, which are connected to each other in layers. The computation done in eachneuron is to sum the weighted inputs and adding a bias term, then applying anactivation function Ï(x) to produce an output y.

y = Ï(ÿ

i

wi

xi

+ b)

� Ï

x1

x2

xn

w1w2w

n

y

Figure 4.2. Artificial Neuron


The activation function transforms the ouput into a range, for example between 0and 1. Common choices are for example hyperbolic tangent and sigmoidal functions.The output of each neuron is then sent as input to all neurons in the next layer inthe network, which in turn sends it’s output to it’s next layer. It is finally sent tothe output layer where it is interpreted as the output of the network.

Training of the network is done in a supervised manner evaluating the di�erenceof the output of the network and the desired output, and updating the weightsof the network accordingly. A common way to decide how the weights should beupdated is the back-propagation algorithm. In which the aim is minimize a totalerror, for example,

E = 12

ÿ

e

ÿ

j

(yej

≠ dej

)2

where e is an index over neuron connections, j is and index over output units,y is the actual output and d the desired output. We can minimize this expressionby using an optimization algorithm such as gradient descent, adjusting the weightsof the network according to the negative gradient of the error, in order to find alocal minimum. To do that, computation of the partial derivative of E with respectto each weight in the network is needed. This is done by backpropagating thederivatives backwards in the network, starting with the output,

Ê

ˆyi

= yi

≠ di

and then using the chain rule to compute,

Ê

ˆxj

= Ê

ˆyj

dyj

dxj

If the activation function of the neurons is sigmoidal,

yj

= 11 + e≠xj

we can then di�erentiate that to get,

Ê

ˆxj

= Ê

ˆyj

yj

(1 ≠ yj

)

The derivative of a weight on a connection from neuron i to j is,

Ê

ˆwji

= Ê

ˆxj

ˆxj

ˆwji

= Ê

ˆxj

yi

and the e�ect of the output going from i to j is,

Ê

ˆxj

ˆxj

ˆyi

= Ê

ˆxj

wji

4.1. FEATURE APPROACH 15

This procedure can be repeated further and further back in the network. In thesimplest version the weight is changed proportional to the accumulated ˆE

ˆw

,

�w = ≠ÁˆE

ˆw

where Á is a scaling factor called the learning rate. [27]In current applications somewhat more complicated variations of this approach

is used, mainly to decrease the computational cost, but the fundamental principlesare the same.

......

...

I1

I2

I3

In

H1

Hn

O1

On

Inputlayer

Hiddenlayer

Ouputlayer

Figure 4.3. Example architecture of a feed-forward neural network.

The architecture presented above provides the capability to approximate anyfunction to any desired accuracy, making the neural network a universal approxi-mator. [16]

The implementation used in this thesis was created with the framework Keras,which is written in Python and uses Theano as backend.


4.2 Sequence Approach

This approach uses the sequential data of the eye movements over time. The ad-vantage compared to the feature approach is that it contains more information andmight therefore find patterns to separate the classes that are not present in thefeatures. The drawback is that it contains more information and can therefore havetrouble finding the relevant information to separate the classes on.

4.2.1 Recurrent Neural Networks

Recurrent Neural Networks (RNN) is a form of Neural Network that uses connec-tions between the hidden units associated to di�erent time steps to create an inter-nal memory structure, making it suitable to process sequences of inputs. To fullyunderstand this part it is recommended to first read about Feed-Forward NeuralNetworks in section 3.2.1 above.

...It

H1

Hn

O

...It+1

H1

Hn

O

Figure 4.4. Recurrent Neural Network example with one input and one outputnode. The figure is showing how the output of the hidden units from one time stepis used as input to the next timestep. This is how the Recurrent Neural Networkachieves an internal memory structure.

The RNN was first introduced in the 1980s, and received a lot of scientificattention in the 1990s when researchers tried to find ways to solve the problemknown as exploding/vanishing gradients [2]. This problem consisted in a di�cultyto train RNN with gradient-based methods (such as backpropagation) on tasks thatrequired the network to learn long-term dependencies.

4.2. SEQUENCE APPROACH 17

In 1997 Hochreiter and Schmidhuber proposed an alternative to the structureof the hidden unit in RNN, which was supposed to solve the problem of explod-ing/vanishing gradients. This structure is called Long Short-Term Memory (LSTM)for it’s ability to learn long-term dependencies [15]. LSTM replaces the hidden unitswith a block containing three additional gates where input can be added to the stateof the cell; input, forget and output gates.

This structure gives the cell the ability to forget its state, be written to and beread from at each time step, depending on how open the gates are.

� Ï ú � Ï úx

r

yz

Ï

�Input Gate

x r

Ï

�Output Gate

x r

ú

Ï

�Forget Gate

x r

Figure 4.5. Hidden Neuron with LSTM-structure. The variables; w are weights,x is input from new data, r is recurrent input/past cell state, y is output and z isrecurrent output. The nodes; ú is a multiplicator, � is a sum over all inputs, Ï isan activation function. The dashed lines are connections with time lag and the filledlines are normal connections.

In recent years RNN and LSTM have been used in many successful applications,for example in handwriting recognition[13]. Another recent development in thefield of machine learning and neural networks is the success of what is referredto as deep learning. Deep learning has yielded astonishing results for examplein the fields of speech recognition [14], video analysis [8] and image captioning[19]. What is new about deep learning is the introduction of multiple processinglayers, composed of linear or non-linear transformations, that correspond to di�erentlevels of abstraction. This is usually done by stacking several layers of the networkon top of each other, but can also involve the use of a combination of di�erentmethods. However, these models often require large amounts of data to achievegood results, something that is not available in this study. Therefore this thesis ismainly concerned with simpler models.

The implementation used in the thesis was, as with the feed-forward neural


network, created with Keras. Where the Theano foundation made it possible to runthe computations on the GPU, severely reducing the training time of the models.

4.3 Naive Approach

For comparison, a baseline using a naive approach to the problem was used. Sincereading time is something widely believed to separate dyslectics from non-dyslectics,the naive approach is to do the separation based on only the reading time for eachtext.

This was calculated by first taking the mean of the reading time for each text.Each reading was then assigned to a class based on if the reading time was aboveor below this mean, dyslectic if above and non-dyslectic if below. Each subject wasassigned a class based on the mean of the assigned classes of the readings of thatsubject.

If the investigated methods do not perform better than the baseline there is nopoint in using eye tracking and complex methods to do the classification. Sincethen, a stopwatch would perform equally good.

4.4 Evaluation

Due to the limited amount of data leave-one-out cross validation is suitable, whereone sample in the data is withdrawn and the model is trained on the rest of thedata, then tested on the withdrawn sample.

To make sure data created by the same person was not used in both training andtesting all texts read by a test subject was withdrawn instead of only one sample,and the model was tested on each of the texts read by the withdrawn test subject.

This was iterated for all test subjects and results from each iteration was aggre-gated. This means that each method is evaluated based on the average performanceof 18 models, one for each test subjects. Using this approach the data available fortraining was maximized for each model.

4.5 Lowering the sample frequency

The practical use of this method as a pre-screening tool for dyslexia is not onlydependant on the performance of the classification. It is also a question of resources,therefore the cheaper the eye tracking equipment that can be used the more likelyit is to be a practically viable method.

Since one of the main di�erences between a high end and a low end eye trackeris the sample frequency, it is important to determine on what level of detail therelevant patterns in the data occur.

To simulate di�erent sample frequencies, data was uniformly removed from thedata set. The simulated frequencies that was tested were other than 300Hz; 200Hzand 100Hz, corresponding to 66% and 33% of the original data set included.

Chapter 5

Results

The results presented in this section are as stated earlier achieved through a leave-one-out cross validation method, where one subject at a time is withdrawn andtested on, and is therefore the aggregated result of 18 models for each method (onefor each test subject).

An overview of the results are presented in the table below and more details canbe found in the following sections.

AccuracyMetric SVM FFNN LSTM NaiveAccuracy 83% 83% 78% 78%

Figure 5.1. Accuracy for the di�erent methods. Accuracy in this case means per-centage of test subjects correctly classified as either dyslectic or non-dyslectic.

As seen in the above table; equal performance was achieved by the two featurebased methods, beating the naive approach, while the sequence based approach hada slightly lower accuracy, on par with the naive approach. To put these numbers inperspective the di�erence between an accuracy of 0.78 and 0.83 can be a di�erencein prediction of only one test subject.

19

20 CHAPTER 5. RESULTS

5.1 Support Vector Machine

The hyperparameters of the SVM was tuned by a broad grid search of di�erentkernels, penalty term sizes and kernel coe�cients. The best and equally good resultswere achieved with two separate feature sets; Saccadic Amplitude (SA) and Meanof Fixation Lengths (MFL) as well as Mean of Forward Saccade lengths (MFS),Mean of Fixation Lengths (MFL) and Forward Saccade Frequency (FSF). Bothsets resulting in exactly the same predictions, although SA and MFL were able topredict it with less complex models.

Similar and equally good results was achieved with radial basis function, sig-moidal and linear kernels. Using the simplest linear kernel provided the addedbenefit of the possibility to plot the coe�cients and decision boundary, since thedata was not transformed to a higher dimension, which can be seen in the graphbelow.

Figure 5.2. Plot of one of the SVM models using a linear kernel and the featuresSaccadic Amplitude (y-axis) and Mean of Fixation Lengths (x-axis). The filled lineis the separating hyperplane and the dashed line is the decision margin. The circlesaround the dots mark the support vectors.

5.2. FEED-FORWARD NEURAL NETWORK 21

5.2 Feed-Forward Neural Network

The structure of the simplest Feed-Forward Neural Network that achieved the bestresults was two-layered (input, one hidden and output layer), with 64 units inthe hidden layer and a softmax output layer of two units. The training was doneusing a stochastic optimizer called adam on a categorical cross-entropy error term.However similar and equally good results were achieved with other optimizers andmore complex network structures.

As with the SVM equally good results were achieved using the two feature sets;Saccadic Amplitude (SA) and Mean of Fixation Lengths (MFL) as well as Meanof Forward Saccade lengths (MFS), Mean of Fixation Lengths (MFL) and ForwardSaccade Frequency (FSF).

Figure 5.3. Error term during optimization of the Feed-Forward Neural Network.

As seen in the plot above both the training and validation error converges afterabout 1000 epochs. It may seem a bit peculiar that the validation error is lowerthan the training error. This is most likely due to the validation set containingmore di�cult samples to classify than the training set.


5.3 Recurrent Neural Network

The structure of the best performing RNN was a two layered network with 512hidden units and two output units with a softmax activation, where the hiddenunits consisted of LSTM-blocks. Deeper networks were tested, but overfitted thedata heavily.

The activation functions used was hard sigmoid for the inner activations andhyperbolic tangent function for the outer.

The optimization was done in the same way as with the feed-forward neuralnetwork, using the stochastic optimizer adam on a categorical cross-entropy errorterm.

Figure 5.4. Error term during optimization of the RNN.

The plot above shows the change of the error term over 200 epochs of training.We can see that it becomes unstable and prone to overfit the training set whentrained for too long.

5.4. NAIVE APPROACH 23

5.4 Naive approach

As seen in the plot below there is a significant correlation between the classes andreading time for the di�erent texts, altough there are some clear outliers.

Figure 5.5. Di�erence in seconds between the reading time of each reading and themean of the reading time for each text. Red lines are dyslectic and blue lines arenon-dyslectic subjects.


5.5 Predictions by test subject

The predictions by test subject were exactly the same for both feature based meth-ods. We can also see that the sequence based method had a similar output as theFeed-Forward Neural Network for many of the test subjects.

Predictions by test subjectSubjectID

SVM FFNN LSTM Naive

Dys

lect

ic

1 1.00 0.73 0.62 0.1314 1.00 0.87 0.64 1.006 1.00 0.73 0.73 0.1347 1.00 0.65 0.28 1.004 1.00 0.85 0.45 1.0010 1.00 0.86 0.75 1.0011 1.00 0.78 0.76 1.0012 1.00 0.84 0.80 1.0013 0.00 0.00 0.11 0.14

Non

-Dys

lect

ic

22 0.29 0.41 0.40 0.0030 0.00 0.07 0.28 0.0016 0.00 0.02 0.17 0.0021 1.00 0.67 0.46 0.7120 0.25 0.37 0.41 0.0035 0.00 0.00 0.33 0.0050 0.00 0.06 0.34 0.0051 0.00 0.20 0.46 0.2552 0.88 0.72 0.50 0.00

Figure 5.6. Table of predictions by test subject of the di�erent methods. Valuesbetween 0 and 1 indicates, for the SVM and Naive method, that some of the textswas predicted as dyslectic and some as non-dyslectic and, for the Neural Networks,represents the mean of outputs for each subject. The green and red circles markcorrectly and incorrectly classified test subjects respectively.

5.6. LOWERING THE SAMPLE FREQUENCY 25

5.6 Lowering the sample frequency

In the figure below is a comparison of results for simulated sample frequencies of300Hz, 200Hz and 100Hz, corresponding to 100%, 66% and 33% of the data included.

Accuracy for lower sample frequenciesData SVM FFNN LSTM Naive100% 83% 83% 78% 78%66% 78% 83% 78% 78%33% 78% 83% 78% 78%

Figure 5.7. Results for the methods with 100%, 66% and 33% of the data included.

Chapter 6

Discussion

In this chapter the results are analyzed and discussed, first from the viewpoint ofthe methods and then with regards to the data used in the thesis. This will thenlead into notes about future research in the area and some final remarks.

6.1 Methods

The equal performance obtained across di�erent methods and hyperparameter choicesof the feature approach indicates that the results are at least somewhat generaliz-able. A sharp optimum for one of the parameter choices or methods would havesuggested that the model was overfitted to the data by cherry-picking of hyperpa-rameters and as we saw in the results this was not the case.

Although the sequential approach did not perform as good as the feature based,it is only by a margin of one test subject. The sequential approach is probablymore a�ected by the lack of data, since the data fed to a sequential model containsmore information, it also requires more examples to learn which of the patternsare important and not before it can achieve good results on the task given. Theoverfitting of the data, with high accuracy for training and validation, but low fortest of more complex models suggests that this might be the case. It would thereforebe interesting to see larger studies evaluating this method.

Bearing the above in mind, the methods and approaches examined in this thesismay all have potential to be used in this application. Even though none of themethods tested can be dismissed, they would have to be tested on larger data setto confirm their validity.

6.2 Data

In general the study su�ered from a lack of data. Therefore some disclaimers aboutthe data set used are important to highlight.

The first is that the results may di�er for younger test subjects. This is due tothe mitigation of negative e�ects and improvement of reading capability dyslectic

27

28 CHAPTER 6. DISCUSSION

readers accumulate through practice and experience. This study also used testsubjects studying at one of the top universities in Sweden, who it is reasonable toassume have mitigated the negative e�ects to a larger degree than the populationin general.

The second is that this is a small study, in which patterns in the data have ahigher probability of occuring by chance. Therefore larger studies of the same naturewould be helpful in determining if this is a viable approach to identify dyslexia.

Lastly because there was a low amount of test subjects available to this study,a rather long test sequence of 8 texts had to be used to get enough data. Theresults might be improved if the comparisons within a model were done on only onetext. In that kind of study text specific patterns, for example in the comparisonof processing the specific words and sentences used in the text, could also play animportant role in separating the classes.

Lowering of the sample frequency showed that some of the models performequally good using only a third of the data. Whether if this would translate intopredictions made on data from a lower end eye tracker is unsure however, sincethere are other things than sample frequency alone that separates the lower fromthe higher end eye trackers. The results should, therefore, be interpreted as beingindicative of lower frequency data having potential to be as good given all else equal.

6.3 Future Research

To confirm the viability of the methods used in this thesis a larger study is war-ranted. The test subjects in the study should preferably be school children, whomthe applications of the technology is meant to be used on. The study might alsobenefit from diagnosing all the test subjects, to make sure there are not any subjectsthat might be dyslectic in the non-dyslectic test group. The use of lower end eyetrackers should also be investigated further, to make sure the patterns are distin-guishable not only with a lower sample frequency but also with a lower trackingaccuracy etc, that a lower end eye tracker usually has.

6.4 Final remarks

This thesis has shown that there might be benefits of applying machine learningmethods in order to identify dyslexia. It has also presented di�erent approachesand methods that could be used.

In general terms about the results, a prediction rate of 83% might be goodenough, since the purpose of the method is to serve as a pre-screening tool and notas a replacement for the actual diagnoses.

The results for the lower sample frequency data is also promising, since it doesnot rule out the possiblity to use consumer grade eye trackers to collect the data.

If a larger study would confirm the results of this thesis it is clearly possibleto argue for the use of eye trackers as a mandatory pre-screening tool for school

6.4. FINAL REMARKS 29

children.It is my hope that this report can serve as an introduction to what methods

could be used for studies of that nature.

Bibliography

[1] D. Adler-Grinberg and L. Stark. Eye movements, scanpaths, and dyslexia.American Journal of Optometry and Physiological Optics, 55(8):557–570, 1978.

[2] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies withgradient descent is di�cult. Neural Networks, IEEE Transactions, 5(2):157–166, 1994.

[3] B. Brown, G. Haegerstrom-Portnoy, A. J. Adams, C. D. Yingling, D. Galin,J. Herron, and M. Marcus. Predictive eye movements do not discriminatebetween dyslexic and control children. Neuropsychologia, 21(2):121–128, 1983.

[4] C. J. Burges. A tutorial on support vector machines for pattern recognition.Data mining and knowledge discovery, 2(2):21–167, 1998.

[5] C. C. Chang and C. J. Lin. Libsvm: a library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 3(2):27, 2011.

[6] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,20(3):273–279, 1995.

[7] M. De Luca, E. Di Pace, A. Judica, D. Spinelli, and P. Zoccolotti. Eye move-ment patterns in linguistic and non-linguistic tasks in developmental surfacedyslexia. Neuropsychologia, 37(12):1407–1420, 1999.

[8] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,K. Saenko, and T. Darrell. Long-term recurrent convolutional networks forvisual recognition and description. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 2625–2634, 2015.

[9] . H. S. Dony, R. D. Neural network approaches to image compression. Pro-ceedings of the IEEE, 2(83):288–303, 1995.

[10] G. F. Eden, J. F. Stein, H. M. Wood, and F. B. Wood. Di�erences in eye move-ments and reading problems in dyslexic and normal children. Vision research,34(10):1345–1358, 1994.

[11] B. Fischer, M. Biscaldi, and P. Otto. Saccadic eye movements of dyslexic adultsubjects. Neuropsychologia, 31(9):887–906, 1993.

31

32 BIBLIOGRAPHY

[12] A. Fouganthine. Dyslexi genom livet: Ett utvecklingsperspektiv på läs-ochskrivsvårigheter. Master’s thesis, Stockholms Universitet, 2012.

[13] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhu-ber. A novel connectionist system for unconstrained handwriting recognitions.Pattern Analysis and Machine Intelligence, 31(5):855–868, 2009.

[14] A. Graves, A. R. Mohamed, and G. Hinton. Speech recognition with deep re-current neural networks. In Acoustics, Speech and Signal Processing (ICASSP),2013 IEEE International Conference, pages 6645–6649, 2013.

[15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural compu-tation, 9(8):1735–1780, 1997.

[16] M. Hornik, K. andStinchcombe and H. White. Multilayer feedforward networksare universal approximators. Neural networks, 2(5):359–366, 1989.

[17] H. W. Hunziker. Im Auge des Lesers: foveale und periphere Wahrnehmung-vom Buchstabieren zur Lesefreude (The eye of the reader: foveal and peripheralperception-from letter recognition to the joy of reading). Transmedia Zurich,2006.

[18] T. Joachims. Text categorization with support vector machines: Learning withmany relevant features. Springer Berlin Heidelberg, pages 137–142, 1998.

[19] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generatingimage descriptions. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 3128–3137, 2015.

[20] K. J. Lang, A. H. Waibel, and G. E. Hinton. A time-delay neural networkarchitecture for isolated word recognition. Neural networks, 1(3):23–43, 1990.

[21] A. Levy-Schoen. Flexible and/or rigid control of oculomotor scanning behavior.Eye movements: Cognition and visual perception, pages 299–314, 1981.

[22] A. Olsen. The tobii i-vt fixation filter. Technical report, Tobii Technology,2012.

[23] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: anapplication to face detection. Computer vision and pattern recognition. Pro-ceedings, IEEE computer society conference, pages 130–136, 1997.

[24] G. T. Pavlidis. The dyslexics erratic eye movements: Case studies. Dyslexiareview, 1:22–28, 1978.

[25] K. Rayner. Eye movements in reading and information processing: 20 years ofresearch. Psychology Bulletin, 124:372–422, 1998.

BIBLIOGRAPHY 33

[26] L. Rello and M. Ballesteros. Detecting readers with dyslexia using machinelearning with eye tracking measures. In Proceedings of the 12th Web for AllConference, page 16, 2015.

[27] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representationsby back-propagating errors. Cognitive modeling, 5(3):1, 1989.

[28] S. C. Sereno and K. Rayner. Measuring word recognition in reading: eye move-ments and event-related potentials. Trends in cognitive sciences, 7(11):489–493,2003.

[29] S. E. Shaywitz. Dyslexia. New England Journal of Medicine, 5(338):307–312,1998.

[30] M. J. Snowling and C. Hulme. Interventions for children’s language and literacydi�culties. International Journal of Language and Communication Disorders,47(1):27–34, 2012.

[31] M. A. Tinker. Recent studies of eye movements in reading. PsychologicalBulletin, 55(4):215, 1958.

[32] A. M. Undheim. A thirteen year follow up study of young norwegian adults withdyslexia in childhood: reading development and educational levels. Dyslexia,15(4):291–303, 2009.

www.kth.se

identifying dyslectic gaze pattern955646/fulltext01.pdf · 2016. 8. 25. · referat identiﬁkation...

Documents