master project(longyu)

8/11/2019 Master Project(LongYu)

1/38

University of Sussex

Dissertation

Building, Visualising, and AnalysingPhenotypic Canine Disease

Networks: A Gaussian Graphical

Model View

Author:

Long Yu 121880August 27, 2014


2/38

1 Abstract

The use of networks to analyse diseases has been proved to be a powerful tool. Here

we build phenotypic canine disease networks based on manually verified database

provided by Royal Veterinary College. The main relation of diseases we want to

study is comorbidity. As networks are expressive and intuitive way to represent

the objects relationship, we build the network for comorbidities in our paper

and introduce a technique named Gaussian graphical model(GGM) to do that.

GGM presents the correlation of the diseases and by applying a proper penalty

parameter, the network would maintain the most correlated disease pairs and get

rid of less correlated ones. To validate the GGM network, we introduce another

two kinds of networks based on measuring disease correlation named Related Riskand -correlation which is Pearsons correlation for binary variables. Also, we

consider a expert validation with Dr Dan ONeill from Royal Veterinary College

and find that GGM network has the best performance among them in terms of

precision. Moreover, we find that Middle Level code of Skin (cutaneous) disorder

finding take an very important role in dog diseases as it has the highest prevalence

and is more likely to be comorbidity.

Keywords: Gaussian Graphical Model, Network Science, Visualization

1


3/38

2 Acknowledgements

I would like to thank my supervisor Dr Novi Quadrianto1. He provides me clear a

guide line, advice of network modelling and visualization information throughout

the period as his supervised student. Every time we talk about the project, he

brings new ideas and plenty of related materials. Due to his efforts, I have the

chance to cooperate with Royal Veterinary College (RVC) and acquire dogs disease

dataset from them to build networks.

I would also like to thank Dr Dan ONeill2 from RVC who provide me well-

structured canine disease data, relevant hierarchy of diseases and expert validation

results on our networks. Every meeting, he provided us of value suggestions from

his perspective.Lastly, I want to thank Noel Kennedy from RVC for his much larger canine

disease dataset and new disease structure called Data Dictionary.

1http://www.sussex.ac.uk/profiles/3355832http://www.rvc.ac.uk/staff/doneill.cfm

2


4/38

Contents

1 Abstract 1

2 Acknowledgements 2

3 Introduction 4

4 Building Canines Disease Networks 5

4.1 Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.2 Gaussian Graphical Model . . . . . . . . . . . . . . . . . . . . . . . 6

4.3 The Relative Risk and-correlation measurement . . . . . . . . . . 9

4.4 Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Visualising Canines Disease Networks 14

5.1 Gephi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2 Generate data for GGM network and visualization . . . . . . . . . . 14

5.3 Generating data for RR and-correlation network and their visu-

alization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 Analysing networks 19

6.1 Networks validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.2 GGM network analysis . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.3 Expert validation of GGM network . . . . . . . . . . . . . . . . . . 25

6.4 Analysing illness progression on different gender . . . . . . . . . . . 26

7 GGM on Large Canine Disease data 29

7.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8 Future work 32

References 33

Appendices 35

3


5/38

3 Introduction

In medicine, many diseases or disorders have no clear boundaries because one

disease may have multiple causes and can be associated with other diseases. One

disease tends to have multiple concurrent diseases called comorbidities. Comorbid-

ity is the presence of one or more additional disorders co-occurring with a primary

disorder. Normally, we consider the comorbidity relationship on the two diseases

if they affect the same individual more than by chance.

A network offers a platform to explore from a graph-theoretic framework in

representing the associations of disorders. During the past decade, a number of

resources have been proven the ability of the network in building and analysing

diseases. Goh et al.(2007) built a human disease network exploring that humangenetic disorders and the corresponding disease genes may be related to each other.

Lee et al.(2008) constructed a human disease network which two diseases are linked

if mutated enzymes associated with them catalyze adjacent metabolic reactions.

Network studies of the mapping of protein protein interactions or interactome

mapping was implemented by Rual et al.(2005), such maps have revealed dynamic

features of interactome networks that relate to known biological properties. From

a proteomic perspective, the reason of the comorbidities is that disease associated

proteins act on the same pathway and Hidalgo et al.(2009) built a disorder network

of human phenotypes based over 30 million medical records.

Studying the comorbidity can help to understand biological and medical ques-

tions. For instance, Schneeweiss et al.(2003) define and improve the performance

of existing comorbidity scores in predicting mortality in Medicare enrollees. In this

paper, we will build a Graphical Gaussian Models network to represent the comor-

bidity. Graphical Gaussian Model(GGM), also known as Gaussian concentration

graphs or covariance selection models, has become a popular tool recently, it is

a effective way to measure the correlation between diseases. It computes all pair-

wise correlations and subsequently draw a corresponding graph based on specificthreshold which is the main free parameter of GGM. In this paper, we build the

phenotypic canine disease network based on GGM so that we can get the straight-

forward way to interpret and analyse dog comorbidities. Another way to build

the comorbidity network is to measure the disease associations, which can be done

by Relative Risk and correlation measurements. Both of them can be used to

4


6/38

measure correlation of diseases and Hidalgo et al.(2009) built phenotypic disease

network according to these two measurements.

To validate the performance of networks, we introduce expert validation withDr Dan ONeill from RVC who is a companion animal epidemiologst worked general

practice for 20 years and running his own companion animal practice for 12 years.

We validate 10 disorders with highest prevalence which shows GGM has the best

performance comparing with Relative Risk and correlation networks. After that,

we focus on how illness spread on different genders. Through Odds Ratio(OR) with

specific threshold, we calculate each disease pair on how likely it happens in one

gender than the other and hence build the OR network.

4 Building Canines Disease Networks

4.1 Source Data

All the source data is obtained from RVC. The main file is canine disease dataset,

it has columns of name, gender, clinic id, date of death, 429 different diseases

etc. as attributes. Each disease refers to VeNom code3, which is used in referral

veterinary hospital electronic patients records and first opinion veterinary practice

management systems. The row information is 3884 dog records which were ran-

domly selected from the overall disease dataset. The whole data were manually

annotated and collected from authorised pet clinics all over the England from 1st

Sep 2009 to Middle of 2013. During this period, any patient could come and leave

any time. The average period of treatment was 365.3 days while maximum period

was 1275.1 days and minimum period was 1 day only. Also, 378 dogs died while

3506 were not. 2051 of them were male, 1817 were female and 16 terms were not

recorded. Average weight of dogs was 19.8kg. In addition, 939 dogs didnt have

any disorder which means nearly a quarter of data was written off in building

the comorbidity network. As the dataset is not large(3884) and a number of dis-eases have quite a low prevalence, the conclusions we draw cannot be completely

convincing from probability perspective.

The second data file is the mapping from each disease to its body location.

There are 8 different kinds of body part named Abdomen, Anus/Perineum, Head

3http://www.venomcoding.org/VeNom/Welcome.html

5


7/38

and neck, Limb, Pelvis, Tail, Thorax, Vertebral. For the visualization purpose, we

will use it to draw a dog-like network on GGM network and locate each disease to

its own body location. Not all disease can be related to the certain body location,a few diseases dont belong to any body location and we will locate them outside of

dog body in the network. By doing this, it is intuitive to know where the disorder

happens. The third data file is the mapping of disease with its Middle Level code

which can be used to categorize diseases. For example, Abdominal finding is

the Middle Level of Ascites. Totally, there are 70 different sorts of Middle Level

terms. Each term has several disorders and we use the same color indicating same

Middle Level disorders in all networks.

4.2 Gaussian Graphical Model

Graph is a representation of a set of nodes or vertices where some pairs of nodes are

connected by links or edges. Node represents canine disorder and two nodes can

form an edge which represents the comorbidity between them. The edge typically

has two types - directed or undirected, as the main goal is to measure comorbidities

between dog diseases, we choose the undirected one. Normally, graph is expressed

as G=(V, E), where V are the vertices and E are the edges. To build the network,

main questions are which nodes should be selected and which nodes pair should be

connected as links. To answer them, we apply Gaussian graphical model(GGM)

technique. It helps us to find the graph structure which is a sparse graph of the

disease nodes that represents the conditional independence properties present in

the data.

The training data consists of 3884 instances(rows) by 429 diseases(columns)

matrix. From disease/column perspective, each disease is a vector with length

3884 and each element has value 1 or 0 (affect or not affect). GGM will com-

pute all pairwise correlations between two disease vectors and subsequently to

draw the corresponding graph. As the name indicates, GGM assume each vari-able follows Gaussian distribution or Normal distribution and all the variable

constitute multivariate Gaussian distribution. Specifically, 429 random vectors

X = (X1,X2, . . . , X p) consist of a multivariate Gaussian distribution Np(,),

where p is 429, both mean and covariance matrix are unknown. Probability

density function of multivariate Gaussian distribution:

6


8/38

f(x1, . . . , xp) = 1

(2)p||exp(1

2(X )T1(X ))

We hope to estimate the inverse covariance matrix C (C=1) because it is the

key matrix to decide the structure of the graph. In inverse covariance matrix(ICM)

(Cij C), a zero element Cij = 0 indicates a conditional independence betweenthe two random variablexiand xj given all the other variables or diseases. In other

words, the correlation between disease xi and xj is absent if and only ifxi and

xj are conditionally independent. It is equivalent to the problem that estimates

the parameter and identify zeros in the ICM. This kind of problem is also called

covariance selection problem(Dempster, 1972).

To address the problem, the standard method is the greedy stepwise forwardselection or backward deletion. The weakness of it is that the common stepwise

procedure has large computational complexity. Each single step, it needs a mass

of candidate models[1]. Meinshausen and bhlmann(2004) proposed a lower com-

putational complexity approach to do the covariance selection by neighbourhood

selection for each vertex in the graph. All the methods introduced above, model

selection and parameter estimation are done separately. In this paper, we choose

a penalized likelihood method that does model selection and parameter estima-

tion together in the Gaussian graphical model(Yuan and Lin, 2007). An important

problem we want to address is to make network sparser which means we want more

0 elements to appear in the proper positions in ICM, to do this, GGM applies an

L1 penalty term which is the same inspiration from the Lasso penalty of linear

regression. Then, the weak correlation of diseases will be ignored by applying

penalty term.

To build the network, all the disease vectors will be centered which means the

sample mean of data is zero. All sampleX1, X2..Xn are Independent and identi-

cally distributed. As all disease vectors follow multivariate Gaussian distribution,

the log-likelood ofand C=1

is

n

2ln det C1

2

ni=1

(Xi )TC(Xi ) (1)

The MLE of(,)is(X, A), where

7


9/38

A= 1

n

n

i=1

(Xi

X)T(Xi

X) (2)

Thus, the inverse covariance matrix C can be estimated by A1. In general, we

choose sample covariance matrix S=n A/(n 1) and C can be estimated by S1.However, the number parameter estimation is quite large, where parameters are

the upper triangular or lower triangular elements of the ICM(total number:p(p+1)2

).

With so many parameters, S is not stable of estimating . So, we introduce lasso

penalty term as discussed before in order to make the graph sparse. The problem

now is to find the minimizer(,C) and C is positive definite matrix[1]:

ln det C+ 1n

ni=1

(Xi )TC(Xi ) s.t.i=j

|cij| t (3)

Here t 0 is turning parameter controlling the sparsity. As disease data hasbeen centered andlog det C has the same result asln det C in finding theminimizer of C, the problem transfers to minimize formula which follows the same

form as(e.g Banerjee et al., 2008; Friedman et al., 2008):

log det C+tr(SC) +

C

1 (4)

where tr is trace, is the tuning parameter andC1 is the L1 norm on C.To solve the formula (4), Yuan and Lin(2007) came up a method by regard-

ing the problem as the determinant maximization problem(maxdet problem) and

solve them using the interior point algorithm which is quite time-consuming. In

our paper, we choose algorithm proposed by Friedman et al.(2008) named Graph-

ical Lasso. They use the block coordinate descent approach that has been used

in Banerjee et al.(2007) as a starting point, then propose a new algorithm that

extremely simple and faster comparing with other methods.

We introduce the KKT conditions to solve (4). As the problem is unconstrained

optimization, we use stationarity condition only which says zero vector is one of

the elements of sub-differential set. The derivative of log det C = C1, proved

in Boyd & Vandenberhe (2004), page 641. Then, we write Graphical lasso KKT

stationarity condition as[2]:

8


10/38

W+S+ = 0 (5)

where |Cij| and W =C1. Now, we will solve in terms of W. Note thatWii= Sii+ for Cii>0. Partitioning W and S as:

W =

W11 w12

wT12 w22

S=

S11 s12

sT12 s22

Where W11 (p1)(p1), w12 (p1)1, w21 1(p1), w22 Consider 12-block of KKT conditions[3]:

w12+s12+ 12 = 0 (6)From

W11 w12

wT12 w22

C11 c12

cT12 c22

=

I 0

0 1

(7), we can get that w12 =

W11c12/c22, subsitituting it to (6), we get:

W11c12c22

+s12+ 12 = 0 (8)

Assuming x = c12/c22 and rewrite it as:

W11x+s12+ = 0 (0)where ||x||1. This formula looks like the KKT conditions for:

minx

xTW11x+sT12x+||x||1 (10)

This is a lasso problem which can be solved quickly by coordinate descent

algorithm[3]. As we have got w12 =W11x, and c12, c22 can be acquired by (7).We setw21=w

T12, c21=c12

T, thus, we reduce the graphical lasso problem to a set

of sequential lasso problems that can be easily solved by many methods.

4.3 The Relative Risk and-correlation measurement

We quantify the strength of the comorbidities through the correlation between two

diseases. The measurements we choose are Relative Risk(RR), -correlation. Both

9


11/38

of them can quantify the disease associations. The RR of a pair of diseases i and

j infecting on the same dog is given by:

RRij =CijN

PiPj

whereCij is the number of dogs affected by both diseases, N is the total number

of dogs. PiandPj are the prevalence of the disease i and j or how many dogs affect

that disease. RRij > 1 means probability of ith disease and jth disease association

is larger than expected by chance, while RRij < 1 means they are smaller than

expected by chance.

The -correlation, which is Pearsons correlation for binary variables, of two

diseases i and j over same dog is defined by:

ij = CijN PiPjPiPj(N Pi)(N Pj)

ij > 0 means comorbidity is more likely than expected by chance, while ij

< 0 means comorbidity is less likely than expected by chance. These two mea-

surements are simple and effective in calculating similarity of two diseases. When

the value goes higher, it indicates stronger correlation between two diseases, vice

versa. A main disadvantage of two approaches is that they have intrinsic biases.

As for the RR, it will overestimates associations involving rare diseases and under-estimates associations between highly prevalent disorders[4]. Take overestimation

for example, ifP1 = 102, P2 = 10

2 are rare diseases and total number of dogs are

N = 107, then RR = C12107

102102 = C12 103. Even ifC12 is a small number, the RR

will be quite a high value that is apparently overestimated. correlation underes-

timates diseases with extremely different prevalence. For instance, assuming the

two diseases are maximize correlated which means the overlap can be quite large

: C12 =P2. Then, replace C12 withP2, we get:

= P2(N P1)P1P2(N P1)(N P2)

=

P2(N P1)P1(N P2)

When the prevalence likes this P2 P1 N, then the approximation of= P2P1 .It is quite a small number which is underestimated. These two measurements are

not totally independent of each other as both of them increase with the number

of dogs affected by both diseases.

10


12/38

4.4 Thresholds

From a visualization perspective, not every comorbidity should be appreciated.

There is a tradeoff between the number of disease associations and the signifi-

cance of them. For RR and correlation networks, if we specify a high cutoff,

we would lose information from original data and preserve fewer most correlated

comorbidities. The resulting networks will be very sparse and most diseases will

be completely disconnected. As for a low cutoff, the visualization of the networks

would become extremely dense, even the accidental event from data will be pre-

sented in the network, it is hard for us to analyse main trend of disease. By trading

off the value of threshold, we hope to find a sparse solution that still adequately

explains the data and what we want to achieve is to preserve a large number of

nodes but relatively few links in this experiment. Most nodes preserved ensures

we wont lose much disease information and few links make us focusing on signif-

icant comorbidities only. Thus, we draw a picture on the nodes number with its

threshold ofcorrelation:

Figure 1: thresholddiseases number

From the figure above, it can be seen that when > 0.09, the number of nodes

decreases dramatically. By applying cutoff = 0.09, it preserves most of diseases, the

comorbidity number decreases from 5989 to 934, thus, the network becomes much

11


13/38

more sparser and significant associations are reserved. What we are interested in

now is what statistical significance level it is. To validate it, we apply the t-test

for all the associations and the null hypothesis becomes = 0. t value can becalculated by this formula4:

t=n 2

1 2Where n is the number of observations. In all of our data we use n=max(Pi, Pj),

which represents the most stringent way in which t can be calculated given our

data. To determine the significant level of t, it is necessary to view t value table,

for n>1000, any t 1.96 is significant at the 5% level and any t 2.58 is at1% level. In our experiemnt, the significant level is 5% and it can be calculatedby stats package of python as stats.t.ppf(1-0.025, n). After calculating all the

diseases pairs, most links will reject the null hypothesis by t test which means

threshold = 0.09 ensure most links significance level at 5%.

Figure 2: RR thresholddiseases number

As for RR, we also plot figure of nodes number with threshold above. It can

be seen that there is several thresholds can be selected. According to the result of

hypothesis test and keeping node number nearly the same ascorrelation network,

4http://barabasilab.neu.edu/projects/hudine/resource/data/data.html

12


14/38

we find that RR = 34 is a good threshold choice. The number of links fall to 786

and nodes to 368. This time, to confirm the significant level, we calculate the 95%

confidence interval given by:

[RRij exp(1.96ij), RRij exp(1.96ij)]where ij is: ij =

1Cij

+ 1PiPj

1N 1

N2

The null hypothesis is RR= 1 and to reject it, we find that more than halflinks with 95% confidence interval dont include 1, which means threshold = 34

ensure majority of links hold the significance level at 5%.

In gaussian graphical model, the main free parameter is the penalty term.

Former two measurements choose the threshold by hypothesis test, as for GGM,

in order to validate with RR and correlation networks, we select the penalty

parameter so as to maintain nearly the same nodes and links. Also, according to

figure below, we can see that when threshold is larger than 0.09, the slope would

decrease rapidly. Thus, we choose parameter of penalty term = 0.09. The link

and node number is 869 and 379 respectively.

Figure 3: GGM thresholddiseases number

13


15/38

5 Visualising Canines Disease Networks

5.1 GephiThe tool to build networks is Gephi5. It is an interactive visualization and free

software for all kinds of networks and complex systems, dynamic and hierarchical

graphs. It can be run on multiple system platforms such as Mac Osx, Windows

and Linux. Two primary source files for Gephi to generate networks are the nodes

and links files, both of them should be edited in CSV format. The main advantage

of Gephi is its various layouts and easy to manipulate nodes and links along with

color, size and location settings. There are two important plugins that are helpful

for us to generate the network. First one is GeoLayout plugin, after installing

it, we can set the nodes at any fixed location by longitude and latitude attributes

which is the same as x y coordinate. In order to have a good-looking and clear

network, we plot the diseases as well as the location to the certain body part. Also,

we locate many isolated nodes so as to draw the outline of a puppy by the longitude

and latitude attributes. Secondly, by installing the SigmaExporter plugin, Gephi

export the network components into a folder which contains HTML, source files

and configuration files. Then, network can be viewed in the browser and deployed

online. The reason why the network can be embeded in browser is that it uses a

technique called Sigma.js. Sigma.js 6

is a JavaScript library dedicated to graphdrawing. It makes easy to publish networks on webpages and integrate network

with rich web applications. What is more, this link7 is a short video in how to

build GGM network through Gephi on our dog disease data.

5.2 Generate data for GGM network and visualization

First of all, we would like to draw the outline of puppy by locating many isolated

nodes. By doing it, users can view the relation between disease and its body loca-

tion directly. The original dog image was downloaded from deviantART websiteand image is a black/white png format image with 528x564 pixels. Now, the task

is how to transfer the puppy image to many nodes with x y location that draw

5http://gephi.github.io/6http://sigmajs.org/7https://www.youtube.com/watch?v=syzgKGYYIdU&list=UUcGDb7rt_B4h1EHqRfqPL8w

14


16/38

the outline of a puppy. We use the Matlab Image Processing Toolbox to process

image. To acquire the x and y coordinates, we regard the 528x564 pixels image as

a matrix and extract location information directly from the row index and columnindex of image matrix. Another problem is the image matrix contains overmany

elements(528564 = 297792), which is cumbersome for visualization. So, to geta sparser layout, we apply mod function to select fewer nodes. Importing image

as bitmap and matlab codes is shown below:

% import image

img = imread(puppy.png);

% select proper dimension

img = img(:,:,1);

% initialize image

new_img = ones(528,564)255;

num=0;

for i = 1: size(img,1)

for j = 1:size (img,2)

if img(i,j) == 0

if mod(i,3)==0 && mod(j,3)==0

new_img(i,j)=0;

num=num+1;

end

end

end

end

imshow(new_img)

% write the image matrix to csv

dlmwrite(puppy.csv,new_img)

To visualize the network, we will use the graphical lasso algorithm as discussed

above. For the code part, it has been already implemented through R library so

that we can use it directly8. The main code is shown below:

# code in code/R/glasso.r

# import glasso library

library( glasso)

all_data < read.csv( source file ,sep = ";")

# calculate covariance matrix

disease_data < subset(all_data)

variance < var(disease_data)

cor


17/38

not zero, then 3th disease and 4th disease form the comorbidity. After iterating

every element of ICM, we collect all the links and nodes information which can

be imported to Gephi. As for format of node file, the additional attribute isMiddleLevel which can be obtained from the Middle Level mapping file. By

SigmaExporter, the GGM network is shown below:(and can also be seen online:9)

Figure 4: GGM network

It is obvious to see disease along with its comorbidities and Middle Level code.

There are 9 node clusters according to body locations. As for network interaction,

when you hover over the disease node, it automatically highlights its comorbidities.

When clicking the node, full description about the disease will be shown on the

right-hand side. Take Interdigital cyst (dogs) as example, when you click that

disease node, you will see the disease information along with its comorbidities(see

Figure 5 below). Also, you can zoom in/zoom out and refresh the GGM network

through the three buttons below or scroll up/down mouse wheel. To search disease,

you can type the name in input field of left-hand toolbar. The size of node is

proportional to the prevalence of the disease while nodes in same color indicate

same Middle Level code.

9http://smileclinic.alwaysdata.net/long_msc2014/ggm_dog_network/

16


18/38

Figure 5: Interaction example of Interdigital cyst(dogs)

5.3 Generating data for RR and-correlation network and

their visualization

The way to generate data for RR and -correlation network is slightly different

from dealing with GGM network(Main code in code/network.py). The first task

is how to process the dog disease file. As we know, the disease file looks like a

instances/diseases matrix and what we are interested in is disease associations. So,

the goal is to transfer the matrix to diseases pairs[(disease1, disease2),(disease2,

disease3),(disease3, disease5),. . . ]. To extract the association, we regard each dog

instance as a vector. For example, a dog instace vector is [1,0,1,1,0,0,. . . ], it

contains 429 elements and each element stands for certain disease. 1 indicates

disease detected and 0 indicates no disease detected. We iterate evey dog instance

to get the permutation of diseases so that we can acquire all the possible disease

pairs. Next, disease pairs will merge and count in order to calculate the RR or score. New disease pairs format looks like this:[((diesase1_id,diesase2_id),RR/

score),((diesase2_id,diesase3_id),RR/ score). . . ]. By applying the threshold of

RR and (34 and 0.09 respectively), the original matrix will be tranfered to final

comorbidities. The edge file add a new attribute called weight which is the RR/

score of each disease-pair. In networks below, the thickness of links indicate weight

17


19/38

of the comorbidity. Here are correlation and RR networks(also viwe online RR

network 10 and correlation network 11).

Figure 6: correlation network

10http://smileclinic.alwaysdata.net/long_msc2014/rr_dog_network/

11http://smileclinic.alwaysdata.net/long_msc2014/phi_dog_network/

18


20/38

Figure 7: RR network

6 Analysing networks

6.1 Networks validation

In the canine disease dataset, the top 3 prevalent diseases with their prevalence

are: Otitis externa - 396, Periodontal disease - 361, Anal sac impaction - 277.

They are extremely common diseases and almost 1/10 dogs have Otitis externa

and Periodontal disease illness. Figure and table below draw diseases prevalence

ditribution. 121 diseases appear only once which means nearly a quarter of diseases

are rare diseases in our dataset.

19


21/38

Figure 8: disease prevalence distribution

Prevalence Count

1 121

3 33

4 20

5 24

6 17

7 11

12 9

15 9

9 8

13 8

8 7

10 7

22 6

17 5

. . . . . .

To validate GGM network, we use the RR and correlation networks as com-

parisions and use Jaccard index to measure the similarity. Jaccard index is a

statistics used for comparing the correlation or similarity of two finite sample sets.

It calculates the intersection of two sets divided by the union of two sets:

Jaccard(A,B) =|A B||A B|

We extract all the comorbidities from three networks and use python code below

to calculate. The Jaccard index of GGM and network is 0.918085106383, while

Jaccard index of GGM and RR network is 0.789189189189. The GGM network has

quite a high score/similarity with network, which means most of comorbidities

among them are overlap. In addition, the result indicates the GGM network is a

reasonable network validated by network. As for the RR network, the score islower than network. It is because the bias of Related Risk. RR overestimates

associations involving rare diseases and nearly a quarter of diseases appear once

only. As a result, the disease pairs are much more likely to be biased. Also, some

other difference among three networks should be taken into consideration. The

GGM network assume that disease distribution is Normal distribution while the

20


22/38

other two have their own biases. The threshold we select is not equivalent for every

network where node and link number are not exactly same in different networks.

#code file: code/analyse_disease_pairs.py

def jaccard_index(set_1, set_2):

intersection_num = len(set_1.intersection(set_2))

return intersection_num / float(len(set_1) + len(set_2) intersection_num)

6.2 GGM network analysis

First, we do the analysis on the Middle Level code. Here lists top 5 prevalent

Middle Level codes of GGM network:

Middle Level code PrevalenceSkin (cutaneous) disorder finding 10.82%

Neoplasia 9.23%

Mass lesion finding 6.86%

Ophthalmological disorder finding 5.8%

Enteropathy 5.28%

Table below is top 8 heavily connected nodes or diseases along with comorbidity

number. Normally, we call these nodes as hubs of network. Inside them, Middle

Level code of Skin (cutaneous) disorder finding has most hubs which includeSkin (cutaneous) disorder, pigmentary, Eosinophilic granuloma and Pododer-

matitis.

Disease name Number of comorbidities

Cognitive dysfunction 19

Skin (cutaneous) disorder, pigmentary 15

Eosinophilic granuloma 15

Cardiomegaly 13

Colitis 13Spondylosis 13

Pododermatitis 13

DJD 13

To see how the Middle Level codes associate with each other, we also map each

comorbidity to its Middle Level code, after combination, table below shows top

21


23/38

3 Middle Level associations with number of occurrence. It can be see that Skin

(cutaneous) disorder finding is again the most popular one. To sum up, dogs

are likely to affect disorder of Skin (cutaneous) disorder finding or comorbiditybelongs to it. As a pet-keeper, he or she should pay more attention on this kind

of disease so as to prevent it.

Middle Level code associates Number

Skin (cutaneous) disorder finding, Skin (cutaneous) disorder finding 15

Enteropathy, Skin (cutaneous) disorder finding 11

Ophthalmological disorder finding, Ophthalmological disorder finding 10

After that, we would like to introduce some network properties. All the prop-

erties are can be calculated directly from Gephi or SNAP library. SNAP library

which is short for Stanford Network Analysis Project 12 has a large number of

interfaces for analysis of network. It is quite efficient to manipulate graphs, cal-

culates structural properties, generates graphs, and supports attributes on nodes

and edges. First property is node degree distribution:

Figure 9: GGM degree distribution

nodes number degree

60 nodes 1

45 nodes 2

52 nodes 3

65 nodes 4

37 nodes 5

36 nodes 6

25 nodes 7

17 nodes 8

12 nodes 9

10 nodes 10

6 nodes 11

6 nodes 12

5 nodes 13

2 nodes 15

1 nodes 19

12http://snap.stanford.edu/snappy/index.html

22


24/38

We plot the degree distribution above, which looks like a power-law distribu-

tion. It is worth noting that scale-free network is a network whose degree dis-

tribution follows a power law. So we want to validate whether the distributionfollows power-law ,therefore, classify it to scale-free network or not. Mathemati-

cally, the power law distribution:P(x) x, where P(x) is the degree number,x is the degree and is the parameter greater than 1. As the power law belongs

to exponential family, in order to simplify the analysis, we get logarithm of de-

gree distribution to see if it is linear function. Also, we introduce two comparison

functions(piecewiselinear and quadratic), the criterion we choose is Bayesian in-

formation criterion(BIC). BIC mainly consider two factors, how well it fits the

data and how many explanatory variables it uses. The good fit means less error

and fewer variables or parameters means the model is simpler and robust to avoidoverfitting problem. Given any two estimated models, the model with the lower

value of BIC is the one to be preferred. The threes figures below are calculated

and plotted by Dr Novi Quadrianto, and we can see that the linear model is pre-

ferred. However, as the difference of score are too small between quadratic and

linear(19.55-19.30=0.25), from [5], the difference is less than 2, which means linear

model doesnt overwhelm quadratic one with strong evidence.

Figure 10: linear function fit with BIC

23


25/38

Figure 11: piecewiselinear function fit with BIC

Figure 12: quadratic function fit with BIC

Another property is clustering coefficient which quantifies how well connected

are the neighbors of a vertex in a graph[6]. In other words, it is described as the

conjoint nodes of one node are still connected. The clustering coefficient of a vertex

is the ratio of existing edges connecting a vertexs neighbours to each other to the

24


26/38

maximum possible number of such edges. The ith nodes clustering coefficient can

be calculated as:

Ci= 2eiki(ki 1)

whereeiis the number of the connections between all these neighbours and kiis

the number of neighbours of the ith node. In GGM network, the average clustering

coefficient of the whole network is C = 1n

ni=1Ci = 0.117641210955. Clustering

coefficient is also a evidence that a network is considered as small-world network if

the clustering coefficient is significantly higher than expected by random chance.

As the result is not high enough, we cannot believe GGM is a small-world network.

Average path length is another important concept in network topology. It isa measure of the efficiency of information or mass transport on a network, which

shows the number of steps it takes to get from one node of the network to another.

It is calculated by finding the shortest path between all pairs of nodes, add them

and divide by the total number of pairs. In our GGM network, the average path

length is 4.265. It tells us once a dog has a disease, it would progress 4 more

disorders before affecting object disorder on average.

6.3 Expert validation of GGM network

Except validating with RR and correlation networks through Jaccard index,

another way we introduce is expert validation which verify network results by

someone with high authority on the area of dog disease. This approach is more

authentic and convincing as it is judged by professional or expert. The person we

invite to do expert validation is Dr Dan ONeill. He is dogs trust companion animal

epidemiologist mainly research in Veterinary Epidemiology and Economics and

Public Health areas. He ran his own companion animal practice for 12 years and

started PhD in veterinary epidemiology at the RVC. ONeill is now a post-doctoral

researcher and continues to expand VetCompass to examine health-welfare issues

in dogs.

What we want to validate is the precision of comorbidities. By listing disease

associations of all three networks, the expert can judge whether the comorbidity is

reasonable and label them as Expected or Unexpected. Two criterions decide

how well the result is. One is how many comorbidities the network detect, the

25


27/38

other is whether comorbidity is correct. As the dataset is small, we selected 10

most common disorders which avoids unreliability and is likely to have the most

chance of having comorbidities according to ONeills advice. The results fromONeill attach on the Appendices with two parts: general comment and validation

results. From the validation results, we can see that RR has poorest performance

where it detects 9 comorbidities and 5 of them are expected(5/9), followed by

correlation with precision of 34/50, and GGM has the best precision with 33/43.

Comparing with GGM, network has 1 more detected disorder but 7 more mis-

detected disorders as well. By the criterions described before, we believe GGM

network is the best one as it provides nearly same number but more accurate

comorbidities.

Then, it is a time to take a look at mis-detected comorbidities. We can seefrom the ONeills validation result that Vomiting doesnt have any comorbidity.

It is a very common disease and such a wide range of triggers for it may reduce

the specific comorbidity with other disorders being identified in these studies.

Some disorders like Diarrhoea finding has the comorbidity of Nasal planum

finding in both GGM and network, which doesnt make sense at all, thus,

the result is unexpected. However, in dog disease data, 6 dogs affect Nasal

planum finding and 4 of them affect Diarrhoea finding which indicates a strong

relationship between two diseases. This kind of error is due to lack of enough data,

if in real world, these two disease are independent, they should not co-occurrent

many times in dataset according to statistics. In other words, the sample disease

distribution(dataset) should follow the population distribution(real world) if the

sample size is large enough. According to results of expert validation, a possible

improvement can be made is to select a set of penalty parameters or thresholds

and validate each one with expert in order to select the best performance one.

Although this method will consume more human resource, it is quite a reliable

and accurate way.

6.4 Analysing illness progression on different gender

This time, we want to analyse comorbidities based on different gender. Gender is a

important factor in diagnosing disease, for example, breast cancer is severe disease

mostly affecting female. As a woman, if affected diseases that are the comorbidities

26


28/38

of breast cancer, she should pay more attention to prevent it in advance. For this

we calculate the Odds Ratio(OR), OR is the ratio of the odds of an event occurring

in one group to the odds of it occurring in another group. In statistics, it is themeasurement on quantify how strongly the presence of disorder i associates with

the presence of disorder j in a given population. In this experiment, the group

refers to female and male. The expression is shown below:

ORij(, ) =pij()(1 pij())pij()(1 pij())

where i and j represent the disease i and disease j in female and male . If

odds ratio equals 1, it means the comorbidity is equally likely to occur in both

female and male. An odds ratio greater than 1 tells us that the comorbidity ismore likely to occur in the female than male, vice versa. In our experiment, we will

present the significant difference by selecting a threshold of 2. In the OR network,

if the OR score is bigger than 2 of female over male, we draw a green link(193

links). if the OR score is bigger than 2 of male over female, we draw a red link(169

links). From the network, we can see that Vomiting(15 of 16 links are red, see

Figure 14) and Enteritis(all 5 links are red) are more likely to be infected among

male while Intertrigo(all 5 links are green) and Incontinence - faecal(all 3 links

are green) are more likely to happen in female. Moreover, comorbidities [Behaviour

disorder, Obesity], [Obesity, Urinary incontinence] and [Corneal disorder finding,Anal sac impaction] should be pay more attention in female with highest OR score

of 7.867. [Claw injury (traumatic), Diarrhoea finding] and [Mitral valve disorder,

Periodontal disease] should be warned among male as they have two highest OR

scores of 10.562 and 8.811 respectively. The OR network is shown below( and can

be also seen online13):

13http://smileclinic.alwaysdata.net/long_msc2014/or_analysis/

27


29/38

Figure 13: OR network(green: female, red: male)

Figure 14: Vomiting: male disease with 15 of 16 red links

28


30/38

7 GGM on Large Canine Disease data

In this section, we will apply the GGM on the a inconsistent but much larger dog

disease dataset. The more data means the we are more likely to avoid the accidental

event and be confident about the result of GGM. The data was provided by Noel

Kennedy from RVC as part of his work on a Veterinary diseases classification

system. As the data are not structured as good as the original one(429 3884),we will re-structure it in several ways. In the end, we find that GGM network on

this dataset doesnt work so good.

7.1 Data Structure

In the large canine disease dataset, the main disease structure are called Data Dic-

tionary(DD). DD groups the VeNom codes or dog disorder codes into a hierarchy,

where most specific disease codes are at the leaf level and more general codes are

at the higher levels. It likes a graph where the nodes represent coded findings in

the ontology, and the edges are directed from more specific codes to more general

codes. This represents an is-a relationship in the ontology. There are two files

that fully describe the DD relationship. The first one is DD code to disease-name

mapping and the second one contains the mapping from child code to its parent

codes. One child code could have multiple parent codes. As for the dog diseasefile, it contains two columns which are animal id and DD code. There are around

200,000 dogs in the dataset and one dog could has multiple DD codes/diseases.

The dogs are coded at multiple levels of understanding which means the DD in one

dog can contain high level disease and leaf disease in the same time. Also, there is

a problem in the data structure that it is inconsistent which means a dog maybe

positive for a specific disease but that diseases parent term is negative. What is

more, there are 460 DD codes matching original 429 diseases because some diseases

have than one DD codes. For example, Owner unsure has two DD codes 114

and 10. Therefore, we will combine the repeated DD codes after obtaining all the

disease information of the dog disease file. Table below shows part of large canine

diseases file:

29


31/38

Animal Id DD code

250012 15

250012 34250012 128

250012 2545

250012 55070

250012 55071

250012 55102

250020 15

. . . . . .

7.2 Data Processing

To compare and validate the performance with previous networks, we want to

map the diseases of large dog-disease file to the original 429 ones. According the

structure of Date Dictionary, the best way to extract the disease information is to

compare each DD code in the file with 429 diseases DD codes along with their

child DD codes. In other words, for each term (see table above), we will search all

the 429 diseases DD codes and their child DD codes. If any one matches, we can

map the term to certain disease, otherwise, we discard it. By this way, however,

only 38 of 429 diseases has been detected through this searching strategy. It is

not acceptable as the goal is to compare the GGM with previous networks based

on all 429 diseases. The reason for this phenomenon is that most diseases of the

file are in the high level of the DD tree, and as the 429 diseases are in the lower

level, they cannot match each other in the hierarchy. Another way to process the

data is that we can search each term in the file and all its child nodes to see which

diseases in 429-diseases detected. By this method, however, we find that most of

dogs will cover all the 429 diseases as DD codes in the file are in quite high level.

As we know, if the DD code is the root node and the only one in the disease tree

or hierarchy, it will definitely cover all the nodes when searching its child nodes.

After analysing the structure of DD, we find that the gap between DD code

in files and 429 diseases is one level only. For example, 3007(not in 429 diseases)

is DD code of Diabetes mellitus finding, which is also the parent of Diabetes

mellitus with code 658(in 429 diseases). There are several 3007 terms and no 658

30


32/38

term in the dog disease file. So, if we search 3007 instead of 658, we can detect

the disease Diabetes mellitus through Diabetes mellitus finding. As discussed

above, the search strategy now is to search one level higher of all the 429 diseases,then detecting their child codes to see if the DD code matches. Then, we find that

all the 429 diseases can be detected. The penalty parameter we choose this time is

0.01 as it keeps nearly the same node number with previous GGM network. Figure

below is the network we draw from the large dog diseases dataset.

Figure 15: large dataset GGM network

From the figure above, we find that several diseases or nodes are heavily con-

nected, such as Splenomegaly and Enteropathy, they contains 128 and 107

comorbidities respectively. Both of them have multiple DD codes and their DD

codes are the root nodes in hierarchy. Thus, higher level disease will cause the

over-connected problem while lower level disease will cause under-connected prob-

lem. To sum up, in the large dataset, the result is heavily affected by the structure

31


33/38

of DD and it is unreasonable to compare diseases in different DD levels.

8 Future work

Hidalgo et al.(2009) showed the phenotypic network based on human diseases and

our work mainly build the network for canine. Actually, both human and canine

are kind of animal from biology perspective. Thus, there could be a potential

connection between canine network and human network. Also, there is plenty of

research focusing on this area, for example, Poldrack et al.(2003) studied the mem-

ory systems of brain between animal and human. Zoobiquity[7] is a publication

providing many cases on the similarity of human world and animal world. The

author is inspired by an eye-opening consultation, which revealed that a monkey

experienced the same symptoms of heart failure as her human patients. Inspired

by this, we suppose that dog comorbidity is similar to human comorbidity. To

validate it, the direct way is to compare the same disease on both networks with

its comorbidities. So, we choose to compare the dog comorbidities with human

phenotypic network built by Hidalgo et al.(2009). However, the human diseases

in their work are coded by ICD-9-CM14 medical coding reference while animal

diseases are coded by VeNom coding system. The difficulty is we cannot get the

precise disease mapping of these two systems. Thus, we compare the comorbiditiesmanually by ourselves. The disease we select is Chronic kidney disease, because

it has been already studied by ONeill et al. in the paper[8] along with its co-

morbidities. The table below is shown the result of Chronic kidney disease from

human phenotypic network.

14http://www.icd9data.com/

32


34/38

Name ICD9 code prevalence score

Renal failure unspecified 586 0.6869 % 0.141

Nephritis and nephropathy not specified as acute or chronic 583 0.3813 % 0.182

Hypertensive heart and chronic kidney disease, malignant.. . 404 0.5050 % 0.185

Acute renal failure 584 1.7552 % 0.207

Malignant hypertensive renal disease without renal 403 1.4743 % 0.310

hyperosmolality and/or hypernatremia 276 27.1 % 0.107

Mechanical complications of unspecified cardiac device . . . 996 4.4082 % 0.107

Sideroblastic anemia 285 14.8 % 0.109

Nephrotic syndrome 581 0.1831 % 0.114

Nephroptosis 593 2.8091 % 0.121

Chronic glomerulonephritis 582 0.5515 % 0.128

Congestive heart failure unspecified 428 18.3 % 0.139

Result of Chronic kidney disease from ONeill et al. (2013):

Anaemia

Cardiac disorder

Decreased appetite

Halitosis

Hypertension

Lethargy

Melaena

pancreatitis

Polyuria/polydipsia

Urinary incontinence

Vomiting

Weight loss

From the table, we find that Hypertensive heart and chronic kidney disease,

malignant. . . of human disease can be related to Hypertension of dog disease.

Both Congestive heart failure unspecified(human) and Cardiac disorder(dog)

are the diseases related to heart. Most comorbidities of human and dog are the

problems towards kidney. Thus, it can be seen that these comorbidities are not

independent of each other. As a result, if a precise mapping from animal disease

to human disease can be provided, we may be able to connect and analyse the

comorbidity of them.

33


35/38

References

[1] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graph-

ical model. Biometrika, 94(1):1935, 2007.

[2] Daniela M Witten, Jerome H Friedman, and Noah Simon. New insights and

faster computations for the graphical lasso. Journal of Computational and

Graphical Statistics, 20(4):892900, 2011.

[3] Rahul Mazumder, Trevor Hastie, et al. The graphical lasso: New insights and

alternatives. Electronic Journal of Statistics, 6:21252149, 2012.

[4] Csar A Hidalgo, Nicholas Blumm, Albert-Lszl Barabsi, and Nicholas A

Christakis. A dynamic network approach for the study of human phenotypes.

PLoS computational biology, 5(4):e1000353, 2009.

[5] Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american

statistical association, 90(430):773795, 1995.

[6] Sara Nadiv Soffer and Alexei Vzquez. Network clustering coefficient without

degree-correlation biases. Physical Review E, 71(5):057101, 2005.

[7] Barbara Natterson Horowitz and Kathryn Bowers.Zoobiquity: What AnimalsCan Teach Us about Being Human. Random House, 2012.

[8] DG ONeill, J Elliott, DB Church, PD McGreevy, PC Thomson, and

DC Brodbelt. Chronic kidney disease in dogs in uk veterinary practices:

prevalence, risk factors, and survival.Journal of Veterinary Internal Medicine,

27(4):814821, 2013.

[9] Jean-Franois Rual, Kavitha Venkatesan, Tong Hao, Tomoko Hirozane-

Kishikawa, Amlie Dricot, Ning Li, Gabriel F Berriz, Francis D Gibbons,

Matija Dreze, Nono Ayivi-Guedehoussou, et al. Towards a proteome-scale map

of the human proteinprotein interaction network. Nature, 437(7062):1173

1178, 2005.

[10] Arthur P Dempster. Covariance selection. Biometrics, pages 157175, 1972.

34


36/38

[11] D-S Lee, J Park, KA Kay, NA Christakis, ZN Oltvai, and A-L Barabsi. The

implications of human metabolic network topology for disease comorbidity.

Proceedings of the National Academy of Sciences, 105(29):98809885, 2008.

[12] Kwang-Il Goh, Michael E Cusick, David Valle, Barton Childs, Marc Vidal,

and Albert-Lszl Barabsi. The human disease network. Proceedings of the

National Academy of Sciences, 104(21):86858690, 2007.

[13] Sebastian Schneeweiss, Philip S Wang, Jerry Avorn, and Robert J Glynn.

Improved comorbidity adjustment for predicting mortality in medicare pop-

ulations. Health services research, 38(4):11031120, 2003.

[14] Russell A Poldrack and Mark G Packard. Competition among multiple mem-ory systems: converging evidence from animal and human brain studies. Neu-

ropsychologia, 41(3):245251, 2003.

[15] Nicolai Meinshausen, Peter Lukas Bhlmann, Peter Lukas Bhlmann, and

Peter Lukas Bhlmann. Consistent neighbourhood selection for sparse high-

dimensional graphs with the lasso. Seminar fr Statistik, Eidgenssische Tech-

nische Hochschule (ETH), Zrich, 2004.

Appendices

Expert validation general comment from Dr Dan ONeill:

Many of the more common disorders in dogs are syndromes in the sense

that they represent a spectrum of underlying specific disorders that al

share a common presentation pattern. This has the result of making

them common as apparently distinctive clinical presentations but may

reduce the comorbidity indices with other disorders because of the vary-

ing underlying true pathologies. It should be noted that comorbidity

studies carried out across all disorders recorded in dogs are subject to

the risk of spurious results being identified due to chance. These stud-

ies are best suited to hypothesis generation and should be confirmed

by later specific confirmatory studies. During the validation process,

35


37/38

the expert defined the comorbidity associations as being expected or

unexpected based on current veterinary norms. The unexpected results

are potential new areas for investigation that offer the opportunity toidentify previously unknown associations. While the GGM and Phi

results were generally consistent with current veterinary expectation,

the RR results seemed to miss some important associations identified

by the other two methods. It would appear that RR is a less useful

method in this respect. Overall these comorbidity results are highly

consistent with conventional veterinary understanding of disease as-

sociations. Novel but potentially useful findings include comorbidity

between DJD and hypothyroidism, and between periodontal disease

and heart disorders.

Validation table:

36


38/38

master project(longyu)

Documents