master project(longyu)

Upload: jackyu

Post on 02-Jun-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 Master Project(LongYu)

    1/38

    University of Sussex

    Dissertation

    Building, Visualising, and AnalysingPhenotypic Canine Disease

    Networks: A Gaussian Graphical

    Model View

    Author:

    Long Yu 121880August 27, 2014

  • 8/11/2019 Master Project(LongYu)

    2/38

    1 Abstract

    The use of networks to analyse diseases has been proved to be a powerful tool. Here

    we build phenotypic canine disease networks based on manually verified database

    provided by Royal Veterinary College. The main relation of diseases we want to

    study is comorbidity. As networks are expressive and intuitive way to represent

    the objects relationship, we build the network for comorbidities in our paper

    and introduce a technique named Gaussian graphical model(GGM) to do that.

    GGM presents the correlation of the diseases and by applying a proper penalty

    parameter, the network would maintain the most correlated disease pairs and get

    rid of less correlated ones. To validate the GGM network, we introduce another

    two kinds of networks based on measuring disease correlation named Related Riskand -correlation which is Pearsons correlation for binary variables. Also, we

    consider a expert validation with Dr Dan ONeill from Royal Veterinary College

    and find that GGM network has the best performance among them in terms of

    precision. Moreover, we find that Middle Level code of Skin (cutaneous) disorder

    finding take an very important role in dog diseases as it has the highest prevalence

    and is more likely to be comorbidity.

    Keywords: Gaussian Graphical Model, Network Science, Visualization

    1

  • 8/11/2019 Master Project(LongYu)

    3/38

    2 Acknowledgements

    I would like to thank my supervisor Dr Novi Quadrianto1. He provides me clear a

    guide line, advice of network modelling and visualization information throughout

    the period as his supervised student. Every time we talk about the project, he

    brings new ideas and plenty of related materials. Due to his efforts, I have the

    chance to cooperate with Royal Veterinary College (RVC) and acquire dogs disease

    dataset from them to build networks.

    I would also like to thank Dr Dan ONeill2 from RVC who provide me well-

    structured canine disease data, relevant hierarchy of diseases and expert validation

    results on our networks. Every meeting, he provided us of value suggestions from

    his perspective.Lastly, I want to thank Noel Kennedy from RVC for his much larger canine

    disease dataset and new disease structure called Data Dictionary.

    1http://www.sussex.ac.uk/profiles/3355832http://www.rvc.ac.uk/staff/doneill.cfm

    2

  • 8/11/2019 Master Project(LongYu)

    4/38

    Contents

    1 Abstract 1

    2 Acknowledgements 2

    3 Introduction 4

    4 Building Canines Disease Networks 5

    4.1 Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    4.2 Gaussian Graphical Model . . . . . . . . . . . . . . . . . . . . . . . 6

    4.3 The Relative Risk and-correlation measurement . . . . . . . . . . 9

    4.4 Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    5 Visualising Canines Disease Networks 14

    5.1 Gephi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    5.2 Generate data for GGM network and visualization . . . . . . . . . . 14

    5.3 Generating data for RR and-correlation network and their visu-

    alization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    6 Analysing networks 19

    6.1 Networks validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.2 GGM network analysis . . . . . . . . . . . . . . . . . . . . . . . . . 21

    6.3 Expert validation of GGM network . . . . . . . . . . . . . . . . . . 25

    6.4 Analysing illness progression on different gender . . . . . . . . . . . 26

    7 GGM on Large Canine Disease data 29

    7.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    7.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    8 Future work 32

    References 33

    Appendices 35

    3

  • 8/11/2019 Master Project(LongYu)

    5/38

    3 Introduction

    In medicine, many diseases or disorders have no clear boundaries because one

    disease may have multiple causes and can be associated with other diseases. One

    disease tends to have multiple concurrent diseases called comorbidities. Comorbid-

    ity is the presence of one or more additional disorders co-occurring with a primary

    disorder. Normally, we consider the comorbidity relationship on the two diseases

    if they affect the same individual more than by chance.

    A network offers a platform to explore from a graph-theoretic framework in

    representing the associations of disorders. During the past decade, a number of

    resources have been proven the ability of the network in building and analysing

    diseases. Goh et al.(2007) built a human disease network exploring that humangenetic disorders and the corresponding disease genes may be related to each other.

    Lee et al.(2008) constructed a human disease network which two diseases are linked

    if mutated enzymes associated with them catalyze adjacent metabolic reactions.

    Network studies of the mapping of protein protein interactions or interactome

    mapping was implemented by Rual et al.(2005), such maps have revealed dynamic

    features of interactome networks that relate to known biological properties. From

    a proteomic perspective, the reason of the comorbidities is that disease associated

    proteins act on the same pathway and Hidalgo et al.(2009) built a disorder network

    of human phenotypes based over 30 million medical records.

    Studying the comorbidity can help to understand biological and medical ques-

    tions. For instance, Schneeweiss et al.(2003) define and improve the performance

    of existing comorbidity scores in predicting mortality in Medicare enrollees. In this

    paper, we will build a Graphical Gaussian Models network to represent the comor-

    bidity. Graphical Gaussian Model(GGM), also known as Gaussian concentration

    graphs or covariance selection models, has become a popular tool recently, it is

    a effective way to measure the correlation between diseases. It computes all pair-

    wise correlations and subsequently draw a corresponding graph based on specificthreshold which is the main free parameter of GGM. In this paper, we build the

    phenotypic canine disease network based on GGM so that we can get the straight-

    forward way to interpret and analyse dog comorbidities. Another way to build

    the comorbidity network is to measure the disease associations, which can be done

    by Relative Risk and correlation measurements. Both of them can be used to

    4

  • 8/11/2019 Master Project(LongYu)

    6/38

    measure correlation of diseases and Hidalgo et al.(2009) built phenotypic disease

    network according to these two measurements.

    To validate the performance of networks, we introduce expert validation withDr Dan ONeill from RVC who is a companion animal epidemiologst worked general

    practice for 20 years and running his own companion animal practice for 12 years.

    We validate 10 disorders with highest prevalence which shows GGM has the best

    performance comparing with Relative Risk and correlation networks. After that,

    we focus on how illness spread on different genders. Through Odds Ratio(OR) with

    specific threshold, we calculate each disease pair on how likely it happens in one

    gender than the other and hence build the OR network.

    4 Building Canines Disease Networks

    4.1 Source Data

    All the source data is obtained from RVC. The main file is canine disease dataset,

    it has columns of name, gender, clinic id, date of death, 429 different diseases

    etc. as attributes. Each disease refers to VeNom code3, which is used in referral

    veterinary hospital electronic patients records and first opinion veterinary practice

    management systems. The row information is 3884 dog records which were ran-

    domly selected from the overall disease dataset. The whole data were manually

    annotated and collected from authorised pet clinics all over the England from 1st

    Sep 2009 to Middle of 2013. During this period, any patient could come and leave

    any time. The average period of treatment was 365.3 days while maximum period

    was 1275.1 days and minimum period was 1 day only. Also, 378 dogs died while

    3506 were not. 2051 of them were male, 1817 were female and 16 terms were not

    recorded. Average weight of dogs was 19.8kg. In addition, 939 dogs didnt have

    any disorder which means nearly a quarter of data was written off in building

    the comorbidity network. As the dataset is not large(3884) and a number of dis-eases have quite a low prevalence, the conclusions we draw cannot be completely

    convincing from probability perspective.

    The second data file is the mapping from each disease to its body location.

    There are 8 different kinds of body part named Abdomen, Anus/Perineum, Head

    3http://www.venomcoding.org/VeNom/Welcome.html

    5

  • 8/11/2019 Master Project(LongYu)

    7/38

    and neck, Limb, Pelvis, Tail, Thorax, Vertebral. For the visualization purpose, we

    will use it to draw a dog-like network on GGM network and locate each disease to

    its own body location. Not all disease can be related to the certain body location,a few diseases dont belong to any body location and we will locate them outside of

    dog body in the network. By doing this, it is intuitive to know where the disorder

    happens. The third data file is the mapping of disease with its Middle Level code

    which can be used to categorize diseases. For example, Abdominal finding is

    the Middle Level of Ascites. Totally, there are 70 different sorts of Middle Level

    terms. Each term has several disorders and we use the same color indicating same

    Middle Level disorders in all networks.

    4.2 Gaussian Graphical Model

    Graph is a representation of a set of nodes or vertices where some pairs of nodes are

    connected by links or edges. Node represents canine disorder and two nodes can

    form an edge which represents the comorbidity between them. The edge typically

    has two types - directed or undirected, as the main goal is to measure comorbidities

    between dog diseases, we choose the undirected one. Normally, graph is expressed

    as G=(V, E), where V are the vertices and E are the edges. To build the network,

    main questions are which nodes should be selected and which nodes pair should be

    connected as links. To answer them, we apply Gaussian graphical model(GGM)

    technique. It helps us to find the graph structure which is a sparse graph of the

    disease nodes that represents the conditional independence properties present in

    the data.

    The training data consists of 3884 instances(rows) by 429 diseases(columns)

    matrix. From disease/column perspective, each disease is a vector with length

    3884 and each element has value 1 or 0 (affect or not affect). GGM will com-

    pute all pairwise correlations between two disease vectors and subsequently to

    draw the corresponding graph. As the name indicates, GGM assume each vari-able follows Gaussian distribution or Normal distribution and all the variable

    constitute multivariate Gaussian distribution. Specifically, 429 random vectors

    X = (X1,X2, . . . , X p) consist of a multivariate Gaussian distribution Np(,),

    where p is 429, both mean and covariance matrix are unknown. Probability

    density function of multivariate Gaussian distribution:

    6

  • 8/11/2019 Master Project(LongYu)

    8/38

    f(x1, . . . , xp) = 1

    (2)p||exp(1

    2(X )T1(X ))

    We hope to estimate the inverse covariance matrix C (C=1) because it is the

    key matrix to decide the structure of the graph. In inverse covariance matrix(ICM)

    (Cij C), a zero element Cij = 0 indicates a conditional independence betweenthe two random variablexiand xj given all the other variables or diseases. In other

    words, the correlation between disease xi and xj is absent if and only ifxi and

    xj are conditionally independent. It is equivalent to the problem that estimates

    the parameter and identify zeros in the ICM. This kind of problem is also called

    covariance selection problem(Dempster, 1972).

    To address the problem, the standard method is the greedy stepwise forwardselection or backward deletion. The weakness of it is that the common stepwise

    procedure has large computational complexity. Each single step, it needs a mass

    of candidate models[1]. Meinshausen and bhlmann(2004) proposed a lower com-

    putational complexity approach to do the covariance selection by neighbourhood

    selection for each vertex in the graph. All the methods introduced above, model

    selection and parameter estimation are done separately. In this paper, we choose

    a penalized likelihood method that does model selection and parameter estima-

    tion together in the Gaussian graphical model(Yuan and Lin, 2007). An important

    problem we want to address is to make network sparser which means we want more

    0 elements to appear in the proper positions in ICM, to do this, GGM applies an

    L1 penalty term which is the same inspiration from the Lasso penalty of linear

    regression. Then, the weak correlation of diseases will be ignored by applying

    penalty term.

    To build the network, all the disease vectors will be centered which means the

    sample mean of data is zero. All sampleX1, X2..Xn are Independent and identi-

    cally distributed. As all disease vectors follow multivariate Gaussian distribution,

    the log-likelood ofand C=1

    is

    n

    2ln det C1

    2

    ni=1

    (Xi )TC(Xi ) (1)

    The MLE of(,)is(X, A), where

    7

  • 8/11/2019 Master Project(LongYu)

    9/38

    A= 1

    n

    n

    i=1

    (Xi

    X)T(Xi

    X) (2)

    Thus, the inverse covariance matrix C can be estimated by A1. In general, we

    choose sample covariance matrix S=n A/(n 1) and C can be estimated by S1.However, the number parameter estimation is quite large, where parameters are

    the upper triangular or lower triangular elements of the ICM(total number:p(p+1)2

    ).

    With so many parameters, S is not stable of estimating . So, we introduce lasso

    penalty term as discussed before in order to make the graph sparse. The problem

    now is to find the minimizer(,C) and C is positive definite matrix[1]:

    ln det C+ 1n

    ni=1

    (Xi )TC(Xi ) s.t.i=j

    |cij| t (3)

    Here t 0 is turning parameter controlling the sparsity. As disease data hasbeen centered andlog det C has the same result asln det C in finding theminimizer of C, the problem transfers to minimize formula which follows the same

    form as(e.g Banerjee et al., 2008; Friedman et al., 2008):

    log det C+tr(SC) +

    C

    1 (4)

    where tr is trace, is the tuning parameter andC1 is the L1 norm on C.To solve the formula (4), Yuan and Lin(2007) came up a method by regard-

    ing the problem as the determinant maximization problem(maxdet problem) and

    solve them using the interior point algorithm which is quite time-consuming. In

    our paper, we choose algorithm proposed by Friedman et al.(2008) named Graph-

    ical Lasso. They use the block coordinate descent approach that has been used

    in Banerjee et al.(2007) as a starting point, then propose a new algorithm that

    extremely simple and faster comparing with other methods.

    We introduce the KKT conditions to solve (4). As the problem is unconstrained

    optimization, we use stationarity condition only which says zero vector is one of

    the elements of sub-differential set. The derivative of log det C = C1, proved

    in Boyd & Vandenberhe (2004), page 641. Then, we write Graphical lasso KKT

    stationarity condition as[2]:

    8

  • 8/11/2019 Master Project(LongYu)

    10/38

    W+S+ = 0 (5)

    where |Cij| and W =C1. Now, we will solve in terms of W. Note thatWii= Sii+ for Cii>0. Partitioning W and S as:

    W =

    W11 w12

    wT12 w22

    S=

    S11 s12

    sT12 s22

    Where W11 (p1)(p1), w12 (p1)1, w21 1(p1), w22 Consider 12-block of KKT conditions[3]:

    w12+s12+ 12 = 0 (6)From

    W11 w12

    wT12 w22

    C11 c12

    cT12 c22

    =

    I 0

    0 1

    (7), we can get that w12 =

    W11c12/c22, subsitituting it to (6), we get:

    W11c12c22

    +s12+ 12 = 0 (8)

    Assuming x = c12/c22 and rewrite it as:

    W11x+s12+ = 0 (0)where ||x||1. This formula looks like the KKT conditions for:

    minx

    xTW11x+sT12x+||x||1 (10)

    This is a lasso problem which can be solved quickly by coordinate descent

    algorithm[3]. As we have got w12 =W11x, and c12, c22 can be acquired by (7).We setw21=w

    T12, c21=c12

    T, thus, we reduce the graphical lasso problem to a set

    of sequential lasso problems that can be easily solved by many methods.

    4.3 The Relative Risk and-correlation measurement

    We quantify the strength of the comorbidities through the correlation between two

    diseases. The measurements we choose are Relative Risk(RR), -correlation. Both

    9

  • 8/11/2019 Master Project(LongYu)

    11/38

    of them can quantify the disease associations. The RR of a pair of diseases i and

    j infecting on the same dog is given by:

    RRij =CijN

    PiPj

    whereCij is the number of dogs affected by both diseases, N is the total number

    of dogs. PiandPj are the prevalence of the disease i and j or how many dogs affect

    that disease. RRij > 1 means probability of ith disease and jth disease association

    is larger than expected by chance, while RRij < 1 means they are smaller than

    expected by chance.

    The -correlation, which is Pearsons correlation for binary variables, of two

    diseases i and j over same dog is defined by:

    ij = CijN PiPjPiPj(N Pi)(N Pj)

    ij > 0 means comorbidity is more likely than expected by chance, while ij

    < 0 means comorbidity is less likely than expected by chance. These two mea-

    surements are simple and effective in calculating similarity of two diseases. When

    the value goes higher, it indicates stronger correlation between two diseases, vice

    versa. A main disadvantage of two approaches is that they have intrinsic biases.

    As for the RR, it will overestimates associations involving rare diseases and under-estimates associations between highly prevalent disorders[4]. Take overestimation

    for example, ifP1 = 102, P2 = 10

    2 are rare diseases and total number of dogs are

    N = 107, then RR = C12107

    102102 = C12 103. Even ifC12 is a small number, the RR

    will be quite a high value that is apparently overestimated. correlation underes-

    timates diseases with extremely different prevalence. For instance, assuming the

    two diseases are maximize correlated which means the overlap can be quite large

    : C12 =P2. Then, replace C12 withP2, we get:

    = P2(N P1)P1P2(N P1)(N P2)

    =

    P2(N P1)P1(N P2)

    When the prevalence likes this P2 P1 N, then the approximation of= P2P1 .It is quite a small number which is underestimated. These two measurements are

    not totally independent of each other as both of them increase with the number

    of dogs affected by both diseases.

    10

  • 8/11/2019 Master Project(LongYu)

    12/38

    4.4 Thresholds

    From a visualization perspective, not every comorbidity should be appreciated.

    There is a tradeoff between the number of disease associations and the signifi-

    cance of them. For RR and correlation networks, if we specify a high cutoff,

    we would lose information from original data and preserve fewer most correlated

    comorbidities. The resulting networks will be very sparse and most diseases will

    be completely disconnected. As for a low cutoff, the visualization of the networks

    would become extremely dense, even the accidental event from data will be pre-

    sented in the network, it is hard for us to analyse main trend of disease. By trading

    off the value of threshold, we hope to find a sparse solution that still adequately

    explains the data and what we want to achieve is to preserve a large number of

    nodes but relatively few links in this experiment. Most nodes preserved ensures

    we wont lose much disease information and few links make us focusing on signif-

    icant comorbidities only. Thus, we draw a picture on the nodes number with its

    threshold ofcorrelation:

    Figure 1: thresholddiseases number

    From the figure above, it can be seen that when > 0.09, the number of nodes

    decreases dramatically. By applying cutoff = 0.09, it preserves most of diseases, the

    comorbidity number decreases from 5989 to 934, thus, the network becomes much

    11

  • 8/11/2019 Master Project(LongYu)

    13/38

    more sparser and significant associations are reserved. What we are interested in

    now is what statistical significance level it is. To validate it, we apply the t-test

    for all the associations and the null hypothesis becomes = 0. t value can becalculated by this formula4:

    t=n 2

    1 2Where n is the number of observations. In all of our data we use n=max(Pi, Pj),

    which represents the most stringent way in which t can be calculated given our

    data. To determine the significant level of t, it is necessary to view t value table,

    for n>1000, any t 1.96 is significant at the 5% level and any t 2.58 is at1% level. In our experiemnt, the significant level is 5% and it can be calculatedby stats package of python as stats.t.ppf(1-0.025, n). After calculating all the

    diseases pairs, most links will reject the null hypothesis by t test which means

    threshold = 0.09 ensure most links significance level at 5%.

    Figure 2: RR thresholddiseases number

    As for RR, we also plot figure of nodes number with threshold above. It can

    be seen that there is several thresholds can be selected. According to the result of

    hypothesis test and keeping node number nearly the same ascorrelation network,

    4http://barabasilab.neu.edu/projects/hudine/resource/data/data.html

    12

  • 8/11/2019 Master Project(LongYu)

    14/38

    we find that RR = 34 is a good threshold choice. The number of links fall to 786

    and nodes to 368. This time, to confirm the significant level, we calculate the 95%

    confidence interval given by:

    [RRij exp(1.96ij), RRij exp(1.96ij)]where ij is: ij =

    1Cij

    + 1PiPj

    1N 1

    N2

    The null hypothesis is RR= 1 and to reject it, we find that more than halflinks with 95% confidence interval dont include 1, which means threshold = 34

    ensure majority of links hold the significance level at 5%.

    In gaussian graphical model, the main free parameter is the penalty term.

    Former two measurements choose the threshold by hypothesis test, as for GGM,

    in order to validate with RR and correlation networks, we select the penalty

    parameter so as to maintain nearly the same nodes and links. Also, according to

    figure below, we can see that when threshold is larger than 0.09, the slope would

    decrease rapidly. Thus, we choose parameter of penalty term = 0.09. The link

    and node number is 869 and 379 respectively.

    Figure 3: GGM thresholddiseases number

    13

  • 8/11/2019 Master Project(LongYu)

    15/38

    5 Visualising Canines Disease Networks

    5.1 GephiThe tool to build networks is Gephi5. It is an interactive visualization and free

    software for all kinds of networks and complex systems, dynamic and hierarchical

    graphs. It can be run on multiple system platforms such as Mac Osx, Windows

    and Linux. Two primary source files for Gephi to generate networks are the nodes

    and links files, both of them should be edited in CSV format. The main advantage

    of Gephi is its various layouts and easy to manipulate nodes and links along with

    color, size and location settings. There are two important plugins that are helpful

    for us to generate the network. First one is GeoLayout plugin, after installing

    it, we can set the nodes at any fixed location by longitude and latitude attributes

    which is the same as x y coordinate. In order to have a good-looking and clear

    network, we plot the diseases as well as the location to the certain body part. Also,

    we locate many isolated nodes so as to draw the outline of a puppy by the longitude

    and latitude attributes. Secondly, by installing the SigmaExporter plugin, Gephi

    export the network components into a folder which contains HTML, source files

    and configuration files. Then, network can be viewed in the browser and deployed

    online. The reason why the network can be embeded in browser is that it uses a

    technique called Sigma.js. Sigma.js 6

    is a JavaScript library dedicated to graphdrawing. It makes easy to publish networks on webpages and integrate network

    with rich web applications. What is more, this link7 is a short video in how to

    build GGM network through Gephi on our dog disease data.

    5.2 Generate data for GGM network and visualization

    First of all, we would like to draw the outline of puppy by locating many isolated

    nodes. By doing it, users can view the relation between disease and its body loca-

    tion directly. The original dog image was downloaded from deviantART websiteand image is a black/white png format image with 528x564 pixels. Now, the task

    is how to transfer the puppy image to many nodes with x y location that draw

    5http://gephi.github.io/6http://sigmajs.org/7https://www.youtube.com/watch?v=syzgKGYYIdU&list=UUcGDb7rt_B4h1EHqRfqPL8w

    14

  • 8/11/2019 Master Project(LongYu)

    16/38

    the outline of a puppy. We use the Matlab Image Processing Toolbox to process

    image. To acquire the x and y coordinates, we regard the 528x564 pixels image as

    a matrix and extract location information directly from the row index and columnindex of image matrix. Another problem is the image matrix contains overmany

    elements(528564 = 297792), which is cumbersome for visualization. So, to geta sparser layout, we apply mod function to select fewer nodes. Importing image

    as bitmap and matlab codes is shown below:

    % import image

    img = imread(puppy.png);

    % select proper dimension

    img = img(:,:,1);

    % initialize image

    new_img = ones(528,564)255;

    num=0;

    for i = 1: size(img,1)

    for j = 1:size (img,2)

    if img(i,j) == 0

    if mod(i,3)==0 && mod(j,3)==0

    new_img(i,j)=0;

    num=num+1;

    end

    end

    end

    end

    imshow(new_img)

    % write the image matrix to csv

    dlmwrite(puppy.csv,new_img)

    To visualize the network, we will use the graphical lasso algorithm as discussed

    above. For the code part, it has been already implemented through R library so

    that we can use it directly8. The main code is shown below:

    # code in code/R/glasso.r

    # import glasso library

    library( glasso)

    all_data < read.csv( source file ,sep = ";")

    # calculate covariance matrix

    disease_data < subset(all_data)

    variance < var(disease_data)

    cor

  • 8/11/2019 Master Project(LongYu)

    17/38

    not zero, then 3th disease and 4th disease form the comorbidity. After iterating

    every element of ICM, we collect all the links and nodes information which can

    be imported to Gephi. As for format of node file, the additional attribute isMiddleLevel which can be obtained from the Middle Level mapping file. By

    SigmaExporter, the GGM network is shown below:(and can also be seen online:9)

    Figure 4: GGM network

    It is obvious to see disease along with its comorbidities and Middle Level code.

    There are 9 node clusters according to body locations. As for network interaction,

    when you hover over the disease node, it automatically highlights its comorbidities.

    When clicking the node, full description about the disease will be shown on the

    right-hand side. Take Interdigital cyst (dogs) as example, when you click that

    disease node, you will see the disease information along with its comorbidities(see

    Figure 5 below). Also, you can zoom in/zoom out and refresh the GGM network

    through the three buttons below or scroll up/down mouse wheel. To search disease,

    you can type the name in input field of left-hand toolbar. The size of node is

    proportional to the prevalence of the disease while nodes in same color indicate

    same Middle Level code.

    9http://smileclinic.alwaysdata.net/long_msc2014/ggm_dog_network/

    16

  • 8/11/2019 Master Project(LongYu)

    18/38

    Figure 5: Interaction example of Interdigital cyst(dogs)

    5.3 Generating data for RR and-correlation network and

    their visualization

    The way to generate data for RR and -correlation network is slightly different

    from dealing with GGM network(Main code in code/network.py). The first task

    is how to process the dog disease file. As we know, the disease file looks like a

    instances/diseases matrix and what we are interested in is disease associations. So,

    the goal is to transfer the matrix to diseases pairs[(disease1, disease2),(disease2,

    disease3),(disease3, disease5),. . . ]. To extract the association, we regard each dog

    instance as a vector. For example, a dog instace vector is [1,0,1,1,0,0,. . . ], it

    contains 429 elements and each element stands for certain disease. 1 indicates

    disease detected and 0 indicates no disease detected. We iterate evey dog instance

    to get the permutation of diseases so that we can acquire all the possible disease

    pairs. Next, disease pairs will merge and count in order to calculate the RR or score. New disease pairs format looks like this:[((diesase1_id,diesase2_id),RR/

    score),((diesase2_id,diesase3_id),RR/ score). . . ]. By applying the threshold of

    RR and (34 and 0.09 respectively), the original matrix will be tranfered to final

    comorbidities. The edge file add a new attribute called weight which is the RR/

    score of each disease-pair. In networks below, the thickness of links indicate weight

    17

  • 8/11/2019 Master Project(LongYu)

    19/38

    of the comorbidity. Here are correlation and RR networks(also viwe online RR

    network 10 and correlation network 11).

    Figure 6: correlation network

    10http://smileclinic.alwaysdata.net/long_msc2014/rr_dog_network/

    11http://smileclinic.alwaysdata.net/long_msc2014/phi_dog_network/

    18

  • 8/11/2019 Master Project(LongYu)

    20/38

    Figure 7: RR network

    6 Analysing networks

    6.1 Networks validation

    In the canine disease dataset, the top 3 prevalent diseases with their prevalence

    are: Otitis externa - 396, Periodontal disease - 361, Anal sac impaction - 277.

    They are extremely common diseases and almost 1/10 dogs have Otitis externa

    and Periodontal disease illness. Figure and table below draw diseases prevalence

    ditribution. 121 diseases appear only once which means nearly a quarter of diseases

    are rare diseases in our dataset.

    19

  • 8/11/2019 Master Project(LongYu)

    21/38

    Figure 8: disease prevalence distribution

    Prevalence Count

    1 121

    3 33

    4 20

    5 24

    6 17

    7 11

    12 9

    15 9

    9 8

    13 8

    8 7

    10 7

    22 6

    17 5

    . . . . . .

    To validate GGM network, we use the RR and correlation networks as com-

    parisions and use Jaccard index to measure the similarity. Jaccard index is a

    statistics used for comparing the correlation or similarity of two finite sample sets.

    It calculates the intersection of two sets divided by the union of two sets:

    Jaccard(A,B) =|A B||A B|

    We extract all the comorbidities from three networks and use python code below

    to calculate. The Jaccard index of GGM and network is 0.918085106383, while

    Jaccard index of GGM and RR network is 0.789189189189. The GGM network has

    quite a high score/similarity with network, which means most of comorbidities

    among them are overlap. In addition, the result indicates the GGM network is a

    reasonable network validated by network. As for the RR network, the score islower than network. It is because the bias of Related Risk. RR overestimates

    associations involving rare diseases and nearly a quarter of diseases appear once

    only. As a result, the disease pairs are much more likely to be biased. Also, some

    other difference among three networks should be taken into consideration. The

    GGM network assume that disease distribution is Normal distribution while the

    20

  • 8/11/2019 Master Project(LongYu)

    22/38

    other two have their own biases. The threshold we select is not equivalent for every

    network where node and link number are not exactly same in different networks.

    #code file: code/analyse_disease_pairs.py

    def jaccard_index(set_1, set_2):

    intersection_num = len(set_1.intersection(set_2))

    return intersection_num / float(len(set_1) + len(set_2) intersection_num)

    6.2 GGM network analysis

    First, we do the analysis on the Middle Level code. Here lists top 5 prevalent

    Middle Level codes of GGM network:

    Middle Level code PrevalenceSkin (cutaneous) disorder finding 10.82%

    Neoplasia 9.23%

    Mass lesion finding 6.86%

    Ophthalmological disorder finding 5.8%

    Enteropathy 5.28%

    Table below is top 8 heavily connected nodes or diseases along with comorbidity

    number. Normally, we call these nodes as hubs of network. Inside them, Middle

    Level code of Skin (cutaneous) disorder finding has most hubs which includeSkin (cutaneous) disorder, pigmentary, Eosinophilic granuloma and Pododer-

    matitis.

    Disease name Number of comorbidities

    Cognitive dysfunction 19

    Skin (cutaneous) disorder, pigmentary 15

    Eosinophilic granuloma 15

    Cardiomegaly 13

    Colitis 13Spondylosis 13

    Pododermatitis 13

    DJD 13

    To see how the Middle Level codes associate with each other, we also map each

    comorbidity to its Middle Level code, after combination, table below shows top

    21

  • 8/11/2019 Master Project(LongYu)

    23/38

    3 Middle Level associations with number of occurrence. It can be see that Skin

    (cutaneous) disorder finding is again the most popular one. To sum up, dogs

    are likely to affect disorder of Skin (cutaneous) disorder finding or comorbiditybelongs to it. As a pet-keeper, he or she should pay more attention on this kind

    of disease so as to prevent it.

    Middle Level code associates Number

    Skin (cutaneous) disorder finding, Skin (cutaneous) disorder finding 15

    Enteropathy, Skin (cutaneous) disorder finding 11

    Ophthalmological disorder finding, Ophthalmological disorder finding 10

    After that, we would like to introduce some network properties. All the prop-

    erties are can be calculated directly from Gephi or SNAP library. SNAP library

    which is short for Stanford Network Analysis Project 12 has a large number of

    interfaces for analysis of network. It is quite efficient to manipulate graphs, cal-

    culates structural properties, generates graphs, and supports attributes on nodes

    and edges. First property is node degree distribution:

    Figure 9: GGM degree distribution

    nodes number degree

    60 nodes 1

    45 nodes 2

    52 nodes 3

    65 nodes 4

    37 nodes 5

    36 nodes 6

    25 nodes 7

    17 nodes 8

    12 nodes 9

    10 nodes 10

    6 nodes 11

    6 nodes 12

    5 nodes 13

    2 nodes 15

    1 nodes 19

    12http://snap.stanford.edu/snappy/index.html

    22

  • 8/11/2019 Master Project(LongYu)

    24/38

    We plot the degree distribution above, which looks like a power-law distribu-

    tion. It is worth noting that scale-free network is a network whose degree dis-

    tribution follows a power law. So we want to validate whether the distributionfollows power-law ,therefore, classify it to scale-free network or not. Mathemati-

    cally, the power law distribution:P(x) x, where P(x) is the degree number,x is the degree and is the parameter greater than 1. As the power law belongs

    to exponential family, in order to simplify the analysis, we get logarithm of de-

    gree distribution to see if it is linear function. Also, we introduce two comparison

    functions(piecewiselinear and quadratic), the criterion we choose is Bayesian in-

    formation criterion(BIC). BIC mainly consider two factors, how well it fits the

    data and how many explanatory variables it uses. The good fit means less error

    and fewer variables or parameters means the model is simpler and robust to avoidoverfitting problem. Given any two estimated models, the model with the lower

    value of BIC is the one to be preferred. The threes figures below are calculated

    and plotted by Dr Novi Quadrianto, and we can see that the linear model is pre-

    ferred. However, as the difference of score are too small between quadratic and

    linear(19.55-19.30=0.25), from [5], the difference is less than 2, which means linear

    model doesnt overwhelm quadratic one with strong evidence.

    Figure 10: linear function fit with BIC

    23

  • 8/11/2019 Master Project(LongYu)

    25/38

    Figure 11: piecewiselinear function fit with BIC

    Figure 12: quadratic function fit with BIC

    Another property is clustering coefficient which quantifies how well connected

    are the neighbors of a vertex in a graph[6]. In other words, it is described as the

    conjoint nodes of one node are still connected. The clustering coefficient of a vertex

    is the ratio of existing edges connecting a vertexs neighbours to each other to the

    24

  • 8/11/2019 Master Project(LongYu)

    26/38

    maximum possible number of such edges. The ith nodes clustering coefficient can

    be calculated as:

    Ci= 2eiki(ki 1)

    whereeiis the number of the connections between all these neighbours and kiis

    the number of neighbours of the ith node. In GGM network, the average clustering

    coefficient of the whole network is C = 1n

    ni=1Ci = 0.117641210955. Clustering

    coefficient is also a evidence that a network is considered as small-world network if

    the clustering coefficient is significantly higher than expected by random chance.

    As the result is not high enough, we cannot believe GGM is a small-world network.

    Average path length is another important concept in network topology. It isa measure of the efficiency of information or mass transport on a network, which

    shows the number of steps it takes to get from one node of the network to another.

    It is calculated by finding the shortest path between all pairs of nodes, add them

    and divide by the total number of pairs. In our GGM network, the average path

    length is 4.265. It tells us once a dog has a disease, it would progress 4 more

    disorders before affecting object disorder on average.

    6.3 Expert validation of GGM network

    Except validating with RR and correlation networks through Jaccard index,

    another way we introduce is expert validation which verify network results by

    someone with high authority on the area of dog disease. This approach is more

    authentic and convincing as it is judged by professional or expert. The person we

    invite to do expert validation is Dr Dan ONeill. He is dogs trust companion animal

    epidemiologist mainly research in Veterinary Epidemiology and Economics and

    Public Health areas. He ran his own companion animal practice for 12 years and

    started PhD in veterinary epidemiology at the RVC. ONeill is now a post-doctoral

    researcher and continues to expand VetCompass to examine health-welfare issues

    in dogs.

    What we want to validate is the precision of comorbidities. By listing disease

    associations of all three networks, the expert can judge whether the comorbidity is

    reasonable and label them as Expected or Unexpected. Two criterions decide

    how well the result is. One is how many comorbidities the network detect, the

    25

  • 8/11/2019 Master Project(LongYu)

    27/38

    other is whether comorbidity is correct. As the dataset is small, we selected 10

    most common disorders which avoids unreliability and is likely to have the most

    chance of having comorbidities according to ONeills advice. The results fromONeill attach on the Appendices with two parts: general comment and validation

    results. From the validation results, we can see that RR has poorest performance

    where it detects 9 comorbidities and 5 of them are expected(5/9), followed by

    correlation with precision of 34/50, and GGM has the best precision with 33/43.

    Comparing with GGM, network has 1 more detected disorder but 7 more mis-

    detected disorders as well. By the criterions described before, we believe GGM

    network is the best one as it provides nearly same number but more accurate

    comorbidities.

    Then, it is a time to take a look at mis-detected comorbidities. We can seefrom the ONeills validation result that Vomiting doesnt have any comorbidity.

    It is a very common disease and such a wide range of triggers for it may reduce

    the specific comorbidity with other disorders being identified in these studies.

    Some disorders like Diarrhoea finding has the comorbidity of Nasal planum

    finding in both GGM and network, which doesnt make sense at all, thus,

    the result is unexpected. However, in dog disease data, 6 dogs affect Nasal

    planum finding and 4 of them affect Diarrhoea finding which indicates a strong

    relationship between two diseases. This kind of error is due to lack of enough data,

    if in real world, these two disease are independent, they should not co-occurrent

    many times in dataset according to statistics. In other words, the sample disease

    distribution(dataset) should follow the population distribution(real world) if the

    sample size is large enough. According to results of expert validation, a possible

    improvement can be made is to select a set of penalty parameters or thresholds

    and validate each one with expert in order to select the best performance one.

    Although this method will consume more human resource, it is quite a reliable

    and accurate way.

    6.4 Analysing illness progression on different gender

    This time, we want to analyse comorbidities based on different gender. Gender is a

    important factor in diagnosing disease, for example, breast cancer is severe disease

    mostly affecting female. As a woman, if affected diseases that are the comorbidities

    26

  • 8/11/2019 Master Project(LongYu)

    28/38

    of breast cancer, she should pay more attention to prevent it in advance. For this

    we calculate the Odds Ratio(OR), OR is the ratio of the odds of an event occurring

    in one group to the odds of it occurring in another group. In statistics, it is themeasurement on quantify how strongly the presence of disorder i associates with

    the presence of disorder j in a given population. In this experiment, the group

    refers to female and male. The expression is shown below:

    ORij(, ) =pij()(1 pij())pij()(1 pij())

    where i and j represent the disease i and disease j in female and male . If

    odds ratio equals 1, it means the comorbidity is equally likely to occur in both

    female and male. An odds ratio greater than 1 tells us that the comorbidity ismore likely to occur in the female than male, vice versa. In our experiment, we will

    present the significant difference by selecting a threshold of 2. In the OR network,

    if the OR score is bigger than 2 of female over male, we draw a green link(193

    links). if the OR score is bigger than 2 of male over female, we draw a red link(169

    links). From the network, we can see that Vomiting(15 of 16 links are red, see

    Figure 14) and Enteritis(all 5 links are red) are more likely to be infected among

    male while Intertrigo(all 5 links are green) and Incontinence - faecal(all 3 links

    are green) are more likely to happen in female. Moreover, comorbidities [Behaviour

    disorder, Obesity], [Obesity, Urinary incontinence] and [Corneal disorder finding,Anal sac impaction] should be pay more attention in female with highest OR score

    of 7.867. [Claw injury (traumatic), Diarrhoea finding] and [Mitral valve disorder,

    Periodontal disease] should be warned among male as they have two highest OR

    scores of 10.562 and 8.811 respectively. The OR network is shown below( and can

    be also seen online13):

    13http://smileclinic.alwaysdata.net/long_msc2014/or_analysis/

    27

  • 8/11/2019 Master Project(LongYu)

    29/38

    Figure 13: OR network(green: female, red: male)

    Figure 14: Vomiting: male disease with 15 of 16 red links

    28

  • 8/11/2019 Master Project(LongYu)

    30/38

    7 GGM on Large Canine Disease data

    In this section, we will apply the GGM on the a inconsistent but much larger dog

    disease dataset. The more data means the we are more likely to avoid the accidental

    event and be confident about the result of GGM. The data was provided by Noel

    Kennedy from RVC as part of his work on a Veterinary diseases classification

    system. As the data are not structured as good as the original one(429 3884),we will re-structure it in several ways. In the end, we find that GGM network on

    this dataset doesnt work so good.

    7.1 Data Structure

    In the large canine disease dataset, the main disease structure are called Data Dic-

    tionary(DD). DD groups the VeNom codes or dog disorder codes into a hierarchy,

    where most specific disease codes are at the leaf level and more general codes are

    at the higher levels. It likes a graph where the nodes represent coded findings in

    the ontology, and the edges are directed from more specific codes to more general

    codes. This represents an is-a relationship in the ontology. There are two files

    that fully describe the DD relationship. The first one is DD code to disease-name

    mapping and the second one contains the mapping from child code to its parent

    codes. One child code could have multiple parent codes. As for the dog diseasefile, it contains two columns which are animal id and DD code. There are around

    200,000 dogs in the dataset and one dog could has multiple DD codes/diseases.

    The dogs are coded at multiple levels of understanding which means the DD in one

    dog can contain high level disease and leaf disease in the same time. Also, there is

    a problem in the data structure that it is inconsistent which means a dog maybe

    positive for a specific disease but that diseases parent term is negative. What is

    more, there are 460 DD codes matching original 429 diseases because some diseases

    have than one DD codes. For example, Owner unsure has two DD codes 114

    and 10. Therefore, we will combine the repeated DD codes after obtaining all the

    disease information of the dog disease file. Table below shows part of large canine

    diseases file:

    29

  • 8/11/2019 Master Project(LongYu)

    31/38

    Animal Id DD code

    250012 15

    250012 34250012 128

    250012 2545

    250012 55070

    250012 55071

    250012 55102

    250020 15

    . . . . . .

    7.2 Data Processing

    To compare and validate the performance with previous networks, we want to

    map the diseases of large dog-disease file to the original 429 ones. According the

    structure of Date Dictionary, the best way to extract the disease information is to

    compare each DD code in the file with 429 diseases DD codes along with their

    child DD codes. In other words, for each term (see table above), we will search all

    the 429 diseases DD codes and their child DD codes. If any one matches, we can

    map the term to certain disease, otherwise, we discard it. By this way, however,

    only 38 of 429 diseases has been detected through this searching strategy. It is

    not acceptable as the goal is to compare the GGM with previous networks based

    on all 429 diseases. The reason for this phenomenon is that most diseases of the

    file are in the high level of the DD tree, and as the 429 diseases are in the lower

    level, they cannot match each other in the hierarchy. Another way to process the

    data is that we can search each term in the file and all its child nodes to see which

    diseases in 429-diseases detected. By this method, however, we find that most of

    dogs will cover all the 429 diseases as DD codes in the file are in quite high level.

    As we know, if the DD code is the root node and the only one in the disease tree

    or hierarchy, it will definitely cover all the nodes when searching its child nodes.

    After analysing the structure of DD, we find that the gap between DD code

    in files and 429 diseases is one level only. For example, 3007(not in 429 diseases)

    is DD code of Diabetes mellitus finding, which is also the parent of Diabetes

    mellitus with code 658(in 429 diseases). There are several 3007 terms and no 658

    30

  • 8/11/2019 Master Project(LongYu)

    32/38

    term in the dog disease file. So, if we search 3007 instead of 658, we can detect

    the disease Diabetes mellitus through Diabetes mellitus finding. As discussed

    above, the search strategy now is to search one level higher of all the 429 diseases,then detecting their child codes to see if the DD code matches. Then, we find that

    all the 429 diseases can be detected. The penalty parameter we choose this time is

    0.01 as it keeps nearly the same node number with previous GGM network. Figure

    below is the network we draw from the large dog diseases dataset.

    Figure 15: large dataset GGM network

    From the figure above, we find that several diseases or nodes are heavily con-

    nected, such as Splenomegaly and Enteropathy, they contains 128 and 107

    comorbidities respectively. Both of them have multiple DD codes and their DD

    codes are the root nodes in hierarchy. Thus, higher level disease will cause the

    over-connected problem while lower level disease will cause under-connected prob-

    lem. To sum up, in the large dataset, the result is heavily affected by the structure

    31

  • 8/11/2019 Master Project(LongYu)

    33/38

    of DD and it is unreasonable to compare diseases in different DD levels.

    8 Future work

    Hidalgo et al.(2009) showed the phenotypic network based on human diseases and

    our work mainly build the network for canine. Actually, both human and canine

    are kind of animal from biology perspective. Thus, there could be a potential

    connection between canine network and human network. Also, there is plenty of

    research focusing on this area, for example, Poldrack et al.(2003) studied the mem-

    ory systems of brain between animal and human. Zoobiquity[7] is a publication

    providing many cases on the similarity of human world and animal world. The

    author is inspired by an eye-opening consultation, which revealed that a monkey

    experienced the same symptoms of heart failure as her human patients. Inspired

    by this, we suppose that dog comorbidity is similar to human comorbidity. To

    validate it, the direct way is to compare the same disease on both networks with

    its comorbidities. So, we choose to compare the dog comorbidities with human

    phenotypic network built by Hidalgo et al.(2009). However, the human diseases

    in their work are coded by ICD-9-CM14 medical coding reference while animal

    diseases are coded by VeNom coding system. The difficulty is we cannot get the

    precise disease mapping of these two systems. Thus, we compare the comorbiditiesmanually by ourselves. The disease we select is Chronic kidney disease, because

    it has been already studied by ONeill et al. in the paper[8] along with its co-

    morbidities. The table below is shown the result of Chronic kidney disease from

    human phenotypic network.

    14http://www.icd9data.com/

    32

  • 8/11/2019 Master Project(LongYu)

    34/38

    Name ICD9 code prevalence score

    Renal failure unspecified 586 0.6869 % 0.141

    Nephritis and nephropathy not specified as acute or chronic 583 0.3813 % 0.182

    Hypertensive heart and chronic kidney disease, malignant.. . 404 0.5050 % 0.185

    Acute renal failure 584 1.7552 % 0.207

    Malignant hypertensive renal disease without renal 403 1.4743 % 0.310

    hyperosmolality and/or hypernatremia 276 27.1 % 0.107

    Mechanical complications of unspecified cardiac device . . . 996 4.4082 % 0.107

    Sideroblastic anemia 285 14.8 % 0.109

    Nephrotic syndrome 581 0.1831 % 0.114

    Nephroptosis 593 2.8091 % 0.121

    Chronic glomerulonephritis 582 0.5515 % 0.128

    Congestive heart failure unspecified 428 18.3 % 0.139

    Result of Chronic kidney disease from ONeill et al. (2013):

    Anaemia

    Cardiac disorder

    Decreased appetite

    Halitosis

    Hypertension

    Lethargy

    Melaena

    pancreatitis

    Polyuria/polydipsia

    Urinary incontinence

    Vomiting

    Weight loss

    From the table, we find that Hypertensive heart and chronic kidney disease,

    malignant. . . of human disease can be related to Hypertension of dog disease.

    Both Congestive heart failure unspecified(human) and Cardiac disorder(dog)

    are the diseases related to heart. Most comorbidities of human and dog are the

    problems towards kidney. Thus, it can be seen that these comorbidities are not

    independent of each other. As a result, if a precise mapping from animal disease

    to human disease can be provided, we may be able to connect and analyse the

    comorbidity of them.

    33

  • 8/11/2019 Master Project(LongYu)

    35/38

    References

    [1] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graph-

    ical model. Biometrika, 94(1):1935, 2007.

    [2] Daniela M Witten, Jerome H Friedman, and Noah Simon. New insights and

    faster computations for the graphical lasso. Journal of Computational and

    Graphical Statistics, 20(4):892900, 2011.

    [3] Rahul Mazumder, Trevor Hastie, et al. The graphical lasso: New insights and

    alternatives. Electronic Journal of Statistics, 6:21252149, 2012.

    [4] Csar A Hidalgo, Nicholas Blumm, Albert-Lszl Barabsi, and Nicholas A

    Christakis. A dynamic network approach for the study of human phenotypes.

    PLoS computational biology, 5(4):e1000353, 2009.

    [5] Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american

    statistical association, 90(430):773795, 1995.

    [6] Sara Nadiv Soffer and Alexei Vzquez. Network clustering coefficient without

    degree-correlation biases. Physical Review E, 71(5):057101, 2005.

    [7] Barbara Natterson Horowitz and Kathryn Bowers.Zoobiquity: What AnimalsCan Teach Us about Being Human. Random House, 2012.

    [8] DG ONeill, J Elliott, DB Church, PD McGreevy, PC Thomson, and

    DC Brodbelt. Chronic kidney disease in dogs in uk veterinary practices:

    prevalence, risk factors, and survival.Journal of Veterinary Internal Medicine,

    27(4):814821, 2013.

    [9] Jean-Franois Rual, Kavitha Venkatesan, Tong Hao, Tomoko Hirozane-

    Kishikawa, Amlie Dricot, Ning Li, Gabriel F Berriz, Francis D Gibbons,

    Matija Dreze, Nono Ayivi-Guedehoussou, et al. Towards a proteome-scale map

    of the human proteinprotein interaction network. Nature, 437(7062):1173

    1178, 2005.

    [10] Arthur P Dempster. Covariance selection. Biometrics, pages 157175, 1972.

    34

  • 8/11/2019 Master Project(LongYu)

    36/38

    [11] D-S Lee, J Park, KA Kay, NA Christakis, ZN Oltvai, and A-L Barabsi. The

    implications of human metabolic network topology for disease comorbidity.

    Proceedings of the National Academy of Sciences, 105(29):98809885, 2008.

    [12] Kwang-Il Goh, Michael E Cusick, David Valle, Barton Childs, Marc Vidal,

    and Albert-Lszl Barabsi. The human disease network. Proceedings of the

    National Academy of Sciences, 104(21):86858690, 2007.

    [13] Sebastian Schneeweiss, Philip S Wang, Jerry Avorn, and Robert J Glynn.

    Improved comorbidity adjustment for predicting mortality in medicare pop-

    ulations. Health services research, 38(4):11031120, 2003.

    [14] Russell A Poldrack and Mark G Packard. Competition among multiple mem-ory systems: converging evidence from animal and human brain studies. Neu-

    ropsychologia, 41(3):245251, 2003.

    [15] Nicolai Meinshausen, Peter Lukas Bhlmann, Peter Lukas Bhlmann, and

    Peter Lukas Bhlmann. Consistent neighbourhood selection for sparse high-

    dimensional graphs with the lasso. Seminar fr Statistik, Eidgenssische Tech-

    nische Hochschule (ETH), Zrich, 2004.

    Appendices

    Expert validation general comment from Dr Dan ONeill:

    Many of the more common disorders in dogs are syndromes in the sense

    that they represent a spectrum of underlying specific disorders that al

    share a common presentation pattern. This has the result of making

    them common as apparently distinctive clinical presentations but may

    reduce the comorbidity indices with other disorders because of the vary-

    ing underlying true pathologies. It should be noted that comorbidity

    studies carried out across all disorders recorded in dogs are subject to

    the risk of spurious results being identified due to chance. These stud-

    ies are best suited to hypothesis generation and should be confirmed

    by later specific confirmatory studies. During the validation process,

    35

  • 8/11/2019 Master Project(LongYu)

    37/38

    the expert defined the comorbidity associations as being expected or

    unexpected based on current veterinary norms. The unexpected results

    are potential new areas for investigation that offer the opportunity toidentify previously unknown associations. While the GGM and Phi

    results were generally consistent with current veterinary expectation,

    the RR results seemed to miss some important associations identified

    by the other two methods. It would appear that RR is a less useful

    method in this respect. Overall these comorbidity results are highly

    consistent with conventional veterinary understanding of disease as-

    sociations. Novel but potentially useful findings include comorbidity

    between DJD and hypothyroidism, and between periodontal disease

    and heart disorders.

    Validation table:

    36

  • 8/11/2019 Master Project(LongYu)

    38/38