a pathway and network oriented approach to enlighten ...jan 15, 2020 · 28 bottom-up manner, and...
TRANSCRIPT
A Pathway and Network Oriented Approach to Enlighten Molecular
Mechanisms of Type 2 Diabetes Using Multiple Association Studies
Burcu Bakir-Gungor1*, Miray Unlu Yazici2, Gokhan Goy1, Mustafa Temiz1 1
1 Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey. 2
2 Department of Bioengineering, Abdullah Gul University, Kayseri, Turkey. 3
* Correspondence: 4
Burcu Bakir-Gungor, [email protected] 5
6
Keywords: Genome-wide association study (GWAS), multiple association studies, single 7
nucleotide polymorphism (SNP), subnetwork identification, pathway subnetwork, pathway 8
clustering analysis, normalized mutual information (NMI), type 2 diabetes. 9
Abstract 10
Diabetes Mellitus (DM) is a group of metabolic disorder that is characterized by pancreatic 11
dysfunction in insulin producing beta cells, glucagon secreting alpha cells, and insulin resistance or 12
insulin in-functionality related hyperglycemia. Type 2 Diabetes Mellitus (T2D), which constitutes 13
90% of the diabetes cases, is a complex multifactorial disease. In the last decade, genome-wide 14
association studies (GWASs) for type 2 diabetes (T2D) successfully pinpointed the genetic variants 15
(typically single nucleotide polymorphisms, SNPs) that associate with disease risk. However, 16
traditional GWASs focus on the ‘the tip of the iceberg’ SNPs, and the SNPs with mild effects are 17
discarded. In order to diminish the burden of multiple testing in GWAS, researchers attempted to 18
evaluate the collective effects of interesting variants. In this regard, pathway-based analyses of 19
GWAS became popular to discover novel multi-genic functional associations. Still, to reveal the 20
unaccounted 85 to 90% of T2D variation, which lies hidden in GWAS datasets, new post-GWAS 21
strategies need to be developed. In this respect, here we reanalyze three meta-analysis data of GWAS 22
in T2D, using the methodology that we have developed to identify disease-associated pathways by 23
combining nominally significant evidence of genetic association with the known biochemical 24
pathways, protein-protein interaction (PPI) networks, and the functional information of selected 25
SNPs. In this research effort, to enlighten the molecular mechanisms underlying T2D development 26
and progress, we integrated different in-silico approaches that proceed in top-down manner and 27
bottom-up manner, and hence presented a comprehensive analysis at protein subnetwork, pathway, 28
and pathway subnetwork levels. Our network and pathway-oriented approach is based on both the 29
significance level of an affected pathway and its topological relationship with its neighbor pathways. 30
Using the mutual information based on the shared genes, the identified protein subnetworks and the 31
affected pathways of each dataset were compared. While, most of the identified pathways 32
recapitulate the pathophysiology of T2D, our results show that incorporating SNP functional 33
properties, protein-protein interaction networks into GWAS can dissect leading molecular pathways, 34
which cannot be picked up using traditional analyses. We hope to bridge the knowledge gap from 35
sequence to consequence. 36
37
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
1 Introduction 38
More than 400 million adults struggle with Diabetes Mellitus, and this number is expected to reach 39
600 million by 2040 (International Diabetes Federation, 2017). Type 1 and Type 2 Diabetes Mellitus 40
(T1DM, T2DM) are the two main types of Diabetes, which contribute to worldwide health care 41
problem by not properly using blood glucose for energy in the body. While T1DM is mostly related 42
with pancreatic beta cell damage, T2DM is both associated with beta cells’ functionality and insulin 43
resistance (DeFronzo et al., 2015; Zheng et al., 2018). Recently, with the help of antidiabetic agents, 44
significant progress has been made in maintaining the glycemic control (blood sugar level) in T2D 45
patients. Still, the targeted glycated hemoglobin levels could not be maintained for 40% of the adults 46
with diabetes in USA. The decrease in pancreatic beta cell functionality and the increase in the 47
insulin sensitivity of T2D patients over the time, eventually gave rise to the imbalance of A1C level 48
and antidiabetic treatment gap (Freeman, 2013). This kind of imbalance and dysfunctionality 49
emerges as a result of the complex interactions among the environmental and genetic risk factors. In 50
this respect, the etiology, driving factors and the genetic predispositions responsible for the increased 51
susceptibility of T2D needed to be well understood in developing new drugs and treatments for this 52
disorder. In this kind of complex diseases, the investigations of different mechanisms of actions may 53
provide benefits for therapeutic approaches. Therefore, post analysis of high throughput studies 54
conducted at different molecular levels and the elucidation of targeted genes and pathways associated 55
with T2D are crucial. 56
The widespread introduction of large-scale genetic studies has enabled researches to investigate the 57
genetic frameworks of complex disorders. During the last decade, genome wide association studies 58
(GWAS) are widely used to identify the risk factors of complex diseases, to better understand the 59
biological mechanisms of these diseases, and hence to help the discovery of novel therapeutic targets 60
(Claussnitzer et al., 2020). Despite GWASs has led to a remarkable range of discoveries in human 61
genetics (Visscher et al., 2017), it has some shortcomings. One important shortcoming of GWAS 62
stems from its testing each marker once at a time for association with disease. Since these studies 63
evaluate the significance of the variants individually, they probably miss the SNPs that have low 64
contribution to disease individually, but might be important when interacting collectively. Moreover, 65
in traditional GWASs, the functional effects of significant SNPs, predicted at the splicing, 66
transcriptional, translational, and post-translational levels are usually neglected. Although GWAS 67
identified more than 140 independent loci influencing the risk of T2D (Bonàs-Guarch et al., 2018; 68
Mahajan et al., 2018b, 2018a; Mercader and Florez, 2017; Scott et al., 2017; Xue et al., 2018; Zhao et 69
al., 2017), most of these loci are driven by common variants and the mechanistic understanding has 70
only been achieved only for a couple of these genes. In this respect, post-GWAS strategies need to be 71
developed to enlighten the molecular mechanisms underlying T2D development and progress (White 72
et al., 2019). 73
Recent studies indicated that the methods focusing on pathways rather than individual genes can 74
detect significant coordinated changes since these genes act in a synergistic mode in a biological 75
pathway (Nguyen et al., 2019). Pathway analysis can hypothetically improve power to uncover 76
genetic factors relevant to disease mechanisms, because identifying the accumulation of small genetic 77
effects acting in a common pathway is often easier than mapping the individual genes within the 78
pathway that contribute to disease susceptibility remarkably (Kao et al., 2017; Lamparter et al., 2016; 79
Thrash et al., 2019). The profound discovery that T2D is genetically heterogeneous suggested that 80
the genetic defects might converge on common pathways building up the final similar phenotype. 81
Besides providing the opportunity to investigate additional therapies that reverse the effects of a 82
particular genetic defect, these findings also may encourage scientist to understand the aberrant 83
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
networks at genetic, cellular and physiological levels and to devise pharmacological and 84
nonpharmacological intervention strategies. 85
Inspired by these findings, in this study, we reanalyzed three meta GWAS dataset of T2D, using the 86
methodology that we have developed to identify disease-associated pathways by combining 87
nominally significant evidence of genetic association with the known biochemical pathways, protein-88
protein interaction (PPI) networks, and the functional information of selected SNPs (Bakir-Gungor et 89
al., 2014). 90
2 Materials and Methods 91
2.1 Datasets 92
2.1.1 70K for T2D Meta-analysis data (T2D1) 93
Bonàs-Guarch et. al. collected T2D genome wide association study (GWAS) data, representing 94
12,931 cases and 57,196 controls of European ancestry from EGA and dbGaP databases (Bonàs-95
Guarch et al., 2018). In 70KforT2D meta-analysis data, each dataset was quality controlled and each 96
cohort was imputed to reference panels (1000G and UK10K). Variants which were selected for 97
IMPUTE2 info score ≥ 0.7, MAF ≥ 0.001 and, Hardy-Weinberg equilibrium (HWE) controls p > 98
1x10-6, were meta-analyzed. For more details about the followed quality control procedure and 99
association analysis of 70KforT2D dataset, please see (Bonàs-Guarch et al., 2018). 100
2.1.2 Meta-analysis of DIAGRAM, GERA, UKB GWAS datasets (T2D2) 101
Xue et. al. performed a meta-analysis of GWAS in T2D by gathering DIAGRAM, GERA, UKB 102
GWAS datasets (Xue et al., 2018). 62,892 cases and 596,424 controls of European ancestry in total 103
were obtained after quality controls and imputed to 1000 Genomes Project. Linkage disequilibrium 104
(LD) score regression analysis was demonstrated. Variants were filtered for GERA and UKB using 105
IMPUTE2 info score ≥ 0.3, MAF ≥ 0.01, HWE controls p > 1x10-6. Further details about DIAGRAM 106
imputed data in stages 1 and 2, genotyping, quality control and association analysis for each dataset 107
can be found in (Xue et al., 2018). 108
2.1.3 Type 2 Diabetes GWAS Meta-analysis Dataset #3 (T2D3) 109
Mahajan et. al. collected T2D GWAS datasets from 32 studies including 74,124 cases and 824,006 110
controls of European population, and aggregated data after initial analyses (Mahajan et al., 2018a). 111
Following quality control checks, the imputation of studies was performed using Haplotype 112
Reference Consortium reference panel, except for deCODE GWAS, where population-specific 113
reference panel was used for imputation. For detailed information, please refer to (Mahajan et al., 114
2018a). 115
2.1.4 Protein-protein interaction (PPI) dataset 116
A human protein-protein interaction (PPI) network (interactome data) containing 13,460 proteins and 117
141,296 protein-protein interactions was derived from (Ghiassian et al., 2015) and used in 118
subnetwork identification steps of this study. 119
2.2 Methods 120
To enlighten the molecular mechanisms underlying T2D development and progress, here we 121
integrated different in-silico approaches that proceed in top-down manner and bottom-up manner, as 122
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
summarized in Figure 1. Via combining nominally significant evidence of genetic association with 123
the known biochemical pathways, PPI networks, and the functional information of selected SNPs, 124
our proposed approach identifies disease-associated pathways. 125
2.2.1 Preprocessing 126
Association summary statistics for the T2D1, T2D2, T2D3 datasets were downloaded from each 127
project’s website. This summary statistics data includes i) marker name as chromosome and position, 128
ii) effect allele, iii) non-effect allele, and iv) p-value of association. To be able to assess the collective 129
effect of the variants detected in GWAS with mild effects, all variants were filtered using p<0.05 130
cutoff, as suggested in previous studies (Bakir-Gungor et al., 2013, 2015; Bakir-Gungor and 131
Sezerman, 2011, 2013; Baranzini et al., 2009). 132
2.2.2 Assigning rsIDs to identified SNPs 133
While T2D2 dataset provides associated rsIDs of the identified SNPs in the summary statistics data, 134
T2D1 and T2D3 datasets only provide chromosome and position information as marker name of the 135
variants and do not provide associated rsIDs. In this respect, fast and easy variant annotation protocol 136
introduced by (Yang and Wang, 2015) is utilized to assign associated rsIDs to the identified SNPs 137
using hg19 or hg38 reference genomes, depending on the provided genomic coordinates at T2D1, 138
T2D3 datasets. 139
2.2.3 Assessing the Functional Impacts of Genetic Variants 140
To assess the functional impact of a non-synonymous change on proteins, numerous computational 141
methods have been developed, as reviewed in (Zeng and Bromberg, 2019). These methods can be 142
classified as following: i) methods that score mutations on the basis of biological principles, ii) 143
methods that use existing knowledge about the functional effects of mutations in the form a training 144
set for supervised machine learning (Carter et al., 2013). Most of these methods assign a numeric 145
score to the non-synonymous change, indicating the predicted functional impact of an amino acid 146
substitution. To identify likely functional missense mutations, Douville et. al. developed a tool called 147
The Variant Effect Scoring Tool (VEST), that utilizes Random Forest as a supervised machine 148
learning algorithm (Douville et al., 2016). Douville et. al. represents all mutations with a set of 86 149
quantitative features; and used missense variants from the Human Gene Mutation Database as a 150
positive class and common missense variants detected in the Exome Sequencing Project (ESP) as a 151
negative class, in their training set (Douville et al., 2016). Since VEST scores result in 0.9 sensitivity 152
and 0.9 specificity values, these scores are utilized to assess the functional impacts of genetic variants 153
in our study. 154
2.2.4 Assigning SNPs to genes 155
Several post-GWAS studies map disease-associated SNPs to genes based on physical distance (Segrè 156
et al., 2010), linkage disequilibrium (LD) (Pers et al., 2015), or a combination of both (Wood et al., 157
2014). In this respect, to aggregate SNP summary statistics into gene scores, several methods have 158
been proposed (Li et al., 2011; Liu et al., 2010; Segrè et al., 2010). Via applying inverse chi-squared 159
quantile transformation on SNP p-values, most of these methods firstly calculate chi-squared values. 160
Secondly, within a window encompassing the gene of interest, some of these methods focus only on 161
the most significant SNP, and assign the maximum-of-chi-squares as the gene score statistic (Lee et 162
al., 2011; Segrè et al., 2010). Some other methods combine results for all SNPs in the gene region by 163
using the sum-of-chi-squares statistic (Wang et al., 2011). In order to compute a well-calibrated p-164
value for the statistic, gene size and LD structure correction is also critical. (Lamparter et al., 2016) 165
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
rigorously analyzed the effects of using the sum and the maximum of chi-squared statistics, which 166
correspond to the strongest and the average association signals per gene, respectively. (Lamparter et 167
al., 2016) proposed a fast and efficient methodology, Pascal, that calculates gene scores by 168
aggregating SNP p-values from a GWAS meta-analysis (without the need for individual genotypes), 169
while correcting for LD structure. Pascal only requires SNP-phenotype association summary 170
statistics and do not require genotype data. Hence, we utilized this tool in our study to map SNPs into 171
genes. 172
2.2.5 The Identification of Dysregulated Modules 173
High throughput experiments enable us to gain better understanding of the functions of the biological 174
molecules in the cell. In addition to the individual activities of these molecules, the molecular 175
interactions are essential to elucidate these molecular mechanisms. In this regard, human protein-176
protein interaction (PPI) networks represent the interactions between human proteins. Via analyzing 177
PPI networks, specific sets of proteins (modules) associated with disease phenotype could be 178
detected. This idea is exploited in several post-GWAS analyzes (Bakir-Gungor et al., 2013, 2014, 179
2015; Bakir-gungor and Sezerman, 2013; Bakir-Gungor and Sezerman, 2011; Chang et al., 2018). 180
An undirected graph could be defined as G = (V, E), in which the vertex or nodes (V) represent 181
proteins, edges (E) represent the physical interactions among proteins, and graph (G) represent 182
protein-protein interaction (PPI) network. A group of proteins in a PPI network that works together to 183
carry out a specific set of functions can be defined as a subnetwork. With the idea of proteins 184
working as a team, disease related protein subnetwork detection has been widely investigated. Active 185
subnetwork search algorithms are originally proposed to identify dysregulated modules in a PPI via 186
utilizing the gene expression values measured in a microarray study (Ideker et al., 2002). p-values of 187
the genes indicate the significance of expression changes of a gene over certain conditions are 188
mapped to PPI and a search algorithm identifies dysregulated modules. Our group and several others 189
later extended this idea to post-GWAS analyzes, where the SNPs are initially mapped to genes and 190
then the p-values of a gene (genotypic p-values) indicate the significance of a gene in the genetic 191
association study. In this study, to detect dysregulated modules, we use the following two approaches 192
that proceed in top-down and bottom-up manners. 193
2.2.5.1 Using Subnetwork Identification Algorithms (Top-down approach) 194
The methodology proposed by (Ideker et al., 2002) to identify active modules in PPI networks, 195
became a pioneer study in this field. While this method brings together the nodes that are highly 196
affected by the condition under study, it also gives a chance to the neighbor nodes of these highly 197
affected nodes, even if they are not highly affected. In this method, firstly a scoring function is 198
defined for each subnetwork and then the problem turned into a search problem of a subnetwork, 199
which maximizes this score. More specifically, to score a subnetwork, the genotypic p-value is 200
converted to a z-score using the equation below, where Φ^ (- 1) indicates inverse normal probability 201
distribution. 202
𝑧𝑖 = 𝛷−1(1 − 𝑝𝑖) 203
The total z score (zA) of the subnetwork A, including k genes is calculated as follows: 204
𝑧𝐴 =1
√𝑘∑ 𝑧𝑖
𝑖 ∈ 𝐴
205
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
While this score is normalized using the following equation, where and indicates mean and 206
standard deviation, respectively; the subnetwork scores are also calibrated by the Monte Carlo 207
method. 208
𝑠𝐴 =(𝑧𝐴 − 𝜇𝑘)
𝜎𝑘 209
Once the subnetwork score is defined, greedy approach, genetic algorithm, and simulated annealing 210
are popular search strategies in active subnetwork identification methodologies. In this study, greedy 211
approach is used during the search steps of the algorithm, and the subnetwork score cutoff is chosen 212
as 3, as suggested in the original paper (Ideker et al., 2002) to select biologically meaningful 213
subnetworks. 214
2.2.5.2 Using Network Propagation (Bottom Up Approach) 215
Based on the idea that the disease-related proteins do not concentrate in a specific region, studies 216
focus on the estimation of dysregulated modules by using the degree of affected nodes information 217
and edges (protein interaction). (Ghiassian et al., 2015) proposed DIseAse MOdule Detection 218
(DIAMOnD) algorithm that finds out dysregulated modules by adding other possible proteins around 219
the known disease protein clusters. Based on random walking, a defined walker starts from a random 220
seed protein and moves through other nodes along the connections of the network. It is hypothesized 221
that more frequently visited proteins are closer to seed proteins (proteins that are known to be 222
associated with the disease). The probability of a random protein with k interaction having ks 223
interaction with seed proteins is calculated by the hyper-geometric distribution as follows: 224
𝑝(𝑘, 𝑘𝑠) =(𝑠0
𝑘𝑠) (𝑁−𝑠0
𝑘−𝑘𝑠)
(𝑁𝑘
) 225
Here, N denotes the number of proteins, s0 denotes the number of seed proteins associated with a 226
particular disease. Whether a protein in the PPI network is randomly interact with the seed protein is 227
calculated by the p-value in equation below. In this way, initiating from seed proteins, other 228
candidate proteins associated with the disease can be identified. 229
𝑝𝑣𝑎𝑙𝑢𝑒 (𝑘, 𝑘𝑠) = ∑ 𝑝(𝑘, 𝑘𝑖)
𝑘
𝑘𝑖=𝑘𝑠
230
2.2.6 Functional Enrichment 231
In multifactorial complex disorders, a single factor is unlikely to explain the disease mechanism. 232
Within this scope, functional enrichment analysis focuses on interconnection of terms and functional 233
groups in networks to predict affected pathways for the interested disease. Hyper geometric test and 234
correction methods such as Bonferroni and Benjamini-Hoschberg are used for analyses. Hyper 235
geometric p-value determines the significance of gene enrichment above a certain threshold form 236
predefined functional terms. 237
𝑃𝑣𝑎𝑙𝑢𝑒 = ∑(
𝑔𝑘
) (𝑓 − 𝑔𝑑 − 𝑘
)
(𝑓𝑑
)
min (𝑔,𝑑)
𝑘=𝑛
238
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
Accordingly, important pathways in the disease and upregulated and downregulated target genes in 239
the pathway are predicted and given as output. In this study, ClueGO (Bindea et al., 2009) is utilized 240
for enrichment analysis. KEGG biological pathways are used as reference pathways. 241
2.2.7 Construction of Pathway Network 242
Figure 2 summarizes our steps regarding pathway-pathway biological network generation and 243
pathway subnetwork identification. In order to establish a pathway network, first of all, the 244
relationships between the genes and 288 KEGG biological pathways need to be analyzed. This 245
relationship is revealed via examining whether the gene of interest is found in a specific pathway or 246
not. For example, if pathway i includes gene j, a value of 1 is assigned to indexi,j in the gene-term 247
matrix and if not, a value of 0 is given to this index. Hence, the created gene-term matrix is a binary 248
matrix, as shown in Figure 2. Secondly, the relationships between pathways need to be analyzed. For 249
this purpose, the term - term matrix is formed by using the previously obtained gene - term matrix, as 250
illustrated in Figure 2. Kappa score metric is used to determine the relationships among the 251
pathways. The equation expressing the Kappa score for any two pathways A, B is given as follows: 252
𝐺𝐴,𝐵 = 𝐶𝑁1,1 + 𝐶𝑁0,0
𝐶𝑁1,1 + 𝐶𝑁0,0 + 𝐶𝑁0,1 + 𝐶𝑁1,0 253
𝐶𝐴,𝐵 = (𝐶𝑁0,1 + 𝐶𝑁1,1) ∗ (𝐶𝑁1,0 + 𝐶𝑁1,1) + (𝐶𝑁0,0 + 𝐶𝑁1,0) ∗ (𝐶𝑁0,0 + 𝐶𝑁0,1)
(𝐶𝑁1,1 + 𝐶𝑁0,0 + 𝐶𝑁0,1 + 𝐶𝑁1,0 ) ∗ (𝐶𝑁1,1 + 𝐶𝑁0,0 + 𝐶𝑁0,1 + 𝐶𝑁1,0 ) 254
𝐾𝐴,𝐵 = 𝐺𝐴,𝐵 − 𝐶𝐴,𝐵
1 − 𝐶𝐴,𝐵 255
where, GA,B represents the observed contingency, CA,B represents random contingency and KA,B 256
represents the Kappa score between pathways A and B. CN1,1, CN0,0, CN1,0, CN0,1 counters are 257
calculated as following. If the gene of interest is present in both compared pathways, CN1,1 counter is 258
increased by 1. Following the same idea, the values of other counters are calculated. Kappa scores, 259
which express the relationships between pairs of pathways, was obtained using observed contingency 260
(G) and random contingency (C) values and stored in term - term matrix. Via applying a threshold on 261
Kappa scores, human KEGG pathway network is created. The pathway network generation steps are 262
implemented in Java. 263
2.2.8 The Identification of Affected Pathway Subnetworks and Pathway Clusters 264
To be able to utilize the interrelated structure of the pathways, we proposed to apply subnetwork 265
identification methodologies on the generated pathway networks, hence disease related affected 266
pathway subnetworks could be identified. A classical subnetwork identification algorithm requires 267
the following two information: i) the biological network file, ii) significance of the nodes. In the 268
regular subnetwork identification problem, while (i) refers to a PPI network, (ii) refers to the 269
significance values of the genes, obtained in a microarray experiment. Here, for (i), we used the 270
pathway network that we generated as described in Section 2.2.7. Regarding (ii), the functional 271
enrichment step, as explained in Section 2.2.6 outputs affected pathway lists with their p-values, 272
indicating the importance of a pathway for the phenotype under study. Hence, to obtain the affected 273
pathway subnetworks, a similar methodology, as described in Section 2.2.5.1 is followed. Instead of 274
using a protein-protein interaction network, in this step, the generated pathway network, as explained 275
in Section 2.2.7, is used. Instead of using the significance values of the proteins, in this step, the 276
significance values of the pathways, generated in Functional Enrichment Step, Section 2.2.6, is used. 277
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
To select biologically meaningful subnetworks among all generated subnetworks, the subnetwork 278
score cutoff is chosen as 3, as suggested in the original paper (Ideker et al., 2002). If the size of the 279
identified subnetwork is bigger than 50, this pathway subnetwork is further sub-divided to find 280
disease related pathway clusters. At this step, we used a graph theoretic clustering algorithm, 281
Molecular Complex Detection (MCODE) to discover densely connected pathway clusters in the T2D 282
affected pathway subnetwork (Bader and Hogue, 2003). In order to confine the dense regions in a 283
PPI, MCODE exploits vertex weighting by local neighborhood density and outward traversal from a 284
locally dense seed protein. In our problem setting, while the PPI refers to the generated pathway 285
network, proteins refer to the pathways. The advantage of MCODE over other graph clustering 286
methods is its allowance for the i) fine-tuning of clusters of interest without considering the rest of 287
the network and ii) inspection of cluster interconnectivity, which is relevant for pathway networks 288
(Bader and Hogue, 2003). It uses 4 different parameters to find clusters: cut off value, K-core value, 289
haircut and fluff parameters. The cut off value sets the intensity of the cluster to be estimated. The K-290
core parameter allows to assign weights to the nodes, which is later used by MCODE to reduce the 291
running time complexity. The haircut parameter, which is a binary parameter, allows the elimination 292
of nodes considered to be topologically irrelevant. The fluff parameter allows someone to set the size 293
of the cluster, which is estimated topologically in the default mode (Bader and Hogue, 2003). In our 294
analyses, the default values of these parameters are used. In the last step, the identified T2D affected 295
pathway subnetworks and pathway clusters are evaluated. 296
2.2.9 Pathway Scoring Algorithm 297
Integration of SNPs across genes and pathways in GWASs has potential to make significant 298
advancement in statistical power and in enlightening relevant biological mechanisms. However, this 299
process is challenging because of the multi-functional roles of genes in several biological processes 300
and the inadequate information about all phenotype – process pairs. In this regard, Pascal (Pathway 301
scoring algorithm) is a robust tool to calculate gene and pathway scores from SNP-phenotype 302
association summary statistics (Lamparter et al., 2016). It does not require genotype data. Firstly, 303
they calculate gene scores by aggregating SNP p-values from a GWAS meta-analysis, and also by 304
correcting for LD structure. While computing the gene scores, they compared the effect of using the 305
sum of chi-squared statistics (average association signals per gene) with the effect of using max of 306
chi-squared statistics (strongest association signals per gene) (Lamparter et al., 2016). Secondly, they 307
calculate pathway scores via aggregating the scores of genes that belong to the same pathways by 308
using modified Fisher method (Lamparter et al., 2016). 309
2.2.10 Comparison of the Identified Subnetworks and Pathways from Different Datasets Using 310
Normalized Mutual Information (NMI) 311
In order to evaluate the similarities between two different community detection algorithms, (Xuan 312
Vinh et al., 2010) and (Tripathi et al., 2016) proposed to use Normalized Mutual Information. Let U 313
and V be the sets of subnetworks that are identified using different datasets. Let U= {U1, …., UR} 314
denote the set of R different subnetworks identified using dataset x, and let V= {V1, …., VS} denote 315
the set of S different subnetworks identified using dataset y. The following contingency table (Table 316
1) illustrates the numbers of shared genes between pairs of subnetworks. In other words, nij indicates 317
the number of common genes between subnetworks Ui and Vj. The entropy of communities H(U), 318
H(V) and mutual information I (U, V) are calculated as following. 319
𝐻(𝑈) = − ∑𝑎𝑖
𝑁
𝑅
𝑖=1
(log𝑎𝑖
𝑁) 320
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
𝐻(𝑉) = − ∑𝑏𝑖
𝑁
𝑆
𝑖=1
(log𝑏𝑖
𝑁) 321
𝐼(𝑈, 𝑉) = ∑ 𝑎
𝑅
𝑖=1
∑𝑛𝑖𝑗
𝑁
𝑆
𝑖=1
(log𝑛𝑖𝑗 𝑁⁄
𝑎𝑖𝑏𝑗 𝑁2⁄) 322
𝑁𝑀𝐼𝑆𝑈𝑀 = 2 × 𝐼(𝑈, 𝑉)
𝐻(𝑈) + 𝐻(𝑉) 323
Here, I (U, V) indicates the amount of information shared between U and V communities. NMISUM is 324
used to compare the clusters in the range of [0,1], where the value 0 refers no similarity between 325
clusters (Vinh et al., 2010). 326
3 Results 327
Based on the idea that the genes and proteins perform cellular functions in a coordinated fashion, 328
understanding the co-operations of proteins in interaction networks may help to identify candidate 329
biomarkers. In this study, we proposed an integrative approach that concurrently analyzes multiple 330
association studies, the functional impacts of these variants, incorporates the interaction partners of 331
susceptibility genes, detects a pathway network of functionally enriched pathways and finally 332
determines the clusterings and subnetworks of affected pathways. The methodology proposed in 333
Figure 1 is applied on three meta-analyses of GWAS data, which are introduced in Section 2.1. As 334
summarized in Table 2, T2D1, T2D2 and T2D3 datasets include 14 .683.492, 5.053.015, 21.635.866 335
SNPs respectively. After the filtration of 3 GWAS datasets using p< 0.05 cutoff, the SNPs with mild 336
effects are collected and the numbers of genetic variants are reduced to 762,111, 557,564 and 337
1,525,650, for T2D1, T2D2 and T2D3 datasets, respectively. Chromosomal position, reference allele, 338
altered allele information of genetic variants are utilized to assign rsIDs. 335,212 and 639,622 rsIDs 339
are assigned to T2D1 and T2D3 datasets, as explained in Section 2.2.2 (Reference genome: hg19). 340
557,564 rsIDs presented as part of T2D2 dataset is used for further analyses. In the next step, 341
functional scores are assigned to each SNP via using VEST (Douville et al., 2016), as explained in 342
Section 2.2.3. Weighted p-values (pW) are calculated for SNPs via combining the genetic association 343
p-values with functional scores (FS) pw=pGWAS/10FS, as proposed by (Saccone et al., 2008). Then, 344
SNPs are mapped to 15,806, 15,460 and 17,200 genes for T2D1, T2D2 and T2D3 datasets, 345
respectively. Combined p-values of 10,298 common genes among three datasets are calculated using 346
Fisher’s combined test (Fisher, 1934), and called as T2D-combined (T2DC) in the rest of this paper. 347
For the detection of dysregulated modules, top-down and bottom-up approaches are followed, as 348
explained in Section 2.2.5. 349
3.1 Affected subnetworks that are identified using meta GWAS datasets 350
For all datasets, the genes and their significance levels are mapped to protein-protein interaction 351
network and 983, 903, 940 and 813 active protein subnetworks are identified for T2D1, T2D2, T2D3 352
and T2DC datasets, respectively. Numbers of the genes included in these subnetworks are depicted in 353
Figure 3A for 70KforT2Dmeta-analysis dataset (T2D1), in Figure 3B for the meta-analysis of 354
DIAGRAM, GERA, UKB GWAS datasets (T2D2), in Figure 3C for T2D3 dataset, in Figure 3D for 355
T2DC dataset. While most of the subnetworks include 175-250 genes in T2D1 and T2D2 datasets, 356
most of the subnetworks detected for T2C dataset include 200-250 genes. Around two third of the 357
subnetworks, which are identified for T2D3 dataset include 150-175 genes. For each identified 358
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
subnetwork, functional enrichment analysis is carried out and hence, affected pathways are 359
determined. 360
3.2 Dysregulated modules of T2D that are identified using network propagation 361
Known T2D genes, collected in the (Ghiassian et al., 2015) study are used as seed genes to find 362
dysregulated modules via expanding a module by adding other possible genes to the known disease 363
gene clusters. This study indicated that seed proteins display unusual interaction patterns among each 364
other. It enlightens the idea that the existence of disease specific modules is not by chance. 365
Connectivity significance values are calculated for all neighbors of 73 known T2D disease associated 366
seed genes. Afterwards, the node with the most significant interaction is added to the module and this 367
iteration is repeated until 200 and 500 genes are included in a module. Then, functional enrichment 368
procedure is performed on each of these two dysregulated modules (T2D_D200, T2D_D500). 369
3.3 Affected Pathways of T2D 370
Based on the observation that genes almost always act cooperatively rather than independently, to 371
facilitate the biological interpretation of high-throughput data, many different methods have been 372
postulated to identify the biological pathways associated with a particular clinical condition under 373
study. Here, to characterize this cooperative nature of genes and to elucidate the molecular 374
mechanisms of T2D, we investigate the affected pathways of T2D and search for the potential 375
failures in these wiring diagrams. 376
3.3.1 Overrepresented Pathways of T2D Dysregulated Modules 377
To detect possible pathogenic pathways related with T2D, the genes listed in each dysregulated 378
module are compared with the genes included in KEGG pathways and the proportion of the module 379
genes over all pathway-associated genes is calculated. Significantly affected KEGG pathways 380
(pathways with corrected p-values < 0.05) for our defined dysregulated modules are appended to 381
potentially significant pathway list of T2D disorder. Table 3 presents top 10 affected pathways that 382
are found to be overrepresented in the dysregulated modules of T2DC dataset. Five of these pathways 383
are also identified in all other T2D datasets. These shared pathways are Spliceosome, Focal adhesion, 384
soluble N-ethylmaleimide-sensitive factor attachment protein receptor (SNARE) interactions in 385
vesicular transport, transforming growth factor-β (TGF-β) signaling, and ErbB signaling pathways. 386
Figures 4A and 4B depicts the commonalities among the top 50 and top 100 affected pathways 387
enriched for the dysregulated modules of T2D1, T2D2, T2D3, T2DC datasets, and among the gold 388
standard T2D pathways. As illustrated in Figure 4A, when the identified top 50 affected pathways are 389
overlapped among all four datasets, 24 KEGG pathways are commonly observed. These pathways 390
are Valine, leucine and isoleucine degradation, SNARE interactions in vesicular transport, 391
Cholinergic synapse, TGF-beta signaling pathway, ErbB signaling pathway, Ubiquitin mediated 392
proteolysis, Focal adhesion, ECM-receptor interaction, Gap junction, Spliceosome, Serotonergic 393
synapse, Pathways in cancer, Retrograde endocannabinoid signaling, beta-Alanine metabolism, 394
Neurotrophin signaling pathway, GABAergic synapse, Chemokine signaling pathway, Glioma, 395
Dopaminergic synapse, Glutamatergic synapse, Endocytosis, GnRH signaling pathway, T cell 396
receptor signaling pathway, Fc gamma R-mediated phagocytosis. When we compare these top 50 397
affected pathways of four datasets with the gold standard T2D pathway set (Yoon et al., 2018), 398
Valine, leucine and isoleucine degradation pathway was commonly identified (as shown in Figure 399
4A). The comparison of the top 100 affected pathways of these datasets with gold standard T2D 400
pathway set resulted in 8 common KEGG pathways, which are Valine, leucine and isoleucine 401
degradation, Jak-STAT signaling pathway, Cell cycle, Glycolysis / Gluconeogenesis, Calcium 402
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
signaling pathway, Insulin signaling pathway, Fatty acid metabolism, Wnt signaling pathway (as 403
shown in Figure 4B). 404
3.3.2 Enriched Pathways for the Expanded Modules of T2D Seed Genes 405
Overrepresented pathways for expanded modules of 73 T2D seed genes, including 200 and 500 406
genes are identified with functional enrichment analysis. As shown in Table 4, the enrichment 407
operation on T2D_D200 and T2D_D500 dysregulated modules resulted in 41 and 84 significant 408
pathways, respectively. 409
3.3.3 The Pathways that are Identified Using Pathway Scoring Algorithm on T2D GWAS meta 410
data 411
The pathway scoring algorithm, as explained in Section 2.2.10 is used to find potentially affected 412
pathways for T2D1, T2D2 and T2D3 data sets. Firstly, gene and pathway scores from SNP-413
phenotype association summary statistics are computed. Secondly, the calculated scores of affected 414
pathways for each datasets are combined with Fisher’s method, and consequently, 38 KEGG and 46 415
Reactome pathways are detected for this combined data (T2D_PC). 416
In Table 4, the commonly identified KEGG pathways of T2DC, T2D_D500, T2D_PC methods, 417
which are described in Sections 3.3.1, 3.3.2, 3.3.3, respectively, are listed. The affected pathways, 418
which are highlighted in bold, refers to the gold standard KEGG pathways reported in the (Yoon et 419
al., 2018)’s study. The affected pathways, which are highlighted in italic, refers to the pathways that 420
are known in literature as related with T2D development mechanisms, as discussed in detail in 421
Section 4. Among the 17 gold standard KEGG pathways of T2D, Type II diabetes mellitus, Calcium 422
signaling, Insulin signaling, Wnt signaling, Adipocytokine signaling, and Jak-STAT signaling 423
pathways are found with our methodology. 424
3.4 Shared T2D Subnetworks and Pathways Among Different GWAS meta data 425
3.4.1 Comparative Evaluation of Identified T2D Subnetworks for Each Dataset 426
The identified T2D1, T2D2, T2D3 and T2DC subnetworks (as explained in Section 3.1, and 427
summarized in Figure 3) are compared in a pairwise manner to assess the shared information among 428
them. Firstly, for each x, y pairs of T2D1, T2D2, T2D3 and T2DC datasets, each identified 429
subnetwork of T2Dx dataset and T2Dy dataset are compared in gene level and a contingency table of 430
T2Dx/T2Dy, as shown in Table 1, is created. In this contingency table, each value of nij represents the 431
shared gene counts between the ith subnetwork of T2Dx dataset and the jth subnetwork of T2Dy 432
dataset. Secondly, based on this table, the entropy values H(T2Dx), H(T2Dy) and the mutual 433
information values I(T2Dx, T2Dy) are computed for each x, y dataset pair. Thirdly, normalized MI is 434
calculated as explained in Section 2.2.10. This procedure is repeated for all pairwise combinations of 435
the T2D datasets. Hence, similarity scores (NMISUM) are calculated between all pairs of datasets. The 436
presented heatmap in Figure 5 illustrate the similarities of datasets according to the strength of the 437
NMISUM score. As illustrated in Figure 5A, T2D1, T2D2, T2D3 and T2DC subnetwork similarities 438
are resulted in range [0, 0.01]. While highest similarity score of 0.0073 is obtained for T2D2-T2D3 439
dataset pair, the lowest score of 0.0060 is obtained for T2D1-T2DC dataset pair. Accordingly, while 440
the darker colors indicate higher correlation, lighter colors indicate smaller correlation in the heatmap 441
of Figure 5. In Figure 5, NMISUM scores in the diagonals of the heatmap are "whitened" for clearer 442
visibility of the other NMISUM values. 443
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
3.4.2 Comparative Evaluation of Identified T2D Pathways for Each Dataset 444
Shared information among different methodologies (subnetwork identification, as presented in 445
Section 2.2.5.1 and bottom-up approach, as presented in Section 2.2.5.2) and different T2D meta-446
datasets, are also evaluated in terms of the identified T2D pathways. The same functional enrichment 447
analysis is applied on the subnetworks and dysregulated modules, as explained in Section 2.2.6. In 448
addition to the identified pathways of T2D1, T2D2, T2D3 and T2DC datasets, the pathways 449
identified from T2D_D200 and T2D_D500 gene sets are also evaluated here. Firstly, for each x, y 450
pairs of T2D1, T2D2, T2D3, T2DC, T2D_D200 and T2D_D500, each identified pathway of T2Dx 451
dataset and T2Dy dataset are compared in terms of their common genes and a contingency table of 452
T2Dx/T2Dy is created, as shown in Table 1. In this contingency table, each value of nij represents the 453
shared gene counts between the ith identified pathway of T2Dx dataset and the jth identified pathway 454
of T2Dy dataset. Secondly, based on this table, the entropy values H(T2Dx), H(T2Dy) and mutual 455
information values I(T2Dx, T2Dy) are computed for each x, y dataset pair. Thirdly, normalized MI is 456
calculated as explained in Section 2.2.10. This procedure is repeated for all pairwise combinations of 457
the T2D datasets. Hence, similarity scores (NMISUM) are calculated between all pairs of datasets, in 458
terms of overrepresented pathways. In terms of the identified pathways, Figure 5B illustrates the 459
similarity levels of the T2D1, T2D2, T2D3, T2DC, T2D_D200 and T2D_D500, in the range of [0-460
0.1]. While a maximum NMISUM score of 0.0658 is achieved for T2D1-T2D3 pair, a minimum 461
NMISUM score of 0.016 is obtained for T2DC-T2D_D200 pair. 462
3.5 Affected Pathway Subnetworks and Pathway Clusters of T2D 463
We hypothesized that similar to the dysregulated modules of proteins, dysregulated modules of 464
pathways have a role in disease development mechanisms. In order to identify affected pathway 465
subnetworks of a disease; we proposed a methodology, as shown in Figure 2. Instead of a PPI 466
network, this method requires a pathway network as the baseline. Here, we utilized the 288 human 467
KEGG pathways as a reference, for the generation of this biological network. To establish a pathway 468
network, firstly, we examined the relationships between the genes and the biological pathways, as 469
explained in Section 2.2.7. In this study, we stored these relationships in a gene-term matrix, which is 470
a binary matrix with dimensions 6881 * 288, representing the number of individual genes in all 471
pathways, and the number of pathways, respectively. Secondly, the relationships between the 472
pathways are analyzed, as explained in Section 2.2.7. For this purpose, kappa statistics was used to 473
determine the relationships between pathways, and a term-term matrix (of size 288 *288), was 474
formed by using the previously obtained gene-term matrix. Thirdly, to identify interrelated pathways, 475
we experimented with different cutoff values of kappa scores. The sizes of the networks that are 476
created with different threshold values are presented in Table 5. Since the node to edge ratio in the 477
human PPI network is approximately 1 to 10, the kappa score threshold value is selected as 0.15 in 478
this study and finally, a human pathway network including 288 pathways (nodes) and 2976 479
interrelations (edges) is created. 480
Active subnetwork identification algorithms require a biological network and the significance values 481
of the nodes, e.g. the p-values of the genes obtained from microarray studies, indicating the 482
significance of a gene, in terms of the expression levels differing between two experimental 483
conditions. Here, while our biological network is selected as our generated pathway network, 484
significance values of the nodes are selected as the corrected hypergeometric test p-values, indicating 485
the importance of the pathway for T2D. Following the methodology proposed in Figure 2, for all 486
T2D datasets, only one affected pathway subnetwork exceeded the predefined subnetwork score, as 487
summarized in Table 5. As the node and edge numbers of these identified pathway subnetworks 488
could be inspected from Table 5, it could be observed that the nodes are severely connected to each 489
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
other in the identified pathway subnetworks. Therefore, these four identified pathway subnetworks 490
(for four different datasets) were further grouped into subcategories as explained in Section 2.2.8, and 491
the affected pathway clusters of T2D are obtained for each dataset. As shown in Table 6, for T2D1, 492
T2D2, T2D3, T2DC datasets, 7, 9, 7, and 8 affected pathway clusters are identified respectively. 493
Numbers of nodes (pathways) included in each cluster and the scores of each pathway cluster can be 494
found in Table 6. When the obtained results are analyzed, it is seen that the initial pathway 495
subnetwork, which is severely connected with each other and has more than 50 nodes is successfully 496
divided into smaller disease related subnetworks. This can be considered as a proof of the 497
effectiveness of the developed method. The highest scoring pathway cluster of T2D1, T2D2, T2D3, 498
T2DC datasets included 38, 34, 35 and 35 pathways, respectively. For each dataset, the representative 499
networks of the identified pathway clusters are shown in Figure 6. When we analyze the 500
commonalities among these pathways, we observed in Figure 7 that 29 of these pathways were 501
commonly identified in T2D1, T2D2, T2D3, T2DC datasets. The details of these commonly 502
identified pathways are given in Table 7. 503
4 Discussion 504
GWASs of T2D have significantly accelerated the discovery of T2D–associated loci (Adeyemo et al., 505
2015; Bonnefond and Froguel, 2015; Liu et al., 2017; Meyre, 2017; Scott et al., 2017). Although the 506
identified T2D-risk variants including 243 loci and 403 distinct association signals exhibit a potential 507
for clinical translation, the genome-wide chip heritability explains only 18% of T2D risk (Bonàs-508
Guarch et al., 2018; Mahajan et al., 2018a; Xue et al., 2018). Traditional GWASs focus on top-509
ranked SNPs and discard all others except ‘the tip of the iceberg’ SNPs. Such GWAS approaches are 510
only capable of revealing a small number of associated functions. In this regard, even though 511
GWASs are a compelling method to detect disease-associated variants, it does not directly address 512
the biological mechanisms underlying genetic association signals, and hence, the development of 513
novel post-GWAS analysis methodologies is needed (Lin et al., 2017), (Gallagher and Chen-Plotkin, 514
2018), (Erdmann and Zeller, 2019). In this respect, to enlighten the molecular mechanisms of Type 2 515
Diabetes development, here we proposed a method that perform protein subnetwork, pathway 516
subnetwork and pathway cluster level analyses of the SNPs that are found to be mildly associated 517
with T2D in multiple association studies. In other words, to achieve a coherent comprehension of 518
T2D molecular mechanisms, the proposed network and pathway-based solution conjointly analyzes 519
three meta-analyses of GWAS, which are conducted on T2D. 520
The baseline of our study is built on the interactions of T2D related proteins, since the proteins act as 521
the functional base units of the cells and construct the frameworks of cellular mechanisms. Protein 522
network structure helps us to gain a collective insight about the biological systems. At the 523
mesoscopic level of these protein networks, active modules are the potential intermediate building 524
blocks between individual proteins and the global interaction network. Dysregulation of these 525
modules are considered to have a role in disease development mechanisms. Hence, the identification 526
of dysregulated modules of T2D helps us to understand the fundamental molecular characteristics of 527
T2D and to discover new candidate disease genes having a role in the regulation of T2D related 528
pathways. In this context, for each analyzed T2D GWAS meta-analysis dataset, where the 529
characteristics of each dataset is summarized in Table 2, 800 to 1000 dysregulated modules, 530
including 150 to 250 genes are detected using a top-down approach, as explained in Section 2.2.5.1. 531
As outlined in Figure 1, these modules are functionally enriched and the pathways that have a 532
potential effect on T2D development are identified. As presented in Table 3, among the top 10 533
affected T2D pathways of T2DC datasets, 5 pathways are commonly overrepresented for the 534
dysregulated modules of T2D1, T2D2, T2D3, T2DC datasets. These five shared pathways are 535
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
Spliceosome, Focal adhesion, SNARE interactions in vesicular transport, TGF-β signaling, and ErbB 536
signaling pathways. All these pathways are known to have a role in T2D development mechanisms. 537
Spliceosome pathway has a role in the regulation of alternative splicing in insulin resistance cases by 538
aberrantly spliced genes like ANO1, GCK, SUR1, VEGF (Costantini et al., 2011; Dlamini et al., 539
2017; Schmid et al., 2012). Focal adhesion pathway is complementary in regulation of insulin 540
signaling pathway. Via controlling adipocyte survival, Focal adhesion kinases (FAK) regulate insulin 541
sensitivity (Luk et al., 2017). SNARE protein contributes to fusion mechanism of insulin secretory 542
vesicles (Xiong et al., 2017). The study conducted by Boström et. al. demonstrated that total skeletal 543
muscle SNARE protein SNAP23 and SNARE related Munc18C protein levels are higher in patients 544
with type 2 diabetes, which are also correlated with markers of insulin resistance (Boström et al., 545
2010). TGF-β signaling pathway has role in inflammation by cytokines such as interleukins, tumor 546
necrosis factors, chemokins interferons, transforming growth factors (TGF). Insulin enhances TGF-β 547
receptors in fibroblasts and epithelial cells. Herder et. al. documented that high levels of anti-548
inflammatory immune mediator TGF-β1 are correlated with T2D (Herder et al., 2009). TGF-β 549
signaling pathway is also shown to have a crucial role in extracellular matrix accumulation in 550
diabetic nephropathy (Kajdaniuk et al., 2013). Akhtar et. al. showed that the dysregulation of 551
epidermal growth factor receptor family (ErbB) triggers vascular dysfunction stimulated by 552
hyperglycemia in T2D (Akhtar et al., 2015). Other dual role of ErbB protein family included diabetes 553
triggered cardiac dysfunction (Akhtar and Benter, 2013). 554
While identifying active subnetworks of T2D, in addition to the top-down approach (as discussed 555
above), we also applied bottom-up approach as explained in Section 2.2.5.2. Overrated pathways of 556
i) top-down approach (T2DC), ii) bottom-up approach (T2D_D200, T2D_D500), and iii) pathway 557
scoring algorithm (T2D_P) are comparatively evaluated. Among these pathways, Type II diabetes 558
mellitus, Calcium, Insulin, Wnt, Adipocytokine, Jak-STAT signaling pathways (shown in bold in 559
Table 4) overlap with gold standard pathways of T2D (Yoon et al., 2018). Additionally, the pathways 560
that are shown in italic in Table 4, have support from the literature as following. The study conducted 561
by (Berntorp et al., 2013) reported that T2D patients express antibodies against gonadotropin-562
releasing hormone GnRH in serum. (De Souza et al., 2016) stated T2D as prognostic and risk factor 563
for pancreatic cancer. (Houtz et al., 2016) reported that paracrine neurotrophin signaling have a role 564
in insulin secretion between pancreatic vascular system and beta cells, which is triggered by glucose. 565
(Ono et al., 2001) stated that phosphatidylinositol signaling system including PTEN (phosphatase and 566
tensin homologue deleted on chromosome 10) and PI3K (phosphoinositide3-kinase) proteins regulate 567
glucose homeostasis and insulin metabolism. In a study performed by (Dissanayake et al., 2018), 568
cadherin mediated adherens junction proteins are shown to have a potential regulation role in insulin 569
secretion mechanism by controlling vesicle traffic in cell. Via studying different GWAS meta-570
analyses, Schierding et. al., indicated the spatial connection of CELSR2–PSRC1 locus with BCAR3, 571
which is part of the insulin signaling pathway (Schierding and O’Sullivan, 2015). The post-GWAS 572
study conducted by (Liu et al., 2017) identified T2D risk pathways. Among these pathways, Type II 573
diabetes mellitus, Calcium signaling pathway, Pancreatic cancer, MAPK signaling pathway, 574
Chemokine signaling pathway, Tight junction pathways were also identified in our study (p<0.05). 575
Another study performed by (Perry et al., 2009) analyzed T2D GWAS data and reported that Wnt 576
signaling pathway, Olfactory transduction, Galactose metabolism, Pyruvate metabolism, Type II 577
diabetes, TGF-signaling pathways are associated with T2D. Wnt signaling and Type II diabetes 578
pathways are overlapped with our findings, as shown in Table 4. The analysis of T2D WTCCC 579
GWAS dataset by (Zhong et al., 2010) indicated 22 affected pathways in T2D. Among these 580
pathways, Tight junction, Phosphatidylinositol signaling system, Pancreatic cancer, Adherens 581
junction, Calcium signaling pathway are replicated in our study, as shown in Table 4. 582
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
Using the mutual information based on the shared genes, the identified protein subnetworks and the 583
affected pathways of each dataset were compared. While the NMISUM subnetwork scores range from 584
0 to 0.01, NMISUM pathway scores range from 0 to 0.1 (as shown in Figure 5). Hence, we show that 585
while the subnetwork level analyzes increase the degree of irregularity, pathway level evaluation of 586
different T2D GWAS meta-data and different methodologies (top-down vs. bottom-up approach) 587
resulted in higher levels of conservation and yielded in more interpretable outcome. 588
While the Type II diabetes mellitus pathway is identified in the later rankings for T2D1, T2D2, 589
T2D3, and T2DC GWAS datasets (as shown in Table 7), the incorporation of the generated pathway 590
network information helped us to prioritize this pathway. This pathway is found in the highest 591
scoring pathway cluster of each dataset. Since the pathways are strongly interrelated, our proposed 592
approach created a pathway network, and identified affected pathway subnetworks and pathway 593
clusters using multiple association studies, which are conducted on T2D. Our approach is based on 594
both significance level of an affected pathway and its topological relationship with its neighbor 595
pathways. 596
In conclusion, the availability of T2D GWAS meta-data and new analytical methods has provided 597
opportunities to bridge the knowledge gap from sequence to consequence. In this study, the collective 598
effects of T2D–associated variants are inspected using network and pathway-based approaches, and 599
the prominent genetic association signals related with T2D biological mechanisms are revealed. We 600
presented a comprehensive analysis of three different T2D GWAS meta-data at protein subnetwork, 601
pathway, and pathway subnetwork levels. To explore whether our results recapitulate the 602
pathophysiology of T2D, we performed functional enrichment analysis on the dysregulated modules 603
of T2D. In addition to our analysis of the shared information among different datasets in terms of 604
subnetworks, we also analyzed the shared information in terms of the identified T2D pathways. The 605
identified pathway subnetworks, pathway clusters and affected genes within these pathways helped 606
us to illuminate T2D development mechanisms. We hope the affected genes and variants within these 607
identified pathway clusters help geneticists to generate mechanistic hypotheses, which can be 608
targeted for large-scale empirical validation through massively parallel reporter assays at the variant 609
level; and through CRISPR screens in appropriate cellular models, and through manipulation in in-610
vivo models, at the gene level. 611
5 Conflict of Interest 612
The authors declare that the research was conducted in the absence of any commercial or financial 613
relationships that could be construed as a potential conflict of interest. 614
6 Author Contributions 615
BBG and MUY conceived the ideas and designed the study. BBG, MUY, GG, MT conducted the 616
experiments and analyzed the results. BBG, MUY, GG, and MT participated in the discussion of the 617
results and writing of the article. All authors read and approved the final version of the manuscript. 618
7 Acknowledgments 619
We would like to thank the anonymous reviewers for their valuable comments and suggestions to 620
improve the quality of the paper. We are also very grateful to Prof. David Torrents from Barcelona 621
Supercomputing Center, to help us with the 70KforT2D meta-analysis data. We also would like to 622
thank Prof. Albert-Laszlo ́ Barabasi at University of Notre Dame and Dr. Michael Cusick at Center 623
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
for Cancer Systems Biology for providing us PPI dataset; Dr. Gabriela Bindea from Integrative 624
Cancer Immunology Team of Cordeliers Research Center for her help with the ClueGO tool. 625
8 Figures 626
627
628
Figure 1. Summary of our pathway and network oriented approach to enlighten T2D mechanisms 629
using multiple association studies. 630
631
632
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
633
Figure 2. Flowchart of pathway network generation and pathway subnetwork identification. 634
635
636
637
638
639
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
640
Figure 3. Numbers of genes included in the identified (A) 983 subnetworks for T2D1, (B) 903 641
subnetworks for T2D2, (C) 940 subnetworks for T2D3, and (D) 813 subnetworks for T2DC datasets. 642
643
644
645
646
Figure 4. Commonalities between (A) top 50, and (B) top 100 affected pathways identified from 647
T2D1, T2D2, T2D3, and T2DC datasets. 648
649
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
650
Figure 5. Shared information comparison among different datasets in terms of (A) identified T2D 651
subnetworks, and (B) identified pathways via normalized mutual information (NMISUM). While the 652
darker colors indicate higher correlation, lighter colors indicate smaller correlation. NMISUM scores in 653
the diagonals of the heatmap are "whitened" for clearer visibility of the other NMISUM values. 654
655
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
656
Figure 6. The representative networks of the highest scoring pathway clusters of (A) T2D1, (B) 657
T2D2, (C) T2D3, (D) T2DC datasets, including 38, 34, 35 and 35 pathways, respectively. 658
659
660
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
661
Figure 7. Commonalities between the highest scoring pathway clusters of T2D1, T2D2, T2D3, and 662
T2DC datasets. 663
664
665
9 Tables 666
Table 1. Contingency table of overlapping genes (ni, j) between subnetworks Ui and Vj , where U and 667
V indicate the sets of subnetworks identified via using datasets X and Y, respectively. 668
U | V V1 V2 … VS Sum
U1
U2
…
UR
n11 n12 … n1S
n21 n22 … n2S
… … … …
nR1 nR2 … nRS
a1
a2
…
aR
Sum b1 b2 … bS N
669
670
671
672
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
Table 2. Summary of T2D1, T2D2, T2D3, T2DC datasets, and the numbers of identified SNPs, 673
genes, subnetworks for each dataset. 674
Datasets # of
Cases
# of
Controls
# of
SNPs
# of SNPs
(p-value <
0.05)
# of
rsIDs
# of
Genes
# of
Subnetworks
T2D1 12.931 57.196 14.683.492 762.111 335.212 15.806 984
T2D2 62.892 596.424 5.053.015 557.564 557.564 15.460 904
T2D3 74.124 824.006 21.635.866 1.525.650 639.622 17.800 941
T2DC - - - - - 10.298 813
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Table 3. Top 10 affected T2D pathways of T2DC dataset. Among these pathways, 5 pathways are commonly overrepresented for the 675
dysregulated modules of T2D1, T2D2, T2D3, T2DC datasets. 676
p-values Rank Number of genes
Percent of
identified genes
in associated
pathways KEGG term T2DC T2D1 T2D2 T2D3 T2D
C T2D1 T2D2 T2D3 T2D_Union
in
pathways
Spliceosome 8.55E-39 3.26E-27 6.95E-30 3.10E-41 1 15 8 5 104 127 0.81
Focal adhesion 7.032E-38 1.80E-30 3.82E-42 1.97E-54 2 10 1 1 172 200 0.86
SNARE
interactions in
vesicular
transport
1.98E-35 1.37E-37 8.16E-33 5.41E-44 3 3 5 4 34 36 0.94
Valine leucine
and isoleucine
degradation
5.97E-35 3.26E-43 6.39E-20 3.34E-29 4 1 34 13 41 44 0.93
Purine
metabolism 7.60E-34 5.35E-43 4.92E-12 1.29E-45 5 2 83 3 99 166 0.59
Dopaminergic
synapse 3.26E-33 1.04E-20 9.48E-32 6.80E-34 6 37 7 9 119 130 0.91
TGF-beta
signaling
pathway
5.03E-29 8.70E-32 5.61E-34 3.23E-28 7 6 3 15 75 84 0.89
ErbB signaling
pathway 1.59E-28 4.64E-31 1.00E-29 1.46E-37 8 8 9 7 85 87 0.97
Chemokine
signaling
pathway
5.23E-28 1.47E-21 1.01E-23 2.97E-19 9 33 20 39 163 189 0.86
Glutamatergic
synapse 3.47E-27 1.97E-20 1.94E-29 3.03E-28 10 38 10 14 101 126 0.80
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Table 4. Comparison of the overrepresented pathways of T2D dysregulated modules (T2DC), 677
expanded modules of T2D seed genes (T2D_D500), and the affected pathways identified using 678
pathway scoring algorithm (T2DP). 679
p-value Rank
KEGG term T2DP T2DC T2D_D500 T2DP T2DC T2D_D500
Pathways in cancer 1.42E-15 2.52E-20 1.86E-33 2 24 79
Focal adhesion 4.39E-14 7.03E-38 1.48E-33 3 2 80
Type II diabetes mellitus 4.72E-14 1.84E-08 1.81E-10 4 127 43
Prostate cancer 4.28E-10 1.19E-19 2.94E-29 7 27 73
Calcium signaling
pathway 9.66E-10 3.71E-13 2.18E-08 9 77 33
MAPK signaling pathway 3.48E-08 8.59E-24 5.25E-27 10 14 71
Small cell lung cancer 7.44E-08 5.10E-10 1.79E-07 11 110 26
Chronic myeloid leukemia 7.78E-08 5.65E-19 1.09E-31 12 33 77
Insulin signaling pathway 2.12E-07 2.67E-14 2.21E-30 13 63 76
Glioma 3.01E-07 7.22E-18 6.81E-32 14 36 78
Non-small cell lung cancer 7.16E-07 6.51E-12 3.38E-26 15 87 70
GnRH signaling pathway 1.93E-06 1.81E-19 8.73E-20 17 29 62
Pancreatic cancer 2.41E-06 4.22E-15 4.55E-21 18 56 65
Vascular smooth muscle
contraction 2.80E-06 1.21E-19 1.41E-05 19 28 19
Leukocyte transendothelial
migration 6.45E-06 2.82E-13 2.35E-16 20 76 53
Chemokine signaling
pathway 8.94E-06 5.24E-28 1.70E-29 21 9 74
Gap junction 3.33E-05 1.17E-20 5.05E-08 23 23 31
Tight junction 9.78E-05 6.68E-14 1.35E-09 25 67 39
Wnt signaling pathway 1.16E-04 5.63E-22 3.97E-06 26 21 22
Adipocytokine signaling
pathway 1.35E-04 5.40E-11 1.35E-05 27 95 20
Acute myeloid leukemia 1.55E-04 1.08E-13 4.62E-21 29 72 63
Adherens junction 1.61E-04 2.81E-24 7.02E-24 30 12 67
ErbB signaling pathway 2.81E-04 1.60E-28 2.74E-54 32 8 83
Phosphatidylinositol signaling system
3.49E-04 1.91E-23 1.05E-02 33 16 2
Neurotrophin signaling
pathway 3.91E-04 3.03E-22 2.08E-58 34 20 84
Melanogenesis 4.38E-04 1.81E-19 1.57E-07 36 30 27
Jak-STAT signaling
pathway 4.57E-04 7.54E-14 6.66E-19 37 68 60
680
681
682
683
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
684
685
686
Table 6. Identified pathway clusters that are affected in T2D for each dataset. 687
T2D1 T2D2 T2D3 T2DC
# of
Clusters
# of
Nodes
Score
of
Cluster
# of
Clusters
# of
Nodes
Score
of
Cluster
# of
Clusters
# of
Nodes
Score
of
Cluster
# of
Clusters
# of
Nodes
Score
Of
Cluster
7
38 32.919
9
34 30.182
7
35 31.412
8
35 31.118
14 8.462 19 13.111 21 14.3 16 8.8
9 4.75 15 5.286 11 5.2 16 8.533
4 3.333 5 5 5 4.5 11 5
3 3 5 4,5 4 4 5 5
3 3 4 4 4 3.333 8 4.286
3 3 3 3 3 3 4 3.333
*Cut Off Value: 0.2, Haircut: True Fluff: FALSE, K-Core: 2 688
689
Table 5. Node – Edge relationships in the generated pathway networks and affected pathway
subnetworks.
Sizes of the generated pathway networks for different threshold values
Threshold Values ( ≥ ) # of Nodes # of Edges
0 288 82944
1.21E-5 288 10904
0.05 288 6806
0.1 288 4617
0.15 288 2976
0.2 288 1866
0.25 288 1321
Sizes of the generated highest scoring pathway subnetworks for different T2D datasets
Dataset # of Nodes # of Edges
T2D1 119 1356
T2D2 134 1383
T2D3 135 1441
T2DC 158 1709
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
Table 7. Common pathways of highest scoring pathway clusters that are identified for different T2D 690
GWAS meta-data. 691
Pathway Name p-values Rank
T2D1 T2D2 T2D3 T2DC T2D1 T2D2 T2D3 T2DC
Renal cell carcinoma 7.12E-15 1.95E-15 7.23E-13 8.14E-15 68 55 90 57
Colorectal cancer 1.52E-12 7.53E-10 1.82E-14 3.51E-17 97 115 77 41
Hepatitis C 2.99E-14 1.29E-14 1.35E-18 1.59E-16 77 62 47 43
VEGF signaling pathway 1.05E-11 1.20E-10 4.18E-12 4.15E-13 104 99 99 78
Toxoplasmosis 2.38E-12 2.24E-12 1.30E-18 4.39E-13 99 78 48 80
Chagas disease (American
trypanosomiasis) 2.10E-18 1.62E-12 3.85E-19 3.57E-15 48 76 42 54
Type II diabetes mellitus 1.32E-12 2.68E-09 6.18E-19 1.84E-08 96 124 44 127
Chemokine signaling
pathway 1.47E-21 1.01E-23 2.97E-19 5.23E-28 33 20 39 9
Progesterone-mediated
oocyte maturation 2.67E-16 3.57E-12 4.95E-16 7.25E-18 62 81 68 37
Insulin signaling pathway 2.16E-16 1.67E-16 2.96E-18 2.67E-14 60 48 49 63
Toll-like receptor signaling
pathway 1.70E-29 2.63E-11 3.20E-13 1.27E-14 13 91 85 62
Cholinergic synapse 6.32E-35 1.17E-25 1.61E-31 4.37E-27 4 16 11 11
Neurotrophin signaling
pathway 4.20E-22 3.68E-23 3.03E-31 3.02E-22 30 22 12 20
Fc gamma R-mediated
phagocytosis 3.57E-19 2.88E-18 1.01E-19 1.75E-16 44 37 35 47
Osteoclast differentiation 5.24E-22 1.28E-14 3.60E-19 3.16E-17 31 61 41 40
T cell receptor signaling
pathway 3.32E-19 3.69E-21 4.49E-20 2.14E-18 43 32 33 34
Fc epsilon RI signaling
pathway 3.75E-18 9.42E-16 5.92E-18 2.33E-23 52 53 52 17
Natural killer cell mediated
cytotoxicit 2.61E-13 1.53E-13 2.12E-09 5.47E-12 90 69 131 86
B cell receptor signaling
pathway 3.28E-19 3.39E-17 2.41E-14 1.96E-19 42 43 78 31
mTOR signaling pathway 1.28E-12 4.34E-10 1.72E-08 1.60E-10 95 108 141 102
Non-small cell lung cancer 7.60E-16 3.04E-11 1.86E-13 6.51E-12 65 92 82 87
ErbB signaling pathway 4.64E-31 1.09E-29 1.46E-37 1.59E-28 8 9 7 8
Acute myeloid leukemia 5.42E-14 1.40E-10 1.03E-11 1.08E-13 80 102 105 72
Chronic myeloid leukemia 7.27E-20 8.58E-17 2.48E-16 5.65E-19 41 45 65 33
Melanoma 4.79E-14 8.51E-17 6.46E-15 1.05E-14 78 44 74 59
Prostate cancer 1.13E-17 1.82E-13 1.12E-12 1.18E-19 53 70 93 27
Glioma 3.33E-21 1.67E-16 7.34E-19 7.21E18 35 47 45 36
Endometrial cancer 3.47E-16 1.67E-14 4.80E-13 1.62E16 63 63 88 45
Pancreatic cancer 6.15E-13 4.15E-14 8.21E-15 4.21E-15 94 65 75 56
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
REFERENCES 692
Adeyemo, A. A., Tekola-Ayele, F., Doumatey, A. P., Bentley, A. R., Chen, G., Huang, H., et al. 693
(2015). Evaluation of Genome Wide Association Study Associated Type 2 Diabetes Susceptibility 694
Loci in Sub Saharan Africans. Front. Genet. 6, 335. doi:10.3389/fgene.2015.00335. 695
Akhtar, S., and Benter, I. F. (2013). The role of epidermal growth factor receptor in diabetes-induced 696
cardiac dysfunction. BioImpacts. doi:10.5681/bi.2013.008. 697
Akhtar, S., Chandrasekhar, B., Attur, S., Dhaunsi, G. S., Yousif, M. H. M., and Benter, I. F. (2015). 698
Transactivation of ErbB Family of Receptor Tyrosine Kinases Is Inhibited by Angiotensin-(1-7) via 699
Its Mas Receptor. PLoS One 10, e0141657. doi:10.1371/journal.pone.0141657. 700
Bader, G. D., and Hogue, C. W. V (2003). An automated method for finding molecular complexes in 701
large protein interaction networks. BMC Bioinformatics 4, 2. doi:10.1186/1471-2105-4-2. 702
Bakir-Gungor, B., Baykan, B., İseri, S. U., Tuncer, F. N., and Sezerman, O. U. (2013). Identifying 703
SNP targeted pathways in partial epilepsies with genome-wide association study data. Epilepsy Res. 704
105, 92–102. doi:10.1016/j.eplepsyres.2013.02.008. 705
Bakir-Gungor, B., Egemen, E., and Sezerman, O. U. (2014). PANOGA: A web server for 706
identification of SNP-targeted pathways from genome-wide association study data. Bioinformatics 707
30, 1287–1289. doi:10.1093/bioinformatics/btt743. 708
Bakir-Gungor, B., Remmers, E. F., Meguro, A., Mizuki, N., Kastner, D. L., Gul, A., et al. (2015). 709
Identification of possible pathogenic pathways in Behçet’s disease using genome-wide association 710
study data from two different populations. Eur. J. Hum. Genet. 23, 678–687. 711
doi:10.1038/ejhg.2014.158. 712
Bakir-Gungor, B., and Sezerman, O. U. (2011). A New Methodology to Associate SNPs with Human 713
Diseases According to Their Pathway Related Context. PLoS One 6, e26277. 714
doi:10.1371/journal.pone.0026277. 715
Bakir-Gungor, B., and Sezerman, O. U. (2013). The Identification of Pathway Markers in Intracranial 716
Aneurysm Using Genome-Wide Association Data from Two Different Populations. PLoS One 8, 717
e57022. doi:10.1371/journal.pone.0057022. 718
Bakir-gungor, B., and Sezerman, U. (2013). The Identification of Pathway Markers in Intracranial 719
Aneurysm Using Genome-Wide Association Data from Two Di erent Populations. PLoS One 8, 720
e57022. 721
Baranzini, S. E., Galwey, N. W., Wang, J., Khankhanian, P., Lindberg, R., Pelletier, D., et al. (2009). 722
Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Hum. 723
Mol. Genet. doi:10.1093/hmg/ddp120. 724
Berntorp, K., Frid, A., Alm, R., Fredrikson, G., Sjöberg, K., and Ohlsson, B. (2013). Antibodies 725
against gonadotropin-releasing hormone (GnRH) in patients with diabetes mellitus is associated with 726
lower body weight and autonomic neuropathy. BMC Res. Notes 6, 329. doi:10.1186/1756-0500-6-727
329. 728
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
Bindea, G., Mlecnik, B., Hackl, H., Charoentong, P., Tosolini, M., Kirilovsky, A., et al. (2009). 729
ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway 730
annotation networks. Bioinformatics 25, 1091–1093. doi:10.1093/bioinformatics/btp101. 731
Bonàs-Guarch, S., Guindo-Martínez, M., Miguel-Escalada, I., Grarup, N., Sebastian, D., Rodriguez-732
Fos, E., et al. (2018). Re-analysis of public genetic data reveals a rare X-chromosomal variant 733
associated with type 2 diabetes. Nat. Commun. 9, 321. doi:10.1038/s41467-017-02380-9. 734
Bonnefond, A., and Froguel, P. (2015). Rare and Common Genetic Events in Type 2 Diabetes: What 735
Should Biologists Know? Cell Metab. 21, 357–368. doi:10.1016/j.cmet.2014.12.020. 736
Boström, P., Andersson, L., Vind, B., Håversen, L., Rutberg, M., Wickström, Y., et al. (2010). The 737
SNARE protein SNAP23 and the SNARE-interacting protein Munc18c in human skeletal muscle are 738
implicated in insulin resistance/type 2 diabetes. Diabetes. doi:10.2337/db09-1503. 739
Carter, H., Douville, C., Stenson, P. D., Cooper, D. N., and Karchin, R. (2013). Identifying 740
Mendelian disease genes with the Variant Effect Scoring Tool. BMC Genomics 14, S3. 741
doi:10.1186/1471-2164-14-S3-S3. 742
Chang, X., Lima, L. de A., Liu, Y., Li, J., Li, Q., Sleiman, P. M. A., et al. (2018). Common and Rare 743
Genetic Risk Factors Converge in Protein Interaction Networks Underlying Schizophrenia . Front. 744
Genet. 9, 434. Available at: https://www.frontiersin.org/article/10.3389/fgene.2018.00434. 745
Claussnitzer, M., Cho, J. H., Collins, R., Cox, N. J., Dermitzakis, E. T., Hurles, M. E., et al. (2020). 746
A brief history of human disease genetics. Nature 577, 179–189. doi:10.1038/s41586-019-1879-7. 747
Costantini, S., Prandini, P., Corradi, M., Pasquali, A., Contreas, G., Pignatti, P. F., et al. (2011). A 748
novel synonymous substitution in the GCK gene causes aberrant splicing in an Italian patient with 749
GCK-MODY phenotype. Diabetes Res. Clin. Pract. 92, e23–e26. doi:10.1016/j.diabres.2011.01.014. 750
De Souza, A., Irfan, K., Masud, F., and Saif, M. W. (2016). Diabetes Type 2 and Pancreatic Cancer: 751
A History Unfolding. JOP 17, 144–148. Available at: 752
http://www.ncbi.nlm.nih.gov/pubmed/29568247. 753
DeFronzo, R. A., Ferrannini, E., Groop, L., Henry, R. R., Herman, W. H., Holst, J. J., et al. (2015). 754
Type 2 diabetes mellitus. Nat. Rev. Dis. Prim. 1, 15019. doi:10.1038/nrdp.2015.19. 755
Dissanayake, W. C., Sorrenson, B., and Shepherd, P. R. (2018). The role of adherens junction 756
proteins in the regulation of insulin secretion. Biosci. Rep. 38. doi:10.1042/BSR20170989. 757
Dlamini, Z., Mokoena, F., and Hull, R. (2017). Abnormalities in alternative splicing in diabetes: 758
therapeutic targets. J. Mol. Endocrinol. 59, R93–R107. doi:10.1530/JME-17-0049. 759
Douville, C., Masica, D. L., Stenson, P. D., Cooper, D. N., Gygax, D. M., Kim, R., et al. (2016). 760
Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool 761
(VEST-Indel). Hum. Mutat. 37, 28–35. doi:10.1002/humu.22911. 762
Erdmann, J., and Zeller, T. eds. (2019). From GWAS Hits to Treatment Targets. Frontiers Media SA 763
doi:10.3389/978-2-88945-982-7. 764
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
Fisher, R. A. (1934). Statistical methods for research workers. 765
Freeman, J. S. (2013). Review of Insulin-Dependent and Insulin-Independent Agents for Treating 766
Patients With Type 2 Diabetes Mellitus and Potential Role for Sodium-Glucose Co-Transporter 2 767
Inhibitors. Postgrad. Med. 125, 214–226. doi:10.3810/pgm.2013.05.2672. 768
Gallagher, M. D., and Chen-Plotkin, A. S. (2018). The post-GWAS era: from association to function. 769
Am. J. Hum. Genet. 102, 717–730. 770
Ghiassian, S. D., Menche, J., and Barabási, A. L. (2015). A DIseAse MOdule Detection (DIAMOnD) 771
Algorithm Derived from a Systematic Analysis of Connectivity Patterns of Disease Proteins in the 772
Human Interactome. PLoS Comput. Biol. 11, 1–21. doi:10.1371/journal.pcbi.1004120. 773
Herder, C., Brunner, E. J., Rathmann, W., Strassburger, K., Tabak, A. G., Schloot, N. C., et al. 774
(2009). Elevated Levels of the Anti-Inflammatory Interleukin-1 Receptor Antagonist Precede the 775
Onset of Type 2 Diabetes: The Whitehall II Study. Diabetes Care 32, 421–423. doi:10.2337/dc08-776
1161. 777
Houtz, J., Borden, P., Ceasrine, A., Minichiello, L., and Kuruvilla, R. (2016). Neurotrophin Signaling 778
Is Required for Glucose-Induced Insulin Secretion. Dev. Cell 39, 329–345. 779
doi:10.1016/j.devcel.2016.10.003. 780
Ideker, T., Ozier, O., Schwikowski, B., and Siegel, A. F. (2002). Discovering regulatory and 781
signalling circuits in molecular interaction networks. Bioinformatics 18, 233–240. 782
doi:10.1093/bioinformatics/18.suppl_1.S233. 783
International Diabetes Federation (2017). IDF Diabetes Atlas-8th Edition. Available at: 784
https://diabetesatlas.org/. 785
Kajdaniuk, D., Marek, B., Borgiel-Marek, H., and Kos-Kudła, B. (2013). Transforming growth factor 786
beta1 (TGFbeta1) in physiology and pathology. Endokrynol. Pol. 64, 384–396. 787
doi:10.5603/EP.2013.0022. 788
Kao, P. Y. P., Leung, K. H., Chan, L. W. C., Yip, S. P., and Yap, M. K. H. (2017). Pathway analysis 789
of complex diseases for GWAS, extending to consider rare variants, multi-omics and interactions. 790
Biochim. Biophys. Acta - Gen. Subj. 1861, 335–353. doi:10.1016/j.bbagen.2016.11.030. 791
Lamparter, D., Marbach, D., Rueedi, R., Kutalik, Z., and Bergmann, S. (2016). Fast and Rigorous 792
Computation of Gene and Pathway Scores from SNP-Based Summary Statistics. PLOS Comput. 793
Biol. 12, e1004714. Available at: https://doi.org/10.1371/journal.pcbi.1004714. 794
Lee, I., Blom, U. M., Wang, P. I., Shim, J. E., and Marcotte, E. M. (2011). Prioritizing candidate 795
disease genes by network-based boosting of genome-wide association data. Genome Res. 21, 1109–796
1121. doi:10.1101/gr.118992.110. 797
Li, M.-X., Gui, H.-S., Kwan, J. S. H., and Sham, P. C. (2011). GATES: a rapid and powerful gene-798
based association test using extended Simes procedure. Am. J. Hum. Genet. 88, 283–293. 799
doi:10.1016/j.ajhg.2011.01.019. 800
Lin, J.-R., Jaroslawicz, D., Cai, Y., Zhang, Q., Wang, Z., and Zhang, Z. D. (2017). PGA: post-801
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
GWAS analysis for disease gene identification. Bioinformatics 34, 1786–1788. 802
Liu, J. Z., McRae, A. F., Nyholt, D. R., Medland, S. E., Wray, N. R., Brown, K. M., et al. (2010). A 803
versatile gene-based test for genome-wide association studies. Am. J. Hum. Genet. 87, 139–145. 804
doi:10.1016/j.ajhg.2010.06.009. 805
Liu, Y., Zhao, J., Jiang, T., Yu, M., Jiang, G., and Hu, Y. (2017). A pathway analysis of genome-806
wide association study highlights novel type 2 diabetes risk pathways. Sci. Rep. 7, 12546. 807
doi:10.1038/s41598-017-12873-8. 808
Luk, C. T., Shi, S. Y., Cai, E. P., Sivasubramaniyam, T., Krishnamurthy, M., Brunt, J. J., et al. 809
(2017). FAK signalling controls insulin sensitivity through regulation of adipocyte survival. Nat. 810
Commun. 8, 14360. doi:10.1038/ncomms14360. 811
Mahajan, A., Taliun, D., Thurner, M., Robertson, N. R., Torres, J. M., Rayner, N. W., et al. (2018a). 812
Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and 813
islet-specific epigenome maps. Nat. Genet. 50, 1505–1513. doi:10.1038/s41588-018-0241-6. 814
Mahajan, A., Wessel, J., Willems, S. M., Zhao, W., Robertson, N. R., Chu, A. Y., et al. (2018b). 815
Refining the accuracy of validated target identification through coding variant fine-mapping in type 2 816
diabetes article. Nat. Genet. doi:10.1038/s41588-018-0084-1. 817
Mercader, J. M., and Florez, J. C. (2017). The Genetic Basis of Type 2 Diabetes in Hispanics and 818
Latin Americans: Challenges and Opportunities. Front. Public Heal. 5. 819
doi:10.3389/fpubh.2017.00329. 820
Meyre, D. (2017). Give GWAS a Chance. Diabetes 66, 2741–2742. doi:10.2337/dbi17-0026. 821
Nguyen, T.-M., Shafi, A., Nguyen, T., and Draghici, S. (2019). Identifying significantly impacted 822
pathways: a comprehensive review and assessment. Genome Biol. 20, 203. doi:10.1186/s13059-019-823
1790-4. 824
Ono, H., Katagiri, H., Funaki, M., Anai, M., Inukai, K., Fukushima, Y., et al. (2001). Regulation of 825
Phosphoinositide Metabolism, Akt Phosphorylation, and Glucose Transport by PTEN (Phosphatase 826
and Tensin Homolog Deleted on Chromosome 10) in 3T3-L1 Adipocytes. Mol. Endocrinol. 15, 827
1411–1422. doi:10.1210/mend.15.8.0684. 828
Perry, J. R. B., McCarthy, M. I., Hattersley, A. T., Zeggini, E., Weedon, M. N., and Frayling, T. M. 829
(2009). Interrogating Type 2 Diabetes Genome-Wide Association Data Using a Biological Pathway-830
Based Approach. Diabetes 58, 1463–1467. doi:10.2337/db08-1378. 831
Pers, T. H., Karjalainen, J. M., Chan, Y., Westra, H.-J., Wood, A. R., Yang, J., et al. (2015). 832
Biological interpretation of genome-wide association studies using predicted gene functions. Nat. 833
Commun. 6, 5890. doi:10.1038/ncomms6890. 834
Saccone, S. F., Saccone, N. L., Swan, G. E., Madden, P. A. F., Goate, A. M., Rice, J. P., et al. (2008). 835
Systematic biological prioritization after a genome-wide association study: an application to nicotine 836
dependence. Bioinformatics 24, 1805–1811. doi:10.1093/bioinformatics/btn315. 837
Schierding, W., and O’Sullivan, J. M. (2015). Connecting SNPs in Diabetes: A Spatial Analysis of 838
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 8
This is a provisional file, not the final typeset article
Meta-GWAS Loci. Front. Endocrinol. (Lausanne). 6, 102. doi:10.3389/fendo.2015.00102. 839
Schmid, D., Stolzlechner, M., Sorgner, A., Bentele, C., Assinger, A., Chiba, P., et al. (2012). An 840
abundant, truncated human sulfonylurea receptor 1 splice variant has prodiabetic properties and 841
impairs sulfonylurea action. Cell. Mol. Life Sci. 69, 129–148. doi:10.1007/s00018-011-0739-x. 842
Scott, R. A., Scott, L. J., Mägi, R., Marullo, L., Gaulton, K. J., Kaakinen, M., et al. (2017). An 843
expanded genome-wide association study of type 2 diabetes in Europeans. Diabetes 66, 2888–2902. 844
Segrè, A. V, Consortium, D., investigators, M., Groop, L., Mootha, V. K., Daly, M. J., et al. (2010). 845
Common inherited variation in mitochondrial genes is not enriched for associations with type 2 846
diabetes or related glycemic traits. PLoS Genet. 6, e1001058. doi:10.1371/journal.pgen.1001058. 847
Thrash, A., Tang, J. D., DeOrnellis, M., Peterson, D. G., and Warburton, M. L. (2019). Pathway 848
Association Studies Tool. bioRxiv, 691964. doi:10.1101/691964. 849
Tripathi, S., Moutari, S., Dehmer, M., and Emmert-Streib, F. (2016). Comparison of module 850
detection algorithms in protein networks and investigation of the biological meaning of predicted 851
modules. BMC Bioinformatics 17, 129. doi:10.1186/s12859-016-0979-8. 852
Vinh, N. X., Epps, J., and Bailey, J. (2010). Information theoretic measures for clusterings 853
comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 854
2837–2854. 855
Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., et al. (2017). 10 856
years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5–22. 857
Wang, L., Jia, P., Wolfinger, R. D., Chen, X., Grayson, B. L., Aune, T. M., et al. (2011). An efficient 858
hierarchical generalized linear mixed model for pathway analysis of genome-wide association 859
studies. Bioinformatics 27, 686–692. doi:10.1093/bioinformatics/btq728. 860
White, M. J., Yaspan, B. L., Veatch, O. J., Goddard, P., Risse-Adams, O. S., and Contreras, M. G. 861
(2019). Strategies for Pathway Analysis Using GWAS and WGS Data. Curr. Protoc. Hum. Genet. 862
doi:10.1002/cphg.79. 863
Wood, A. R., Esko, T., Yang, J., Vedantam, S., Pers, T. H., Gustafsson, S., et al. (2014). Defining the 864
role of common variation in the genomic and biological architecture of adult human height. Nat. 865
Genet. 46, 1173–1186. doi:10.1038/ng.3097. 866
Xiong, Q.-Y., Yu, C., Zhang, Y., Ling, L., Wang, L., and Gao, J.-L. (2017). Key proteins involved in 867
insulin vesicle exocytosis and secretion. Biomed. Reports 6, 134–139. doi:10.3892/br.2017.839. 868
Xuan Vinh, N., Epps, J., and Bailey, J. (2010). Information Theoretic Measures for Clusterings 869
Comparison: Variants, Properties, Normalization and Correction for Chance. 870
Xue, A., Wu, Y., Zhu, Z., Zhang, F., Kemper, K. E., Zheng, Z., et al. (2018). Genome-wide 871
association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 872
diabetes. Nat. Commun. 9, 2941. doi:10.1038/s41467-018-04951-w. 873
Yang, H., and Wang, K. (2015). Genomic variant annotation and prioritization with ANNOVAR and 874
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint
Post-GWAS Methodology to Enlighten T2D
PAGE \* \*
MERGEFORMAT 9
wANNOVAR. Nat. Protoc. 10, 1556–1566. Available at: https://doi.org/10.1038/nprot.2015.105. 875
Yoon, S., Nguyen, H. C. T., Yoo, Y. J., Kim, J., Baik, B., Kim, S., et al. (2018). Efficient pathway 876
enrichment and network analysis of GWAS summary data using GSA-SNP2. Nucleic Acids Res. 46, 877
e60–e60. doi:10.1093/nar/gky175. 878
Zeng, Z., and Bromberg, Y. (2019). Predicting Functional Effects of Synonymous Variants: A 879
Systematic Review and Perspectives . Front. Genet. 10, 914. Available at: 880
https://www.frontiersin.org/article/10.3389/fgene.2019.00914. 881
Zhao, W., Rasheed, A., Tikkanen, E., Lee, J. J., Butterworth, A. S., Howson, J. M. M., et al. (2017). 882
Identification of new susceptibility loci for type 2 diabetes and shared etiological pathways with 883
coronary heart disease. Nat. Genet. doi:10.1038/ng.3943. 884
Zheng, Y., Ley, S. H., and Hu, F. B. (2018). Global aetiology and epidemiology of type 2 diabetes 885
mellitus and its complications. Nat. Rev. Endocrinol. 14, 88–98. doi:10.1038/nrendo.2017.151. 886
Zhong, H., Yang, X., Kaplan, L. M., Molony, C., and Schadt, E. E. (2010). Integrating Pathway 887
Analysis and Genetics of Gene Expression for Genome-wide Association Studies. Am. J. Hum. 888
Genet. doi:10.1016/j.ajhg.2010.02.020. 889
890
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 15, 2020. ; https://doi.org/10.1101/2020.01.14.905547doi: bioRxiv preprint