systems level analysis of systemic sclerosis shows a ......systemic sclerosis (ssc) is a rare...
TRANSCRIPT
Systems Level Analysis of Systemic Sclerosis Shows aNetwork of Immune and Profibrotic Pathways Connectedwith Genetic PolymorphismsJ. Matthew Mahoney1¤*, Jaclyn Taroni1, Viktor Martyanov1, Tammara A. Wood1, Casey S. Greene1,
Patricia A. Pioli2, Monique E. Hinchcliff3, Michael L. Whitfield1*
1 Department of Genetics, Geisel School of Medicine at Dartmouth, Hannover, New Hampshire, United States of America, 2 Department of Obstetrics and Gynecology,
Geisel School of Medicine at Dartmouth, Hannover, New Hampshire, United States of America, 3 Department of Medicine, Northwestern University Feinberg School of
Medicine, Chicago, Illinois, United States of America
Abstract
Systemic sclerosis (SSc) is a rare systemic autoimmune disease characterized by skin and organ fibrosis. The pathogenesis ofSSc and its progression are poorly understood. The SSc intrinsic gene expression subsets (inflammatory, fibroproliferative,normal-like, and limited) are observed in multiple clinical cohorts of patients with SSc. Analysis of longitudinal skin biopsiessuggests that a patient’s subset assignment is stable over 6–12 months. Genetically, SSc is multi-factorial with many geneticrisk loci for SSc generally and for specific clinical manifestations. Here we identify the genes consistently associated with theintrinsic subsets across three independent cohorts, show the relationship between these genes using a gene-geneinteraction network, and place the genetic risk loci in the context of the intrinsic subsets. To identify gene expressionmodules common to three independent datasets from three different clinical centers, we developed a consensus clusteringprocedure based on mutual information of partitions, an information theory concept, and performed a meta-analysis ofthese genome-wide gene expression datasets. We created a gene-gene interaction network of the conserved molecularfeatures across the intrinsic subsets and analyzed their connections with SSc-associated genetic polymorphisms. Thenetwork is composed of distinct, but interconnected, components related to interferon activation, M2 macrophages,adaptive immunity, extracellular matrix remodeling, and cell proliferation. The network shows extensive connectionsbetween the inflammatory- and fibroproliferative-specific genes. The network also shows connections between thesesubset-specific genes and 30 SSc-associated polymorphic genes including STAT4, BLK, IRF7, NOTCH4, PLAUR, CSK, IRAK1, andseveral human leukocyte antigen (HLA) genes. Our analyses suggest that the gene expression changes underlying the SScsubsets may be long-lived, but mechanistically interconnected and related to a patients underlying genetic risk.
Citation: Mahoney JM, Taroni J, Martyanov V, Wood TA, Greene CS, et al. (2015) Systems Level Analysis of Systemic Sclerosis Shows a Network of Immune andProfibrotic Pathways Connected with Genetic Polymorphisms. PLoS Comput Biol 11(1): e1004005. doi:10.1371/journal.pcbi.1004005
Editor: Shervin Assassi, University of Texas Health Science Center at Houston, United States of America
Received February 13, 2014; Accepted October 27, 2014; Published January 8, 2015
Copyright: � 2015 Mahoney et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by the Scleroderma Research Foundation (MLW, MEH; http://www.srfcure.org), P50AR060780 (MLW), P30AR061271 (MLW),and the Scleroderma Foundation Linda Lee Wells Research Grant (PAP; http://www.scleroderma.org/). JMM was supported by Award Number R25 CA134286-01from the National Cancer Institute (NCI). JT was supported by Award Number T32GM008704 from the National Institute of General Medical Sciences (NIGMS). Thecontent is solely the responsibility of the authors and does not necessarily represent the official views of the NIGMS, NIAMS, NCI or the NIH. The funders had norole in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: I have read the journal’s policy and have the following conflicts: MLW and MEH have filed patents for gene expression biomarkers in SSc.MLW is the Scientific Founder of Celdara Medical LLC.
* Email: [email protected] (JMM); [email protected] (MLW)
¤ Current address: Department of Neurological Sciences, College of Medicine, University of Vermont, Burlington, Vermont, United States of America
Introduction
Genome-scale gene expression profiling of systemic sclerosis
(SSc) skin has identified distinct intrinsic molecular subsets
(inflammatory, fibroproliferative, and normal-like) within the
subset of patients diagnosed with diffuse cutaneous SSc (dSSc)
based upon the extent of skin involvement. These subsets are
identified by an intrinsic gene analysis [1] that shifts the focus to
differences between patients rather than patient biopsies. The
inflammatory subset is characterized by increased expression of
genes associated with inflammation and extracellular matrix
(ECM) deposition, while the fibroproliferative subset is character-
ized by increased expression of genes associated with cell
proliferation [1,2]. Biopsies from patients in the normal-like subset
show gene expression most similar to healthy control skin biopsies.
The presence of three distinct molecular SSc subsets within
patients diagnosed with dSSc underscores the molecular hetero-
geneity of SSc. However, it is unclear whether the subsets
represent distinct diseases with different etiologies or whether they
represent disease progression. To address this question, we
identified the conserved molecular pathways characteristic of each
subset that are reproducible between different datasets from
multiple patient cohorts, and examined the connectivity of these
genes and SSc-associated polymorphic genes in a predicted
functional network.
SSc is a rare disease without validated disease progression
markers and no known cure. SSc affects between 49,000–276,000
Americans; one in three patients dies within 10 years of diagnosis
PLOS Computational Biology | www.ploscompbiol.org 1 January 2015 | Volume 11 | Issue 1 | e1004005
[3]. The rarity of SSc makes this disease an excellent test case for a
genomic meta-analysis to understand disease mechanism. Using
this approach, we have begun to understand the molecular and
clinical complexity of SSc. Our findings may assist in generating
patient-specific therapies [4] and delivering real-time quantitative
feedback regarding therapeutic response during clinical trials
[4,5].
High-throughput gene expression data has demonstrated that
genes that function together are almost always co-expressed and
thus highly correlated with each other [6,7]. Indeed, the expressed
genes in a biopsy form a co-expression network, where genes serve
as nodes and correlations as links between genes. (The language of
networks is technical and beyond the scope of this paper. The
supporting information (S1 Text) contains a glossary of keywords
that are used in a technical sense in the main body of the paper.)
This observation forms part of the basis for ‘‘network medicine’’
[8,9]. The co-expression network contains groups of highly
correlated genes that represent the biological processes at work
in the tissue [6,10]. These groups of genes can be found using a
variety of procedures for co-expression clustering (e.g. [6,10]). All
of these procedures group highly correlated genes together, i.e.
they partition the genome into non-overlapping groups of genes
with similar expression patterns sometimes called modules. The
output of co-expression clustering is a data-driven partition of the
expressed genes in the genome. As a result of technical differences
in data acquisition protocols as well as true biological variation
(patient heterogeneity, treatment), the exact modules identified in
one SSc dataset differ somewhat from those identified in another
dataset, although the same pathways, biological processes, and
many core genes are found in each dataset. We developed a tool to
compare gene co-expression modules derived from multiple
disparate datasets to identify the modules that are reproducibly
expressed in each dataset. We then derived gene sets from the
overlaps between conserved modules across datasets. These
‘‘consensus clusters’’ are the gene clusters that are conserved
across all datasets.
A naive approach to solving this problem is to simply intersect
the ‘‘intrinsic gene lists’’ derived for each of the cohorts [1,4,11].
The methodological issue with this approach is that these lists are
derived under a large multiple hypothesis testing burden, and
although the same biological processes and some genes are found
reproducibly, the gene sets do not exactly recapitulate across data
sets [11]. This simple intersection approach would be much too
conservative and consequently exclude many biologically impor-
tant genes. Our alternative approach is to consider ‘‘modules
first’’. In brief, our goal is to identify the modules that are
conserved across datasets first and then extract the consensus
genes as those that are consistently assigned to those modules. This
transfers the multiple testing burden onto the much smaller list of
modules and allows genes to be included in the consensus even if
they do not achieve extremely high statistical significance in all
datasets simultaneously.
We developed this idea into a novel data mining procedure
called Mutual Information Consensus Clustering (MICC) to
identify conserved gene expression modules across multiple gene
expression datasets. Consensus clustering is a set of techniques
from computer science and bioinformatics that refers to strategies
for extracting robust clusters from an ensemble of partitions.
Typically this is done using a large ensemble of partitions. Early
work focused on weak clustering algorithms and consensus
clustering was used to ‘‘boost’’ the weak partitions into an
aggregate, consensus partition [12]. In bioinformatics, consensus
clustering algorithms have been developed to aggregate ensembles
of partitions that are derived from data resampling [13]. These
techniques have in common that they do not ‘‘trust’’ a particular
partition from one of their clustering algorithms. Here, we use a
strong clustering algorithm called weighted gene co-expression
network analysis (WGCNA). Our ensemble of partitions is the
collection that we obtain from having multiple, clustered datasets
from independent cohorts. While WGCNA extracts meaningful
signals in each data set, the potentially interesting modules in one
dataset are not precisely replicated in all others.
Mutual information [14] provides a rigorous criterion by which
modules from different datasets can be said to have significant
overlap (i.e. are conserved) and allows one to identify when the
available information between two partitions is exhausted. Mutual
information is a sum of positive and negative contributions from
each pair of modules across datasets, and MICC automatically
disregards all overlaps that do not contribute positively to the total
mutual information, thus giving an objective measure of conserved
gene expression that is both comprehensive and parsimonious. As
such, MICC is a metaclustering procedure that ‘‘clusters the
clusters’’ [12], but does not produce a complete partition. Instead,
it generates only a partition of the subset of the genome that has
strongly conserved gene co-expression.
Using previously published gene expression data from skin
biopsies from patients with SSc recruited at three independent
academic centers [1,4,11] and new samples analyzed as part of this
study (Table 1), we identified the consensus clusters that were
present in all datasets. Due to the unbiased nature of high-
throughput screening, these datasets contain information about
SSc-specific biology as well as the general biology of skin. We
showed that MICC yields consensus clusters that are biologically
specific. At the level of the whole transcriptome, we demonstrated
that the consensus clusters are enriched for hubs defined by co-
expression network analysis. We then filtered the consensus
clusters down to those that were intrinsic subset-specific. The
existence of the intrinsic subsets is a robust observation in each of
these studies and the consensus clusters associated with them
provide a rigorous picture of the core gene set underlying the
subsets. The comprehensive and concise annotation of the
conserved differential gene expression that we developed suggests
Author Summary
Systemic sclerosis (SSc) is a rare autoimmune diseasecharacterized by skin thickening (fibrosis) and progressiveorgan failure. Previous studies of SSc skin biopsies haveidentified molecular subsets of SSc based upon geneexpression termed the inflammatory, fibroproliferative,normal-like, and limited intrinsic subsets. These geneexpression signatures are large and although the biolog-ical processes are conserved, the exact list of genes canvary across datasets due to random variation, as well asminor differences in the composition of the study cohorts(e.g. early vs. late disease). We developed a computationaltool to identify the consensus genes underlying thesubsets across heterogeneous data and characterized thebiological role of the consensus genes in SSc in order toobtain a systems level perspective of the SSc subsets. Ouranalysis reveals a complex network of genes connectingtwo of the major SSc intrinsic subsets, inflammatory andfibroproliferative. Many genetic loci associated with SScrisk show connections with the consensus genes of theintrinsic subsets, indicating that differential expression ofgenes defining the subsets may be related to genetic riskfor SSc, thus for the first time placing the genetic riskfactors in the context of, and showing putative relation-ships with, the intrinsic gene expression subsets.
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 2 January 2015 | Volume 11 | Issue 1 | e1004005
that the intrinsic subsets represent pathophysiological states of one
disease. Our major findings include the following: 1. We show that
the subset-specific consensus clusters are part of a gene-gene
network and for the first time to our knowledge, demonstrate
putative connections between the intrinsic gene expression subsets
of SSc and SSc-associated genetic polymorphisms identified by
candidate and genome-wide association studies (GWAS). 2. We
provide additional unbiased data to support the hypothesis that
immune system activation is an early event and plays a central role
in SSc pathogenesis as SSc risk alleles are linked to the immune
system nodes of our network. 3. The consensus gene-gene network
provides insights into genes that may be central to the major
disease processes and identifies genes and pathways that may
connect these major groups of genes. 4. We show a link between
the inflammatory and fibroproliferative patient groups through a
shared TGFb/ECM subnetwork, suggesting a theoretical path by
which these gene expression subsets may be linked. Collectively,
these findings demonstrate that MICC is a powerful tool that
identifies the reproducible signals in gene expression data across
multiple datasets and shows how they may relate to the genetic
polymorphisms associated with SSc.
Results
We analyzed a compendium of three whole transcriptome
datasets from SSc skin biopsies (Milano et al. [1], Pendergrass et al.
[11], and an expanded version of Hinchcliff et al. [4]; see
Materials and Methods). These datasets consist of 70 patients with
dSSc, 10 patients with limited SSc (lSSc), 4 morphea samples, and
26 healthy controls (Table 1). Our aim was a comprehensive
picture of the gene expression abnormalities in SSc skin and we
integrated several publicly available tools with a novel consensus
clustering procedure. As demonstrated in Fig. 1, our analysis
began with gene coexpression clustering (Fig. 1A), followed by a
novel post-processing step called Mutual Information Consensus
Clustering (MICC) that identified conserved gene expression
modules across the three cohorts (Fig. 1B). The outputs from
MICC were consensus clusters, i.e. modules that were conserved
across datasets, which were the objects of further study, including
ontology annotation and functional interaction analysis (Fig. 1C).
To understand the molecular processes at work in SSc skin
biopsies, we constructed data-driven partitions of the expressed
genes across multiple SSc skin gene expression datasets using
weighted gene co-expression network analysis (WGCNA) [10]
(Figs. 1A, 2). Each co-expression cluster, or module, in the
partition corresponds to a collection of correlated molecular
processes present in the SSc tissue at the time of biopsy. To
compare these modules across SSc datasets, we used mutual
information to detect when a module from one dataset is present
in another dataset. The partitions of the genome-wide expression
data vary from one dataset to the next due to clinical heterogeneity
and treatment effects, as well as technical variation in RNA
processing protocols. All samples were analyzed on Agilent DNA
microarrays with the same DNA probes in the same laboratory,
providing consistency of the gene expression data and genes
analyzed.
To identify genes with conserved expression across multiple
datasets, we developed a procedure called Mutual Information
Consensus Clustering (MICC) that detects significant conservation
of a piece of a module and groups these conserved modules into
collections called communities, which are sets of modules with
considerable mutual overlap between datasets. Each community is
associated with a gene set; namely all genes that are annotated to a
module in that community for each dataset. We call these gene sets
consensus clusters. The basis of MICC is the concept of mutual
information from information theory [14]. Specifically, we use
mutual information of partitions (MIP), which is an information
measure specific for partitions. MIP quantifies the amount of
information one partition has about another; i.e. it measures the
correlation of cluster labels across datasets. MICC identifies
consensus clusters using MIP to build a module similarity network
of significant module overlaps, which we call the information
graph (Figs. 1B, 3A). Then MICC algorithmically identifies
communities in that network (Fig. 1B; Fig. 3A, colored nodes).
These communities are collections of modules that have substan-
tial overlap among each other, and they represent nearly all of the
mutual information between the genomic partitions. In this way,
MICC extracts almost all of the available information present in
the separate clusterings of individual datasets and reports the
clusters that are conserved across the three cohorts (see Materials
and Methods for a detailed description).
WGCNA identifies that large numbers of genes in thegenome are deregulated in SSc skin
Weighted gene coexpression network analysis (WGCNA) [10] is
a gene co-expression clustering procedure that automatically
detects the number of modules in a dataset and removes outlier
genes. WGCNA performed on a single SSc skin dataset (Milano et
al. [1], S1 Data file), demonstrates the complexity of comparative
studies across multiple datasets (Fig. 2). The molecular subsets
termed ‘SSc intrinsic subsets’ were first identified by Milano et al.
[1]. The Milano et al. dataset is the best characterized dataset
showing the SSc intrinsic subsets that has been analyzed to date.
These data were clustered into modules by WGCNA, and the
resulting modules were summarized by their first principal
component or module eigengene (Fig. 2). Module eigengenes are
a one-dimensional summary of the gene expression within a
module that captures the bulk of the variance within that module.
To identify those modules that were intrinsic subset-specific, we
performed Kruskal-Wallis tests on the module eigengenes with
groups defined according to intrinsic subsets. Of the 54 total
modules, 23 had a significant p-value for association with the
intrinsic subsets (all p,0.05 after Bonferroni correction; Fig. 2
shows six representative examples). These 23 modules comprise
approximately 40% of the expressed genes in the genome
(Table 2), demonstrating that the intrinsic subsets found in SSc
skin are defined by deregulation of a very large fraction of the
expressed genes in any given cell. Gene expression for six
significant modules along with their corresponding module
eigengenes demonstrates clear association with the intrinsic subsets
(Fig. 2). These modules are enriched for broad functional
categories previously associated with SSc, including chemokine
signaling, NFkB signaling, RAS-RAC signaling in the inflamma-
tory subset, and cell cycle processes in the fibroproliferative subset
[1,4,11].
To identify the core set of genes reproducibly found in each SSc
intrinsic subset, we performed WGCNA on two additional SSc
skin gene expression datasets: Pendergrass et al. [11] and an
expanded version of Hinchcliff et al. [4] (Data files S2–S3). The
Hinchcliff data are from an ongoing clinical trial of mycopheno-
late mofetil (MMF). Preliminary data have been published [4], but
we also analyzed data for an additional 82 unpublished samples
from that trial (the full data available from NCBI GEO at
GSE59787). A summary of the three cohorts, including the
expanded Hinchcliff cohort, is available in Table 1. Each of the
three datasets had approximately 60 modules. The module
eigengenes were tested for association with the intrinsic subsets,
and it was found that, within a dataset, between 18% and 67% of
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 3 January 2015 | Volume 11 | Issue 1 | e1004005
Ta
ble
1.
Co
mp
osi
tio
no
fst
ud
yco
ho
rts.
Da
taS
et
Pe
nd
erg
rass
Mil
an
oH
inch
clif
f
SS
cC
on
tro
lS
Sc
Co
ntr
ol
Mo
rph
ea
SS
cC
on
tro
lM
orp
he
a
All
sub
ject
sN
=2
2N
=9
N=
24
N=
6N
=3
N=
34
N=
11
N=
1
Ag
e*,
me
an(S
D)
46
.1(9
.1)
NA
51
.8(1
0.4
)4
0.2
(10
.6)
50
.7(2
.9)
47
.7(1
1.7
)4
1.2
(11
.8)
47
Sex,
n(%
wo
me
n)
17
(77
.3)
NA
21
(87
.5)
5(8
3.3
)3
(10
0.0
)3
2(9
4.1
)7
(63
.6)
1(1
00
.0)
SSc
pat
ien
tso
nly
N=
22
N=
24
N=
34
SSc
sub
typ
e,
n(%
dif
fuse
)2
2(1
00
.0)
17
(70
.8)
31
(91
.2)
SSc
dis
eas
ed
ura
tio
n*,
me
an(S
D)
18
.4(1
3.1
)m
os
7.7
(7.2
)yr
s5
2.7
(67
.4)
mo
s
SSc
auto
anti
bo
dy,
n(%
po
siti
ve)
Scl-
70
3(1
3.6
)5
(20
.8)
9(2
6.5
)
RN
AP
ol
III3
(13
.6)
NA
9(2
6.5
)
AC
A1
(4.5
)2
(8.3
)2
(5.9
)
SSc
intr
insi
csu
bse
t,n
(%)
Infl
amm
ato
ry9
(40
.9)
5(2
0.8
)1
7(5
0.0
)
Pro
life
rati
ve8
(36
.4)
11
(45
.8)
7(2
0.6
)
No
rmal
-lik
e2
(9.1
)4
(16
.7)
7(2
0.6
)
Lim
ite
dN
A3
(12
.5)
2(5
.9)
*At
bas
eb
iop
sy.
Th
eth
ree
coh
ort
so
fp
atie
nts
wit
hSS
can
alyz
ed
inth
isst
ud
y:M
ilan
oe
tal
.[1
],P
en
de
rgra
sse
tal
.[1
1],
and
ane
xpan
de
dve
rsio
no
fH
inch
clif
fe
tal
.[4
].d
oi:1
0.1
37
1/j
ou
rnal
.pcb
i.10
04
00
5.t
00
1
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 4 January 2015 | Volume 11 | Issue 1 | e1004005
the modules were subset-specific (Table 2). This shows that for
each dataset a substantial fraction of modules was associated with
the intrinsic subsets and, that 4,004–10,373 genes out of the
approximately 19,500 in the human genome were in differentially
regulated modules associated with the intrinsic subsets (Tables 2
and 3).
The gene co-expression modules represent biological processes
that are active in skin and some are reflective of disease
pathogenesis. To determine which processes were conserved
across all three datasets, we constructed the information graph
for the three separate WGCNA partitions of the genome (Fig. 3A).
The information graph is a network where a node in the network
is a module from one dataset, and a link between modules
indicates that the overlap between those modules is significantly
larger than would be expected at random. In other words, an edge
represents conservation of a significant part of a module across two
datasets (see Materials and Methods for a detailed discussion of
module overlap scores). Triangles in the information graph
correspond to a significant three-way overlap of modules or,
equivalently, a module conserved across all three datasets. We
enumerated all triangles in the information graph to identify all
such conserved modules. There were 157 triangles and approx-
imately 2000 genes in their corresponding triple overlaps. Most
(129 out of 178) of the modules across all SSc datasets are present
in at least one triangle, i.e. most co-expression modules had a
significant portion co-expressed in the other datasets (Table 2,
bottom row). This indicates that the WGCNA-derived modules
are reproducible features of SSc gene expression. Nine of the
triangles had all three nodes (modules) significantly associated with
the subsets (five inflammatory, four fibroproliferative; see below
and Fig. 3).
The consensus genes are hubs in the gene-gene co-expression
networks. To see this, we noted that module eigengenes represent
hubs in the gene-gene correlation network [15]. A module
eigengene does not correspond to an actual gene, but rather
represents a theoretical gene that is most central in the module.
Fig. 1. Schematic of the analysis pipeline for integrative analysis of multiple SSc skin datasets. (A) Each microarray dataset (Milano et al.,Pendergrass et al., and Hinchcliff et al.) was independently clustered by WGCNA into gene coexpression modules (colored circles). Each module is aset of genes that was highly correlated within a dataset. (B) Modules were compared across datasets using a novel procedure (MICC) to determinewhich were approximately conserved across all three datasets. The network in (B) is called the information graph and encodes the nontrivial overlapsof modules across datasets. Triangles in this network correspond to approximately conserved modules across all three datasets. Communities in thisnetwork (dotted ovals) represent collections of modules that are conserved together and thus have similar biological function. Note thatcommunities in the network can overlap (e.g. module P1 in the schematic belongs to two communities). (C) Genes derived from the modulecommunities are called consensus genes and were used for downstream bioinformatics analyses including gene ontology enrichment analysis usingthe g:Profiler tool, testing for intrinsic subset-specificity, and functional interaction network analysis using the IMP functional network. Each of thesedownstream analyses is independent and complementary.doi:10.1371/journal.pcbi.1004005.g001
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 5 January 2015 | Volume 11 | Issue 1 | e1004005
Therefore, genes that are highly correlated to their module
eigengene are more central within their module. We calculated the
correlation of each gene to its corresponding module eigengene
(S1 Fig.). The density of these gene-eigengene correlations is
shown for all genes in the genome (blue curve) and for only the
consensus genes (red curve). The consensus genes are significantly
more correlated with their module eigengene than randomly
selected genes are with their module eigengene, indicating that the
consensus genes are significantly enriched for hub genes in their
(dataset-specific) co-expression network. This is a useful positive
control for the MICC method because it shows that the consensus
genes are enriched for ‘‘hubness’’ in the SSc co-expression network
and thus MICC finds genes that have salient network features.
The information graph reveals conserved, subset-specificmolecular modules
While most modules are partially conserved between the three
datasets and many of them are intrinsic subset-specific, not all
intrinsic subset-specific modules are conserved across all datasets
(S4 Data file). To find the conserved, intrinsic subset-specific
modules, we noted that the information graph has groups of
triangles with considerable mutual edge-sharing (Fig. 3A). Many
of the triangles in the information graph overlap and form
communities of triangles (Fig. 3A, S2 Fig.). This was intriguing
because it opened up a broader interpretation of ‘‘consensus
cluster’’. If the information graph had been a disconnected
collection of single triangles, this would have implied that there
was a one-to-one mapping between the modules from different
datasets. Instead, a single module from one dataset gets broken
into pieces in the other datasets. The community structure of the
information graph indicates what we have known from many prior
microarray studies, namely that specific groups of genes are
commonly expressed together and that the aggregate set of genes
underlying these multiple co-expression clusters constitutes the
truly conserved processes in SSc [1,6,16].
We detected communities in the information graph using a
variant of clique percolation [17], a network community detection
procedure that, in this case, explicitly identifies communities of
triangles (S1 Text). Clique percolation identified 26 communities,
13 of which were single, isolated triangles, while the rest were
groups of more than one triangle (Fig. 3A, S2 Fig.).
To derive a gene set associated with a community in the
information graph, we took all modules within the community,
computed their union within datasets, and computed their
intersection across datasets (S3 Fig.). In this way, we captured all
genes whose co-expression was conserved across the three datasets.
(A mathematical description of this procedure is presented in the
Fig. 2. Gene expression modules associated with the intrinsic subsets of SSc. We identified 54 major sets of genes (modules) using WGCNAthat define the spectrum of gene expression in SSc skin using Milano as a test case. The top 6 most significant modules are shown and each shows astatistically significant association with the intrinsic subsets (including the limited subset). Module assignment for each gene is unique. The genesthat compose the subset-specific modules represent more than 40% of the protein-coding genes in the human genome. Therefore, the intrinsicsubsets seem to be determined by a large fraction of the encoded genes. The module eigengene of each module is shown in a stem-plot below eachheatmap with intrinsic subsets indicated by color above the heatmap. Proliferative, red; inflammatory, purple; limited, yellow; normal-like, green. (A)Inflammatory modules (p,1029 and p,1027; Kruskal-Wallis non-parametric ANOVA corrected for multiple testing), (B) Limited Module (p,0.006),(C) Fibroproliferative modules (p,1027; p,1028), (D) Fibroproliferative and Limited expression module (p,1029). Enriched molecular processes areindicated for each subset to the right of each heat map.doi:10.1371/journal.pcbi.1004005.g002
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 6 January 2015 | Volume 11 | Issue 1 | e1004005
Materials and Methods.) We termed these community-derived
gene sets consensus clusters (CCs). Using g:Profiler [18], we found
that the consensus clusters are enriched for many biological
processes (summary in Table 3; raw data in S5 Data file) present
in both healthy and SSc biopsies. For example, CCs 1, 4, 5, 7, 8,
and 11 are enriched for basic metabolic and cellular processes,
while CC 12 showed enrichment for keratinocyte-specific
processes (Table 3; Fig. 3A, cyan). These consensus clusters show
Fig. 3. Information graph and consensus clusters for the MPH cohorts. (A) The information graph of the MPH cohorts is highly modular (cf.S2 Fig.), indicating approximate conservation of gene expression modules across datasets. The information graph is tripartite by construction, so atriangle in the graph necessarily connects modules across all three datasets. The triangles form communities of mutual edge sharing. Colored nodesand edges highlight four of these communities. The purple community contains modules that are up-regulated in the inflammatory subset (cf. panelB). The red community contains modules that are up-regulated in the fibroproliferative subset (cf. panel B). The cyan community contains modulesthat are enriched for keratinocyte-specific processes. The orange community contains modules that are enriched for fatty acid metabolism genes. Theremaining communities (22 in all and not colored to avoid cluttering the display) are enriched primarily for housekeeping processes and are neitherskin- nor disease-specific (see Table 3). (B) Modules from the communities were tested for their enrichment in the subsets. Each row corresponds to atriangle in the information graph and each column corresponds to a dataset. The black lines separate communities, e.g. all of the rows in the blockmarked ‘‘1’’ correspond triangles in community 1. The cells are colored according to whether the module was significantly differentially expressed in asubset with dark colors representing up-regulation and light colors representing down-regulation (Bonferroni-corrected Wilcoxon rank sum p-valuep,0.05). We assessed statistical significance of modules within each dataset for each of the three diffuse SSc intrinsic subsets, as well as all SSc vs.healthy controls (Purple- Inflammatory, Red- Proliferation, Green- Normal-like, Blue- All SSc). Note the inflammatory up community (*) and thefibroproliferative up community (**). Note also that community 2 is significantly highly expressed in the inflammatory subset and lowly expressed inthe proliferative subset in Milano only. Likewise, community 9 appears to be expressed at low levels in the inflammatory subset in Milano, but noneof the other data sets.doi:10.1371/journal.pcbi.1004005.g003
Table 2. Statistics of module conservation.
Dataset Milano Pendergrass Hinchcliff
Number of subset-specific modules (Number of modules total) 17 (54) 13 (66) 31 (58)
Portion subset specific 31% 18% 53%
Number of genes in subset-specific modules 10,373 4,004 8,517
Number of conserved modules 32 50 47
Portion conserved 59% 76% 81%
The number of modules identified by WGCNA varied across datasets. However, a large majority of modules from each dataset were at least partially conserved acrossthe three cohorts, meaning that they were present in at least one triangle of the information graph (Fig. 3A). A smaller fraction of the modules were subset-specificwithin their dataset. The Hinchcliff dataset had the largest fraction of subset-specific modules, which may be due to subject enrollment in the cohort rather than to SScbiology. The Milano dataset had the largest number of genes present in subset-specific modules. This shows that the subsets are associated with a large number ofgenes (up to 40% of the genome in Milano et al. assuming an upper bound of 25,000 human genes).doi:10.1371/journal.pcbi.1004005.t002
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 7 January 2015 | Volume 11 | Issue 1 | e1004005
that MICC extracts biologically coherent sets of genes that are
known to be active in skin as consensus clusters. This provides an
additional positive control for the MICC method.
More importantly, CC3 and CC9 showed enrichment for
processes implicated in SSc (Table 3; S5 Data file). CC3 was
enriched for response to interferons, B cell receptor signaling,
monocyte chemotaxis, and TGFb and PDGF signaling, as well as
ECM remodeling processes. CC9 showed enrichment for cell cycle
and cell proliferation processes, as well as integrin interactions with
fibrin. Note that CC3 and CC9 both show enrichment for distinct
ECM-related molecular processes. These data are consistent with
the analysis of experimentally derived pathway signatures [19].
The consensus clusters CC3 and CC9 map to the major
intrinsic subsets previously described [1]. We tested every module
for association with the intrinsic subsets (see Materials and
Methods) and we constructed a ‘‘heatmap’’ of the triangles in
the information graph by dataset (Fig. 3B). The rows were ordered
by community membership and the columns were ordered by
dataset. We concatenated each of these plots so all subsets,
datasets, and consensus clusters can be viewed simultaneously.
Only consensus clusters 3 and 9 were enriched for SSc intrinsic
subset specificity. Consensus cluster 3 contained modules that are
almost all significantly expressed at high levels in the inflammatory
group of patients (Fig. 3A, purple nodes; Fig. 3B). Consensus
cluster 9 contains modules that are almost all significantly
expressed at high levels in the fibroproliferative group (Fig. 3A,
red nodes; Fig. 3B). We also included tests for all SSc biopsies
versus healthy controls to determine if there were any consensus
clusters that were generally conserved across all SSc biopsies.
There were no consensus clusters that were enriched for all SSc
versus healthy controls, which illustrates quantitatively SSc
heterogeneity. Furthermore, there were no consensus clusters that
were consistently expressed at low levels in any of the subsets.
Some consensus clusters are enriched for a subset in some of the
datasets, but are not replicated across all datasets (Fig. 3B). For
example, CC2 is expressed at high levels in the inflammatory
subset and low levels in the proliferative subset in Milano, but
neither of the other datasets. Inflammatory-specific CC3 is
expressed at low levels in the proliferation subset in Milano and
in the normal-like subset in Pendergrass and Hinchcliff, and is
expressed at high levels in all SSc versus healthy controls in
Pendergrass and Hinchcliff only. Similarly, CC9, which is
proliferative-specific, is expressed at low levels in the inflammatory
subset in Milano only. These observations demonstrate that genes
with increased expression should be the focus in SSc.
Conserved inflammatory and fibrosis genes form anetwork with putative SSc risk alleles
The biology of CC3 and CC9 show the processes common to
the intrinsic subsets that have been observed across multiple gene
Table 3. Molecular processes enriched in the 13 largest consensus clusters.
Consensus Cluster Enriched Molecular Processes # Consensus genes
1 Tubulin processing 1144
2 Insulin signaling 1194
3 TGFb signaling 312
PDGF signaling
Collagen fibril organization
B cell receptor signaling
Monocyte chemotaxis
Response to interferons
Patterning of blood vessels
4 RNA processing and transport 873
Ribosome biogenesis
5 Tubulin processing 485
6 ARF activity 225
7 Mitochondria 73
8 Mitochondria 80
9 Fibrin/fibrinogen 274
Mitotic spindle
G2/M transition
Integrin interactions with fibrin
10 Alcohol metabolism 80
11 Fatty acid metabolism 98
12 DAP12 (TYROBP) signaling 63
Keratinocyte processes
13 Phosphatidic acid 11
An analysis with g:Profiler resulted in many biological processes enriched in the each of consensus clusters. This table contains a condensed list of the significantpathways (p,0.05, corrected for multiple testing by default in g:Profiler), retaining those that are specific for skin or other housekeeping biology. Note particularlyconsensus clusters 3 and 9, which are enriched for the inflammatory and fibroproliferative subsets respectively. Consensus clusters 3, 9 11, and 12 are highlighted inFig. 3A by the colored communities in the information graph.doi:10.1371/journal.pcbi.1004005.t003
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 8 January 2015 | Volume 11 | Issue 1 | e1004005
expression datasets: inflammation, cell interactions with ECM,
and cell proliferation (Table 3). To determine if there was a more
interconnected relationship between these conserved processes
(such as genes related to specific cell types) than could be gained
from an ontological annotation analysis like g:Profiler, we used
CC3 and CC9 as a query gene set for the IMP gene-gene
interaction Bayesian network (IMP) (Fig. 4) [20]. IMP is a gene-
gene interaction network developed using a large compendium of
high-throughput biological data including all publicly available
microarray data that predicts the probability that pairs of genes
have a co-expression interaction. A list of genes is imported into
IMP, and a list of high-probability interactions between the genes
on the imported list and (up to 50 additional genes in) the rest of
the genome is generated. IMP is completely agnostic to SSc-
specific biology and reports predicted interactions that are based
on the preponderance of evidence across all publicly available
gene expression data. As our query, we pooled the two consensus
clusters CC3 and CC9 to discover possible molecular links
between the inflammatory and fibroproliferative intrinsic subsets.
We added polymorphic genes from genome-wide association
studies (GWAS), as well as genes from candidate gene studies that
have been replicated in at least one follow up study (see Materials
and Methods; S6 Data file). In addition, we added four genes that
are putative predictors of Modified Rodnan Skin Score (MRSS), a
widely used clinical measure of skin fibrosis [5].
The output network from IMP was dominated by one large
interconnected network that had five distinct subnetworks (Fig. 4;
S7–S9 Data files). The five molecular subnetworks were each
enriched for a distinct biological process: interferon response, M2
macrophage activation, adaptive immunity, ECM deposition and
remodeling and TGFb signaling, and cell proliferation.
One subnetwork was dominated by interferons and interferon-
inducible genes (Fig. 4, top middle; S9 Data file). The interferon
subnetwork contained genes solely from the inflammatory
consensus cluster (Fig. 4, purple nodes). This subnetwork con-
tained the interferon inducible genes IFI16 and IFI44, the latter
Fig. 4. Molecular network of inflammatory and fibroproliferative consensus genes. The consensus genes for the inflammatory andfibroproliferative subsets are connected in the IMP functional network. Inflammatory genes are colored purple, while fibroproliferative genes arecolored red. Genes with polymorphisms are colored in green and MRSS biomarker genes are colored yellow. One MRSS biomarker gene (IFI44) wasalso an inflammatory consensus gene (pink), while three polymorphic genes were inflammatory consensus genes (turquoise). Note the five distinctsubnetworks corresponding to type I interferons, M2 macrophages, ECM proteins and TGFb signaling, adaptive immunity, and cell proliferation. Theinterferon, M2 macrophage, and adaptive immunity subnetworks are composed almost exclusively of inflammatory genes, while the ECMsubnetwork shares genes from both intrinsic subsets. Furthermore, the polymorphic genes interact primarily with inflammatory subset genesindicating that the genetic risk in SSc is related to immune abnormalities.doi:10.1371/journal.pcbi.1004005.g004
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 9 January 2015 | Volume 11 | Issue 1 | e1004005
of which is a putative biomarker of fibrosis [5]. This subnetwork
also contains the polymorphic interferon regulatory factor genes
IRF5, IRF7, and IRF8.
A second subnetwork contained genes characteristic of M2
macrophage activation (Fig. 4, bottom left; S9 Data file). The
genes in this network, which include major histocompatibility
complex (MHC) class II genes with SSc-associated polymor-
phisms, are derived primarily from the inflammatory consensus
cluster, implicating macrophages as mediators of inflammation.
Polarized macrophages can broadly be categorized as ‘‘classically
activated’’ (M1) or ‘‘alternatively activated’’ (M2), although it is
important to recognize that macrophage polarization encompasses
a broad spectrum of activation states. M1 macrophages may be
elicited through stimulation with IFN-c and LPS, are microbicidal,
and promote Th1-mediated immune responses. In contrast, M2
cells, which mediate immune suppression, may be activated by
various stimuli, including IL-4 and/or IL-13, which are elevated in
SSc sera [21,22]. Genes associated with M2 activation, including
CX3CR1 [23], IL10R [24], and HLA-DMB [25], were consis-
tently expressed in this subnetwork, in accordance with previous
studies that found increased M2-polarized macrophages in SSc
skin compared to healthy skin [26]. As M2-polarized cells regulate
vascularization and are a potent source of TGFb, PDGF, and
inflammatory cytokines [27–29], activated M2 macrophages may
play a role in mediating fibrosis and inflammation in SSc.
A third molecular subnetwork contained genes related to
adaptive immunity (Fig. 4, top left; S9 Data file). There are
relationships to both B and T cells in the genes in this subnetwork.
Two chains of the T cell receptor complex are represented: CD3G(gamma chain of the T cell receptor (CD3)) and CD247 (the zeta
chain of the T cell receptor), which contains SSc-associated
polymorphisms. The IL-12 pathway, which mediates Th1 cell
differentiation and activation [30,31], is represented through
IL12RB2. Binding of IL-12 to IL12RB2 on activated T cells
initiates a signal transduction cascade that results in activation of
STAT transcription factors, including STAT4 [32] (also repre-
sented in this subnetwork), which regulate T cell signaling and
immune activation [33]. Aberrant expression of IL12RB2 has
been reported in autoimmune and infectious diseases [34,35],
implicating this gene as an important regulator of inflammation
and immune defense.
B cell receptor activation and signaling are also represented in
this subnetwork. DOCK10 expression is up-regulated in B cells by
pro-inflammatory IL-4 [36], and BANK1 and BLK are B cell
proteins that have polymorphisms associated with SSc. Both LYNand CSK appear in this subnetwork and are directly connected to
each other. The tyrosine kinase LYN, which plays a critical role in
down-regulating B cell activation and mediating self-tolerance
[37,38], is phosphorylated by CSK [39]. Polymorphisms in CSK
have been linked to both SSc and systemic lupus erythematosus
(SLE) and are associated with aberrant B cell signaling [40]. CSK
also associates with Lyp [41], which is the product of the tyrosine
phosphatase PTPN22. The PTPN22 gene also contains an SSc-
associated polymorphism. Mutations in PTPN22 that interfere
with its ability to bind to CSK also interfere with both B and T cell
receptor activation [42,43]. Moreover, mutations in PTPN22have been reported in a variety of other autoimmune diseases,
including SLE, rheumatoid arthritis, and type 1 diabetes [44].
Negative regulators of B and T cell activation such as SOCS2and SOCS3, are included in this network. SOCS3 has been shown
to directly inhibit IL-12-induced STAT4 activation [45]. The co-
occurrence of pro- and anti-inflammatory signals in this subnet-
work is notable and is likely because our data are derived from
whole skin biopsies (see Discussion).
The fourth molecular subnetwork contained TGFb pathway
genes (which have long been implicated in the activation of fibrosis
in SSc [46,47]) and ECM structural proteins (Fig. 4, bottom
middle). This TGFb/ECM subnetwork contained genes from both
the inflammatory and fibroproliferative consensus clusters (Fig. 4,
red and purple nodes; S9 Data file). We also found expression of
genes associated with Notch signaling such as NOTCH4, which
contains SSc-associated polymorphisms, and with the epithelial-
mesenchymal transition (EMT) such as LATS2. Alternatively
activated macrophages are known to produce large quantities of
TGFb in SSc pulmonary fibrosis [29], suggesting that the M2
macrophage subnetwork could drive activation of the TGFb/
ECM subnetwork.
The final molecular subnetwork contained cell cycle/cell
proliferation genes, which were primarily from the fibroprolifera-
tive consensus cluster (Fig. 4, right). The expression of prolifera-
tion genes is commonly observed in cancer [2,7,48] and their
presence in the gene expression data of SSc was a surprising and
unexpected finding [1]. The large and densely interconnected
subnetwork of genes in Fig. 4 (right, red nodes) was composed
almost exclusively of cell cycle-regulated genes including AURKA/B, CCNA2, CCNB1, CHK1, and DHFR [7]. This subnetwork
was conserved and showed increased expression in the fibropro-
liferative subset of patients across all three cohorts, and constituted
the core gene expression signature in that subset of patients (Fig. 4,
red nodes). Therefore, the cell proliferation signature of the
fibroproliferative subset of patients first observed in Milano et al.
[1] is a conserved feature of SSc across three independent cohorts
from three separate clinical centers. This molecular subnetwork
has connections to each of the other four subnetworks (interferon,
M2 macrophages, adaptive immunity, and TGFb/ECM) suggest-
ing that cell proliferation in SSc skin is modulated by the
inflammatory and ECM remodeling processes in skin.
IMP predicts that the genes linked to SSc-associated polymor-
phisms (30/41 total) and the putative MRSS biomarker genes of
Lafyatis and co-workers (4/4) have interactions within this large
component of the molecular network (Fig. 4). Polymorphisms in
IRF5, IRF7, and IRF8 were linked to the interferon subnetwork.
IRF7 is also differentially expressed in the inflammatory subset.
The polymorphisms associated with human leukocyte antigen
(HLA) alleles predominantly have interactions with the M2
macrophage subnetwork of genes. Polymorphisms in and differ-
ential expression of NOTCH4 were linked to the TGFb/ECM
subnetwork. The same was true for the MRSS biomarker genes;
IFI44 was linked to the interferon subnetwork; SIGLEC1 was
linked to the M2 macrophage subnetwork; and both COMP and
THBS1 were linked to the TGFb/ECM subnetwork. These
results suggest that prediction of worsening skin disease requires
sampling genes from each molecular subnetwork.
The molecular network contains SSc-pathology-specifichubs
The molecular network contains genes that are hubs (i.e. highly
connected nodes) of the subnetworks.
Interferon-induced protein 44 (IFI44) is a hub of the interferon
subnetwork. It has conserved high expression across all three of
our cohorts in the inflammatory subset and is one of the most
highly connected genes in the interferon subnetwork (Fig. 5, top
right). IFI44 is predicted to have co-expression interactions with
several other interferon-inducible and interferon-regulating genes,
including IFI16, IRF7, IFITM2, ISG20, GBP1, and TRIM22.
Allograft Inflammatory Factor 1 (AIF-1) is a hub of the M2
macrophage. AIF1 is consistently highly expressed in the
inflammatory subset across all three SSc skin cohorts and is one
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 10 January 2015 | Volume 11 | Issue 1 | e1004005
of the most highly connected genes in the M2 macrophage
subnetwork (Fig. 5, bottom left). In the molecular network (Fig. 5,
bottom left), AIF1 has many links including: ITGB2, a binding
partner of the monocyte marker ITGAL, and MHC class II genes
HLA-DMB, HLA-DPA1, and HLA-DQB1. In addition, AIF1 has
connections to chemokine receptors CCR1 and CX3CR1, which
are connected to chemokines CX3CL1 (fractalkine) and CCL2(MCP-1).
The tyrosine kinase gene LYN is a hub of the adaptive immunity
subnetwork (Fig. 5, top left). LYN has predicted edges with four
polymorphic genes in this subnetwork: BLK, BANK1, CSK, and
GRB10. LYN also has connections to the polymorphic, bridge
genes PLAUR and LCP2 (see below), and suppressors of cytokine
signaling genes SOCS2 and SOCS3. The conserved finding of high
expression of LYN in the inflammatory subset and its centrality
within the adaptive immune subnetwork suggests that LYN plays a
key role in the adaptive immune component of SSc in skin.
Fibrillin-1 (FBN1) is a hub of the TGFb/ECM subnetwork
(Fig. 5, bottom right). High expression of FBN1 is conserved
across the inflammatory subset of all three cohorts of SSc skin, and
FBN1 is highly connected within the TGFb/ECM subnetwork of
the molecular network (Fig. 5, bottom right). The TGFb/ECM
subnetwork includes genes that primarily show high expression in
the inflammatory subset but also includes genes that are highly
expressed in the fibroproliferative group, thus providing a putative
molecular link between the two groups. FBN1 has predicted
connections to many genes whose increased expression is
conserved, including: pro-fibrotic genes including COL1A2,
COL5A2, and elastin (ELN), CTGF, SPARC, THBS1, THBS4,
COMP, TNC and ECM remodeling and wound response genes
LOX, NNMT, and FBLN5. In addition, FBN1 has connections
with growth factor genes and receptors such as HTRA1 and
NOTCH4; cell adhesion genes CDH11 and LAMA4; as well as
the complement system gene C1S.
The molecular network shows genes that bridgesubnetworks
In addition to containing discrete subnetworks, the molecular
network also shows genes that bridge the subnetworks (Fig. 6).
These genes are of particular interest because they have predicted
connections between multiple, distinct subnetworks. The primary
reason for using CC3 and CC9 simultaneously as queries to IMP
Fig. 5. Hubs in the inflammatory and ECM components of the network. The putative MRSS biomarker gene IFI44 is a hub of the type 1interferon subnetwork. AIF1, which contains SSc-associated polymorphisms and is related to M2 macrophage polarization, is a hub of the M2macrophage network. FBN1, which contains SSc-associated polymorphisms in some populations and is a key component of ECM that regulatesmatrix stiffness, is a hub of the TGFb/ECM network. The tyrosine kinase gene LYN is associated with B cell activation and mediating self-tolerance andis a hub in the adaptive immunity subnetwork.doi:10.1371/journal.pcbi.1004005.g005
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 11 January 2015 | Volume 11 | Issue 1 | e1004005
was to identify possible molecular connections between the core
molecular processes of the inflammatory and fibroproliferative
intrinsic subsets. The bridge genes live at the interfaces between
the subnetworks that constitute these core molecular processes.
The genes CXCR4 and LCP2 are the major connections
between the adaptive immunity subnetwork and the M2
macrophage subnetwork (Fig. 6). LCP2 (SLP-76), which modu-
lates T cell activation [49], has predicted interactions with AIF1and IL10RA in the M2 macrophage subnetwork and to SOCS2,
SOCS3, STAT4, LYN, and CSK in the adaptive immunity
subnetwork (see edges extending from LCP2 in Fig. 6). The
chemokine CXCR4 has predicted interactions with the cytokines/
chemokines IL10RA, CX3CR1, CCR1, and the polymorphic
CCR6 in the M2 macrophage subnetwork (Fig. 6). CXCR4 has
predicted interactions with SOCS3 and JAK3 in the adaptive
immunity subnetwork.
GRB10 contains an SSc-associated polymorphism and is also
expressed at high levels in the inflammatory subset (see blue
GRB10 node, Fig. 6). GRB10 is part of a complex path from the
adaptive immune subnetwork to the M2 macrophage subnetwork
that includes genes containing pleckstrin homology domains
including PLEKHO1, PLEKHO2, CYTH4 and ADAP2.
The major connection between the M2 macrophage subnet-
work hub AIF1 and the interferon subnetwork hub IFI44 is
through RAC2. RAC2 encodes a member of the Rac family of
signaling molecules and has multiple predicted interactions with
both the interferon subnetwork and the M2 macrophage
subnetwork (Fig. 6). In the interferon subnetwork (Fig. 6, upper
middle), RAC2 connects to CTSC (cathepsin C), IFITM1 and
IFI16, as well as the Rho GTPase related genes ARHGDIB and
RAB31. In the M2 macrophage subnetwork (Fig. 6, lower left),
RAC2 connects to ITGB2, the actin cytoskeleton related proteins
LCP1 and COTL1, and GMFG. COTL1 is also related to
leukotriene biosynthesis through a known interaction with
ALOX5. These diverse interactions suggest that RAC2 is involved
simultaneously in macrophage motility, leukotriene biosynthesis,
and interferon signaling.
The major bridges between the M2 macrophage subnetwork
and the ECM subnetwork are THY1 (CD90) and CD14 (Fig. 6,
lower left). THY1 connects to SIGLEC1, MXRA5 and COL1A2.
THY1 mediates adhesion of leukocytes and monocytes to
endothelial cells and fibroblasts [50], may also have a role in lung
fibrosis (a major complication of SSc); THY1 knockout mice have
increased lung fibrosis [51,52]. CD14 is a cell surface protein
Fig. 6. Bridges between components of the network. Several genes bridge the component subnetworks of the molecular network. PLAUR is agene that contains SSc-associated polymorphisms that forms a bridge between the interferon subnetwork and TGFb/ECM subnetwork. The geneRAC2 is a bridge between the interferon and M2 macrophage subnetworks. The genes LCP2 and CXCR4 are bridges between the M2 macrophagesubnetwork and the adaptive immunity subnetwork. There are also several paths through GRB10 to ADAP2 between the M2 macrophage subnetworkand the adaptive immunity subnetwork. The genes CD14 and THY1 (CD90) are bridges between the M2 macrophage subnetwork and the TGFb/ECMsubnetwork. The genes IRAK1 and PXK are bridges between the TGFb/ECM subnetwork and the cell proliferation subnetwork.doi:10.1371/journal.pcbi.1004005.g006
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 12 January 2015 | Volume 11 | Issue 1 | e1004005
mainly expressed by macrophages, is inducible by and connected
to AIF1 [53]. It also has connections to the polymorphic genes
TLR2 and HLA-DRA (Fig. 6, lower left).
PLAUR (UPAR) contains a putative SSc-associated polymor-
phism, is a member of the interferon subnetwork, and has
numerous links with the ECM, M2 macrophage, and the adaptive
immunity subnetworks (Fig. 6). PLAUR encodes the plasminogen
activator, urokinase receptor protein and is a pleiotropic gene at
the interface of ECM remodeling, as a component of the
fibrinolysis system, and in both adaptive and innate immune
processes, including monocyte migration [54]. PLAUR is induc-
ible by proinflammatory cytokines IL1b and TNFa. PLAURconnects to the tyrosine kinase LYN, the hub gene of the adaptive
immunity subnetwork, and to the integrin gene ITGB2 in the M2
macrophage subnetwork. It is also connected to the polymorphic
genes TNFSF10B and TNFAIP in the interferon network and to
TPM4, INHBA, THBS1, and CCL2 in the ECM subnetwork.
The centrality of PLAUR within the consensus gene network
suggests that PLAUR may be a key mediator of inflammatory and
ECM remodeling signals in SSc skin.
The proliferation subnetwork has predicted interactions with
the inflammatory and ECM subnetworks. The most pronounced
connection is between the ECM subnetwork and the cell
proliferation subnetwork through TGFb pathway genes (Fig. 6).
The TGFb pathway is known to modulate cell proliferation.
There are multiple paths from the TGFb pathway genes
TGFB3 and TGFBR2 to the cell proliferation subnetwork
through the polymorphic genes IRAK1 and PXK, which have
predicted interactions with the serine/threonine kinases LATS2,
WNK4, and PRKAA1. Serine/threonine kinases are well
known to be important regulators of cell proliferation and they
are bridges between the ECM subnetwork and cell proliferation
network.
Discussion
The intrinsic subsets of SSc have been found in multiple skin
gene expression datasets. Until now, the majority of experimental
data has indicated that the subsets are mutually exclusive—i.e.
patients are categorized as being in one of the subsets, and that the
core molecular processes and subsets of genes, are reproducible
across cohorts. Despite this consistency, the exact set of intrinsic
genes varies across datasets. We address both of these issues here.
Our consensus clustering approach allowed us to detect a
conserved set of genes from a module perspective across the three
independent SSc patient cohorts by considering molecular
processes first and constituent genes second. The predicted co-
expression interactions between these consensus genes indicate
that the key processes represented by the consensus genes
(inflammation, ECM remodeling, and cell proliferation) may
interact at a molecular level, with specific links between the
subnetworks. Thus, we have demonstrated theoretical connectionsbetween the genes of the SSc intrinsic subsets that are difficult to
capture experimentally.
It is clear that in addition to its clinical heterogeneity, SSc is a
genetically complex disease. Many risk alleles for SSc have been
identified, but each has only a modest odds ratio and the complete
picture of SSc will likely develop from the interactions between
various risk factors. The network of consensus genes demonstrates
that a significant fraction of the genes with risk alleles for SSc have
probable interactions with the consensus genes that underlie the
intrinsic gene expression subsets. This implicates these polymor-
phic genes as interacting with genes differentially expressed in the
subsets. This simultaneously provides a picture of the key gene
expression abnormalities in the intrinsic subsets and the validated
genetic associations at a systems level.
These data and the resulting network were developed from a
detailed meta-analysis of SSc skin gene expression datasets using
MICC, a consensus clustering framework we developed. Our
method reports only consensus clusters that are conserved across
all input datasets and dispenses with non-conserved gene
expression. The concept of mutual information gives MICC a
theoretical foundation, but like any data mining algorithm, its
value is gauged by performance on real data.
The rationale for gene coexpression clustering algorithms like
WGCNA is that co-expression networks are inherently modular
and that co-expression hub genes are likely related to the
regulation of the modules. This has been borne out by several
studies in humans [55], mice [56], and even across species [57].
The genes identified by MICC are disproportionately more hub-
like than a random population of the same size (S1 Fig.).
Therefore, MICC does not identify spurious overlaps but rather
detects network-relevant overlaps that are enriched for key hub
genes. At the same time, the information graph used by MICC is
not simply a disconnected set of triangles, which would indicate a
one-to-one mapping of modules between datasets. Instead, the
modules in one dataset are broken into a small set of pieces that
are re-assorted to build the modules in another dataset. This is
likely due to variations in study design and protocols between the
datasets, but also the inherent heterogeneity of SSc, therapy
effects, and environmental exposures. The MICC method is
explicitly designed to handle this unavoidable variance by
broadening the definition of consensus cluster to allow for
imperfect conservation of gene coexpression. We also note that
MICC is completely general with respect to the data that are
clustered and which clustering algorithms are used. In principle,
gene expression from different tissues (e.g. blood and skin) or
different species (e.g. mouse and human) or data from multiple
experimental modalities (e.g. transcriptomics and proteomics) can
be compared using MICC. These types of data exist for multiple
tissues in SSc and multiple animal models of SSc. Follow-up
studies will integrate these to further elaborate the molecular
underpinnings of SSc.
The consensus clusters from MICC show both skin-specific
processes that represent basic biological processes in this tissue as
well as disease-specific processes. Nearly all (24 out of 26)
consensus clusters are enriched only for general cellular or
otherwise skin-specific biology: metabolism, cell turnover, kerati-
nocyte-specific gene expression, etc. We view these consensus
clusters as a useful positive control for the MICC method. Such
housekeeping processes are clearly biologically relevant and
MICC would be missing important structure in the data if these
were not found. By taking a ‘‘module first’’ approach, MICC is
able to identify consensus genes that are specifically clustered into
pathologically active modules (the subset-specific modules).
The MICC-derived consensus clusters are enriched forknown mediators of SSc pathology and genetic riskfactors
Two of the consensus clusters were SSc subset-specific (Fig. 3B).
These clusters contain the key gene expression abnormalities in
SSc that are conserved across all three cohorts. The consensus
clusters are enriched for inflammatory process Gene Ontology
terms, as well as TGFb signaling, PDGF signaling, and cell
proliferation (Table 3). Most (30 out of 41) of the genes with
replicated SSc-associated polymorphisms are predicted to interact
with genes in the consensus clusters; 28 out of 30 of these interact
in the immune (interferon, M2 macrophage, and adaptive
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 13 January 2015 | Volume 11 | Issue 1 | e1004005
immunity) and TGFb/ECM subnetworks (Fig. 4). The inflamma-
tory-specific consensus cluster also contains the genes FBN1 and
AIF1. Previous work implicates FBN1 in SSc pathogenesis, as a
duplication of FBN1 causes fibrosis in the Tsk1 mouse [58] and a
point mutation in FBN1 causes the fibrotic phenotype in the Stiff
Skin Syndrome mouse [59]. Fibrillin-1 forms a matrix of elastic
microfibrils that provide a scaffold for elastins and collagens, and a
means for sequestering matricellular growth factors. Mouse
embryonic fibroblasts expressing the Tsk1 mutant FBN1 have
altered microfibril morphology that results in increased collagen
deposition [58]. While polymorphisms in FBN1 might cause
dosage effects that result in fibrosis in some models (e.g. Tsk1), it is
possible that chronic inflammation causes chronic high expression
of FBN1 to similar effect in humans. Rare polymorphisms in
FBN1 have been associated with SSc in some subpopulations [60–
62].
Similarly, AIF1 is implicated in SSc disease progression. A SNP
in AIF1 has been implicated in anticentromere antibody (ACA)
positive SSc [63]. Moreover, AIF-1 is interferon-inducible,
constitutively expressed in macrophages [64], and plays a role in
vasculogenesis and endothelial cell proliferation and migration
[65]. In the Sclerodermatous Graft-Versus-Host Disease
(sclGVHD) mouse model of SSc, AIF1 was found to be highly
expressed in skin [66] and to induce fibroblast and monocyte
chemotaxis [53]. AIF1 has many predicted interactions with
chemokine receptors CCR1 and CX3CR1, which are connected to
chemokines CX3CL1 (fractalkine) and CCL2 (MCP-1). The genes
CX3CL1 and CCL2 are M1 and M2 macrophage-related genes
respectively [67] and are chemotactic for monocytes, macrophag-
es, and T cells [68], suggesting enhanced recruitment of
inflammatory cells to this subnetwork. A recent study of a mouse
model of SSc demonstrated that both CCR2 and CX3CR1regulate skin fibrosis, further implicating these mediators in the
pathogenesis of SSc [69]. In addition, CCL2 has been shown to
induce M2 macrophage polarization [70], which may result in
persistent M2 activation. The repeated and conserved finding of
high AIF-1 levels in the inflammatory subset and its tight
connection to innate immune mediators of inflammation suggest
it may be involved in enhanced macrophage chemotaxis and
activation in SSc skin.
LYN, a hub of the adaptive immunity subnetwork, modulates B
cell activation and plays a role in self-tolerance. B cell signaling has
been implicated in SSc development and progression, as B cells
have been shown to play a role in both the development of
autoantibodies and cutaneous fibrosis in the Tight Skin 1 (Tsk1)
mouse model of SSc. Notably, LYN is overactive in response to
overexpression of CD19 in this model [71]. Thus, LYN may play a
role in the autoimmune component of SSc in human patients.
The molecular network shows putative connectionsbetween subnetworks
The consensus gene network (Figs. 4 and 6) also implicates
genes as bridges between the subnetworks. These notably include
the polymorphic genes PLAUR, IRAK1, PXK, and GRB10. In
addition, we find differentially expressed genes straddling the
subnetworks including RAC2 and LCP2. The interconnections
between the subnetworks present possible molecular paths through
which these processes interact.
The finding that most SSc-associated polymorphisms are
associated with immune system mediators suggests that the initial
events in SSc are likely to be immune-regulated and to involve
interferon activation (Fig. 7). The immune response in SSc likely
differs from a normal response because of predisposing genetic
variants in these and associated genes. This may lead to the
secondary recruitment of macrophages via RAS-RAC signaling
(Fig. 7).
We predict that the interferon network suppresses cell
proliferation, given the clear distinction between the inflammatory
and fibroproliferative subgroups. This inference is based on known
interferon biology and not on the network itself. In contrast, it is
possible that the ECM network stimulates cell proliferation
through the TGFb pathway and serine/threonine kinases IRAK1,
LATS2, WNK4, and PRKAA1. In this model, inflammatory gene
expression creates a balancing feedback loop that modulates
fibroproliferative gene expression (Fig. 7).
A major strength of the IMP network and its data integration
capabilities derives from its ability to provide a more detailed
picture of SSc development and progression compared with more
conventional approaches. For example, while all of the purple
nodes in Fig. 4 are highly expressed in the inflammatory group
across all data sets, the IMP network provides information
regarding gene-gene interactions in addition to expression data.
In this example, the IMP network indicates which subnetworks
correspond to discrete processes (interferon, M2 macrophages,
ECM, and adaptive immunity) and which interactions are
mediated through the network. Thus, we gain insight by
recognizing that the interferon component is distinct from the
M2 macrophage component, despite their co-expression and
known interdependence. The value of the IMP network is as much
in the connections that are not present as those that are.
What is the nature of the interactions between theintrinsic gene expression subsets?
Since the original publication of the intrinsic subsets, two
important questions have been central to their interpretation and
their clinical relevance: First, can a patient’s subset change over
the course of their disease? And second, can the subsets predict
therapeutic response?
Pendergrass et al. [11] demonstrated that a patient’s subset is
stable over time scales of 6 to 12 months. This means either that
patients never change subsets and the intrinsic subsets are
effectively distinct diseases, or that the subsets are long-lived states
of the same disease. Our analysis shows that the inflammatory and
fibroproliferative subsets share a molecular network containing
TGFb pathway genes and ECM component genes, suggesting that
inflammatory patients may transition to the fibroproliferative
subset, perhaps in response to successful immunosuppressive
therapy. Indeed, immunosuppressive therapy has not been widely
successful for treatment of SSc [72]. On the other hand,
fibroproliferative biopsies still have some activation of the
TGFb/ECM network despite the absence of the inflammatory
signature (Fig. 4). The connection of the subsets through the
TGFb/ECM subnetwork indicates that the fibroproliferative
subset shares a common pathway with the inflammatory subset
and that the fibroproliferative subset is tied to chronic TGFbactivation and ECM deposition. Thus, based on the molecular
network, it is possible that immunosuppressive therapy can move
patients to the fibroproliferative subset rather than restoring their
gene expression to that of healthy skin. Our data from an ongoing
MMF clinical trial and analysis of mouse models of SSc suggests
that gene expression changes precede clinical changes [4,66];
therefore gene expression could act as a readout for the
effectiveness of a drug. This idea should be rigorously tested in
clinical trials that carefully monitor gene expression in patient skin
biopsies.
The pathogenesis of SSc has been enigmatic, but a number of
genetic risk factors have been identified by genome-wide
association studies and candidate gene studies. Three of these
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 14 January 2015 | Volume 11 | Issue 1 | e1004005
polymorphic genes, NOTCH4, IRF7, and GRB10, are in the
inflammatory consensus cluster, and hence are consistently
differentially expressed in the inflammatory subset (Fig. 4). This
suggests that these may be cis-acting alleles and demonstrates the
need for candidate gene studies to determine if differential
expression is genetically driven in a subset of patients. The IMP
functional network predicts that twenty-five of the remaining forty-
one polymorphic genes interact with genes from the inflammatory
consensus cluster (Fig. 4). Rather than being scattered evenly
across all of the subsets or unrelated to any of the consensus genes,
the risk alleles are overwhelmingly related to the inflammatory
subset. The genetic studies, however, did not stratify their patients
by intrinsic gene expression subset. The studies were carried out as
case versus control or case versus case, when stratified by
autoantibody status or other clinical outcomes. Risk alleles
associated with a particular gene expression subset have not been
reported. We reemphasize the fact that we found no consensus
clusters that were differentially regulated in all SSc vs. healthy
control biopsies.
These data support the hypothesis that the subsets are related to
disease progression and that SSc starts with immune activation,
perhaps in response to an environmental trigger [73,74]. The
SNPs associated with SSc would then likely be risk factors for an
aberrant immune response to this trigger. Should such a model be
correct, we are still left with the question of why we have different
subsets that generally show little or no correlation with disease
duration. The simplest explanation for this result is that patients
progress through the gene expression subsets at dramatically
different rates and that our measures of disease duration are
currently inadequate.
Another possibility is that any given patient transitions between
these intrinsic gene expression groups in a dynamic manner that
we do not observe using serial skin biopsies across 6–12 month
time interval. This would mean that cross-sectional studies of
patients would still capture all subsets while maintaining a weak
correlation to disease duration. We think this is unlikely because
serial biopsies are generally found in the same subset.
The final possibility is that the subset a patient stays in, and the
duration in which they remain, is dependent on many outside and
as yet poorly characterized factors. These could include environ-
mental stimuli that trigger an inflammatory response, or genetic
factors that determine the rate at which one progresses through
the mechanistic stages of SSc. It is possible that patients in each
intrinsic subset have a different set of predisposing genetic
polymorphisms or similar environmental triggers. This can only
be addressed if we can look for genetic risk factors in a cohort of
Fig. 7. Model of interactions among the components of the network. The molecular network of Fig. 4 is densely interconnected, implicatingmany possible interactions between the core molecular processes (interferon activation, M2 macrophage activation, adaptive immunity, ECMremodeling, and cell proliferation). Stepping back from the granular detail of single genes, we see a system of distinct parts through which SSc couldbe initiated and maintained. Among these are paths of particular interest. The interferon subnetwork and the M2 macrophage subnetwork areconnected by RAC2. The M2 macrophage subnetwork in turn is connected to the ECM subnetwork through paths through CD14 and THY1.Suggesting macrophages may influence or drive ECM abnormalities in skin. The interferon subnetwork and the ECM subnetwork are connectedthrough paths containing the pleiotropic and polymorphic gene PLAUR. The M2 macrophage subnetwork is connected to the adaptive immunitysubnetwork through several distinct sets of paths through the genes GRB10, LCP2, and CXCR4. The ECM subnetwork is connected to the cellproliferation cluster through TGFb pathway genes and paths containing the polymorphic genes IRAK1 and PXK, which suggests that ECM remodelingmodulates cell proliferation through the TGFb pathway. The interferon node may negatively regulate proliferation via the ERK/MAPK pathwayresulting in the general mutual exclusivity of the inflammatory and fibroproliferative subsets. Thus we see a set of interconnected, balancingfeedback loops that can enforce subset homeostasis, but also allow for patients to transition between the subsets, possibly in response to therapy.doi:10.1371/journal.pcbi.1004005.g007
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 15 January 2015 | Volume 11 | Issue 1 | e1004005
patients stratified by gene expression subset for genetic risk alleles.
There may be genetic risk factors that cause a patient to ‘‘stall’’ at
particular point along the progression from inflammatory to
proliferative to normal-like. Genetic modifiers of the molecular
links in the consensus gene network (Fig. 4) might hold the key to
showing why many patients go into spontaneous remission while
others experience rapid clinical progression, and indeed, our
network analysis suggests candidates for explaining this (Figs. 6, 7).
For example, IRAK1 and PXK are polymorphic genes that exist
on paths in the network between the TGFb/ECM network and
the cell proliferation network. This strongly argues for future
studies that test their possible roles in TGFb-modulated cell
proliferation, with particular attention to their roles in influencing
other serine-threonine kinases that modulate the cell cycle.
Associations with autoantibodiesThe presence of antinuclear autoantibodies in patient serum is a
widely used biomarker of SSc. To date the intrinsic subsets have
shown no clear association with autoantibody status [1,4,11],
which is consistent with a model by which the subsets represent
disease progression. Several genetic polymorphisms are associated
with autoantibody status (S9 Data file), including BLK and
BANK1, which are related to ACA- and ATA-positive SSc
respectively. These B cell proteins are already attractive candidates
for autoantibody production, as they are directly associated with
the cells that produce the antibodies, but our network analysis also
shows that they are functionally related to adaptive immune genes
that are highly expressed in the inflammatory subset.
Conclusions and limitationsA primary role of bioinformatics in complex diseases is to pare
down the possibilities to a coherent set of candidates for future
study. The many risk alleles for SSc each have modest odds ratio
and the final picture of SSc will likely lie in the interactionsbetween various risk factors, but the number of possible
interactions between these combinatorial factors is prohibitively
large. It is here that the network approach may be most useful in
delineating candidates for interaction studies. We might speculate,
for example, that SSc results from the presence of multiple,
functionally distinct alleles, but that it does not matter what gene is
mutated as long as the mutation has a particular functional
outcome. The predicted interactions in the network suggest which
alleles might be functionally related and which might be distinct
from each other, as the alleles either cluster within a subnetwork or
straddle the subnetworks.
This report is limited by our utilization of whole skin biopsies,
which are complex mixtures of cells, and in that the studies were
observational. The use of whole skin means that we cannot directly
ascribe gene expression to specific cell types. For example, we inferthat the M2 macrophage subnetwork is related to that cell type
based on the coherent expression of monocyte markers and
cytokines related to M2 polarization of macrophages. Our study is
therefore hypothesis generating. Mechanistic studies will be needed
to evaluate the existence of the molecular links suggested by the
network analysis.
Nevertheless, our analyses place the intrinsic subsets as a
possible readout of SSc pathology. The consensus gene expression
of the subsets implicates a number of molecular mechanisms that
have been associated with SSc and suggests functional roles for a
large fraction of the replicated SSc-associated polymorphisms. We
demonstrate that the core molecular processes of the inflammatory
and fibroproliferative subsets are molecularly connected to each
other. This suggests the possibility that SSc subsets may be
dynamic and interconnected.
Materials and Methods
Ethics statementThe analysis of prospectively collected human samples in this
study was approved by the Committee for the Protection of
Human Subjects at Dartmouth College (CPHS#16631) and by
the IRB review panel at Northwestern Feinberg School of
Medicine (STU00004428). All subjects in the study provided
written consent, which was approved by the IRB review panels at
Dartmouth College and Northwestern Feinberg School of
Medicine.
Patient informationThis study used data from three previously published cohorts
(Table 1). Each of the studies is available from NCBI GEO at the
following accession numbers: Milano et al. (GSE9285), Pender-
grass et al. (GSE32413) and Hinchcliff et al. (GSE45485). We used
an expanded version of the Hinchcliff dataset that contained an
additional 12 SSc patients, 1 healthy control and 1 morphea
patient beyond what was included in Hinchcliff et al.[4]
(GSE59785).
Each of the three study cohorts contained patients with SSc
defined using the 1980 ACR criteria. Specifically, all patients met
the American College of Rheumatology classification criteria for
SSc [75] and were further characterized as the diffuse (dSSc), or
the limited (lSSc) subsets. Limited SSc patients had 3 of the 5
features of CREST syndrome, or had Raynaud’s phenomenon
with abnormal nail fold capillaries and scleroderma-specific
autoantibodies.
Preprocessing and clustering of microarray dataAll three studies used Agilent Technologies 44,000 element
DNA microarrays representing the full human genome. All
samples were processed and all microarrays hybridized in the
Whitfield lab providing consistency between the datasets. The
DNA probes between these datasets are identical and thus were
indexed using the same probe identifiers allowing direct mapping
from one data set to another without significant loss of data.
Microarray data from each cohort were Log2 Lowess-normalized
and only spots with mean fluorescent signal at least 1.5 greater
than median local background in Cy3- or Cy5- channels were
included in the analysis. Genes with less than 80% good data were
excluded. Since a common reference experimental design was
used for all cohorts, each probe was centered on its median value
across all arrays. Data were multiplied by -1 to convert them to
Log2(Cy3/Cy5) ratios.
The three cohorts were clustered into coexpression modules
using the WGCNA procedure. We used the WGCNA R package
available on the Comprehensive R Archive Network (http://cran.
r-project.org) and described in [10]. We used the default
parameters for running the software except that we used the
‘‘signed’’ network option and a soft thresholding parameter d = 12.
These parameters are described in depth in [10,15]. Genes that
were classified as outliers were discarded from further analysis.
The information graph and consensus clustersTo each pair of modules from different datasets we associate an
overlap score W. Specifically, if Ci is a module in, say, Milano et
al. and Cj a module in Pendergrass et al., then we define
Wij~DCi
TCj D
Nlog
DCi
TCj DN
DCi DDCj D
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 16 January 2015 | Volume 11 | Issue 1 | e1004005
where N is the total number of genes in the genome. The W-scores
can be interpreted as edge weights in a module-module network
(the information graph). This network encodes the mutual
information between the WGCNA-derived genomic partitions.
We computed the W-scores between each pair of modules across
all three datasets by the above formula and set the small and
negative W-scores below a threshold to zero (S4 Fig.). A
mathematical derivation of the relationship between the W-scores
and mutual information and a detailed description of the
thresholding procedure are available in the supporting information
(S1 Text). The resulting 3-partite information graph was mined for
consensus clusters.
Since triangles in the information graph represent a module
conserved across all three datasets, we clustered the information
graph using a variant of triangle percolation [17], which is a
community detection procedure designed to find sets of modules
that are members of many triangles together. Specifically, from the
information graph we constructed an auxiliary graph, called the
triangle graph, and detected communities in the triangle graph by
greedy modularity maximization [76]. A description of the
construction of the triangle graph is available in the supporting
information (S1 Text).
We define a final consensus cluster as all of the genes that are
contained in a module from the community for each of the three
data sets community. Note that triangle percolation allows for
overlapping communities in the underlying information graph.
For example, the inflammatory consensus cluster and the
keratinocyte consensus cluster overlap by one module (Fig. 3A).
This is one of MICC’s strengths because it does not require a
whole module from one dataset to be associated with only one
consensus cluster. To derive a gene set associated to the consensus
cluster, we took all modules within that community, computed
their unions within their dataset, and then computed their
intersection across datasets. In symbols, let Comm denote a set of
modules that form a community in the information graph (e.g. the
dotted circles Fig. 1B and the colored nodes of Fig. 3A). Let
MComm, PComm, and HComm denote respectively the sets of Milano,
Pendergrass and Hinchcliff modules within Comm. Let m, p, and hdenote modules in the Milano, Pendergrass, and Hinchcliff data
sets respectively; note that these are sets of genes. We associate a
gene set CCComm with the community Comm through the
following formula:
CCComm~[
m[MComm
m
0@
1A\ [
p[PComm
p
0@
1A\ [
h[HComm
h
0@
1A
We call CCComm the consensus cluster associated with thecommunity Comm and it consists of all genes that are present in
a module from each data set within the community. The elements
of CCComm are the consensus genes. It is clear by definition that the
consensus clusters are nonoverlapping even though communities
can share modules. This is because a gene needs to be present in a
module in the community from each of the three data sets. Since
modules do not overlap within data sets, consensus clusters cannot
either.
Statistical tests for subset specificityTo determine if a WGCNA-derived module was significantly
differentially regulated in a subset, we performed one-tailed
Wilcoxon rank sum tests. Specifically, we computed the module
eigengene of each module by first normalizing the gene expression
so that each gene expression vector had Euclidean length 1. The
module eigengene is the first principal component of the
normalized gene expression vectors within the WGCNA module.
The module eigengene is a one-dimensional summary score for
the module’s gene expression across all biopsies. To determine if
the module was significantly up- or down-regulated in a particular
subset, we determined if the median of the module eigengene for
that subset was above or below that of the whole population, and
then performed a one-tailed Wilcoxon rank sum test to determine
the significance of the median being above or below that of the
population as a whole. We used the subset assignments reported in
the previous papers describing these datasets [1,4,11]. We used
Bonferroni corrections for multiple comparisons. There were 178
modules in total across the datasets. In Table 2, we corrected for
17863 tests for each of the subset-specificity tests. In Fig. 3B, we
corrected for 17864 tests because we included tests for all non-
normal-like SSc versus normal-like SSc and healthy controls (see
also S4 Data file).
The IMP Bayesian networkThe IMP Bayesian network is available through an online
interface at (http://imp.princeton.edu). To build our network, we
queried IMP with four gene sets: inflammatory and fibroproli-
ferative consensus genes derived from the consensus clusters, SSc-
associated polymorphisms (as described below), and the four gene
MRSS biomarker reported in [5]. IMP provides export of the
subnetwork corresponding to the query genes as a weighted edge
list (a three-column table indicating which genes are connected
and with what probability). IMP automatically thresholds the
probabilities at 0.5 and exports the network with up to an
additional 50 genes that provide extra context for the query genes.
In our case, the 50 genes were predominantly cell cycle genes. This
is probably because the cell cycle is heavily studied in the
microarray compendium from which IMP was built. In that case,
IMP would be highly confident about predicting interactions
between the fibroproliferative genes and other cell cycle genes.
We developed in-house Matlab and R scripts to transform the
edge list data into the Graph Exchange Format (gexf), which
allows for manipulation in Gephi, an open source network
visualization program [77]. Data files S7-S8 contain post-
processed networks and Data files S10-S11 provide R data and
code snippets for manipulating the network programmatically.
Genetic polymorphisms for network analysisWe collected genes with SSc-associated polymorphisms from
the literature and curated them according to the following criteria.
We included polymorphic genes that were reported in genome-
wide association studies of SSc [78–82], from a recent study using
the Immunochip platform [83] and from case-control candidate
gene studies that were replicated in at least one other study
[54,84–105]. This resulted in a list of 41 polymorphic genes (S6
Data file).
Supporting Information
S1 Fig Consensus genes are enriched for coexpression hubs. The
consensus genes reported by MICC are more correlated to their
module eigengene (red density) than is typical for an arbitrary
gene-eigengene correlation (blue density). Genes are compared
only to their module eigengene, i.e. to the hub that the gene is
closest to in the coexpression network. Note the evident shift in the
red density toward 1, which is perfect correlation, indicating that
consensus genes are more ‘‘hub-like’’.
(TIF)
S2 Fig Adjacency matrix for the triangle graph. The triangle
graph is a weighted graph whose nodes are triangles in the
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 17 January 2015 | Volume 11 | Issue 1 | e1004005
information graph and whose edges indicate that the correspond-
ing triangles in the information graph share and edge. (A) The
weighted adjacency matrix for the triangle graph. Rows and
columns of the adjacency matrix are indexed by nodes of the
triangle graph (i.e. by triangles in the information graph). The
rows and columns of the matrix are sorted according to
community order. Note the distinct block structure of the
matrix indicating that the underlying graph is highly modular.
(B) The same matrix, but unweighted so that the matrix
contains only 0’s and 1’s (blue and red cells in the matrix,
respectively) indicating that the nodes are either connected or
disconnected. This aids in the visualization of the community
structure of the graph (block structure of the matrix), although
community detection was performed on the weighted triangle
graph.
(TIF)
S3 Fig Schematic for building consensus gene sets. To each
community (1) in the information graph we associate a consensus
gene set by (2) computing the union of modules within a data set
and then (3) computing the intersection across data sets.
(TIF)
S4 Fig Construction of the information graph. (A) Three pairs of
partitions of a 12-element set and their associated bipartite
information graphs. Edge width denotes the size of the W-score for
a pair of modules. Dotted edges represent negative W-scores. The
highest possible mutual information occurs when modules are
perfectly conserved. The information graph is disconnected with
edges denoting the mapping between conserved modules. In the
intermediate case, modules break into pieces that are reassorted
among each other. The information graph here has strong
community structure, but is not completely disconnected. The low
mutual information case occurs when the partitions labels are
random with respect to each other. In this case, all edges are small
and are partially cancelled by the negative edges also present in the
graph. (B,C) W-scores are calculated for each pair of modules; in
this case one from Milano and one from Pendergrass. (B) Most W-
scores are small in absolute value (blue histogram; logarithm of
density), while their distribution has a right tail of significantly
large scores. We can threshold the small and negative W-scores by
keeping only those scores that contribute positively to the total
mutual information (red curve; x-intercept). The sum of all W-
scores is the total mutual information between the Milano and
Pendergrass genomic partitions (dashed blue horizontal line). (C)
The W-scores are positively correlated with the size of the overlap
between gene clusters, but the relationship is not perfect. The W-
score threshold is shown by a dotted blue vertical line and the
overlaps that exceed the threshold are plotted in red. In particular,
note that there are relatively large overlaps that fail to meet the
threshold. Likewise, there are relatively small overlaps that have
high W-scores.
(TIF)
S1 Text Additional mathematical details about the MICC
method and glossary of keywords used in main text.
(PDF)
S1 Data file WGCNA clustered PCL file for Milano skin data.
(ZIP)
S2 Data file WGCNA clustered PCL file for Pendergrass skin
data.
(ZIP)
S3 Data file WGCNA clustered PCL file for Hinchcliff skin
data.
(ZIP)
S4 Data file Table of p-values for modules in each dataset
(includes module sizes).
(XLSX)
S5 Data file Full output of g:Profiler for consensus clusters.
(XLS)
S6 Data file Complete list of polymorphic genes used in this
study.
(XLSX)
S7 Data file Molecular network plotting file GEXF format.
(GEXF)
S8 Data file Molecular network plotting file Gephi format.
(ZIP)
S9 Data file Molecular network plotted in PDF (text searchable
for genes).
(PDF)
S10 Data file R data for programmatic access to network
(iGraph format).
(ZIP)
S11 Data file R code snippet demonstrating reading and writing
graphs from R to GEXF format.
(R)
Author Contributions
Conceived and designed the experiments: JMM MLW. Performed the
experiments: TAW MEH. Analyzed the data: JMM JT VM TAW CSG.
Contributed reagents/materials/analysis tools: MEH. Wrote the paper:
JMM PAP MEH MLW.
References
1. Milano A, Pendergrass SA, Sargent JL, George LK, McCalmont TH, et al.
(2008) Molecular subsets in the gene expression signatures of scleroderma skin.
PLoS ONE 3: e2696.
2. Whitfield ML, George LK, Grant GD, Perou CM (2006) Common markers of
proliferation. Nat Rev Cancer 6: 99–106.
3. Steen VD, Medsger TA (2007) Changes in causes of death in systemic sclerosis,
1972–2002. Ann Rheum Dis 66: 940–944.
4. Hinchcliff M, Huang C-C, Wood TA, Matthew Mahoney J, Martyanov V, et
al. (2013) Molecular Signatures in Skin Associated with Clinical Improvement
during Mycophenolate Treatment in Systemic Sclerosis. J Invest Dermatol: -.
5. Farina G, Lafyatis D, Lemaire R, Lafyatis R (2010) A four-gene biomarker
predicts skin disease in patients with diffuse cutaneous systemic sclerosis.
Arthritis Rheum 62: 580–588.
6. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display
of genome-wide expression patterns. Proc Natl Acad Sci U S A 95: 14863–14868.
7. Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, et al. (2002)
Identification of genes periodically expressed in the human cell cycle and their
expression in tumors. Mol Biol Cell 13: 1977–2000.
8. Barabasi AL (2007) Network medicine—from obesity to the ‘‘diseasome’’.
N Engl J Med 357: 404–407.
9. Barabasi AL, Gulbahce N, Loscalzo J (2011) Network medicine: a network-
based approach to human disease. Nat Rev Genet 12: 56–68.
10. Langfelder P, Horvath S (2008) WGCNA: an R package for weighted
correlation network analysis. BMC Bioinformatics 9: 559.
11. Pendergrass SA, Lemaire R, Francis IP, Mahoney JM, Lafyatis R, et al. (2012)
Intrinsic gene expression subsets of diffuse cutaneous systemic sclerosis are
stable in serial skin biopsies. J Invest Dermatol 132: 1363–1373.
12. Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for
combining multiple partitions. The Journal of Machine Learning Research 3:
583–617.
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 18 January 2015 | Volume 11 | Issue 1 | e1004005
13. Monti ST, Pablo; Mesirov, Jill; Golub, Todd (2003) Consensus clustering: aresampling-based method for class discovery and visualization of gene
expression microarray data. Machine learning 52.
14. Cover TM, Thomas JA (2012) Elements of information theory: Wiley-
interscience.
15. Horvath S, Dong J (2008) Geometric interpretation of gene coexpressionnetwork analysis. PLoS Comput Biol 4: e1000117.
16. Whitfield ML, Finlay DR, Murray JI, Troyanskaya OG, Chi JT, et al. (2003)Systemic and cell type-specific gene expression patterns in scleroderma skin.
Proc Natl Acad Sci U S A 100: 12319–12324.
17. Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping
community structure of complex networks in nature and society. Nature 435:
814–818.
18. Reimand J, Arak T, Vilo J (2011) g:Profiler-a web server for functional
interpretation of gene lists (2011 update). Nucleic Acids Res 39: W307–315.
19. Johnson ME, Mahoney JM, Taroni J, Sargent JL, Marmarelis E, et al. (in press)
Experimentally-derived fibroblast gene signatures identify molecular pathwaysassociated with distinct subsets of systemic sclerosis patients in three
independent cohorts. PLoS ONE, 10.1371/journal.pone.0114017.
20. Wong AK, Park CY, Greene CS, Bongo LA, Guan Y, et al. (2012) IMP: amulti-species functional genomics portal for integration, visualization and
prediction of protein functions and networks. Nucleic Acids Res 40: W484–490.
21. Scala E, Pallotta S, Frezzolini A, Abeni D, Barbieri C, et al. (2004) Cytokine
and chemokine levels in systemic sclerosis: relationship with cutaneous andinternal organ involvement. Clin Exp Immunol 138: 540–546.
22. Riccieri V, Rinaldi T, Spadaro A, Scrivo R, Ceccarelli F, et al. (2003)Interleukin-13 in systemic sclerosis: relationship to nailfold capillaroscopy
abnormalities. Clin Rheumatol 22: 102–106.
23. Auffray C, Fogg D, Garfa M, Elain G, Join-Lambert O, et al. (2007)
Monitoring of blood vessels and tissues by a population of monocytes with
patrolling behavior. Science 317: 666–670.
24. Katakura T, Miyazaki M, Kobayashi M, Herndon DN, Suzuki F (2004)
CCL17 and IL-10 as effectors that enable alternatively activated macrophagesto inhibit the generation of classically activated macrophages. J Immunol 172:
1407–1413.
25. Weis N, Weigert A, von Knethen A, Brune B (2009) Heme oxygenase-1
contributes to an alternative macrophage activation profile induced by
apoptotic cell supernatants. Mol Biol Cell 20: 1280–1288.
26. Higashi-Kuwata N, Makino T, Inoue Y, Takeya M, Ihn H (2009) Alternatively
activated macrophages (M2 macrophages) in the skin of patient with localizedscleroderma. Exp Dermatol 18: 727–729.
27. Gordon S, Martinez FO (2010) Alternative activation of macrophages:mechanism and functions. Immunity 32: 593–604.
28. Mosser DM, Edwards JP (2008) Exploring the full spectrum of macrophage
activation. Nat Rev Immunol 8: 958–969.
29. Atamas SP, White B (2003) Cytokine regulation of pulmonary fibrosis in
scleroderma. Cytokine Growth Factor Rev 14: 537–550.
30. Zhou W, Zhang F, Aune TM (2003) Either IL-2 or IL-12 is sufficient to direct
Th1 differentiation by nonobese diabetic T cells. J Immunol 170: 735–740.
31. Gollob JA, Murphy EA, Mahajan S, Schnipper CP, Ritz J, et al. (1998) Altered
interleukin-12 responsiveness in Th1 and Th2 cells is associated with the
differential activation of STAT5 and STAT1. Blood 91: 1341–1354.
32. Naeger LK, McKinney J, Salvekar A, Hoey T (1999) Identification of a
STAT4 binding site in the interleukin-12 receptor required for signaling. J BiolChem 274: 1875–1878.
33. O’Shea JJ, Lahesmaa R, Vahedi G, Laurence A, Kanno Y (2011) Genomicviews of STAT function in CD4+ T helper cell differentiation. Nat Rev
Immunol 11: 239–250.
34. Mackay IR (2009) Clustering and commonalities among autoimmune diseases.J Autoimmun 33: 170–177.
35. de Paus RA, Geilenkirchen MA, van Riet S, van Dissel JT, van de Vosse E(2013) Differential expression and function of human IL-12Rbeta2 polymor-
phic variants. Mol Immunol 56: 380–389.
36. Yelo E, Bernardo MV, Gimeno L, Alcaraz-Garcia MJ, Majado MJ, et al.
(2008) Dock10, a novel CZH protein selectively induced by interleukin-4 in
human B lymphocytes. Mol Immunol 45: 3411–3418.
37. Xu Y, Harder KW, Huntington ND, Hibbs ML, Tarlinton DM (2005) Lyn
tyrosine kinase: accentuating the positive and the negative. Immunity 22: 9–18.
38. Silver K, Bouriez-Jones T, Crockford T, Ferry H, Tang HL, et al. (2006)
Spontaneous class switching and B cell hyperactivity increase autoimmunity
against intracellular self antigen in Lyn-deficient mice. Eur J Immunol 36:2920–2927.
39. Hata A, Sabe H, Kurosaki T, Takata M, Hanafusa H (1994) Functionalanalysis of Csk in signal transduction through the B-cell antigen receptor. Mol
Cell Biol 14: 7306–7313.
40. Manjarrez-Orduno N, Marasco E, Chung SA, Katz MS, Kiridly JF, et al.
(2012) CSK regulatory polymorphism is associated with systemic lupus
erythematosus and influences B-cell signaling and activation. Nat Genet 44:1227–1230.
41. Levinson NM, Seeliger MA, Cole PA, Kuriyan J (2008) Structural basis for therecognition of c-Src by its inactivator Csk. Cell 134: 124–134.
42. Vang T, Liu WH, Delacroix L, Wu S, Vasile S, et al. (2012) LYP inhibits T-cellactivation when dissociated from CSK. Nat Chem Biol 8: 437–446.
43. Fiorillo E, Orru V, Stanford SM, Liu Y, Salek M, et al. (2010) Autoimmune-
associated PTPN22 R620W variation reduces phosphorylation of lymphoid
phosphatase on an inhibitory tyrosine residue. J Biol Chem 285: 26506–26518.
44. Burn GL, Svensson L, Sanchez-Blanco C, Saini M, Cope AP (2011) Why is
PTPN22 a good candidate susceptibility gene for autoimmune disease? FEBS
Lett 585: 3689–3698.
45. Yamamoto K, Yamaguchi M, Miyasaka N, Miura O (2003) SOCS-3 inhibits
IL-12-induced STAT4 activation by binding through its SH2 domain to the
STAT4 docking site in the IL-12 receptor beta2 subunit. Biochem Biophys Res
Commun 310: 1188–1193.
46. Varga J, Whitfield ML (2009) Transforming growth factor-beta in systemic
sclerosis (scleroderma). Front Biosci (Schol Ed) 1: 226–235.
47. Katsumoto TR, Whitfield ML, Connolly MK (2011) The pathogenesis of
systemic sclerosis. Annu Rev Pathol 6: 509–537.
48. Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, et al. (2000)
Molecular portraits of human breast tumours. Nature 406: 747–752.
49. Liu H, Thaker YR, Stagg L, Schneider H, Ladbury JE, et al. (2013) SLP-76
sterile alpha motif (SAM) and individual H5 alpha helix mediate oligomer
formation for microclusters and T-cell activation. J Biol Chem 288: 29539–
29549.
50. Rege TA, Hagood JS (2006) Thy-1, a versatile modulator of signaling affecting
cellular adhesion, proliferation, survival, and cytokine/growth factor responses.
Biochim Biophys Acta 1763: 991–999.
51. Cohen PY, Breuer R, Wallach-Dayan SB (2009) Thy1 up-regulates FasL
expression in lung myofibroblasts via Src family kinases. Am J Respir Cell Mol
Biol 40: 231–238.
52. Phipps RP, Penney DP, Keng P, Quill H, Paxhia A, et al. (1989)
Characterization of two major populations of lung fibroblasts: distinguishing
morphology and discordant display of Thy 1 and class II MHC. Am J Respir
Cell Mol Biol 1: 65–74.
53. Yamamoto A, Ashihara E, Nakagawa Y, Obayashi H, Ohta M, et al. (2011)
Allograft inflammatory factor-1 is overexpressed and induces fibroblast
chemotaxis in the skin of sclerodermatous GVHD in a murine model.
Immunol Lett 135: 144–150.
54. Manetti M, Allanore Y, Revillod L, Fatini C, Guiducci S, et al. (2011) A genetic
variation located in the promoter region of the UPAR (CD87) gene is
associated with the vascular complications of systemic sclerosis. Arthritis
Rheum 63: 247–256.
55. Saris CG, Horvath S, van Vught PW, van Es MA, Blauw HM, et al. (2009)
Weighted gene co-expression network analysis of the peripheral blood from
Amyotrophic Lateral Sclerosis patients. BMC Genomics 10: 405.
56. Farber CR, Bennett BJ, Orozco L, Zou W, Lira A, et al. (2011) Mouse genome-
wide association and systems genetics identify Asxl2 as a regulator of bone
mineral density and osteoclastogenesis. PLoS Genet 7: e1002038.
57. Horvath S, Nazmul-Hossain AN, Pollard RP, Kroese FG, Vissink A, et al.
(2012) Systems analysis of primary Sjogren’s syndrome pathogenesis in salivary
glands identifies shared pathways in human and a mouse model. Arthritis Res
Ther 14: R238.
58. Lemaire R, Farina G, Kissin E, Shipley JM, Bona C, et al. (2004) Mutant
fibrillin 1 from tight skin mice increases extracellular matrix incorporation of
microfibril-associated glycoprotein 2 and type I collagen. Arthritis Rheum 50:
915–926.
59. Gerber EE, Gallo EM, Fontana SC, Davis EC, Wigley FM, et al. (2013)
Integrin-modulating therapy prevents fibrosis and autoimmunity in mouse
models of scleroderma. Nature.
60. Tan FK, Wang N, Kuwana M, Chakraborty R, Bona CA, et al. (2001)
Association of fibrillin 1 single-nucleotide polymorphism haplotypes with
systemic sclerosis in Choctaw and Japanese populations. Arthritis Rheum 44:
893–901.
61. Tan FK, Tercero GM, Arnett FC, Wang N, Chakraborty R (2003)
Examination of the possible role of biologically relevant genes around FBN1
in systemic sclerosis in the Choctaw population. Arthritis Rheum 48: 3295–
3296.
62. Zhou X, Tan FK, Wang N, Xiong M, Maghidman S, et al. (2003) Genome-
wide association study for regions of systemic sclerosis susceptibility in a
Choctaw Indian population with high disease prevalence. Arthritis Rheum 48:
2585–2592.
63. Alkassab F, Gourh P, Tan FK, McNearney T, Fischbach M, et al. (2007) An
allograft inflammatory factor 1 (AIF1) single nucleotide polymorphism (SNP) is
associated with anticentromere antibody positive systemic sclerosis. Rheuma-
tology (Oxford) 46: 1248–1251.
64. Sibinga NE, Feinberg MW, Yang H, Werner F, Jain MK (2002) Macrophage-
restricted and interferon gamma-inducible expression of the allograft
inflammatory factor-1 gene requires Pu.1. J Biol Chem 277: 16202–16210.
65. Tian Y, Jain S, Kelemen SE, Autieri MV (2009) AIF-1 expression regulates
endothelial cell activation, signal transduction, and vasculogenesis. Am J Phy-
siol, Cell Physiol 296: C256–266.
66. Greenblatt MB, Sargent JL, Farina G, Tsang K, Lafyatis R, et al. (2012)
Interspecies comparison of human and murine scleroderma reveals IL-13 and
CCL2 as disease subset-specific targets. Am J Pathol 180: 1080–1094.
67. Mantovani A, Garlanda C, Locati M (2009) Macrophage diversity and
polarization in atherosclerosis: a question of balance. Arterioscler Thromb
Vasc Biol 29: 1419–1423.
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 19 January 2015 | Volume 11 | Issue 1 | e1004005
68. Mantovani A, Sica A, Sozzani S, Allavena P, Vecchi A, et al. (2004) The
chemokine system in diverse forms of macrophage activation and polarization.Trends Immunol 25: 677–686.
69. Arai M, Ikawa Y, Chujo S, Hamaguchi Y, Ishida W, et al. (2013) Chemokine
receptors CCR2 and CX3CR1 regulate skin fibrosis in the mouse model ofcytokine-induced systemic sclerosis. J Dermatol Sci 69: 250–258.
70. Roca H, Varsos ZS, Sud S, Craig MJ, Ying C, et al. (2009) CCL2 andinterleukin-6 promote survival of human CD11b+ peripheral blood mononu-
clear cells and induce M2-type macrophage polarization. J Biol Chem 284:
34342–34354.71. Saito E, Fujimoto M, Hasegawa M, Komura K, Hamaguchi Y, et al. (2002)
CD19-dependent B lymphocyte signaling thresholds influence skin fibrosis andautoimmunity in the tight-skin mouse. J Clin Invest 109: 1453–1462.
72. Opitz C, Klein-Weigel PF, Riemekasten G (2011) Systemic sclerosis - asystematic overview: part 2 - immunosuppression, treatment of SSc-associated
vasculopathy, and treatment of pulmonary arterial hypertension. VASA 40:
20–30.73. Arron ST, Dimon MT, Li Z, Johnson ME, T AW, et al. (2014) High
Rhodotorula sequences in skin transcriptome of patients with diffuse systemicsclerosis. J Invest Dermatol 134: 2138–2145.
74. Farina A, Cirone M, York M, Lenna S, Padilla C, et al. (2014) Epstein-Barr
virus infection induces aberrant TLR activation pathway and fibroblast-myofibroblast conversion in scleroderma. J Invest Dermatol 134: 954–964.
75. Committee. SfSCotARADaTC (1980) Preliminary criteria for the classificationof systemic sclerosis (scleroderma). 23: 581–590.
76. Newman MEJ (2006) Modularity and community structure in networks.Proceedings of the National Academy of Sciences 103: 8577–8582.
77. Bastian M, Heymann S, Jacomy M. Gephi: an open source software for
exploring and manipulating networks.; 2009.78. Radstake TR, Gorlova O, Rueda B, Martin JE, Alizadeh BZ, et al. (2010)
Genome-wide association study of systemic sclerosis identifies CD247 as a newsusceptibility locus. Nat Genet 42: 426–429.
79. Gorlova O, Martin JE, Rueda B, Koeleman BP, Ying J, et al. (2011)
Identification of novel genetic markers associated with clinical phenotypes ofsystemic sclerosis through a genome-wide association strategy. PLoS Genet 7:
e1002178.80. Allanore Y, Saad M, Dieude P, Avouac J, Distler JH, et al. (2011) Genome-
wide scan identifies TNIP1, PSORS1C1, and RHOB as novel risk loci forsystemic sclerosis. PLoS Genet 7: e1002091.
81. Martin JE, Broen JC, Carmona FD, Teruel M, Simeon CP, et al. (2012)
Identification of CSK as a systemic sclerosis genetic risk factor throughGenome Wide Association Study follow-up. Hum Mol Genet 21: 2825–2835.
82. Martin JE, Assassi S, Diaz-Gallo LM, Broen JC, Simeon CP, et al. (2013) Asystemic sclerosis and systemic lupus erythematosus pan-meta-GWAS reveals
new shared susceptibility loci. Hum Mol Genet 22: 4021–4029.
83. Mayes MD, Bossini-Castillo L, Gorlova O, Martin JE, Zhou X, et al. (2014)Immunochip analysis identifies multiple susceptibility loci for systemic sclerosis.
Am J Hum Genet 94: 47–61.84. Rueda B, Gourh P, Broen J, Agarwal SK, Simeon C, et al. (2010) BANK1
functional variants are associated with susceptibility to diffuse systemic sclerosisin Caucasians. Ann Rheum Dis 69: 700–705.
85. Gourh P, Agarwal SK, Martin E, Divecha D, Rueda B, et al. (2010)
Association of the C8orf13-BLK region with systemic sclerosis in North-American and European populations. J Autoimmun 34: 155–162.
86. Manetti M, Allanore Y, Saad M, Fatini C, Cohignac V, et al. (2012) Evidencefor caveolin-1 as a new susceptibility gene regulating tissue fibrosis in systemic
sclerosis. Ann Rheum Dis 71: 1034–1041.
87. Koumakis E, Bouaziz M, Dieude P, Ruiz B, Riemekasten G, et al. (2013) Aregulatory variant in CCR6 is associated with susceptibility to antitopoisome-
rase-positive systemic sclerosis. Arthritis Rheum 65: 3202–3208.
88. Dieude P, Guedj M, Truchetet ME, Wipff J, Revillod L, et al. (2011)
Association of the CD226 Ser(307) variant with systemic sclerosis: evidence of acontribution of costimulation pathways in systemic sclerosis pathogenesis.
Arthritis Rheum 63: 1097–1105.
89. Hoshino K, Satoh T, Kawaguchi Y, Kuwana M (2011) Association ofhepatocyte growth factor promoter polymorphism with severity of interstitial
lung disease in Japanese patients with systemic sclerosis. Arthritis Rheum 63:2465–2472.
90. Diaz-Gallo LM, Simeon CP, Broen JC, Ortego-Centeno N, Beretta L, et al.
(2013) Implication of IL-2/IL-21 region in systemic sclerosis geneticsusceptibility. Ann Rheum Dis 72: 1233–1238.
91. Bossini-Castillo L, Martin JE, Broen J, Gorlova O, Simeon CP, et al. (2012) AGWAS follow-up study reveals the association of the IL12RB2 gene with
systemic sclerosis in Caucasian populations. Hum Mol Genet 21: 926–933.92. Dieude P, Bouaziz M, Guedj M, Riemekasten G, Airo P, et al. (2011) Evidence
of the contribution of the X chromosome to systemic sclerosis susceptibility:
association with the functional IRAK1 196Phe/532Ser haplotype. ArthritisRheum 63: 3979–3987.
93. Dieude P, Guedj M, Wipff J, Avouac J, Fajardy I, et al. (2009) Associationbetween the IRF5 rs2004640 functional polymorphism and systemic sclerosis: a
new perspective for pulmonary fibrosis. Arthritis Rheum 60: 225–233.
94. Sharif R, Mayes MD, Tan FK, Gorlova OY, Hummers LK, et al. (2012) IRF5polymorphism predicts prognosis in patients with systemic sclerosis. Ann
Rheum Dis 71: 1197–1202.95. Dieude P, Dawidowicz K, Guedj M, Legrain Y, Wipff J, et al. (2010)
Phenotype-haplotype correlation of IRF5 in systemic sclerosis: role of 2haplotypes in disease severity. J Rheumatol 37: 987–992.
96. Carmona FD, Gutala R, Simeon CP, Carreira P, Ortego-Centeno N, et al.
(2012) Novel identification of the IRF7 region as an anticentromereautoantibody propensity locus in systemic sclerosis. Ann Rheum Dis 71:
114–119.97. Carmona FD, Simeon CP, Beretta L, Carreira P, Vonk MC, et al. (2011)
Association of a non-synonymous functional variant of the ITGAM gene with
systemic sclerosis. Ann Rheum Dis 70: 2050–2052.98. Anaya JM, Kim-Howard X, Prahalad S, Chernavsky A, Canas C, et al. (2012)
Evaluation of genetic association between an ITGAM non-synonymous SNP(rs1143679) and multiple autoimmune diseases. Autoimmun Rev 11: 276–280.
99. Coustet B, Agarwal SK, Gourh P, Guedj M, Mayes MD, et al. (2011)Association study of ITGAM, ITGAX, and CD58 autoimmune risk loci in
systemic sclerosis: results from 2 large European Caucasian cohorts.
J Rheumatol 38: 1033–1038.100. Wu SP, Leng L, Feng Z, Liu N, Zhao H, et al. (2006) Macrophage migration
inhibitory factor promoter polymorphisms and the clinical expression ofscleroderma. Arthritis Rheum 54: 3661–3669.
101. Diaz-Gallo LM, Gourh P, Broen J, Simeon C, Fonollosa V, et al. (2011)
Analysis of the influence of PTPN22 gene polymorphisms in systemic sclerosis.Ann Rheum Dis 70: 454–462.
102. Rueda B, Broen J, Simeon C, Hesselstrand R, Diaz B, et al. (2009) The STAT4gene influences the genetic predisposition to systemic sclerosis phenotype. Hum
Mol Genet 18: 2071–2077.103. Broen JC, Bossini-Castillo L, van Bon L, Vonk MC, Knaapen H, et al. (2012) A
rare polymorphism in the gene for Toll-like receptor 2 is associated with
systemic sclerosis phenotype and increases the production of inflammatorymediators. Arthritis Rheum 64: 264–271.
104. Dieude P, Guedj M, Wipff J, Ruiz B, Riemekasten G, et al. (2010) Associationof the TNFAIP3 rs5029939 variant with systemic sclerosis in the European
Caucasian population. Ann Rheum Dis 69: 1958–1964.
105. Gourh P, Arnett FC, Tan FK, Assassi S, Divecha D, et al. (2010) Association ofTNFSF4 (OX40L) polymorphisms with susceptibility to systemic sclerosis. Ann
Rheum Dis 69: 550–555.
Systems Level Analysis of Systemic Sclerosis
PLOS Computational Biology | www.ploscompbiol.org 20 January 2015 | Volume 11 | Issue 1 | e1004005