environment-wide associations to disease and disease ...mg775gw7130/... · environment-wide...
TRANSCRIPT
ENVIRONMENT-WIDE ASSOCIATIONS TO DISEASE AND DISEASE-
RELATED PHENOTYPES
A DISSERTATION
SUBMITTED TO THE PROGRAM IN BIOMEDICAL INFORMATICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Chirag Jagdish Patel
August 2011
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/mg775gw7130
© 2011 by Chirag Jagdish Patel. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Atul Butte, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Jayanta Bhattacharya
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Mark Cullen
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
iv
ABSTRACT
Common diseases arise out of combination of both genetic and environmental
influences. Advances in genomic technology have enabled investigators to
create hypotheses regarding the contribution of genetic factors at a
breathtaking pace. However, the assessment of multiple and specific
environmental factors—and their interactions with the genome-- has not. We
lack high-throughput analytic methodologies to comprehensively and
systematically associate multiple physical and specific environmental factors,
or the “envirome”, to disease and human health.
We claim that the creation of hypotheses regarding the environmental
contribution to disease is practicable through high-throughput analytic methods
that have been well established in genomics. In the following dissertation, we
develop and apply methods to systematically and comprehensively associate
specific factors of the envirome with disease states, prioritizing factors for in-
depth future study.
The current disciplines of studying the environmental determinants of health
include toxicology and epidemiology, which operate on molecular and
population scales, respectively. This dissertation proposes approaches in both
of these disciplines. For example, we have developed a framework to conduct
the first “Environment-wide Association Study” (EWAS), systematically
associating environmental factors to disease on a population scale. We have
applied this framework to investigate type 2 diabetes and heart disease on
cohorts that are representative United States population, finding novel and
robust associations in diverse and independent cohorts. Given the lack of
explained risk resulting from current day genome-wide studies, the time is ripe
to usher in a more comprehensive study of the environment, or “enviromics”,
toward better understanding of multifactorial diseases and their prevention.
v
ACKNOWLEDGEMENTS
Foremost, I thank my advisor, Dr. Atul Butte, for his undying confidence,
inspiration, and guidance. Even just three years ago, it was far from my belief
that the scientist whom I admired from afar would eventually take me on as a
student and teach me how to compute, see, and enlighten. For Dr. Atul Butte’s
supervision I am forever indebted and most fortunate.
I am also indebted to my dissertation committee, Drs. Jay Bhattacharya, Mark
Cullen, John Ioannidis, and Robert Tibshirani. Much of this work has come
out of discussions with these individuals and it is inspired by and stands on
their fundamental teachings. I thank my academic advisors, Drs. Mark Musen
and Betty Cheng, for encouraging me to keep taking courses that enabled this
work.
I thank my many friends and colleagues in the Butte Laboratory and in the
Biomedical Informatics program whom I continue to look up to and draw
inspiration from. I feel honored and privileged to be among you. In particular,
I thank Dr. Rong Chen, Alex Morgan, Joel Dudley, and Nick Tatonetti for
providing support and encouragement when it was least expected but most
needed.
From teaching me how to read and write and to gifting me the newest
computers, I thank my parents, Neela and Jagdish Patel. I will always be
grateful to them for initiating this most rewarding journey of lifelong learning.
I thank my brother, Ankur Patel, for his unflagging support and faith through
thick and thin.
vi
I thank my in-laws, Tapan and Kokila Chaudhuri, for their support and
encouragement.
I do not have the words to thank my partner in life, Trina Chaudhuri. I hope
that I can some day enable her to achieve her aspirations as she has done for
me.
I am grateful to the National Library of Medicine and Applied Biosystems, Inc.
for financial support. I thank Centers for Disease Control and Prevention
(CDC), the National Center for Health Statistics (NCHS), and the staff and
individuals who take part in the National Health and Nutrition Examination
Survey (NHANES). In particular, I thank Vijay Gambhir and Peter Meyer of
the CDC/NCHS for their support in accessing and processing NHANES
restricted genetic data. I am grateful again to Dr. Atul Butte for providing
funds to access the NHANES restricted data. I thank the staff of the
Biomedical Informatics Training program and the Butte Laboratory, Mary
Jeanne Oliva, Susan Aptekar, Alex Skrenchuk, Dr. Russ Altman, and Dr. Larry
Fagan. Without the support of these institutions and people, this work would
have not been possible.
A portion of the work in this dissertation derives from two published articles
and two articles currently in review for publication:
Chapter 2:
1. Patel, C. J. and A. J. Butte, Predicting environmental chemical factors associated with disease-related gene expression data. BMC Med Genomics, 2010. 3(1): p. 17.
vii
Chapter 4:
2. Patel, C.J., J. Bhattacharya, and A.J. Butte, An Environment-Wide Association Study (EWAS) on type 2 diabetes mellitus. PLoS ONE, 2010. 5(5): p. e10746.
3. Patel, C.J., M. R. Cullen, J.P.A. Ioannidis, A.J. Butte, Non-genetic associations and correlation globes for determinants of lipid levels: an environment-wide association study. Submitted, 7/2011.
Chapter 5:
4. Patel, C.J., R. Chen, J.P.A. Ioannidis, A.J. Butte, Systematic identification of interaction effects between validated genome- and environment-wide associations on Type 2 Diabetes Mellitus. Submitted, 8/2011.
In the Chapter 2 work, I devised the methodology and wrote the manuscript
with my advisor, Atul Butte. In the Chapter 4 work, I devised the
“Environment-wide-Association Study” (EWAS) framework and carried out
the analyses. For the EWAS on Type 2 Diabetes, I wrote the manuscripts with
Jay Bhattacharya and Atul Butte. For the EWAS on serum lipid levels, I wrote
and edited the manuscripts with Mark Cullen, John Ioannidis, and Atul Butte.
Finally, in the Chapter 5 work, I devised the “Gene-Environment-Wide
Association Study” (G-EWAS) framework and implemented the software to
carry out the analyses. Rong Chen and Atul Butte provided the database of
curated genetic information. I interpreted the data and wrote the manuscript
with Rong Chen, John Ioannidis, and Atul Butte.
viii
TABLE OF CONTENTS
CHAPTER 1: INTRODUCING MULTI-DIMENSIONAL AND DATA-DRIVEN APPROACHES TO CREATE HYPOTHESES REGARDING ENVIRONMENTAL ASSOCIATIONS TO DISEASE ................................ 1
What is the “Environment”? What is the “Envirome”? .................................... 3 Creation of robust hypotheses connecting the environment, genome, and multifactorial disease ............................................................................................ 12
Creating hypotheses comprehensively on a population scale ............................. 14 Creating hypotheses comprehensively on a molecular or toxicological scale .... 18
Discussion ............................................................................................................... 21
CHAPTER 2. MAPPING MULTIPLE TOXICOLOGICAL RESPONSES TO COMPLEX DISEASE ............................................................................. 25
INTRODUCTION ................................................................................................. 25 METHOD TO PREDICT ENVIRONMENTAL ASSOCIATION TO GENE EXPRESSION RESPONSE ................................................................................. 30 RESULTS ............................................................................................................... 41
Verification Phase ............................................................................................. 42 Predicting Environmental Chemicals Associated with Cancer Data Sets ... 44 Clustering Significant Predictions by PubChem-derived Biological Activity ............................................................................................................................ 54
DISCUSSION ........................................................................................................ 57
CHAPTER 3. METHODS TO EXECUTE ENVIRONMENT-WIDE ASSOCIATIONS ON DISEASE AND DISEASE-RELATED PHENOTYPES ON POPULATIONS. ......................................................... 61
INTRODUCTION ................................................................................................. 61 METHODS BACKGROUND .............................................................................. 63
Genome-wide association to disease .................................................................. 63 Environment-wide association to disease ........................................................... 65 Genetic versus non-genetic associations in population scaled studies ............... 68
EWAS METHOD .................................................................................................. 72 Stage 1: Linear Modeling ................................................................................... 72 Stage 2: Controlling for Multiple Hypotheses by Estimating the False Discovery Rate ..................................................................................................................... 74 Stage 3: Validation .............................................................................................. 76 Stage 4: Sensitivity Analyses .............................................................................. 78 Stage 5: Correlation Globes ................................................................................ 80
DISCUSSION ........................................................................................................ 80
CHAPTER 4: ENVIRONMENT-WIDE ASSOCIATIONS TO DISEASE AND ADVERSE PHENOTYPES: APPLICATIONS TO TYPE 2 DIABETES (T2D) AND SERUM LIPID LEVELS ..................................... 83
INTRODUCTION ................................................................................................. 83 ENVIRONMENT-WIDE ASSOCIATION STUDY ON TYPE 2 DIABETES 84
EWAS on T2D: Methods .................................................................................... 84 EWAS on T2D: Results ...................................................................................... 91
ix
EWAS on T2D: Conclusions ............................................................................ 100 ENVIRONMENT-WIDE ASSOCIATION STUDY ON SERUM LIPID LEVELS ............................................................................................................... 102
EWAS on Serum Lipids: Methods ................................................................... 102 EWAS on Serum Lipids: Results ...................................................................... 107 EWAS on Serum Lipids: Conclusions .............................................................. 123
DISCUSSION ...................................................................................................... 126
CHAPTER 5: TOWARD ENVIROME-GENOME INTERACTIONS IN THE CONTEXT OF HUMAN HEALTH: COMPREHENSIVELY SCREENING FOR GENE-ENVIRONMENT INTERACTIONS IN ASSOCIATION TO TYPE 2 DIABETES. ................................................ 130
INTRODUCTION ............................................................................................... 130 Background ....................................................................................................... 131 Screening for Gene-Environment Interactions: “G-EWAS” ............................ 133
METHODS .......................................................................................................... 135 Data and selected genetic and environmental factors ....................................... 136 Regression Analyses ......................................................................................... 137 Multiplicity Correction and FDR ...................................................................... 138
RESULTS ............................................................................................................. 140 Allele frequencies ............................................................................................. 141 Power Calculations ........................................................................................... 141 Marginal Associations ...................................................................................... 142 Correlation between genetic variants with environmental variables ................ 143 Screening for Genetic Variant by Environment Interactions ............................ 144 Sensitivity Analyses limited to non-Hispanic white and other Hispanic participants and older individuals ..................................................................... 147 Limited Evidence to Support Interactions with Body Mass Index ................... 148
DISCUSSION ...................................................................................................... 149
CHAPTER 6: CONCLUSION AND DISCUSSION ................................. 153 REFERENCES ............................................................................................. 157
x
LIST OF TABLES
Table 1. Tentative categories of environmental factors as collected from MEDLINE MeSH terms. .................................................................................... 8 Table 2. Gene expression dataset summary for verification stage. .................. 37 Table 3. Chemical Prediction Results from the Verification Phase. ................ 43 Table 4. Prediction of environmental chemicals associated with prostate cancer samples (GSE6919). ......................................................................................... 50 Table 5. Prediction of environmental chemicals associated with lung cancer samples (GSE10072). ....................................................................................... 52 Table 6. Prediction of environmental chemicals associated with breast cancer samples (GSE6883). ......................................................................................... 53 Table 7. Highly statistically significant environmental factors associated to T2D found in more than one NHANES cohort. ............................................... 95 Table 8. Marginal association of each locus (n=18) or environmental factor (n=5) to T2D (FBG > 125 mg/dL). ................................................................ 143
xi
LIST OF FIGURES
Figure 1. Number of publications investigating genetics (red) or the environment (black) in MEDLINE from 1971 onward. ..................................... 3 Figure 2. Environmental factors investigated in context of disease sourced from MEDLINE. ................................................................................................ 7 Figure 3. Envirome-disease network for WHO priority diseases .................... 10 Figure 4. “Zoomed-in” Envirome-disease network for WHO priority diseases. .......................................................................................................................... 11 Figure 5. Overview of population- and molecular-scale methods to create hypotheses across the envirome and genome.. ................................................. 13 Figure 6. Creation of the chemical-gene signatures based on the Comparative Toxicogenomics Database (CTD). ................................................................... 32 Figure 7. Creation of the ‘Envirome Map’ using CTD chemical-gene signatures. ......................................................................................................... 33 Figure 8. Predicting environmental chemical association to gene expression datasets.. ........................................................................................................... 35 Figure 9. Clustering chemical prediction lists by biological activity archived in PubChem. ......................................................................................................... 41 Figure 10. Curated disease-chemical enrichment versus prediction lists for prostate cancer datasets. ................................................................................... 46 Figure 11. Curated disease-chemical enrichment versus prediction lists for lung cancer datasets. ................................................................................................. 47 Figure 12. Curated disease-chemical enrichment versus prediction lists for breast cancer datasets. ...................................................................................... 48 Figure 13. Chemical predictions for Prostate, Lung, and Breast Cancer datasets clustered by PubChem BioActivity. ................................................................. 56 Figure 14. Sample data structure for EWAS. ................................................... 66 Figure 15. Method summary for EWAS on NHANES data. ........................... 87 Figure 16. “Manhattan plot” style graphic showing the environment-wide association with T2D. ....................................................................................... 93 Figure 17. “Manhattan plot” style graphic showing the environment-wide associations to lipid levels. ............................................................................. 109 Figure 18. Forest plot for top 12 validated environmental factors per cohort associated with triglycerides in a model adjusting for age, age-squared, SES, ethnicity, sex, BMI. ........................................................................................ 111 Figure 19. Forest plot for validated environmental factors associated with LDL-C.. .......................................................................................................... 112 Figure 20. Forest plot for top 12 validated environmental factors associated with HDL-C. ................................................................................................... 113
xii
Figure 21. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(triglycerides). ........... 117 Figure 22. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(LDL-C). See Figure 21 for complete caption. Legend abbreviations: TFIBE: total fiber; TVC: total vitamin C; TCRYP: total cryptoxanthin; count: total supplement use in past 30 days; cardiovascular: on lipid lowering drug or had heart disease. ................ 118 Figure 23. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(HDL-C). ................... 119 Figure 24. Pair-wise correlation globes for validated environmental and risk factors associated with triglycerides. .............................................................. 121 Figure 25. Pair-wise correlation globes for validated environmental and risk factors associated with LDL-C. ...................................................................... 122 Figure 26. Pair-wise correlation globes for validated environmental and risk factors associated with HDL-C. ..................................................................... 123 Figure 27. Schematic for comprehensive testing and screening for gene-environment interactions against T2D. ........................................................... 136 Figure 28 Power estimation for detection of interaction for each genetic locus and environmental factor pair tested against T2D (FBG > 125 mg/dL). ....... 142 Figure 29. Manhattan plot of significance values of interaction term (-log10(p-value) for interaction term of pair of factors). ................................................ 144 Figure 30. Per-risk allele effect sizes for top putative interactions with p < 0.05. ................................................................................................................ 147
1
CHAPTER 1: INTRODUCING MULTI-DIMENSIONAL AND DATA-
DRIVEN APPROACHES TO CREATE HYPOTHESES REGARDING
ENVIRONMENTAL ASSOCIATIONS TO DISEASE
Environmental factors -- “non-genetic” and often modifiable attributes such as
diet, drugs, chemical pollutants, ecological processes, and infectious agents--
in addition to genetic factors, contribute to disease and health [1, 2].
Epidemiologists and toxicologists have been formally studying the
contribution and how much disease risk both environmental and genetic factors
impart on the population, for decades [3, 4].
For example, epidemiologists have measured the attributable fraction, or the
fraction of disease that would be eliminated if the factor were to be eliminated.
For some complex diseases, up to 70-90% of attributable risk can be attributed
to differing “environments” [4, 5], defined here as a cumulative effect of
specific factors. Genetic factors, on the other hand, may also play a large role;
for example, heritability in obesity is estimated to range from 40-70% [6]. To
determine what specific genetic or environmental factors contribute to disease,
epidemiologists begin with an associative study in which variation for a factor
is compared to disease status; for example, the presence of an environmental
factor is compared in diseased versus undiseased individuals.
Despite both types of factors contributing to disease and health, many
investigations since the 1990s have focused investigating the genetic factors
(Figure 1). Recently, genetic association studies have advanced through a
framework known as the “Genome-wide Association Study”, or GWAS. For
example, the Wellcome Trust Case Control Consortium study of 7 common
diseases [7] is a notable one. In GWAS, 100,000 to 1 million genetic factors
are compared in frequency between diseased and non-diseased populations [8]
2
and to date, there have already been over 350 of these studies [9]. This is
contrasted with environmental epidemiological studies, which at most
investigate currently a handful of candidate factors at a time in association to
disease or phenotype. Relatedly, we lack methods to comprehensively and
systematically report environmental associations to disease [10].
Environmental epidemiological studies are neither data-driven nor multi-
dimensional and therefore do not allow for discovery as is the case in genome-
wide studies.
While epidemiology attempts to ascertain the contribution of factors on a
population scale, toxicology uses model biological systems to study the
influence of specific physical factors through assessment of molecular
responses [3, 11]. While technology exists to measure these molecular
responses on a genome-wide dimension [12], we lack methods to
comprehensively connect these with human disease state. In fact, the National
Academy of Sciences has called for a molecular-based effort to map
comprehensively map toxicological findings with health and risk-associated
phenotypes [13-15].
Our claim is that creation of hypotheses regarding the environmental
associations to disease is possible through data-driven analytical methods that
are standard in multi-dimensional genome-wide studies. To this end, we
propose and implement 1.) An integrative approach to connect molecular and
toxicological response to human disease states (Chapter 2), and 2.) A
population-based study framework to correlate multiple environmental factors
to disease, called an “Environment-wide Association Study” (EWAS) (Chapter
3 and 4), and furthermore, 3.) A method to predict how environmental factors
interact with genetic variants through unbiased integration of EWAS and
GWAS (Chapter 5). First, we introduce a concept that assumes a large subset
of specific and possible environmental factors known as the “envirome”.
3
Figure 1. Number of publications investigating genetics (red) or the environment (black) in MEDLINE from 1971 onward. We queried MEDLINE for all articles that investigating World Health Organization priority diseases (Cardiovascular Disease, Coronary Disease, Hypertension, Type 2 Diabetes, Lung Cancer, Breast Cancer, Colon Cancer, Asthma, Compulsive Obstructive Pulmonary Disorder, Premature birth, Kidney Disease, and Alzheimer Disease) and tabulated those strictly studying either genetics or the environment. Only articles investigating either disease and genetics or disease and environment as a primary subject matter were considered. Since the 1990s, genetic factors have been studied more than environmental factors for WHO priority diseases.
What is the “Environment”? What is the “Envirome”?
The “environment” is a loosely defined, heterogeneous mixture of non-genetic
factors. For example, specific physical environmental factors include
infectious agents (bacteria, viruses, fungi), dietary components (nutrients and
vitamins), ultraviolet radiation, and non-nutrient chemicals (such as drugs or
cigarette smoke). They may be by products of man-made processes, such as
air pollutants, or of natural processes, such as toxins from animals and plants.
Other types of environmental factors are non-“physical” – not associated with
one concrete factor -- and are intertwined with behavior and life-style, such as
stress and exercise. The environment also includes factors that arise as a
1970 1980 1990 2000
0200
400
600
800
1000
1200
year published
tota
l pub
licat
ions
geneticsenvironment
4
result of interaction between internal biological processes and other
environmental factors, such as metabolites of environmental chemicals or
infectious agents [16]. Air pollution, lead, tobacco smoke, ultraviolet
radiation, occupational risk factors, and climate – in addition to infectious
agents –are example factors prioritized by the World Health Organization for
constant monitoring [17].
Environmental factors are unique in their routes of exposure, mode of measure,
and dynamic. This heterogeneous mixture is contrasted with genetic variant
factors assessed in GWAS, which are discrete and static units. The
homogenous nature of genetic factors has been a factor in enabling
standardization of measurement (e.g., polymerase chain reaction [18] and gene
“chips” [19]) and organization (through efforts by the National Center of
Biotechnology Information, NCBI [20]).
Despite these characteristics, we propose a concept that allows for the multi-
dimensional study of the environment, the ”envirome”, the total “ensemble of
the environment” [21] that can influence biological processes such as disease.
While the “environment” refers to non-genetic factors in an abstract sense, we
refer to the “envirome” in an analogous way as the genome, a collection of
specific and varying factors.
To get a better grasp of the types of and specific factors investigated in the
context of disease, we searched through all of the scientific and medical
literature in MEDLINE up to year 2009 for evidence of investigation between
an environmental factor and disease condition. This search is made possible
using Medical Subject Headings (MeSH), an annotation system administered
by the National Library of Medicine (NLM) and applied to all articles in
MEDLINE [22, 23]. These subject headings contain categories of terms, such
as diseases (or condition), chemicals, drugs and physiological attributes. These
5
sets of terms also have “qualifiers”, which contextualize relationships between
terms; sample qualifiers include “etiology” (indicating an etiological
relationship among a term corresponding to a condition and environmental
factor), or “epidemiology” (indicating a population-based study between a
condition and factor), and “drug therapy” (indicating an a therapeutic
relationship studied between the condition and factor), among others.
Specifically, we went through each the MeSH annotations for each article and
looked for terms with environmental factors and diseases, qualified by an
etiological or epidemiological relationship. For example, an article examining
the effects of smoking on incident type 2 diabetes (e.g., [24]) is annotated with
the condition “Diabetes Mellitus, Type 2”, the specific environmental factor,
“Smoking”, qualified by “etiology”. Because many different aspects of factors
and conditions can be reported for a given article, we looked specifically for
those factors and conditions that MEDLINE annotators deemed the “major”, or
the main, subjects of the article. From this scraping of MEDLINE, we attained
a comprehensive list of disease-environmental factor pairs investigated in the
medical literature. Furthermore, we also attained the number of scientific
publications that investigated each factor-disease pair. This indicated the
degree or intensity to which a particular disease and factor has been studied.
We attempt to assemble the envirome into coherent categories of
environmental factors using the MeSH annotations. Specifically, each factor in
MeSH is described by a set of terms. For example, “Smoking” is described by
the term “Individual Behavior” and “Lead” by the terms “Hazardous Substance”
and “Element”. We manually categorized each factor based on these
descriptors into 21 categories based on these descriptors (Table 1). There is
notable overlap between factor categories involving drug and chemicals; for
these factors, we opted to categorize all factors that are drugs as “drug” and
non-drug environmental chemicals (such as pesticides or materials) in
6
chemically oriented bins (“organic chemical”, “inorganic chemical”,
“element”).
The most highly investigated factor categories are drugs and medical
procedures, comprising 24% and 29% of all factor-disease pairs respectively.
In comparison, organic chemicals comprise of but 8% of all factor-disease
relations. However, there are a large number of these organic chemicals in
relation to all others (15%), suggesting an opportunity to explore these factors.
Another opportunity lies in pinning down further composite factors lying in the
most abstract categories, such as “chemical”; for example, air pollution is a
complex matrix of specific factors belonging to other categories, each of which
might have a distinct contribution to disease. The envirome is a complex and
entity due to the heterogeneity of factors.
We acknowledge overlap and imbalance of this categorization; however, we
stress that this first pass is exploratory and not definitive. We propose future
work, the “envirome sequencing project”, focused on how exactly the
envirome is defined and categorized, considering composition of physical
factors (e.g., chemical structure), scope of biological effect (e.g., toxicological
responses), potential routes of exposure, and modes of direct measurement in
human tissue (e.g., cell assays, mass spectrometry, self-report). The next phase
of such a project would follow in the footsteps of the HapMap project [25],
characterizing how common environmental factors vary in different
populations.
Our search resulted in a total of 89,653 unique disease-factor pairs assembled
from 4977 unique environmental factors and 3189 unique disease conditions.
Figure 2 shows the number of publications published for each factor versus the
number of diseases investigated in context of that factor. Specifically, the
median number of publications for a factor was 8; the most highly investigated
7
factor was “Smoking” (2416 publications) and 17% of all factors were only
investigated once (1 publication). The median number of diseases investigated
per factor was 6; the most highly investigated disease was “Dermatitis” (3353
publications), “Drug Eruptions” (3151 publications), and “Occupational
Disease” (3105 publications). 11% of diseases have only been investigated
once (1 publication). In general, the more a factor is investigated the more
diseases it is investigated with (Figure 2). The most highly studied factor-
disease pairs are seen in Table 1 and include factors such as Asbestos, UV
Radiation, Smoking, and Air Pollution studied with diseases such as Lung
Cancer, Skin Cancer, and Mesothelioma.
Figure 2. Environmental factors investigated in context of disease sourced from MEDLINE. Each point in the figure represents an environmental factor, where the x-axis represents the number of publications for the factor and the y-axis represents the number of diseases investigated for that factor. For example, the factor “Smoking” is the top right most point in the plot with over 2,416 publications investigating it among 560 diseases. The red line depicts median and grey lines depicts decile. Markers are faint and jittered to depict concentration of data.
1 5 10 50 100 500
12
510
2050
100
200
500
number of publications for a factor
num
ber o
f dis
ease
s fo
r a fa
ctor
8
Category Total
number of factors (%)
Number of factor-disease relations (%)
Top Factor and Disease relationship
Animal (General)
7 (< 1%) 15 (< 1%) Birds (Bird Disease)
Bacteria 189 (4%) 857 (1%) E. coli (Diarrhea) Behavior 28 (<1%) 1101 (1%) Smoking (Lung neoplasms) Chemical
(General) 220 (4%) 4392 (5%) Air pollution (Asthma)
Dietary component
290 (6%) 4912 (5%) Lipids (Cardiovascular disease)
Drug 1322 (27%)
21071 (24%) Analgesics, Opioid (Pain)
Element 134 (2%) 3229 (4%) Lead (Lead Poisoning) Eukaryotes 99 (2%) 218 (< 1%) Mites (Mite Infestations) Fungus 35 (<1%) 168 (< 1%) Candida (Candidiasis) Hormone 263 (5%) 5283 (5%) Estrogens (Breast neoplasms) Immune
factors 38 (5%) 5826 (6%) Measles-Mumps-Rubella
Vaccine (Autistic Disorder) Poisoning /
injury process
38 (<1%) 930 (1%) Occupational Exposure (Dermatitis)
Inorganic chemical
145 (2%) 2554 (3%) Asbestos (Lung neoplasms)
Man-made object
58 (1%) 1153 (1%) Laser (Eye injuries)
Nucleic Acids
28 (<1%) 207 (<1%) RNA, viral (Hepatitis C)
Occupation 23 (<1%) 177 (<1%) Equipment design (Burns) Organic
chemical 737 (15%) 7318 (8%) Latex (Dermatitis)
Procedure 801 (16%) 26388 (29%) Radiotherapy (Neoplasms) Natural
Process 73 (1%) 2244 (3%) UV Rays (Neoplasms)
Protein 83 (2%) 952 (1%) Streptokinase (Thrombosis) Virus 141 (3%) 658 (<1%) HIV (AIDS)
Table 1. Tentative categories of environmental factors as collected from MEDLINE MeSH terms. Second column shows number of factors within that category and the percentage of all factors. Third column shown the number of disease relationships a category. Right-most column shows an example factor belonging to the category and a disease studied with that factor. Left-most column is color key for Figure 3.
9
Just as the “Human disease network” has enabled the definition of the
“diseaseome”, a comprehensive representation of diseases and their
interrelationships with genomic factors [26], we introduce the “Envirome-
disease network” to aid in the definition of the envirome and its interplay with
common diseases (Figure 3, Figure 4). The “Envirome-disease network”
consists of weighted links corresponding to the number of publications
between specific factors and WHO-prioritized diseases [27], including
cardiovascular disease, coronary disease, hypertension, type 2 diabetes (T2D),
kidney disease, premature births, lung cancer, prostate cancer, colorectal
cancer, Alzheimer’s disease, asthma, and chronic obstructive pulmonary
disease (COPD). In other words, the more publications published between a
disease and a factor, the stronger the link. As seen in the network (Figure 4),
the cardiovascular-related and metabolic diseases cardiovascular disease,
hypertension, T2D, kidney disease), appear to share many diet- and therapy-
related factors, such as carbohydrates and anti-hypertensive drugs. Further,
lung-related diseases such as asthma, COPD, and lung cancer share factors
such as smoking, tobacco smoke, and air pollution. Conditions such as
premature births, colorectal cancer, and COPD are less studied relative to
cardiovascular related diseases or breast and lung cancer.
10
Figure 3. Envirome-disease network for WHO priority diseases [27]. A link between an environmental factor and disease node denotes their association in MEDLINE. Each link is scaled in size according to the amount of citations for that association; for example, the largest link is observed between “Smoking” and “Lung Neoplasms”. Environmental factor and disease nodes are scaled in proportion to the number of connections they have with other nodes (their “degree”); similarly, labels appear for nodes with degree ≥ 4. Factors studied unique to specific diseases are observed on the outer portion of the figure with single links, while factors linked with many diseases are toward the center. Specific factors colored according to their category (Table 1); disease nodes are not colored. Top 5 cited factors for each disease are annotated offset with the number of citations in parentheses.
Smoking (30)Environmental Exposure (7)Occupational Exposure (7)
Air Pollutants (5)Air Pollution, Indoor (5)
Smoke
Nitrogen Dioxide
Adrenergic beta-Agonists
Anti-Asthmatic Agents
DustNickel Air
Pollution, Indoor
Construction Materials
Metals Pulmonary Disease, Chronic
Obstructive
Ultraviolet Rays
EggsHeating GarlicHumidity Cobalt Formaldehyde Detergents Antibodies, Monoclonal Seafood
Antitubercular Agents
Acetaminophen Endotoxins Cesarean Section Respiratory
Therapy
Chlorine
Asthma
Cromolyn Sodium
Marine Toxins
Aerosols Infant Food
Polyurethanes Albuterol
Cereals
Cyanates
Allergens Sulfur Dioxide Bronchodilator
Agents
Irritants Tetanus Toxoid Powders
Water Pollutants, Chemical Zinc
Progesterone Soybeans
Bedding and
Linens
Fatty Acids,
Unsaturated Prolactin
Polychlorinated Biphenyls
Medroxyprogesterone 17-Acetate
Environmental Pollution
Animal Feed Fungi
Toluene 2,4-Diisocyanate Resins,
Synthetic Beclomethasone
Isocyanates
Viral Vaccines Anti-Inflammatory
Agents Immunoglobulin E
Dinoflagellida FlourFoodLatexPlant Extracts
Terpenes PaintOzone
Phthalic Anhydrides
Resins, Plant Delivery, Obstetric
Influenza Vaccines
Amylases
Selective Estrogen Receptor
Modulators
Endocrine Disruptors
Hydrocarbons, Chlorinated
Dehydroepiandrosterone Estrogens,
Non-Steroidal DDT
Hematinics Superoxide Dismutase Pyridoxine
Vitamin B
Complex Cardiovascular Agents
Endoscopy, Gastrointestinal
Ascorbic Acid
Pyrazoles Tacrolimus
von Willebrand
Factor
Metals, Heavy Diet, Vegetarian
HIV Protease Inhibitors Vasectomy Contraceptives, Oral,
Synthetic
Growth Hormone
Lactones Piperazines
Carbon Monoxide
Anesthesia, Epidural
Polyvinyls Metformin Sirolimus
Tissue Plasminogen
Activator
Smog
Phosphodiesterase Inhibitors
Intercellular Adhesion
Molecule-1 Pravastatin
Cyclooxygenase 2
Inhibitors
Antibodies, Bacterial Cathartics
Human Growth
Hormone Lipoproteins, HDL
Erythropoietin, Recombinant
Thyroxine Enterovirus
Estrogen Antagonists
Anti-HIV Agents
Silicon Dioxide
Malondialdehyde Diet,
Reducing Interleukin-6
Arginine
Peritoneal Dialysis,
Continuous Ambulatory
Hot Temperature
Aldosterone Antioxidants C-Reactive
Protein Cyclooxygenase Inhibitors
Hydrocortisone Carbon
Disulfide
Peptide Fragments
Sodium Chloride, Dietary
Losartan Anesthesia, General
Diuretics
Surgical Procedures, Operative
Cholesterol
Kidney Transplantation
Magnesium Hypoglycemic
Agents Antilipemic
Agents Triglycerides
Lipids
Hysterectomy Cardiovascular Diseases
Cadmium
Kidney Diseases
Blood Transfusion Liver
Transplantation Anesthesia Angiotensin-Converting
Enzyme Inhibitors
Renal Dialysis
Immunosuppressive Agents Calcium
Progesterone Congeners
Atenolol
Calcium Channel Blockers Vibration
Cholesterol, LDL
Peritoneal Dialysis
Cyclosporine Platelet
Aggregation Inhibitors
Natriuretic Peptide,
Brain Noise,
Occupational
Antihypertensive Agents
Anti-Inflammatory Agents,
Non-Steroidal
Sodium Reserpine Tyramine Nifedipine
Angiogenesis Inhibitors
Aluminum Silicates 7,8-Dihydro-7,8-dihydroxybenzo(a)pyrene 9,10-oxide
GlassTitanium
Radon Daughters CokeRadioactive
Pollutants Industrial
Waste
Plutonium Alpha Particles Coal
Tar Quinazolines Asbestos,
Amphibole
Radon Minerals Tobacco
Tobacco Smoke
Pollution
Carbon Hydrocarbons
Polycyclic Compounds
Polonium Calcium
Compounds Talc
Urethane Coal
Ceramics
Chromates Hydrazines
Isoniazid
Tobacco, Smokeless
Iron Benzopyrenes Mineral Fibers
Cholecystectomy Polypropylenes
Beer
Smoking
Particulate Matter Occupational
Exposure Gastrins
Vehicle Emissions
Insecticides Vitamin
A Carcinogens, Environmental Carcinogens Inhalation
Exposure Air Pollutants,
Occupational
Carotenoids
Glycoproteins Nitric Oxide Inflammation
Mediators Bile Acids and Salts
Radiotherapy
Tumor Markers, Biological
Colorectal Neoplasms
Air Pollutants
Air Pollution
Bismuth
Nitrosamines Polycyclic Hydrocarbons,
Aromatic
Mustard Gas
Orthomyxoviridae TarsAsbestos,
Serpentine Beryllium Tin
Thorium Dioxide
Air Pollution,
Radioactive Soot
beta Carotene
Uranium
Lung Neoplasms
Acrylonitrile
Air Pollutants, Radioactive Asbestos,
Crocidolite
Sodium, Dietary
Antiretroviral Therapy,
Highly Active
Iron, Dietary
Cytokines
Noise
Testosterone
Corticotropin-Releasing Hormone
Fertilization in
Vitro
Laser Therapy
Electrosurgery Pregnancy Reduction, Multifetal
Abortion, Legal
Clindamycin
Amyloid beta-Protein
Positive-Pressure Respiration Lipid
Peroxides Recombinant Proteins
Adrenocorticotropic Hormone
Aldosterone Antagonists
Prostaglandins
Fatty Acids,
Omega-3 Estradiol Congeners Contraceptive
Devices
Iodine Radioisotopes
Silicones Deodorants Lignans Electricity
Tetrachloroethylene
Genistein
Ovulation Induction
Estrogens, Conjugated
(USP) Progestins Splenectomy
Estrogens
Diethylstilbestrol Follicle
Stimulating Hormone
Estrogen Replacement
Therapy
Prostheses and
Implants
Sunlight Pesticides
Breast Implants Mammaplasty
Central Nervous System
Depressants
Antidepressive Agents
LightDichlorodiphenyl Dichloroethylene
Pacemaker, Artificial Hair
Dyes Estrone
Hazardous Substances
Obstetric Labor,
Premature Folic Acid
Surgery, Plastic Abortion, Induced
Piperidines
Alzheimer Disease
Hypertension
Vitamin D
Insulin-Like Growth Factor
Binding Protein
3
Contraceptives, Oral
Dietary Fats
Adrenergic beta-Antagonists
Cholesterol, HDL
Adrenal Cortex
Hormones
Noise, Transportation
Beverages Hydroxymethylglutaryl-CoA Reductase Inhibitors
Estradiol Ovariectomy Contraceptives,
Oral, Hormonal Raloxifene
Serotonin Uptake
Inhibitors
Environmental Exposure
Androgens
Glucocorticoids Cold
Temperature
Alcoholic Beverages
Propranolol Dehydroepiandrosterone Sulfate
Epinephrine
Breast Neoplasms
Anti-Bacterial Agents
MeatContraceptives,
Oral, Combined
Gonadal Steroid
Hormones
Hormone Replacement
Therapy Caffeine Coffee
Electromagnetic Fields
Insulin
Ethanol
Fatty Acids
Alcohol Drinking
Diet
Hydralazine Nicardipine Norepinephrine
Sodium Chloride Intubation,
Intratracheal Erythropoietin Diet, Protein-Restricted
Phenelzine Nephrectomy
Cilazapril
Desoxycorticosterone Electrocoagulation Oxygen Guanethidine
Enalapril Benzimidazoles Tourniquets Ethanolamines
Arteriovenous Shunt,
Surgical
Electroconvulsive Therapy Spironolactone Glycyrrhiza Endarterectomy,
Carotid Angioplasty, Balloon Tranylcypromine 17-Hydroxycorticosteroids
Tetrazoles Angiotensin II
Lithium Biphenyl Compounds Methyldopa Dexamethasone Coronary
Artery Bypass
Clonidine Mineralocorticoids Adrenalectomy
Cholinesterase Inhibitors
Indans Memantine
Nootropic Agents Dopamine
Agents Neuroprotective
Agents
Dental Amalgam
Aluminum
Anticholesteremic Agents
Perindopril
Catheterization Steroids Traction Blood
Vessel Prosthesis
Endarterectomy Catecholamines
Monoamine Oxidase
Inhibitors Lisinopril Vasodilator
Agents Potassium
Phenoxybenzamine Furosemide Amlodipine
Arsenic
Fibrinogen Marijuana Smoking Plant
Oils
Glucose Fatty Acids,
Nonesterified
Thiazoles Chromans Meat
Products Antipsychotic
Agents Carbonated Beverages
Dietary Sucrose
Antineoplastic Agents
Vitamin E
Heart Transplantation
Lipoproteins, LDL
Diabetes Mellitus,
Type 2
Asbestos
Selenium
Heterocyclic Compounds
Dietary Carbohydrates
Environmental Pollutants
Lead
Cholesterol, Dietary Aspirin Nicotine
Vaccination Analgesics,
Non-Narcotic Lung
Transplantation
Thiazolidinediones Ferritins
Cardiac Surgical
Procedures Creatinine Prednisone Lithotripsy
Angiotensin II
Type 1
Receptor Blockers
Cardiopulmonary Bypass Cystatins
Extracorporeal Circulation
Whole-Body Irradiation Aminocaproic
Acids
Pharmaceutical Preparations
Analgesics, Opioid
Sulfonamides Organomercury Compounds
Aminoglycosides Embolization, Therapeutic
Urinary Diversion
Hematopoietic Stem Cell
Transplantation Dioxoles Radiotherapy,
High-Energy
BCG Vaccine
Bone Marrow
Transplantation Smallpox Vaccine Phenacetin Edetic
Acid Tetracycline Analgesics
Tyrosine Immunoglobulin G Methoxyflurane Radioisotope
Teletherapy
Complement System Proteins
Condiments Mercury Chlorothiazide Nephrostomy,
Percutaneous
Chlorambucil Oxalates
Dietary Proteins
Angioplasty, Transluminal, Percutaneous
Coronary
Smoking (22)Diet (14)
Alcohol Drinking (13)Meat (11)
Dietary Fats (8)
Smoking (30)Environmental Exposure (7)Occupational Exposure (7)
Air Pollutants (5)Air Pollution, Indoor (5)
Hypoglycemic Agents (23)Insulin (15)
Smoking (11)Coffee (9)
Carbohydrates (7)
Hypoglycemic Agents (23)Insulin (15)
Smoking (11)Coffee (9)
Carbohydrates (7)Lithotripsy (18)
Kidney Transplant (15)Smoking (12)
Phenacetin (8)Renal Dialysis (6)
Antihypertensive Agents (133)Salt, Dietary (29)
Kidney Transplant (24)Renal Dialysis (23)Aldosterone (21)
Estrogen Replacement (77)Dietary Fats (66)
Alcohol Drinking (43)Abortion, Induced (36)
Oral Contraceptives (34)
Smoking (14)Abortion, Induced (12)
Air pollutants (6)Alcohol Drinking (3)
Anti-bacterial agents (3)
Smoking (12)Aluminum (6)
Electromagnetic Fields (6)Cholinesterase Inhibitors (5)
Cholesterol (4)
11
Figure 4. “Zoomed-in” Envirome-disease network for WHO priority diseases. For clarity, only nodes of degree ≥ 4 are seen here. See figure 3 for caption and full network.
Car
diov
ascu
lar
Dis
ease
s
Sele
nium
Dia
bete
s M
ellit
us,
Type
2
Rad
ioth
erap
y
Kidn
ey
Dis
ease
s Die
tary
C
arbo
hydr
ates
Insu
lin
Die
tC
ontra
cept
ives
, O
ral
Cho
lest
erol
Kidn
ey
Tran
spla
ntat
ion
Lipi
ds
Cof
fee
Fatty
Ac
ids
Brea
st
Neo
plas
ms
Caf
fein
e An
ti-Ba
cter
ial
Agen
ts
Die
tary
Fa
ts
Envi
ronm
enta
l Ex
posu
re
Lung
N
eopl
asm
s En
viro
nmen
tal
Pollu
tant
s
Occ
upat
iona
l Ex
posu
re
Smok
ing
Alco
hol
Drin
king
C
arot
enoi
ds
Air
Pollu
tant
s,
Occ
upat
iona
l
Smok
e
Asth
ma
Col
orec
tal
Neo
plas
ms
Elec
trom
agne
tic
Fiel
ds
Inha
latio
n Ex
posu
re
Air
Pollu
tant
s
Air
Pollu
tion
Toba
cco
Smok
e Po
llutio
n Pu
lmon
ary
Dis
ease
, C
hron
ic
Obs
truct
ive
Cho
lest
erol
, H
DL
Surg
ical
Pr
oced
ures
, O
pera
tive
Ant
i-Inf
lam
mat
ory
Agen
ts,
Non
-Ste
roid
al
Diu
retic
s
Alzh
eim
er
Dis
ease
Hyp
erte
nsio
n Antih
yper
tens
ive
Agen
ts
Test
oste
rone
Obs
tetri
c La
bor,
Prem
atur
e
12
From such a comprehensive representation, we may begin to assemble the
envirome from a set of factors investigators have prioritized through study of
their relationship with disease. However, for discovery and hypothesis
generation, arguably the most important associations are ones that have few or
no citations, for example factors found in the lower left quadrant of Figure 2,
or corresponding factors with a single or no link in the envirome-disease
network (Figure 3). In the following section, we present how we may
systematically create these hypotheses to establish links for further study.
Creation of robust hypotheses connecting the environment, genome, and
multifactorial disease
Genomics and informatics have enabled the creation of novel and validated
hypotheses on a multidimensional breadth for multifactorial diseases. This
scale is required for complex diseases where many such factors are thought to
take part. Through GWAS we have discovered genetic variants that have
enabled scientists to postulate about the function of the pathways discovered in
diseased individuals and have led to biological and clinical experimentation
(for example, [28-31]). While GWAS has come short to explain total heritable
risk of complex disease [32], the data-driven methodology has enabled us to
collect robust, multidimensional evidence, opening new avenues of
investigation [32, 33].
In this dissertation, we have developed and implemented analytic methods to
associate environmental factors to multifactorial disease (Figure 5).
Specifically, we propose methods that scale across experimental frameworks,
from populations (Figure 5 A, C) down to molecules (Figure 5 D, F). Further,
they scale in resource utilization through use of publicly accessible data sets.
13
Figure 5. Overview of population- and molecular-scale methods to create hypotheses across the envirome and genome. Examples are fictional and shown for illustrative purposes. A.) An Environment-wide association study (EWAS) is a population scale method to screen multiple factors in the envirome for association with a disease of interest. Depicted is a “Manhattan Plot”, a way to visualize strength of association for each factor over the envirome. For example, the pesticide Heptachlor is depicted as the highest ranking finding in an EWAS in association to Type 2 Diabetes. B.) Genome-wide Association study or GWAS in which genomic variants are associated to disease on a genome-wide dimension. C.) Genome-Environment-Wide Association Study (G-EWAS), in which marginal findings from Genome-wide Association Studies (GWAS) and EWAS are integrated and screened jointly for interaction in context of a disease of interest. For example, evidence for Heptachlor and SLC30A8 interaction against Type 2 Diabetes is shown in the 2D plot. D.) Representation of an “Envirome Map” whereby gene expression “signatures” induced by physical environmental factors on model systems are summarized in a matrix. For example, Bisphenol A has a signature consisting of CYP1A1, MAPK1, and ESR1 expression. E.) Standard gene expression disease responses collected on multiple diseases, for example Type 2 Diabetes, Breast Cancer, and Coronary Heart Disease. F.) Method to correlate environmental factor expression signatures to disease state, for example Bisphenol A to Breast Cancer. A-C.) considers studies on a population scale, D-F.) on a molecular or toxicological scale. Depiction of the envirome domain is seen in green, the genome domain in red. Examples are shown in italics.
Diseases
Genome domain
bisphenol Aarse
nicPCB170
Gen
ome-
wid
e fu
nctio
n(ie
, mR
NA
exp
ress
ion)
Heptachlor
SLC30A8
dise
ase
ass
ocia
tion
sign
ifica
nce
e.g.
Typ
e 2
Dia
bete
s
Type 2 Diabetes
Coronary Heart Disease
Breast Cancer
A. Environment-wide Association Studies(EWAS) (Chapter 3)
B. Genome-wide Association Studies(GWAS)
bisphenol A
Breast Cancer
Corr ( )
,E. Disease expression studies
D. Envirome Map (Chapter 2)
ESR1
MAPK1
CYP1A1F. Envirome-disease expression
signature correlation(Chapter 2)
C. G-EWAS (Chapter 4)
dise
ase
ass
ocia
tion
sign
ifica
nce
e.g.
Typ
e 2
Dia
bete
s SLC30A8 & Heptachlor
Envirome-wideGenome-wide
Envirome domain
Popu
latio
n sc
ale
Mol
ecul
ar s
cale
dise
ase
ass
ocia
tion
sign
ifica
nce
e.g.
Typ
e 2
Dia
bete
s
GWAS
lociEWAS
factors
Legend
Illustrative examples in italics
14
Creating hypotheses comprehensively on a population scale
In EWAS (Chapter 3 and 4, Figure 5A), as in GWAS (Figure 5B), multiple
environmental factors, or the “envirome”, are interrogated against disease state
using an epidemiological study design. These studies can be “case-control”,
whereby factor variability is compared between incident disease cases
(individuals recently diagnosed) and non-diseased individuals. EWAS can be
utilized in other observational study designs, most notably a “cohort” or
“cross-sectional” one, in which individuals are sampled from a pre-defined
population and their disease status is estimated after the sampling process.
Both selection and determination of cases and controls is a well-studied in
epidemiology, and non-optimal selection can lead to biases in estimates [34].
For example, if a disease cases are misclassified as the opposite, estimates may
be attenuated.
Other types of biases abound in observational studies and must be
acknowledged and considered. For example, in a cross-sectional study one
cannot easily resolve the temporal relationship between exposure to a factor
and disease (i.e., did the disease come first, or the exposure?). This bias is
known as “reverse causality” [34]. Other biases, such as confounding, also
hinder inference in observational studies. A confounding variable is one that
both correlates with the disease state and the factor; thus, association of the
factor to the disease can be thought of as a stand-in between the confounding
variable and disease. Confounding variables need not be measured. While
introduce means to estimate confounding bias in this dissertation utilizing
measured variables (Chapter 3), confounding bias cannot be avoided altogether.
We have developed and applied EWAS [35, 36] using a cross-sectional dataset
known as the National Health and Nutrition Examination Survey (NHANES),
a representative survey of the non-institutionalized United States population
carried out by the Centers for Disease Control and Prevention and the National
15
Center for Health Statistics [37]. Participants of NHANES are surveyed
regarding their health and disease through a battery of questionnaires,
physician-led physical exam, and urine- and blood-based laboratory tests. A
series of lab tests (N=150-266, depending on survey year) consist of blood or
urine markers of environmental factors, such as heavy metals, persistent
organic pollutants, nutrients and vitamins, antibodies against allergens, and
indicators of pathogenic exposure. Furthermore, several questionnaire items
are used as proxies of environmental factors (N=300-1000, depending on
survey year), such as years smoked and pharmaceutical and drug use.
Furthermore, many clinical biomarkers used for disease diagnosis and risk
prediction, such as body mass index, fasting glucose, and serum lipid levels are
jointly measured, providing a platform to create hypotheses regarding
prevalent diseases without investment in recruitment of subjects.
In its current form, each factor in an EWAS – or the “envirome” -- is
comprehensively associated to disease or phenotypic state, often depicted in a
“Manhattan plot” (Figure 5A), a transparent representation of all findings. Like
GWAS, the framework calls for massive number of comparisons, which can
lead to false positives. EWAS utilizes the “false discovery rate” (FDR) to
account for multiple comparisons, which provides a quantitative estimate of
the number of false discoveries at a given level of statistical significance and is
less conservative than frequentist methods for control [38-40]. Along with a
more stringent threshold to account for false positives, positive findings are
evaluated in independent cohorts and surveys. Multiple comparisons are not
considered in most environmental epidemiological studies and this level of
stringency allows for robust and quantitative prioritization of findings [10, 41,
42].
We have applied EWAS on multiple disease related phenotypes, notably T2D
[35] and serum lipid levels [36], risk factors for coronary heart disease
16
(Chapter 4). Like GWAS, we were able to identify and validate in independent
surveys both novel and known factors associated with the phenotypes that
should be followed up in additional epidemiological or toxicological studies.
For example, we have created hypotheses about the pollutant factors such as
polychlorinated biphenyls and organochlorine pesticides, both associated with
significant increased T2D prevalence and adverse lipid profiles. Surprisingly, a
vitamin marker, γ-tocopherol, was also observed to have an adverse
relationship with the diseases, leading us to hypothesize both about reverse
possible harmful effects of the vitamin [43].
Common disease is hypothesized to be a combination of both genetic and
environmental factors, but GWAS and EWAS examine these domains without
consideration of the other, or conduct marginal associations. We propose
another population-based study, called a Gene-Environment-Wide-Study (or
G-EWAS, Figure 5C, Chapter 5), to examine the joint effect of these factors.
Specifically, we interrogate individual findings from GWAS and EWAS,
testing whether each pair-wise combination of a genetic and environmental
factor “interact”, known as “gene-environment interaction” [44]. When testing
for interaction, we examine whether the joint effects are greater or smaller than
when considering marginal effects alone.
Interaction effects enable investigators to postulate about biological
mechanisms underlying the disease of interest. As an example, investigators
have recently confirmed increased risk of bladder cancer for individuals with
variants in NAT2 gene and who smoke [45]. NAT2 is a gene that metabolizes
chemical compounds; presence of a statistical interaction has prompted a
hypothesis between altered NAT2 function, chemical compounds found in
cigarette smoke and their metabolites, and pathology of bladder cancer [46].
17
However, there are a few problems with current gene-environment
investigations [44, 47] (Chapter 5). First, like most environmental
epidemiological investigations, gene-environment interaction studies rely on a
priori selection of factors to test. The task of choosing factors is particularly
daunting given that the number of common genetic and environmental variants
is approximately on the order of thousands to millions. Second, gene-
environment studies are resource intensive, requiring exponentially greater
sample sizes than studies that study either component alone. Relatedly, these
studies can be analytically intensive, incurring a large multiple hypotheses cost
due to the many combinations of factors to test. Therefore, in the spirit of
EWAS and GWAS, we propose G-EWAS. This approach attempts to solve
the problem of choosing which factors to test while alleviating the some of the
analytical burden of testing a large hypothesis space.
Like EWAS and GWAS, G-EWAS is a data-driven method to find interactions
between robust and replicated variants marginally associated with disease in
EWAS and GWAS (Figure 5B). This method avoids the variable selection
bias that has plagued candidate gene studies [48] while keeping the hypothesis
space constrained. Specifically, each possible pair of factors found in EWAS
and GWAS are tested for interaction association to the disease of interest,
screening a 2-dimensional hypothesis space on the order of hundreds, not
thousands (Figure 5C). Gene-environment studies are power-intensive and
keeping the hypothesis space as small as possible is desired [47]. Relatedly,
increasing the number of hypotheses increases the burden of false positives and
using frequentist methods for multiple hypothesis control (i.e., Bonferroni)
becomes too conservative and methods to estimate the FDR are necessary.
We demonstrated the utility of G-EWAS in application to T2D (Chapter 5) and
found an interaction between a non-synonymous variant in the SLC30A8 gene
and 2 vitamin markers, γ-tocopherol and trans-β-carotene after adjustment for
18
risk factors and consideration of multiple comparisons. Of note, we observed
up to 30-40% increased genetic risk when considering specific environmental
factors. Of course, proof of statistical interaction does not imply an etiological
relationship between the factors. However, investigators have observed that
diabetes is only induced for SLC30A8 knockout models in presence of a high-
fat diet [49, 50]. Results here strengthen this hypothesis and offer specific
factors present or absent within diet to induce a diabetic state. With G-EWAS,
we have a platform to produce multiple data-driven hypotheses regarding
biological mechanisms through gene-environment interactions. Further, the
these interaction findings have implications for personalized genetic risk [23].
Creating hypotheses comprehensively on a molecular or toxicological scale
Toxicology is concerned with the physical substances and exposures that lead
to adverse changes on the organism and/or molecular level, and, how
organisms are exposed to substances. Specifically, toxicologists utilize the
physical sciences to measure how physical substances interact with biological
systems to induce physiological change. For example, how do biochemical
processes alter a substance for digestion, absorption, and excretion, or what are
the “toxicokinetic” properties of a system given exposure? Second, how does
the substance induce “toxicodynamic” change, for example, how does the
substance influence specific targets? Last, how do toxicokinetics and
dynamics influence functional change such as in cell viability and metabolism
[3]? For example, a cornerstone of toxicology is known as “dose-response”
modeling, in which a molecular response is correlated with controlled doses of
a substance. Ascertaining a dose-response relationship enables inference
regarding the type of relationship between a substance and a biological system
connected to the response (e.g., adverse or protective effect).
Another way to ascertain molecular response includes utilizing genome-wide
measurements, such as commoditized gene expression microarrays. This
19
subfield of toxicology is known as “toxicogenomics” and aims to “study the
response of a whole genome to toxicants” [51]. In contrast to the population-
based EWAS approach, toxicogenomics offers how specific environmental
factors may perturb a biological system; however, these responses are
unconnected to complex diseases.
In Chapter 2, we show how one may use to tools of integrative genomics to
connect toxicogenomic responses with disease-associated responses, thus
enabling hypothesis generation between specific physical environmental
factors to complex diseases, such as cancer (Figure 5D-F). For example,
landmark genomic research have linked chemically induced functional changes
to disease and related phenotypes in context of therapeutic prediction. Lamb et
al., in an effort dubbed the “Connectivity Map”, correlated 164 drug-induced
gene expression changes on cell lines to human disease-associated gene
expression states, predicting novel molecules for therapeutics [52]. Analogous
to this work, we ask what potentially environmentally induced changes are
correlated with, and might explain variation in functional disease states.
The proposed method takes full advantage of the publicly available
toxicological and disease-related data such as the Gene Expression Omnibus
(GEO) [53], the Comparative Toxicogenomics Database (CTD) [54], the Toxin
and Toxin-Target Database (T3DB) [55], and the National Toxicology
Program’s ToxCast effort [12, 56], thus providing a scalable way to derive
hypotheses with minimal effort in upfront experimental design.
To begin, numerous experiments examining disease-associated gene
expression are accessible in GEO. Furthermore, it is possible to compare these
disparate datasets to make inferences over the aggregate. For example, Dudley
et al. have collected 238 disease-associated expression responses from GEO
for cross-disease analysis [57]. Further, the authors have shown that signal
20
associated with disease is stronger than experimental or tissue-of-origin
artifacts [57]. And, most importantly, the authors have used this representation
to predict novel therapeutics connected with disease in an unbiased manner
[58]. We claim the same can be achieved with environmental factors.
Toward this goal, we represent a compendium of disease-associated gene
expression datasets in matrix form, where columns represent different diseases,
and rows genes, and each entry a measure of differential gene expression
(Figure 5F) corresponding to the gene-disease pair. This representation allows
one to systematically infer over these disparate datasets. Of course, when
aggregating data over multiple experiments from different investigators, care
must be taken to ensure the inter-comparability of the data [59, 60].
Nevertheless, the commoditization of measurement platforms has enabled data
standardization easing some of the burden of ensuring their comparability [57].
Prototypical gene expression experiments in toxicogenomics can be framed
similarly (single columns of Figure 5D). These experiments typically involve
characterizing a range of dosages of a handful of environmental chemicals on
gene expression of a model organism, such as mouse or rat, or cell line system.
These experimental data files are then submitted to GEO or are summarized in
databases such as the CTD. Just as the Connectivity Map contains has
compiled “signatures”, or patterns of expression for individual chemicals for
prediction of therapeutics for disease states, one can do the analogous utilizing
numerous toxicogenomic experiments covering multiple environmental
chemicals: we call this effort an “Envirome Map” (Figure 5D).
We then query the Envirome Map (Figure 5D) with specific disease-associated
expression datasets (Figure 5E), correlating environmental signatures with
disease-associated expression states (Figure 5F). In Chapter 3, we develop a
method to compute correlations between these datasets [61]. Specifically, each
21
environmental chemical expression signature is queried for enrichment for
genes expressed in the disease expression set. If a disease expression dataset
has many genes expressed in a chemical signature greater than chance alone,
one concludes the chemical signature and disease expression states are highly
correlated. This process is repeated for every chemical in the Envirome Map.
Again, multiple hypotheses are considered by estimating the FDR and the top
ranked correlations are the top hypotheses generated from the procedure.
We utilized the Envirome Map to create hypotheses regarding Breast, Lung,
and Prostate Cancers (Chapter 3, Figure 5F). Specifically, gene signatures of
established factors such as estrogens were highly correlated with breast cancer.
We also observed an endocrine disruptor, bisphenol A, to be associated with
expression states of the disease. However, we lack of directionality of such
associations and further experimentation is needed to characterize the
toxicokinetics and toxicodynamics of BPA in relation to breast cancer. We
discuss validation of these hypotheses in the next section.
Discussion
We have introduced a representation of the collection of all dynamic and
specific environmental factors called the envirome. In our brief survey of how
this entity is studied, we observed its breadth and heterogeneity (Figure 1,
Figure 2, Figure 3). However, despite its breadth, the envirome is not studied
as rigorously as the genome. To this end, we propose population- and
molecular-scale methods to enable scalable hypothesis generation between the
envirome and disease.
The next question is what happens with these hypotheses? We discuss and
introduce methods for validation to infer population risk and second, discuss
new study designs to investigate molecular response of predicted factors.
22
On a population scale, validation of associations to affirm risk ideally occurs in
diverse populations in study designs that minimize confounding and reverse
causal bias. For example, the randomized trial is the “gold standard” for
validation of therapeutics. As directly randomized trials are not possible for
factors with adverse associations, prospective studies may be executed,
utilizing cohorts followed through time or even of multiple familial
generations, such as the Framingham Study [62, 63]. However, while we may
understand the temporal pattern of the exposure-disease relationship, biases
still cannot be excluded.
Taking a step back, standardization of methodology and measurements has
enabled validation of results and verification of risk estimates derived from
genome-wide studies. For example, it is now typical for large consortia to
validate genetic results; for example, recent GWAS results have been
strengthened by multiple meta-analyses, collecting individuals on up to
100,000 individuals around the world [64, 65]. We argue that data standards
for the envirome would enable validations similarly, aiding design of
longitudinal studies that can be combined for meta-analyses. On this point, the
PhenX project is centered on building consensus of type and measurement for
high priority environmental factors and phenotypes [66, 67]; however, they
have yet to be adopted in high-profile epidemiological studies.
Introduction of standards would enable methods such as “Mendelian
randomization” to be comprehensively evaluated as a tool for validation [68,
69]. Mendelian randomization provides a way to approximate a randomized
trial through use of genetic loci that vary with exposure independent of
phenotypic state. Therefore, the association between disease and an exposure
is mimicked by the genetic variant and disease. Given that genetic variants
assort randomly, we avoid the biases in traditional association analyses by
using variants as stand-ins for exposures. Following from this, a set of these
23
variants can possibly be utilized to validate factors found in EWAS. The
central challenge would be in the determining what variants vary with the
massive numbers of possible factors that could be found in an EWAS. Further,
the variation must be described in populations of interest. Nevertheless,
GWAS have explored genetic variation in relation to environmental factors,
such as smoking dependence and consumption [70, 71], alcohol consumption
and dependence [72-74], infection susceptibility [75-77], coffee consumption
[78], exercise [79], and vitamin B levels [80], in addition to existing
“pharmacogenetics” studies which associate genetic variation to drug response
[81]. For example, suppose one hypothesizes a relationship between coffee
consumption and diabetes. Suppose also a (hypothetical) genetic variant called
COFFEE has been identified as strong marker correlated with the amount of
coffee an individual drinks per day. Therefore, evidence that the COFFEE
marker is associated with diabetes will support our original hypothesis; if it
does not, we might conclude that the association is biased. This is just a
simple example; there is need for investigation and novel methods that utilize
sets of such proxy genetic variants as means to validate environment-disease
associations.
On the other hand, we must test how chronic and low-doses of specific
environmental factors modulate molecular responses in model, but clinically
relevant, systems in order to learn about disease biology. On a molecular scale,
it has long been possible to attain a wealth of both phenotypic and genotypic
data from model systems [82] and these methods should complement
toxicology to elucidate disease biology of hypothesized factors [83]. Ideally,
however, we should attempt to study molecular response in actual populations,
eventually investigating both disease biology and population risk
simultaneously.
24
In fact, initial investigations have blurred the line between traditional
molecular- and population-based approaches in studying external factors in
context of complex phenotypes. For example, Idaghdour and colleagues have
shown the relationship between genetic variation and leukocyte gene
expression in the context of urban and/or rural habitation in Morocco on 194
individuals [84]; however, geography is an abstraction of specific
environmental factors. Future studies of similar design should assess how
hypothesized factors correlate with changes in genome-wide molecular
measures on a population scale among diseased individuals.
In conclusion and in the following chapters, we describe and apply methods to
comprehensively associate multiple specific environmental factors, a subset of
the “envirome”, to complex disease for hypothesis generation. Specifically,
we introduce methods to link molecular responses to disease states through a
representation we call the “Envirome Map”. Second, we describe population-
level methods to find novel and robust associations between the envirome and
disease called EWAS. Third, we apply EWAS in the context of T2D and
serum lipid levels. Last, we show how we can integrate EWAS findings with
GWAS to predict how environmental factors modify genetic risk. Future work
in deciphering environmental contributions to disease will benefit from
specific definition of the envirome, standardization of its measurement, and
comprehensive integration of molecular-scale measures on at-risk populations.
25
CHAPTER 2. MAPPING MULTIPLE TOXICOLOGICAL RESPONSES
TO COMPLEX DISEASE
“All substances are poisons; there is none which is not a poison.”
-- Paracelsus (1493–1541)
INTRODUCTION
Certainly Paracelsus, a physician from the Enlightenment Period credited for
the beginnings of the study of “poisons” and toxicology, would remark
similarly if living today. As a result of modernization and commercialization,
the breadth of “poisons”, what we will call in the abstract environmentally
sourced physical exposures, have become immense in type and property [85].
As introduced in Chapter 1, toxicology is the study that studies the effect of
physical agents on biological systems and often the study is in regard to
adverse effect [3].
Specifically toxicologists try to understand fundamental mechanisms – be they
biochemical, cellular, genetic, or molecular – of these effects. The history of
the field goes back to Paracelsus’ time, and there is a rich literature in
methodology in elucidating these effects. One area of practice in toxicology is
known as “risk assessment”, or prediction of how toxicological effect results in
changes in health [3]. However, “risk assessment” is used in terms of potential,
immediate hazard, and non-chronic effect, inferred in model systems and
organisms, and high doses [11, 13]. Toxicological risk assessment says very
little about chronic or complex disease [86], which is the subject of this
dissertation.
Nonetheless, our knowledge regarding the ways chemical exposures induce
low-level biological response is increasing with the advent of high-throughput
26
measurement and screening modalities [12, 54, 87, 88]. However, while
toxicological response remains unconnected to complex disease and public
health, it is also currently difficult to ascertain multiple associations of
chemicals to health status without significant experimental investment or large-
scale epidemiological study. Use of publicly-available environmental
chemical factor and genomic response data – such as toxicogenomic gene
expression data-- may facilitate the discovery of these associations.
What is “toxicogenomics”? Toxicogenomics ramps up signatures of
“biological response” to the dimension of the entire genome. That is,
toxicogenomics refers to the patterns of changes in response due to exposure to
physical agents measured via modalities in functional genomics, such as
proteomic mass spectrometry and gene expression microarrays and analyzed
using bioinformatics techniques [51]. These modalities have become
entrenched in functional genomics such that there already exists a rich,
publicly available data sources and methods to explore toxicogenomic-level
response (Chapter 1).
In the following, we propose to use pre-existing datasets and knowledge-bases
in order to derive hypotheses regarding chemical toxicological association to
disease without upfront experimental design, extending the work of
toxicogenomics. Specifically, we have asked what environmental chemicals
could be associated with gene expression data of disease states such as cancer,
and what analytic methods and data are required to query for such correlations.
This study describes a method for answering these questions. We integrated
publicly available data from gene expression studies of cancer and toxicology
experiments to examine disease/environment associations. Central to our
investigation was the Comparative Toxicogenomics Database (CTD) [54],
which contains information about chemical/gene/protein responses and
chemical/gene/disease relationships, and the Gene Expression Omnibus (GEO)
27
[53], the largest public gene expression data repository. Information in the
CTD is curated from the peer-reviewed literature, while gene expression data
in GEO is uploaded by submitters of manuscripts. We use the CTD to create
an “Envirome Map” which is ultimately used to create hypotheses about the
molecular links between environmental factors and disease states (Figure 5D-
F).
Most approaches to date to associate environmental chemicals with genome-
wide response can be put into 2 categories. These approaches either 1.) have
tested a small number of chemicals on cells and measured responses on a
genomic scale, or 2.) used existing knowledge bases, such as Gene Ontology,
to associate annotated pathways to environmental insult.
The first method involves measuring physiological response on a gene
expression microarray. This approach allows researchers to test chemical
association on a genomic scale, but the breadth of discoveries is constrained by
the number of chemicals tested against a cell line or model organism. These
experiments are not intended for hypothesis generation across hundreds of
potential chemical factors with multiple phenotypic states. Only a few
chemicals can be tractably tested for association to gene activity [89, 90], or
disease on cell lines [91], or on model organisms, including rat and mouse [92].
In rare cases, this approach has reached the level of a hundred or thousand
chemical compounds, such as the Connectivity Map, developed by Lamb,
Golub, and colleagues [52], which attempts to associate drugs with gene
expression changes. After measuring the genome-wide effect on gene
expression after application of hundreds of drugs at various doses, drug
signatures are calculated and are then queried with other datasets for which a
potential therapeutic is desired. While this has proven to be an excellent
system to find chemicals that essentially reverse the genome-wide effects seen
in disease, the approach of measuring gene expression and calculating
28
signatures across tens of thousands of environmental chemicals is not always
feasible or scalable. Although other data-driven approaches have been
described [93], few have given insight into external causes of disease.
A second approach has been to use knowledge bases, such as Gene Ontology
[94] to aid in the interpretation of genomic results. For example, Gene
Ontology analysis of a cancer experiment might elucidate a molecular
mechanism related to an environmental chemical. Unfortunately, there is still
a lack of methodology to derive hypotheses for environmental-genetic
associations in disease pathogenesis, as Gene Ontology and general gene-set
based approaches have limited information on environmental chemicals.
In contrast to the previous approaches, we claim that the integration of pre-
existing data and knowledge bases can derive hypotheses regarding the
association of chemicals to gene activity and disease from multiple datasets in
a scalable manner. Gohlke et al. have proposed an approach to predict
environmental chemicals associated with phenotypes also using knowledge
from the CTD [95]. Their method utilizes the Genetic Association Database
(GAD) [96] to associate phenotypes to genetic pathways and the CTD to link
pathways to environmental factors. This method has proved its utility,
allowing for production of hypotheses for chemicals associated with diseases
categorized as metabolic or neuropsychiatric disorders. However, in its current
configuration, their method is dependent on the GAD, which contains statically
annotated phenotypes in relation to genes containing variants; such DNA
changes are not likely to be reflective of molecular profiles of tissues being
suspected for environmental influence. Unlike this method, our proposed
approach is tissue- and data-driven in that the phenotype is determined by the
individual measurements of gene expression in cells and tissues, allowing for
the dynamic capture of phenotypes.
29
The approach we propose here is agnostic to experiment protocol, such as cell
line or chemical agent tested, and provides for a less resource-intensive
screening of chemicals to biologically validate. Our methodology essentially
combines the best features of these current approaches. We start by compiling
“chemical signatures” in a scalable way using the CTD. As the CTD is a hand-
crated collection of chemical-response data, theoretically a chemical signature
can theoretically be constructed from primary data of individual chemical
expression experiments in GEO. These chemical signatures capture known
changes in gene expression secondary to hundreds of environmental chemicals.
The representation of these gene expression states for all of these chemicals we
dub the “Envirome Map”, introduced in Chapter 1 (Figure 5D). In the
following, we describe how to merge the Envirome Map with disease states
(Figure 5 D-F).
In a manner similar to how Gene Ontology categories are tested for
over-representation, we then calculate the genes differentially expressed in
disease-related experiments and determine which chemical signatures are
significantly over-represented. We first verified the accuracy of our
methodology by analyzing microarray data of samples with known chemical
exposure. After these verification studies yielded positive results, we then
applied the method to predict disease-chemical associations in breast, lung, and
prostate cancer datasets. We validated some of these predictions with curated
disease-chemical relations, warranting further study regarding pathogenesis
and biological mechanism in context of environmental exposure. Our method
appears to be a promising and scalable way to use existing datasets to connect
genome-wide toxicological response to disease [61].
30
METHOD TO PREDICT ENVIRONMENTAL ASSOCIATION TO
GENE EXPRESSION RESPONSE
The Comparative Toxicogenomics Database (CTD) includes manually-curated,
cross-species relations between chemicals and genes, proteins, and mRNA
transcripts [97]. We downloaded the knowledge-base spanning 4,078
chemicals and 15,461 genes and 85,937 relationships between them in January
2009. An example of a relationship in the CTD is “Chemical TCDD results in
higher expression of CYP1A1 mRNA as cited by Anwar-Mohamed et al. in H.
sapiens” (demonstrated in Figure 6A). The median, 70th, and 75th percentile of
the number of genes related to a chemical is 2, 5, 7 respectively.
With the single gene, single chemical relationships, we created “chemical
signatures”, or gene sets associated with each chemical (Figure 6B). Gene sets
were created from gene-expression relations spanning 249 species, but most
relations came from H. sapiens, M. musculus, R. norvegicus, and D. rerio. We
eliminated chemical-gene sets that had less than 5 genes in the set. This step
yielded a total of 1,338 chemical-gene sets.
We assemble the Envirome Map (Figure 7) by aggregating all 1,338 chemical-
gene signatures from the CTD (Figure 6B). Specifically, each signature can be
depicted as a vector whereby each entry represents a functional link between a
chemical and gene. The Envirome Map is a collection of these vectors. While
we present and apply the Map as a matrix of binary associations, it can easily
be configured to represent a richer set of relationships, such as ordinal values
depicting the scale of association.
The CTD also contains curated data regarding the association of diseases to
chemicals. These associations are either shown in an experimental model
physiological system or through epidemiological studies. We used these
curated associations to validate our predicted factors associated to disease.
31
There are 3,997 diseases-chemical associations in the CTD, consisting of 653
diseases (annotated by unique MeSH terms) and 1,515 chemicals (Figure 6C).
The median, 70th, and 75th , and 80th percentile of the number of curated
chemicals per disease is 2, 3, 4, and 5 respectively.
32
Figure 6. Creation of the chemical-gene signatures based on the Comparative Toxicogenomics Database (CTD). A.) The CTD contained 85,937 total unique chemical-gene relations over 4,078 chemicals and 15,461 genes. Each relation had one or more citations of support. An example hypothetical relation, “TCDD lead to higher expression of CYP1A1 mRNA in H. sapiens as shown in Anwar-Mohamed et al.” is seen on the right panel. B.) Creation of chemical-gene set relations. Each chemical-gene relation had a number of citations of support, xi. For each chemical, we constructed a gene set, or “signature” from the individual chemical-gene relations. We filtered out signatures that had at least 5 genes in the set, leaving a total of 1,338 chemical-gene sets. An example of one chemical-gene set (a column of Figure 5D, Figure 7) is seen on the right panel of B: the genes CYP1A1, AHR, AHR2 are shown to have multiple citations for the relation, 60, 40, and 9 respectively. Each of these signatures in the aggregate forms the “Envirome Map” (Figure 7). C.) Representation of disease-chemical associations in CTD is used for validation.
chemical4,078
gene15,461
organism249
1 ncited relations85,937 total unique chemical-gene relations
chemical1,338
gene 1
gene 2
gene n
1,338 total chemical-gene setsor "signatures"
A
B
TCDD
CYP1A1
increased expressionAnwar-Mohamed et al
H. sapiens
Example: "Chemical TCDD leads to higher expression of CYP1A1 mRNA in H. sapiens (Anwar-Mohamed et al)"
Dioxins
CYP1A1
AHR
AHR2
Example: Gene set for the Dioxins chemical,with 60, 40, and 9 references of CYP1A1, AHR
and AHR2 interactions
60
40
9
x1x2xn
xi denote number of references for a chemical-gene relation
C
chemical 1
chemical 2
chemical n
disease653
x1x2
xn
xi denote number of references for a disease-chemical relation
Prostatic Neoplasms
sodium arsenite
cadmium
bisphenol A
Example: Chemical set for the "Prostatic Neoplasms" disease
(MeSH: D011471),with references of associating the disease
to sodium arsenite, cadmium, and bisphenol A
33
Figure 7. Creation of the ‘Envirome Map’ using CTD chemical-gene signatures. A.) Each chemical in the CTD is functionally associated with a set of genes, described earlier (Left panel, see also Figure 6B). This signature can be represented in the ‘Envirome Map’ (Right panel), whereby each column represents the genes (rows) associated for each environmental chemical in binary form. B.) An example representation for the “Dioxins” signature. The entire Envirome Map is populated with 1338 signatures (columns).
We built a system to test whether genes significantly differentially expressed
within a gene expression dataset could be associated with any of the calculated
chemical signatures in the Envirome Map (Figure 8A). We conducted two
phases of analysis in this study. The first phase was a verification one, testing
whether the method could accurately predict known chemical exposures
applied to samples Figure 8B). Our input for this first phase were gene
chemical1,338
gene 1
gene 2
gene n chemical
Gen
ome-
wid
e fu
nctio
n(ie
, mR
NA
exp
ress
ion)
gene n
gene 2gene 1
Genome domainEnvirome domain
Legend
Envirome MapCTD chemical-gene signature
Dioxins
CYP1A1
AHR
AHR2
60
40
9
Dioxins
Gen
ome-
wid
e fu
nctio
n(ie
, mR
NA
exp
ress
ion)
AHR2
AHR
CYP1A1
1338
A
B
34
expression datasets of chemically-exposed samples and unexposed control
samples, and our output were lists of chemicals predicted to be associated with
each dataset. The second investigation phase involved predicting chemicals
associated with cancer gene expression datasets (Figure 8C). Our input for this
second phase were gene expression datasets of cancer samples and control
samples, and our output were lists of chemicals predicted to be associated with
the dataset. We attempted to validate these findings further by using curated
disease-chemical relations (Figure 8D). Finally, we attempted to group our
chemical predictions associated with cancer dataset by PubChem-derived
BioActivity similarity measures, seeking further evidence of potential
underlying mechanism or similar modes of action between chemicals.
35
Figure 8. Predicting environmental chemical association to gene expression datasets. A.) A representation of the 1338 chemical-gene sets in the Envirome Map. B.) For the validation step, we conducted SAM to find genes whose expression was altered in each of our datasets. We then mapped the differentially expressed genes to corresponding extra-species genes in our database by using Homologene. For each chemical-gene set signature in the Map, we conduct a hypergeometric test for enrichment and ranked each result by p-value. C.) We applied the approach used in B to predict chemical association to prostate, breast, and lung cancer data and validated these results with curated disease-chemical annotations from the CTD represented in D.). D.) Representation of the curated disease-chemical associations in the CTD.
We used Significance Analysis of Microarrays (SAM) software to select
differentially expressed genes from a microarray experiment [98]. The FDR
for SAM for all of our predictions were controlled up to a maximum of 5 to
7% in order to reduce false associations.
We mapped microarray annotations to other corresponding representative
species, H. sapiens, M. musculus, and R. norvegicus using Homologene [99].
In the CTD, gene identifiers were commonly associated with H. sapiens;
however, some are mapped to specific organisms, such as M. musculus and R
norvegicus. Most mappings in the CTD are among these 3 organisms. By
Chemically Perturbed Microarray Data:
Exposed vs. Non-exposedChemical annotated dataset
Estradiol (2), TCDD, Zinc, Bisphenol A, Vitamin D
Accuracy:rank of correctly identified
chemical
Disease Microarray Data:Disease vs. Non-disease
Prostate, Lung, Breast Cancers
Significance Analysis of Microarrays
Homologene mapping
Hypergeometric test
Literature Validation:proof of disease association
among highly ranked chemicals
A B C
CTD derived disease-chemical relations
diseasechemical
chemical
1n
D
Predictions:ranked by p-value, q-value
factor p1.)2.)
q
Predictions:ranked by p-value, q-value
factor p1.)2.)
q
chemical
Gen
ome-
wid
e fu
nctio
nde
rived
from
CTD
gene n
gene 2gene 1
1338
36
mapping our expression annotation to these organisms, we ensured gene
compatibility with a large portion of the CTD.
We checked for enrichment of differentially expressed genes among each of
the 1,338 chemical-gene sets in the Envirome Map with the hypergeometric
test. To account for multiple hypothesis testing, we computed the q-value, or
false discovery rate for a given p-value, by using 100 random resamplings of
genes from the microarray experiment and testing each of these random
resamplings for enrichment against each of the 1,338 chemical-gene sets. This
methodology is similar to the q-value estimation method described in
“GoMiner”, a gene ontology enrichment assessment tool [100]. We assessed a
positive prediction for those that had exceeded a certain p-value and q-value
threshold in our list of 1,338 tested associations. All analyses were conducted
using the R statistical environment [101].
Method Verification Phase
For our verification phase, we surveyed publicly available data from the Gene
Expression Omnibus (GEO) for experiments in which sets of samples exposed
to chemicals were compared with controls. We found and used six datasets in
the validation phase. Set 1 included GSE5145 (3 study samples and 3 controls)
in which H. sapiens muscle cell samples were exposed to Vitamin D [102].
Set 2 was GSE10082 (6 study samples and 5 controls) in which wild-type M.
musculus were exposed to tetradibenzodioxin (TCDD) [103]. Set 3 was
GSE17624 in which H. sapiens Ishikawa cells (4 study samples and 4 controls)
were exposed to high doses of bisphenol A (no reference). Set 4 was GSE2111
in which H. sapiens bronchial tissue (4 study samples and 4 controls) were
exposed to zinc sulfate [104]. The CTD had some chemical-gene relations
based on this dataset; we removed these relations prior to computing the
predictions for this dataset. Set 5 was GSE2889 in which M. musculus thymus
tissues (2 study samples and 2 controls) were exposed to estradiol [105].
37
Finally, set 6 was GSE11352 in which H. sapiens MCF-7 cell line was
exposed to estradiol at 3 different time points [106]. In all cases except for set
6, we treated SAM analysis as unpaired t-tests; for set 6, we used the time-
course option in SAM. See Table 2 for the number of differentially expressed
genes found for each dataset along with their median false discovery rate.
Dataset Chemical
Tested Number of
Samples/Controls (tissue type)
SAM: median FDR
Number of Differentially
Expressed Genes / Total
GSE5145 [102] Vitamin D3 3/3 (H.sapiens muscle)
0.04 805/20555
GSE10082 [103]
TCDD 6/5 (M. musculus injection)
0.05 2066/21863
GSE17624 Bisphenol A 4/4 (H. sapiens Ishikawa cells)*
0.04 8406/20828
GSE2111 [104] Zinc sulfate 4/4 (H. sapiens bronchial tissue)
0.05 31/13306
GSE2889 [105] Estradiol (M. musculus thymus)
0.07 112/13383
GSE11352 [106]
Estradiol (H. sapiens MCF7) 0.05 114/20555
Table 2. Gene expression dataset summary for verification stage. 1st column denotes GEO accession, 2nd column is the chemical exposed to the samples. 4th column is the median FDR for SAM. * denotes “high” dosage of Bisphenol A used for the exposed sample group.
38
Predicting Environmental Factors Associated with Disease-related Gene
Expression Data Sets: Prostate, Lung, and Breast Cancer
We found previously measured cancer gene expression datasets to identify
potential environmental associations with cancer. We used measurements from
human prostate cancer from GSE6919 [107, 108], lung cancer from GSE10072
[109], and breast cancer from GSE6883 [110]. We conducted all SAM
analyses using an unpaired t-test between disease and control samples. See
Table 2 for the number of differentially expressed genes measured for each
dataset along with the level of FDR control.
We deliberately chose cancer datasets that used a different population of
controls rather than normal tissues from the same patients. The prostate cancer
dataset (GSE6919) consisted of 65 prostate tissue cancer samples and 17
normal prostate tissue samples as controls.
The lung cancer dataset (GSE10072) consisted of two patient groups: non-
smokers with cancer (historically and currently), and current smokers with
cancer. We conducted the predictions on these groups separately. The cancer-
non smoker group consisted of 16 samples and the cancer-smoker group had
24 samples. The control group consisted of 15 samples.
The breast cancer dataset (GSE6883) consisted of two distinct cancer sub-
groups: non-tumorigenic and tumorigenic. As with the lung cancer data, we
conducted our predictions on these groups separately. The non-tumorigenic
group consisted of three samples and the tumorigenic group had six samples.
The control group contained three samples.
We then validated our highly ranked factor predictions with disease-chemical
knowledge from the CTD. In particular, we determined if the highly
39
significant chemicals in our prediction list included those that had curated
relationship with cancer in the CTD (disease-chemicak relation). This step was
similar to measuring association to chemicals via enriched gene sets using the
hypergeometric test as described above. We used curated factors associated
with Prostatic Neoplasms (MeSH ID: D011471), Lung Neoplasms (D008175),
and Breast Neoplasms (D001943), to validate our predictions generated with
the prostate cancer, lung cancer, and breast cancer datasets respectively.
Further, we assessed the validation by computing the actual number of false
positives and true negatives. To compute this number, we assessed whether
the prediction list was enriched for chemicals associated with any of the other
diseases in the CTD at a higher significance level than the true disease; for this
test, we chose diseases that had at least 5 chemical associations, a total of 141
diseases. As an example, to assess the false positive rate for the prostate
cancer (MeSH ID: D011471) predictions, we determined the curated
enrichment of our predictions for all 140 other disease-chemical sets and
counted the number of diseases that had a lower p-value than that computed for
D011471.
Clustering Significant Predictions By PubChem-derived Biological Activity
Chemical-gene sets derived from the CTD are but one representation of how a
chemical might affect biological activity. Biological activity of chemicals may
also be derived from high-throughput, in-vitro chemical screens such as those
archived in PubChem [87, 111]. Specifically, the PubChem database provides
a large number of phenotypic measurements (or “BioAssays”) for many of the
chemicals we predicted for cancer. In addition, PubChem provides tools to
compare BioAssay measurements for different chemicals. Quantitative and
standardized BioAssay measurements (normalized “scores”) allow comparison
of biological activities of chemicals and derivation of biological activity
similarity between chemicals. For example, PubChem represents the biological
40
activity of a compound through a vector of BioAssay scores and assembles a
bioactivity similarity matrix between each pair of chemicals with this data.
We sought further external evidence of the relevance of the predicted
chemicals though comparison of their patterns of PubChem-sourced biological
activity (Figure 9). First, we produced a list of chemical predictions for each
cancer dataset as described above (Figure 8, Figure 9A, Figure 9B) and
submitted our list of chemicals to PubChem for activity comparison (Figure 9).
Finally, we observed patterns of correlation between PubChem-derived
biological activities of the compounds to their chemical-gene set association
significance by clustering the chemicals in the prediction list by their
biological activity.
41
Figure 9. Clustering chemical prediction lists by biological activity archived in PubChem. A.) A representation of the CTD-based Envirome Map as shown in detail in Figure 6. B.) Prediction of the chemicals associated to each cancer dataset using chemical-gene sets from the CTD. We selected highly significant chemical predictions for each cancer and clustered these chemicals by their “Bioactivity” similarity as defined and computed in PubChem. C.) Within PubChem, each of these chemicals has a vector of standardized BioAssay scores. PubChem had 790 BioAssay scores for 66 of our significant predictions. The PubChem BioActivity similarity tool uses these vectors of scores to computes the biological activity similarity for each pair of chemicals and similarity is represented as a matrix.
RESULTS
We implemented a method to predict a list of environmental factors associated
with differentially expressed genes (Figure 8). The method is centered on
creation of the Envirome Map (Figure 5D-F, Figure 7), an aggregation of
chemical-gene sets that are derived from single curated chemical-gene
response relationships in the CTD (Figure 6). We determine whether the
differentially expressed genes are associated to a chemical by assessing if the
Disease Microarray Data:Disease vs. Non-disease
Prostate, Lung, Breast Cancers
Predictions:ranked by p-value, q-value
factor p1.)2.)
qPubChem BioActivity Score Data
chemicalbioassa
y 1
bioassay 2
bioassay 3
bioassay 7
90
chemical
PubChem BioActivity Similarity MatrixCorrelation of bioassay scores
6666
Significant Predictions:clustered by BioActivity Similarity
factor p
A B C
significant chemicalsp < α1
66
1338
42
expressed genes are enriched for a chemical-gene set, or contain more genes
from the chemical-gene set than expected at random using the hypergeometric
test. We applied this method in two phases, the first a verification phase in
which we sought to rediscover known exposures applied to samples, and a
query phase, in which we sought to find factors associated with cancer gene
expression datasets. We refer to significant chemical-gene set associations to
gene expression data as “associations” or “predictions” in the following.
Verification Phase
We first applied our method to gene expression data from experiments in
which samples were exposed to specific chemicals, reasoning that if our
method could identify these known chemical exposures, we could use the
method to predict chemicals that may have perturbed gene expression in
unknown experimental or disease conditions. Our goal was to determine
where a gene expression-altering chemical might lie in the range of
significance rankings applied by the prediction method.
We applied our method on datasets that measured gene expression after
exposure to vitamin D, tetrachlorodibenzodioxin (TCDD), bisphenol A, zinc,
and estradiol (2 datasets) on different tissue types. Table 3 shows the results of
our predictions along with a subset of genes in the chemical-gene set that were
differentially expressed.
43
Actual
Chemical Exposure
(GEO accession)
Chemicals Predicted
Hypergeo-metric P-
value
Rank (Percentile)
q-value Relevant Genes Expressed
Vitamin D3 on H. sapiens
muscle cells (GSE5145)
Calcitriol 1x10-23 1 (100) ~0 VDR (25), CYP24A1 (14)
TCDD on M. musculus
(GSE10082)
TCDD 2x10-15 3 (99) ~0 CYP1A1 (59), CYP1B1 (15),
AHRR(6), CYP1A2 (14)
Bisphenol A on H. sapiens Ishikawa cells (GSE17624)
Bisphenol A
1x10-6 15 (99) ~0 ESR1(31), ESR2(7), S100G
(6)
Zinc sulfate on H. sapiens bronchial
tissue (GSE2111)
Zinc sulfate
3x10-3 15 (99) 0.04 SLC30A1 (3), MT1F(2), MT1G(2)
Estradiol on M. musculus thymus
(GSE2889)
Estradiol 5x10-3 17 (99) 0.08 C3(6), LPL (4), CTSB (2)
Estradiol on H. sapiens MCF7
cell line (GSE11352)
Estradiol 5x10-3 19 (99) 0.08 ISG20 (2), MGP (2), SERPINA1
(2)
Table 3. Chemical Prediction Results from the Verification Phase. Each row represents a gene expression dataset and relevant prediction and ranking. The first column specifies the gene expression dataset, the 2nd column the actual exposure applied to the samples for the gene expression set. The 3rd and 4th columns represent the hypergeometric p-value for chemical-gene set enrichment along with the rank of the chemical in the prediction list. The 5th column shows the 5th percentile of the ranking derived from 100 random samplings of genes from the gene expression dataset. The 6th column show notable genes expressed in the chemical-gene set along with the number of references the chemical-gene relation in the CTD.
We were able to satisfactorily predict the exposures applied to the gene
expression datasets. We ascertained a positive prediction if the exposure had a
relatively high ranking (low p-value for enrichment) and if the q-value was
lower than 0.1. For the datasets measuring expression after exposure to
Vitamin D, calcitriol, a type of vitamin D, was ranked first in the list (p=10-23,
44
q~=0). Similarly, TCDD was predicted third in its respective list (p=10-15,
q~=0). The other exposures ranked within the top percentile, ranging from 15
to 19; the lower bound of p-values were between 10-6 and 0.01 and q-values
less than 0.1. We reasoned that we could detect true associations between
environmental chemicals and gene expression phenotypes provided they met
these significance thresholds.
Predicting Environmental Chemicals Associated with Cancer Data Sets
We applied our prediction methods to predict association to cancer disease
states, specifically merging the Envirome Map with prostate, breast, and lung
cancer datasets. In particular, we computed predictions for prostate cancer
from primary prostate tumor tissue, lung adenocarcinomas from lung tissue
from non-smoking individuals, and non-tumorigenic breast cancer cells grown
in mouse xenografts. To validate and select specific predictions from our
ranked list of 1,338 environmental chemicals from the Envirome Map, we
measured how enriched top-ranking chemicals were for annotated disease-
chemical citations in for diseases of interest (“Prostate Neoplasms”, “Breast
Neoplasms”, and “Lung Neoplasms”). To call a positive chemical association
or prediction to disease phenotype, we used p-value thresholds similar to what
we observed during the verification phase (α ≤ 10-4, 0.001, 0.01) along with q-
values as low as possible, specifically less than 0.1. For comparison, we also
used the typical p-value threshold of 0.05.
Figure 10, Figure 11, and Figure 12 shows the result of the disease validation
phase. In all cases, the significant chemicals contained many of the specific
curated disease-chemical relations. For example, if we call chemicals with p-
values less than 0.01 as positive predictions, then we were able to capture 18%,
16%, and 7% of all of the curated relationships for prostate, lung, and breast
cancers respectively (p=10-7, 10-4, and 4x10-5). We assessed specificity of our
list by computing how many curated chemicals we found for all other diseases
45
in the CTD (Figure 10, Figure 11, and Figure 12, offset points in orange and
black). We achieved false positive rates between 1 to 4% for prostate cancer, 8
to 20% for lung cancer, and 2 to 10% for breast cancer. However, most all of
the “false positives” were other types of neoplasms or cancers (Figure 10,
Figure 11, and Figure 12, examples annotated in italics/arrows). For example,
for the lung and prostate cancer predictions at α=0.001 only 1 disease other
than neoplasm or carcinoma was detected: Liver Cirrhosis, Experimental
(MeSH ID: MESH:D008325).
46
Figure 10. Curated disease-chemical enrichment versus prediction lists for prostate cancer datasets. For a prediction list, we selected chemicals that ranked within α=10-4, 10-3, 10-2, and 0.05. This –log10(threshold) along with number of total chemicals found (in parentheses) for each threshold is seen on the x-axis of each figure. We tested if these highly ranked chemicals found under each threshold were enriched for chemicals that had known curated association with the cancer in question. The –log10(p-value) for this enrichment is seen on the y-axis. The solid round red marker represents the enrichment test for the actual disease for which the predictions were based; the number underneath represents the total number of chemicals found in the prediction list that had a curated association with the disease and the percent found among all curated relations for that disease. We estimated accuracy and precision by computing disease-chemical enrichment for all other diseases; false positives are offset in black and true negatives are in yellow. The false positive rate is bracketed and in italics. Examples of false positives are annotated in blue italics along with the number of chemicals found in the prediction list corresponding to that disease and the percent found among all curated relations for that disease.
−log10(pvalue) threshold for factor ranking (number of chemicals found)
−log
10(p
valu
e) fa
ctor−d
isea
se e
nric
hmen
t
!
!
!
!
7(10%)
10(15%)12(18%)
13(19%)!
!
!
! !!
!
!
! !
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!! !
!
!!
!
!
!
!
!
!
!!
!
!
! !
!
!
! !
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!! !!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!! !
!
!!! !!
!
!
!!
!
!
!
!!
!
!
!
!
!!
!!
!
!
!
!
!!!
!
!! !
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!!
!
!! !
!
!!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!!
!
!
!
!
!
!
!!
!
!
!
!
1.3 (89) 2 (68) 3 (50) 4 (27)
02
46
810
!
!
!
True NegativesFalse PositivesProstatic Neoplasms
2%
1%
3%4%
}}
} }
Carcinoma, Hepatocellular 10(30%)
Liver Neoplasms 10(29%)
Liver Cirrhosis 10(16%)
47
Figure 11. Curated disease-chemical enrichment versus prediction lists for lung cancer datasets. See Figure 10 for complete legend.
−log10(pvalue) threshold for factor ranking (number of chemicals found)
−log
10(p
valu
e) fa
ctor−d
isea
se e
nric
hmen
t
!
!
!
!
4(8%)
7(14%)8(16%)
9(18%)
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!!
!
!!
!
!!
!
!
!
!
!
!
!!!
!
!!
!
!
!
!!
!
!
!
!
!!
!
! !
!
!
!!
!
!
!!
!
!!!
!
!
!
!
! !! !
!
!
!
!
!!
!
!
!!
!
!
!!
!
!!
!
! !!!!
!!
!!!
!
!!!
! !
!
!! !!
!!
! !!!!
!
!! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!!!
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!!
!!
!
!!!
!!
!
!
!
!
!
!!!
!!! !!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!
!
!
!!!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
! !
!!
!
!!!
!
!
!
!
!
!
!
!!
!
!
! !!!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!!
!
!!!!
!
!
!
!!
!
!
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!
!!
!
!!
!!
!
!!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
1.3 (84) 2 (73) 3 (42) 4 (29)
02
46
810
1214
!
!
!
True NegativesFalse PositivesLung Neoplasms
10%} 9%}8%}
20%}
Prostatic Neoplasms (15%)
Carcinoma, Hepatocellular9(27%)
Mammary Neoplasms,
Experimental 7(28%)
Liver Cirrhosis 9(15%)
48
Figure 12. Curated disease-chemical enrichment versus prediction lists for breast cancer datasets. See Figure 10 for complete legend.
For the prostate cancer dataset, we chose a chemical signature association
threshold of 0.001 (q ≤ 0.01). Of 1,338 chemicals tested, 50 total were found
under this threshold. Of these 50 chemicals predicted, 10 had a curated
relation with the MeSH term “Prostate neoplasms”. This amounted to
prediction of 15% of all CTD curated disease-chemical relations for the
Prostatic Neoplasms term (p = 3x10-7). These chemicals are seen in Table 4
and include estradiol, sodium arsenite, cadmium, and bisphenol A. Also
predicted were known therapeutics, including raloxifene, doxorubicin,
genistein, diethylstilbestrol, fenretinide, and zinc. We observed that many of
−log10(pvalue) threshold for factor ranking (number of chemicals found)
−log
10(p
valu
e) fa
ctor−d
isea
se e
nric
hmen
t
!
!
!
!
0(0%)
3(3%)
7(7%)
9(10%)
!!!! !!
!!
!
!
!!!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!! !!!!!!!!!!!!!!!!! !!!!!! !!!!!! !!!!!! !!! !!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! !!!!!!!! !!!! !! !!!! !!! !!
!
!
!
!
!
!!!!!!!!!!! !!
!
!!!!
!
!
!
!
!
!
!!!
!
!!!
!
!!!
!!
!!!!
!
!!!!!!!!
!
!!!!!!!!
!
!
!
!! !!!
!
!
!
!
!!
!
!
!!!!
!
!
!
!!
!
!!!!!!!!
!
!!!!
!
!!!!!!
!
!!!!!!!!!
!
!!! !!!!
!
!
!
!!!!!!!!!
!
!
!
!
!
!
!
!
!!!
!
!
!
!!
!
!!!
!
!
!
!
!
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!
!
!!
!
!!
!
!
!!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
! !!
!
!
!
!
!
!
!
!
!!
!
!!!!!!
!
!
!
!
!!
!
!
!
!
!
!
!
!!
!!
!!
!
!!!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!!!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!!
!!
!!
!
!
!
!
!!
1.3 (86) 2 (28) 3 (11) 4 (5)
01
23
45
6!
!
!
True NegativesFalse PositivesBreast Neoplasms
Carcinoma, Hepatocellular 5(33%)
Prostatic Neoplasms 7(11%)
Skin Neoplasms4 (19%)
2%}10%}
4%}
49
the genes detected were well-studied, additional support to our predictions.
For example, ESR2, PGR, and MAPK1 had 37, 34, and 14 references
respectively citing their activity in the context of estradiol exposure (Table 3,
second-to-right column). Second, we observed common occurrence of genes
such as ESR2, BCL2, and MAPK1, among some of the gene sets associated
with chemicals such as estradiol, raloxifene, sodium arsenite, doxorubicin,
diethylstilbestrol, and genistein.
50
Chemical Predicted
Hypergeo- metric
P-value
Rank (percentile)
q-value Relevant genes in set (number of references)
Citations
Estradiol 4x10-10 5 (99) ~0 ESR2(37), PGR(34),
MAPK1(14)
[112]
Raloxifene 1x10-9 6 (99) ~0 ESR2(6), IGF1(5), BCL2(4)
[113]
Sodium arsenite 1x10-8 8 (99) ~0 JUN(13), MAPK1(9), CCND1(8),
FOS(6)
[114]
Doxorubicin 7x10-7 11 (99) ~0 BCL2(23), MAPK1(14),
TNF(10)
[115-118]
Cadmium 6x10-6 13 (99) ~0 MT2A(14), MT1A(12), MT3(11), MT1(6)
[119]
Genistein 3x10-5 19 (99) 6x10-4 ESR2(22), PGR (10),
MAPK1 (5)
[120-122]
Diethylstilbestrol 3x10-5 22 (98) 0.001 ESR2(8), FOS(8),
HOXA10(4)
[123, 124]
Fenretinide 3x10-4 40 (97) 0.004 BCL2(3), ELF3(2), LDHA(2)
[125]
Bisphenol A 6x10-4 47 (96) 0.01 PGR(8), ESR2(7), IL4RA(2)
[112]
Zinc 9x10-4 53 (96) 0.01 MT3(18), MT2A(13), MT1A(11)
[126-129]
Table 4. Prediction of environmental chemicals associated with prostate cancer samples (GSE6919). Shown in the table are a subset of the highly ranked chemicals (p < 0.001) that were predicted to have association with prostate cancer gene expression and had evidence of association with the MeSH term “Prostatic Neoplasms” as in the CTD. The 1st column represents the chemical predicted and the 2nd and 3rd columns show the hypergeometric p-value and ranking. The 4th column shows q-value derived from random samples of genes. The 5th column shows the notable genes in the chemical-gene set that were differentially expressed. The 6th column contains references for the prostate cancer and chemical association found from the CTD. For the lung cancer dataset, we also chose a threshold of 0.001 (q ≤ 0.004). Of
1,338 chemicals tested, 42 were found under this threshold. Of these 42
chemicals, 7 had a cited relation with “Lung neoplasms”, 14% of all curated
disease-chemical relations for the term (p = 1x10-5). These chemicals are seen
51
in Table 5. For lung cancer, we observed cited chemicals such as sodium
arsenite, vanadium pentoxide, dimethylnitroamine, 2-acetylaminoflourene, and
asbestos. Therapeutics observed included doxorubicin and indomethacin. We
did not observe common genes represented for different chemical-gene sets,
unlike the prostate cancer predictions. Predictions for the smoker-lung cancer
samples were similar, resulting in sodium arsenite, dimethylnitrosamine, and
vanadium pentoxide, albeit through different differentially expressed genes.
52
Chemical Predicted Hypergeo-
metric P-value
Rank (percentile)
q-value Relevant genes in set (number of references)
Citations
Doxorubicin 1x10-6 16 (99) 4x10-4 CASP3(60), ABCB1(28),
BAX(26), BCL2 (23)
[130]
Sodium arsenite 8x10-6 20 (98) 4x10-4 JUN(13), NQ01(6), EGR1(6)
[131-133]
Vanadium pentoxide 1x10-5 24 (98) 6x10-4 HBEGF(3), CDK7(1), CDKN1B
(1), CDKN1C(1)
[134]
Dimethylnitrosamine 6x10-5 27 (98) 7x10-4 TGFB1(23), TIMP1(15), PCNA(6)
[135]
Indomethacin 2x10-4 34 (97) 0.002 BIRC5(3), CDKN1B(2),
MMP9(2)
[136-138]
2-Acetylaminofluorene
3x10-4 36 (97) 0.003 ABCB1(4), ABCG2(4), KRT19(2)
[139]
Asbestos, Serpentine 4x10-4 39 (97) 0.004 IL6(2), MMP9(2),
MMP12(2), PDGFB(2)
[140]
Table 5. Prediction of environmental chemicals associated with lung cancer samples (GSE10072). Shown in the table are subsets of the highly ranked chemicals (p < 0.001) that were predicted to have association with lung cancer gene expression (non-smokers) and had evidence of association with the MeSH term “Lung Neoplasms”.
For the breast cancer dataset, we chose a threshold of 0.01 (q ≤ 0.08). Of
1,338 chemicals tested, 28 were found under this threshold. Of these 28
chemicals, 7 had a cited relation with “Breast neoplasms”, 7% of all curated
disease-chemical relations for the disease. These chemicals are seen in Table 4
(p = 4x10-5). The chemicals predicted included progesterone and bisphenol A.
Therapeutics found included indomethacin and cyclophosphamide. There was
evidence for both a harmful chemical and a therapeutic for chemicals such as
estradiol, genistein, and diethylstilbestrol for breast cancer. Unlike the
predictions shown for prostate and lung cancer, the genes utilized in the
predictions for breast cancer were not as well studied, with 1 to 3 references
53
for the gene and environment association. We observed some commonality in
chemical-gene sets, such as the presence of IL6 and CEBPD in several of the
top chemicals predicted in association to the disease. Similar chemicals were
predicted for the tumorigenic breast cancer dataset, such as estradiol and
progesterone. However, chemicals not highly ranked in the non-tumorigenic
predictions included benzene and the therapies tamoxifen and resveratrol.
Chemical Predicted Hypergeo-
metric P-value
Rank (percentile)
q-value Relevant genes in set (number of references)
Citations
Progesterone 2x10-4 6 (99) 0.01 IL6(3), STC1(3),
CEBPD(2)
[141, 142]
Genistein 6x10-4 10 (99) 0.03 CEBPD(1), APLP2(1), MLF1(1)
[143-145]
Estradiol 7x10-4 11 (99) 0.03 LPL(4), IL6(3),
CEBPD(2)
[146-150]
Indomethacin 3x10-3 17 (99) 0.05 CCDC50(1), BIRC3(1), DNAJB(1)
[151]
Diethylstilbestrol 3x10-3 18 (99) 0.05 IL6(1), MARCKS(1),
MXD1(1), MMP7(1)
[152, 153]
Cyclophosphamide 4x10-4 19 (99) 0.06 IL6(3), MARCKS(1),
PSMA5(1)
[154-156]
Bisphenol A 6x10-3 21 (99) 0.08 CEBPD(1), MLF1(1), DTL(1)
[157]
Table 6. Prediction of environmental chemicals associated with breast cancer samples (GSE6883). Shown in the table are subsets of the highly ranked chemicals (p < 0.01) that were predicted to have association with breast cancer gene expression (non-tumorigenic) and had evidence of association with the MeSH term “Breast Neoplasms”.
54
Some of the chemicals found were common to more than one type of cancer.
For example, we predicted chemicals such as sodium arsenite for both prostate
cancer and lung cancers, and bisphenol A for both prostate and breast cancers.
In some of the cases, the predicted chemical overlap across different cancers
are due to the expression of distinct genes for each dataset, highlighting the
potential of many possibilities for interaction between environmental
chemicals and genes.
Clustering Significant Predictions by PubChem-derived Biological
Activity
We have described a method of generating a list of chemical predictions
associated with disease-annotated gene expression datasets and applied the
method on gene expression data for several cancers, in effect merging a
comprehensive representation known as the Envirome Map with disease
datasets. We have validated a subset of our predictions with evidence from the
literature as described above.
We sought further evidence of the biological relevance of our
predictions through internal comparison of their potential activity archived in
PubChem. Specifically, we expected some degree of correlation between
“similar” chemicals and their gene set significance to the cancer datasets. We
opted to use PubChem BioActivity to assess chemical similarity, assuming this
measure of phenotypic similarity would be representative of underlying
biological pathways of action. We picked chemicals that were deemed
significant for thresholds used above (p=0.001, 0.001, 0.01, for the prostate,
lung, and breast cancer datasets) for all of the cancer datasets. This resulted in
a total of 130 chemicals, 66 of which had BioActivity data in PubChem. The
BioActivity similarity for each of the 66 chemicals was computed through 790
BioAssay scores. Figure 13 shows the –log10 of significance for the highest
ranked chemical predictions clustered by their BioActivity similarity.
55
We found some chemicals with similar biological activity profiles in
PubChem had similar patterns of chemical-gene set association across the
cancer datasets. For example, sodium arsenite, sodium arsenate, and
doxorubicin have closely related biological profiles as well as high
significance of chemical-gene set association for the prostate and lung cancer
data (Figure 13, enclosed in orange box); however, we did not observe other
biologically similar chemicals such as Tetradihydrobenzodioxin. On the other
hand, we also observed correlation between the biological activity similarity
and chemical-gene set association for hormone or steroidal chemicals such as
ethinyl estradiol, estradiol, and diethylstilbestrol as well as progesterone and
corticosterone (Figure 13, enclosed in purple boxes).
56
Figure 13. Chemical predictions for Prostate, Lung, and Breast Cancer datasets clustered by PubChem BioActivity. Highly significant chemical prediction p-values for the prostate, lung, and breast cancer datasets (p=0.001, 0.001, 0.01, for the prostate, lung, and breast cancer datasets) are reordered by their BioActivity similarity computed by PubChem. A column represents the cancer analyzed and each cell corresponds to the chemical-gene set association –log10(p-value). Examples of correlation between BioActivity similarity and chemical-gene set significance include the sodium arsenite, sodium arsenate, and Doxorubicin cluster (labeled in orange), the Genistein, Estradiol, Ethinyl Estradiol, and Diethylbisterol and Progesterone, Tretinoin, and Corticosterone clusters (labeled in purple). Other examples of BioActivity similarity and chemical-gene set association include chemicals vinclozolin, tert-Butylhydroperoxide, and Carbon Tetrachloride (outlined in blue).
chandran
landi
liu
IndomethacinresveratrolCorticosteroneTretinoinProgesteroneVitamin K 3DisulfiramMifepristoneCalcitriol4-hydroxytamoxifenDiethylstilbestrolEthinyl EstradiolEstradiolGenisteinFenretinidepirinixic acidAm 580pyrazoleCadmium ChloridehydroquinoneFurosemideAcetaminophenVitamin AMethylnitronitrosoguanidineCholecalciferolalachlorindole-3-carbinolFenthionTrichloroethylenenaphthalene3-dinitrobenzenePiperonyl ButoxideAcroleinEthanolmono-(2-ethylhexyl)phthalateDimethylnitrosamineFolic AcidCyclosporineMechlorethamineLindaneIsotretinoinTamoxifenEtoposidebisphenol AMethapyrileneRaloxifene4-(N-methyl-N-nitrosamino)-1-(3-pyridyl)-1-butanonebenzyloxycarbonylleucyl-leucyl-leucine aldehydeCarbon Tetrachloride2-nitrofluorenetert-ButylhydroperoxidevinclozolinalitretinoinMetribolonefulvestrantflavopiridolnickel chloride2-AcetylaminofluoreneAflatoxin B1Hydrogen PeroxideBenzeneThioacetamideTetrachlorodibenzodioxinDoxorubicinsodium arsenatesodium arsenite
0 2 4 6 8Value
020
4060
Color Keyand Histogram
Count
prostate cancer
GSE 6919 lung cancer
non-smokers
GSE 10072 breast cancer
non-tumorigenic
GSE 10072-log10(pvalue)
BioA
ctivi
ty S
imila
rity
57
DISCUSSION
We have developed a knowledge- and data-driven method to predict chemical
associations with gene expression datasets, using publicly available and
previously disjoint datasets. Specifically, we have created a functional, gene
expression representation of 1,338 environmental chemicals called the
Envirome Map (Figure 5D-F) and have developed a quantitative method to
query this map. To our knowledge, there are few methods that generate
hypotheses regarding environmental associations with disease from gene
expression data. Most current approaches in toxicology have focused on a
small number of environmental influences on single or small groups of genes,
while current approaches in toxicogenomics have been concentrated on
measuring genome-wide responses for a few chemicals [158]. Our prediction
method enables the generation of hypotheses in a larger scalable manner using
existing data, examining the potential role of hundreds of chemicals over
thousands of genome-wide measurements and diseases.
As an example, we found predicted chemicals such as sodium arsenite in its
association with prostate and lung cancers, estrogenic compounds such as
bisphenol A and estradiol with prostate and breast cancers, and
dimethylnitrosamine with lung cancer. Although each has curated knowledge
behind the association in the CTD, mechanisms for the action are not well
known and call for further study. So far, Benbrahim-Talaa et al. have found
hypomethylation patterns in the presence of arsenic in prostate cancer cells
[114]. Zanesi et al. show a potential interaction role of FHIT gene and
dimethylnitrosamine to produce lung cancers [135]. Evidence of a complex
mechanistic action of estrogens, such as estradiol, on breast cancer
carcinogenesis has been established [159]; however the role of other
estrogenic-like compounds have only recently been studied. For example,
bisphenol A has been shown to invoke an aggressive response in cancer cell
58
lines [160], possibly by affecting estrogen-dependent pathways [161]. It is
evident that more experimentation is required involving the measurements of
exposure-affected proteins and genes and their activation state in cellular
models and their relation to the chemical signatures.
An overlap of activity of the same genes induced by different chemicals would
suggest a common physiological action by these chemicals. For example, the
ESR2 and MAPK1 genes in the prostate cancer prediction, and the IL6 and
CEBPD in the breast cancer predictions, were associated with several
chemicals for each of the diseases. We also found an overlap between
chemicals amongst different cancers. This result comes as a result of the
correlation in the significant pathways shared by these cancers; however, it
may also indicate a need to explore less significant associations in order to find
unique and specific gene expression/chemical exposure relationships for a
given disease. Furthermore, this result may also indicate a bias of gene and
chemical relationships cataloged in the CTD. For example, it could be that
genes specific to common cancer-related pathways are those that are well
studied, such as BCL2 or ESR2.
Related to this, we have attempted to show how biological activity, as assayed
in a high-throughput chemical screen in PubChem, can be correlated with
chemical gene-set associations. Observing a correlation in both PubChem-
derived bioactivity in addition to a chemical-gene set association from the
CTD provides a way to identify shared modes of action among groups of
similar or related chemicals. This data serves to both provide internal
validation for list of predicted chemicals acting through similar pathways (such
as those induced by estrogen) but also to prioritize hypotheses. For example,
we did not find curated evidence in the CTD for association of the chemicals
vinclozolin, tert-Butylhydroperoxide, and Carbon Tetrachloride to prostate or
59
lung cancers; however, their similar bioactivity profiles (Figure 13, enclosed in
blue box) and high chemical-gene set association calls for further review.
We do acknowledge some arbitrariness in our choice of methods and
thresholds; most of these were chosen to show significance in our
methodology without adding complexity. We could have chosen any of
several alternative approaches to implementing our method; however,
predictions made with the Gene Set Enrichment Analysis (GSEA) [162]
method during the verification phase were not as sensitive (not shown).
Another limitation in our first implementation is that in calculating the
chemical signatures associating chemicals with gene sets, we ignored the
specific degree of expression change (up or down) encoded in the CTD. We
decided not to use this information due to the presence of contradictions (some
references may point to an increase of exposure-induced gene expression while
another reference might claim the opposite), and other preliminary work
suggesting that filtering by the degree of change reduced sensitivity (data not
shown). Because of these limitations, direction of association cannot be
inferred. Further still, we acknowledge richer and more refined chemical
signatures along with further integration with resources like PubChem will
need to be built to make the most accurate predictions.
Another issue with querying the microarray data of any experiment is the lack
of full sample information to stratify results; for example, different exposures
may be associated with a subset of the samples. A related concern includes
small sample sizes of some of the datasets used to evaluate the method. For
example, the best predictive power was seen the largest dataset (prostate
cancer, GSE6919), and the worst with one of the smallest, (breast cancer,
GSE6883). Despite this heterogeneity and lack of power, we still arrived at
noteworthy and literature-backed findings warranting further study. We also
60
urge that more evaluation must occur with datasets that have a larger number
of samples.
Most importantly, we stress that these types of association remain as
predictions and hypotheses that need validation and verification. The method
presented here is not a substitute for traditional toxicology or epidemiology.
These studies are required to provide quantitative and population generalizable
estimates of disease risk and dose-response relationships. However, as the
space of potential environmental chemicals potentially causing biological
effects is large, we suggest that this methodology would give investigators at
least some clue where to start the search for environmental causal factors to
study in these other modes. We believe, like the Connectivity Map, the
Envirome Map is a feasible and practical way to represent toxicological
response for use in prediction. Predicting a linkage between chemicals, genes,
and clinically-relevant disease phenotypes using existing resources falls in line
with the National Academies’ vision of high-throughput efforts to decipher
genome-wide toxicity response to disease [13].
61
CHAPTER 3. METHODS TO EXECUTE ENVIRONMENT-WIDE
ASSOCIATIONS ON DISEASE AND DISEASE-RELATED
PHENOTYPES ON POPULATIONS.
INTRODUCTION
Complex diseases and adverse phenotypes arise due the contribution of
multiple interacting genetic and environmental factors [2], but despite this
many recent epidemiological or population-based studies have emphasized the
genetic components. For example, the Genome-wide Association Study
(GWAS) is a low-cost, commoditized, and popular framework used by
researchers to evaluate genetic factors that correlate with disease status on a
genome-wide scale [9, 163-165] (Figure 5B). A function of its accessibility
and the nature of the simple measurements assayed in GWAS, standards for
cross-study comparison and reporting of genetic association in epidemiology
have established, in the very least calling for comprehensive, systematic, and
agnostic reporting of associations and their validation results. As a result, over
370 GWAS have been published, often over 20 for specific diseases such as
T2D [9].
While GWAS has strengthened the epidemiological process and methodology
of screening and validating genetic variants, most of the findings attained
through the many studies have been unable to explain a large portion of risk
variability between individuals and are of modest effect size [32, 33].
Furthermore, variants found have not been able to shed light on biology of
disease. One hypothesis for this includes that complex disease arises as a
result of sum of effects of variants that are less prevalent than that assayed in
GWAS [166]. Another is that these studies have not considered the joint
contributions of both genetics and the environment. However, before we may
62
address the latter hypothesis, we must understand what environmental factors
are associated with disease.
Despite little relevance of genetic variants found and, more importantly, the
fact that diseases arise out of the contribution of both genetics and the
environment, there exists no analogous or comparable platform to assay and
analyze enviromic associations to disease on an epidemiological scale.
Specifically, given multiple environmental factors – or “envirome” (Chapter 1)
measured on a population– we ask a analogous question to that asked in
genome-wide association studies: what specific environmental factors are
associated with a disease or phenotype of interest out of all possible individual
environmental factors measured on epidemiological scale? Note that these
associations are of different scale than that covered in Chapter 2 (“Mapping
Multiple Toxicological Responses To Complex Disease”).
To answer this question, we propose an analogous framework to GWAS,
called “Environment-wide association study” (EWAS), to search for and
analytically validate environmental factors associated with continuous
phenotypes or discrete ones such as disease (Figure 5A). This type of question
is different from a hypothesis-driven approach in which candidate
environmental factors are chosen a priori and tested individually in their
association to a phenotype and analogous to questions facilitated by GWAS.
We begin our description of EWAS by introducing the genome-wide analog,
GWAS. Second, we describe the EWAS framework and third, describe
differences between genome-wide and envirome-wide epidemiological studies.
Fourth, we describe the current EWAS methodology. Last, we discuss our
results and posit ways to extend the EWAS methodology. In the following
chapter, we describe specific, peer-reviewed, published applications of EWAS.
63
METHODS BACKGROUND
Genome-wide association to disease
With the sequencing of the genome and projects that characterized common
genetic variation such as the HapMap, investigators are now able to interrogate
how genome-wide genetic differences are associated with disease and disease-
related phenotypes on an epidemiological scale [25, 167]. These revolutionary
studies, known as “genome-wide association studies” (GWAS), have enabled
investigators to ask what common genetic loci are associated to a phenotype in
an agnostic, systematic, and comprehensive way with explicit control of
multiplicity.
Specifically, during the HapMap project, common single nucleotide (SNP)
variants and have been catalogued on basis of their population frequency (≥
10% population frequency), and major and minor allele versions [168]. The
location of each SNP along the genome is referred to as a “locus” and the
presence of variation at a particular locus denotes a “polymorphism” or a
“polymorphic” locus. “Common” polymorphisms are those that occur at
approximately greater than 5-10% in the population. Thus, by definition, a
“common” SNP must reside at a polymorphic locus. There are greater than 1
million common SNPs in the genome [25]. While SNPs are the most common
type of polymorphism in the genome accounting for 90% of genetic variation,
many other types of genetic variation exists, such as copy number variants,
insertions, and deletions.
GWAS relate traits to variation at each – or a large subset of—common
polymorphic locus in the genome and are enabled by genomic technologies,
known as “SNP microarrays”, which can assay greater than 1 million loci
simultaneously for an individual. These microarrays are now mere commodity
items, like computers, making accessible genome-wide measurements on a
64
large number of individuals [169]. Further, these technology platforms are
known to have very low measurement error [10].
GWAS are constructed by recruiting thousands of individuals with (“cases”)
and without a trait or disease (“controls”). Genotype frequency at each locus
across the genome are then compared between cases and controls using
common statistical tests such as chi-squared test [8], assuming the
independence between each locus. Continuous traits, such as levels of a
biomarker, may also be related to genetic variation using by modeling the
continuous phenotype in a linear regression model [170]. Multiple
comparisons are accounted for through conservative Bonferroni adjustment
and significant loci are validated in independent populations, often (but not
always) of differing demographic character than of the original screen.
Preceding GWAS were “candidate gene studies”, a hypothesis-driven study to
correlate a handful of genetic variants to a trait of interest using a “smaller”
sample size. As a consequence of lack of power and prohibitive genotyping
cost, the agnostic, comprehensive, and systematic analytical and validation
procedure of GWAS eluded traditional genetic association studies [32, 33,
171-173]. To facilitate discussion regarding “Environment-wide association”,
we describe these “agnostic”, “systematic”, and “comprehensive”
characteristics of GWAS.
GWAS is agnostic and data-driven, not hypothesis-driven. Traditionally,
genetic epidemiology association studies were hypothesis driven, testing a
handful of genetic variants at a time against a phenotype. The process of
GWAS also calls for both systematic associations both within an individual
study and between multiple studies. The process of simultaneous association
requires accounting for multiple testing, controlling for the family-wise error
rate and false positives. a notable problem in the fragmented literature. Often
65
the threshold for significance is fixed a priori (Bonferroni correction). Second,
GWAS significant results are validated in additional populations at the same
stringent level. Last, and related to the agnostic characteristic, GWAS
becomes close to comprehension: each common variant present on the
measurement chip is associated to the phenotype and its strength of association
is reported in context of all other common genotypes assayed (as seen in the
“Manhattan plot”, Figure 5A).
Environment-wide association to disease
In the following, we propose a study design analogous to GWAS, called
“environment-wide association study” (EWAS) to search for and analytically
validate environmental factors associated with complex diseases and
phenotypes.
EWAS assumes similar “data structure” to that of GWAS. Recall that in
GWAS, multiple genetic factors are assayed along with phenotypic
information on each individual (Figure 14). In other words, the genetic factors
are the independent variables, and the phenotype is the dependent variable. In
EWAS, the genome domain are substituted with envirome domain (Figure 5A).
Specifically, the quantity or presence of environmental factors is directly
measured on each individual, such as the amount of a chemical in bodily tissue,
or a proxy measure, such as self-report historical exposure (Chapter 2). This is
in contrast to data that are self-report and subject to bias, “ecological” [11],
data summarized on a level higher than that of individuals but on samples
grouped by some common characteristic, such as family, social network [174],
and town or city regions [175]. As discussed in Chapter 1 and further below,
the environment is a dynamic entity, unlike the data structure of GWAS. Thus,
the dimension of time may also be added to the structure of EWAS data
structure, framing it in a longitudinal context. While we describe methods to
66
accommodate a longitudinal data structure below, specific applications
described do not consider time (Figure 14).
Figure 14. Sample data structure for EWAS. “Phenotype” is the dependent variable. “Sex”, “Age”, “ethnicity”, “SES” (socioeconomic status) are examples of adjustment variables. X1 through Xp are environmental factors; sample1…samplen are the individuals that make up the sample. Values inside each cell denote an example of the data type for the variable. For example, “Phenotype“ here is a binary variable taking on 1 if the phenotype is present, 0 if absent; “sex” is a categorical variable for males and females. X variables representing environmental factor may be continuous (e.g., X1, Xp), positive/negative (e.g., X2), or ordinal (X3). Data might be missing (e.g., NA cells). The vertical axis denotes individual in the sample. Each environmental factor belongs to a disjoint “class”, or grouping, that represents a common characteristic of those factors, represented in the figure as “Class A”, “Class B”, and “Class Z”.
GWAS variables are “binned” by their chromosomal location, facilitating the
description of their correlation structure – known as Linkage Disequilibrium
(LD)-- when visualizing associations. Specifically, LD is the correlation of the
two loci in the genome. Further, LD is a function of relative location of the
two loci; that is, the closer together two loci are on a chromosome in general,
the higher their LD. Suppose we are considering one locus: in this scenario,
we inherit alleles from our parents, one from the mother and one from the
father. The genotype at one locus is a random event and is dependent on the
frequency of alleles present at that one locus in the mother and father. Now
0
2
35 1M
A
M
00
W
2
60
NA
1
20 B
-
-.2
M
-.3
55
0
F
1
-
1
0.53
NA
M 2
-
-
M0
1
1
10
W
+
phenotype
sex
age
ethnicity
SES X 1 X 2 X 3 X p
-.2
NA
0
-1
1
...
3-02W401 F .3
sample1
sample2
sample3
sample4
sample5
samplen
...
...
...
Class A
Class B
Class Z
67
suppose two loci (two sets of genotypes) are in “LD”. This means that their
pattern of inheritance are correlated; that is the occurrence of a particular allele
“A” at a locus A and “B” at a locus B are non-random, or dependent with
respect to one another. In other words, the presence of one allele can predict
the presence of another. LD among different populations has been
characterized by the HapMap project and is ongoing with the 1000 Genomes
project [25]. In GWAS, LD structure “buys” us several things. First, since we
are but only assaying a prevalent subset of polymorphic loci, LD allows us to
narrow down what variants might be causal; for example, given an association
signal for a variant, the causal variant might be one in strong LD with it. LD
also gives us an internal gauge of validity; for example, given a strong
association signal of a variant at loci X, one would expect measured common
variants that are also in LD with X to also harbor some signal.
At present, LD in EWAS is qualitative not quantitative as in GWAS. In our
applications (Chapter 4) we binned factors according to categories that
described the compound “class”, had shared environmental health “relevance”,
or described some other arbitrary shared characteristic as a group of factors.
Current categories and examples within each are seen in Chapter 1. We
anticipate, as investigators characterize the envirome, that these categories will
encompass assays for stress, microbial flora, drugs, noise, and ecological
measures. A research effort will be to fully characterize the LD of the
“envirome”, including their correlation/covariance structure and population-
wide prevalence as has been done with the HapMap.
EWAS achieves the agnostic, systematic, and comprehensive qualities that
characterize GWAS. First, instead of testing a few environmental associations
at a time, EWAS evaluates multiple environmental factors agnostically.
EWAS is comprehensive in that each factor measured is associated with
phenotype. Next, associations are systematically adjusted for multiplicity of
68
comparisons. Further, EWAS calls for validation of significant associations in
an independent population.
The EWAS framework calls for systematic and comprehensive sensitivity
analyses of highly significant or validated factors. Specifically, all possible
measured confounders are included in final models and their effect on the
estimate of the environmental factor is assessed. Last, given the dense web of
correlation for non-genetic measures, such as between environmental factors
and clinical measures, the correlation structure between validated
environmental factors and risk factors are systematically computed and
visualized to understand the degree of their interdependence. By visualizing
relationships in this way, we can infer groups of non-independent exposures
associated with phenotype, similar to “relevance network” or clustering
analyses [176, 177].
Genetic versus non-genetic associations in population scaled studies
While EWAS has been inspired by GWAS, there are both critical differences
and similar drawbacks between genetic versus non-genetic epidemiology. In
the following, we discuss these differences and similarities, between 1.) current
day non-genetic association studies versus GWAS and, 2.) the new paradigm
of non-genetic association studies, or EWAS, and GWAS. Work done by
Ioannidis et al. guides this section [10].
Current association studies seeking association between environmental or non-
genetic factors and phenotypes test a few factors at a time. Results may be
further biased by selective reporting of subsets of analyses, phenotypes, and
adjustments, leading to fragmented body of literature [10, 178-180]. Second,
related to selective reporting of subsets of analyses, consideration of
multiplicity of tests are not considered. Current environmental epidemiology
69
studies are not agnostic, systematic, and comprehensive; however, the EWAS
analytic method amends these differences as described in the previous section.
However, there remain some critical differences and drawbacks between the
new paradigm of “enviromic” and genome-wide association. First, high-
throughput, low-error, commoditized, assay technologies have facilitated
systematic, agnostic, and comprehensive interrogation of genome-wide
variants. An analogous high-throughput and low-error assay technology
platform does not exist for the environmental factors.
Of course, a high-throughput assaying technology can be realized only after
the domain of what to measure – common loci of the genome – have been
characterized. The HapMap project has enabled us to characterize the
variability across the genome. Further, as a result of this characterization, we
also have an idea of how genetic variants are “correlated”, or the pattern of
linkage disequilibrium.
We are far from describing the “LD” of the envirome, let alone what
environmental factors make up the envirome (Chapter 1). However, in our
own applications (Chapter 4) and from others we know that the correlation
matrix of environmental variables is dense [181]; many variables are correlated
with each other strongly. Therefore, it is difficult to pinpoint both what factors
are independently associated with the phenotype and the directionality of
association.
Issues related to observational studies influence all association studies, be it
from hypothesis-driven candidate factor study, GWAS, or EWAS. In contrast
to “gold-standard” randomized trial study data, both genetic and non-genetic
studies rely on observational study data, such as longitudinal cohort, case-
control, or cross-sectional data. Both types of epidemiological studies are
70
subject to confounding biases that hinder causal inference and are avoided, to
some degree, in randomized studies [182]; however, the gold-standard scenario
of a clinical trial is not suited for agnostic study of the envirome as it is
impossible to randomize such a matrix of factors.
“Confounding” is used to describe a scenario in which a variable is correlated
with both the factor of interest (the independent variable) and phenotype
(dependent variable) [183]; in our analyses, the factor acts as a “proxy” to the
confounding variable, resulting in a false association between the dependent
and independent variable. A partial solution to this type of bias is including the
confounder as a covariate in the statistical model, or “controlling” for the
confounder. This of course is only possible when the confounder is known and
measured.
In modern-day genome-wide studies, a notable example of confounding
included the initial association of a variant belonging to the FTO gene and T2D
[165]. In subsequent analyses adjusting for body mass index (BMI), a clinical
risk factor associated with both T2D and the FTO variant, the association was
nullified. Subsequently, FTO was shown and validated in its association to
BMI and obesity in GWAS [170]. Confounding is a major issue in non-genetic
studies, especially noting the dense correlation structure of non-genetic and
environmental variables and many such examples exist of associations biased
by confounders. Famous examples include associations derived from
observational studies later contradicted by randomized control trials (RCT): 1.)
β-carotene, thought to have mute risk for smoking-induced cancer [184], only
to be refuted by a RCT later [185], 2.) same with of vitamin E and decreased
risk of coronary heart disease (CHD) [186], and even, 3.) for vitamin C and
CHD, relative risks between of observational studies and RCTs had even
switched direction [69]!
71
Another source of “bias” includes “reverse causality”, or reverse association.
Reverse causality leads to the failure to infer proper “forward” direction
between the independent variable and dependent variable, the phenotype.
Specifically, it occurs when the independent variable comes directly or
indirectly as a result of the dependent variable. For example of this in includes
a sample-wide behavioral shift due to the dependent variable, such as increased
intake of a vitamin due to an adverse phenotype. If we were to associate the
environmental factor, the vitamin, with the phenotype as the dependent
variable, the interpretation of the model as is suggests that change in vitamin
exposure leads to change in phenotype when in fact the opposite is true. These
biases are especially present in case-control or cross-sectional studies in which
individuals are measured at one point in time. A way to take into account the
dynamic nature of non-genetic variables and biases such as reverse causality
includes conducting a longitudinal study in which we may observe jointly
changes in phenotype and exposure pattern as a function of time [34]. Lastly,
the notion of reverse causality is a non-issue in genetic variant association
studies due to the static state of nucleotide variants.
The nature of the environmental factor themselves also biases results. First,
the assessment of the quantity of environmental factors in blood and serum is
subject to measurement error [10] and self-report variables are subject to recall
bias. Further, physiological characteristics of factors themselves influence
estimates, including the variability of the kinetics of chemical factors, such as
how long they are retained in accessible body tissue. For example, chemical
compounds that are easily measured include those that are lipophilic, persistent
in fatty tissue. As adiposity is related to both the measurement of the factor
and often the phenotype of interest (e.g., metabolic syndrome), a positive
correlation might indicate confounding. On the other hand, many types of
factors are excreted quickly, also affecting their measurement and association
to the phenotype of interest; however, “steady-state” or constant exposure
72
might allay a kinetic effect of environmental chemicals [187]. Nevertheless, in
genetic studies, these issues are altogether avoided: error rates of array-based
assays and DNA sequencing are miniscule to that of environmental factors.
EWAS METHOD
The EWAS methodology and analysis framework is analogous to that utilized
in GWAS. First, we conduct an initial scan for environmental factors
associated with a phenotype of interest through general linear modeling, such
as logistic or linear regression. Since environmental association occurs in the
observational (vs. randomized scenario), these models include variables that
adjust for known confounders, such as clinical risk factors. Second, we
account for multiple hypotheses by estimating the false discovery rate (FDR).
Third, factors that we deem significantly associated with the phenotype beyond
the region of false discovery are “validated” in independent cohorts. Factors
that are validated are considered true discoveries.
The EWAS framework also calls for systematic sensitivity analyses, whereby
validated factors are modeled under different assumptions or with additional
covariates. Further, the pair-wise correlation between each validated factor is
computed and examined to determine their dependence, which can be
interpreted as potential evidence for route of exposure or confounding. Each
step is described further.
Stage 1: Linear Modeling
Each environmental factor is associated with a phenotype of interest using
general linear models; for example, each associated with disease status using
logistic regression. Normally distributed continuous phenotypes are correlated
to environmental factors with linear regression. Common risk, demographic,
and clinical factors are added as adjusting variables, such as age, sex, ethnicity,
73
socioeconomic status, as phenotypic states and environmental factors are
confounded by these variables. Thus, for an environmental factor Xi in our list
of measured factors Xi … Xp we model the disease state (Y) as a linear
function of environmental factors and adjustment variables (represented by Z):
Y = α + βI Xi + ζ Z
Xi corresponds to the environmental factor and βi corresponds to the effect size
of that factor, adjusted by other variables.
The strength of association is computed by the 2-sided p-value for βi, which
tests the “null hypothesis” that βi is equal to zero. When modeling the
phenotype as the logit (logistic regression), the exponentiation of βi serves as
the odds ratio, or the change in the odds in disease versus un-diseased status
for a unit change of the factor. In the linear regression setting, βI can be
interpreted as the change in phenotype per unit change of the factor. In
summary, the screening procedure of stage 1 can be described by this pseudo-
code:
1. Pvalues <- NewList() 2. Effect_sizes <- NewList() 3. For xi in [X1…Xp]: 4. Modi <- GeneralLinearModel(phenotype,xi,Xses,Xeth,Xsex,Xage) 5. ListAppend(Pvalues, getPvalue(Modi, xi)) 6. ListAppend(Effect_sizes, getEffectSize(Modi, xi)) Algorithm 1. Screening for Environmental Factors (Stage 1) of EWAS.
In Algorithm 1, line 1 and 2 initialize an array data structure to store p-values
and effect sizes (coefficients) for each environmental factor. In line 3, we
compute a linear model (‘GeneralLinearModel’) that models phenotype as a
function of environmental factor Xi and the adjusting factors. We simply take
the p-value and coefficient from each model and store them in our list. P-
values are computed through common tests of significance, such as Wald tests.
74
Continuous factors are z-transformed (centered about the mean and divided by
their standard deviation) in order to compare the effect sizes. Many factor
measured in tissue have a right skew and thus are log-transformed prior to z-
transformation. Binary factors (such as presence or absence of a factor) are
standardized such that effect size reflects a unit change between exposed and
un-exposed status; that is, the referent is consistently the “negative” result of a
binary test. Ordinal factors are left untransformed.
Stage 2: Controlling for Multiple Hypotheses by Estimating the False
Discovery Rate
Given a set of “discoveries”, or a list of potentially significant factors, how can
we deem those that are false discoveries? In the GWAS setting, Bonferroni
correction is utilized to adjust for multiple comparisons. The Bonferroni
adjustment is straightforward: it simply divides the significance threshold α for
the total number of tests conducted. This adjustment guarantees the “family-
wide error rate” – the probability of having one or more false positive in a set
of results is equivalent to a setting in which only one hypothesis was tested at
level α. However, the threshold is conservative and therefore we lose power
for detection.
To account for multiple comparisons, we compute an empirical estimate of the
False Discovery Rate (FDR) derived through permutations of the phenotype
multiple times, effectively creating a “null distribution” of test statistics. In
contrast to the Bonferroni correction, the FDR provides a quantitative estimate
of the number of false positives in a set of “discoveries”. The FDR is less
conservative and therefore more powerful than the Bonferroni correction [38].
Further, since our estimate of the FDR utilizes the data itself, it inherently
considers the covariance structure of the data, an important quality given the
dense correlation of non-genetic factors [38].
75
The FDR is the estimated proportion of false discoveries made versus the
number of real discoveries made for a given significance level α, to control for
multiple hypothesis testing. To estimate the number of false discoveries, we
create a “null distribution” of regression test statistics shuffling the phenotype
a large number of times (100-1000) and refit the regression models. The FDR
is the ratio of the proportion of results that were called significant at a given
level α in the null distribution and the proportion of results called significant
from our real tests. We use a significance level that corresponds to FDR of 5-
10% to select associations.
The pseudo-code to compute the FDR follows: 1. Do: ‘Algorithm 1’. 2. nullPvalues <- NewList() 3. For i in [1…numberPermutations]: 4 randomPheno <- permutePhenotypeWithoutReplacement(phenotype) 5. For xi in [X1…Xp]: 6. Modi <-GeneralLinearModel(randomPheno,xi,Xses,Xeth,Xsex,Xage) 7. ListAppend(nullPvalues, getPvalue(Modeli, xi)) 8. fdrRaw <- [] 9. for pvalue in Pvalues: 10. numerator <- sum(nullPvalues < pvalue)/numberPermutations 11. denominator <- sum(Pvalues < pvalue) 12. listAppend(fdrRaw, numerator/denominator) 13. fdrs <- [] 14. for I in [1…p]: 15. fdr <- min(rawFdr[i…p]) 16. ListAppend(fdrs, fdr) Algorithm 2. Computing the FDR (q-value) for each p-value during Stage 1 of EWAS. To begin algorithm 2, we need to have established the stage 1 of EWAS with
Algorithm 1. Then, for a number of permutations, we refit the regression
model for the random phenotype for each environmental factor and collect all
of these ‘null’ p-values (line 3-7). For each p-value computed in Stage 1, we
compute the raw FDR, or the ratio if raw number of results that are exceed that
p-value threshold in the permuted data and the number of results that exceed
that p-value in stage 1 (line 11,12,13). As FDR should be a monotonically
increasing function of the p-value, we ensure that the FDR for a p-value is the
76
minimum of the FDRs for all p-values equal to or greater than that p-value
(line 15). The resulting array of FDR values corresponds to the FDR for each
p-value computed in Stage 1.
Of course, the original method for estimating the FDR can be used [39],
eliminating the need for Algorithm 2. However, as discussed earlier,
estimating the FDR through permutations of the dependent variable is
preferred in the scenario in which the variables are correlated. In addition,
much has been documented about what variables to permute or bootstrap. For
example, it has been suggested that model residuals, the difference between the
predicted and true values, should be permuted (or bootstrapped) as opposed to
the original outcome variables (replacing line 4 in Algorithm 4 appropriately).
In our experience (Chapter 4), we had similar estimates of the FDR under
different documented methods of permuting. The reader is advised to refer to
Manly, Efron, and Westfall and Young for more in this area [188-190].
Stage 3: Validation
Findings deemed significant corresponding to some nominal FDR level are
validated in one or more additional independent cohorts. As a rule, the
significance level of the validation result must be the same or more stringent
FDR level as the initial cohort. For example, if a factor is deemed significant
at FDR of 10% in one cohort, it must also have an FDR of 10% in one of the
validation cohorts. Furthermore, and importantly, the sign of the effect size in
the validation cohort must be equivalent to that in the initial screen.
We also compute the empirical FDR of the validation step, the overall FDR of
validating a factor. We first estimate the number of false positives by counting
the number of factors found significant at level α in multiple cohorts from the
permuted analyses. For example, to assess the FDR of validating a factor in 2
cohorts, we collected the factors that fell below the significance threshold α in
77
the permuted data corresponding to two different cohorts and counted the
number of factors found significant in both. We repeated this operation on all
possible pairs of cohorts, adding up numbers found to be significant in each
pair. We then estimate the FDR by computing the ratio between the total
number of false positives and the number of true validated factors (factors
found to be significant in more than 1 cohort). We repeated the analogous
operation for factors significant in however many cohorts we use to validate
our results. The pseudo code for this procedure follows: 1. numberOfCohorts <- numberOf(cohorts) 2. fdrThreshold = 10% 3. significantFactors <- NewHash(key=factor,init=0) 4. significantNullFactors <- NewHash(key=factor,init=0) 5. For cohort in cohorts: 6. Do ‘Algorithm 1’. 7. Do ‘Algorithm 2’. 8. PvalueThresh <- max(cohortPvalue[fdr < fdrThreshold]) 9. signficantFactorsInCohort = whichFactors(Pvalues < pvalueThresh) 10. signficantFactorsInNullCohort = whichFactors(nullPValues < PvalueThreshold) 11. for factor in significantFactorsInCohort: 12. significantFactors[factor] ++ 13. for factor in significantFactorsNullInCohort: 14. significantNullFactors[factor] ++ 15. validatedFactors <- NewHash(key=numCohorts) 16. for numCohort in [2..numCohorts]: 17. for factor in significantFactors: 18. if(significantFactors[factor] >= numCohort):# need to check the effect size direction 19. validatedFactors[numCohort] ++ 20. nullValidatedFactors <- NewHash(key=numCohorts) 21. for numCohort in [2..numCohorts]: 22. for factor in significantNullFactors: 23. if(significantNullFactors [factor] >= numCohort): 24. nullValidatedFactors [numCohort] ++ 25. validatedFDR <- NewHash(key=numCohorts) 26. for numCohort in [2..numCohorts]: 27. falsePosValidRate <- nullValidatedFactors[numCohort]/numPermutations 28. fdr <- falsePosValidRate / validatedFactors[numCohort] 29. validatedFDR[numCohorts] <- fdr Algorithm 3. Computing the FDR for the multi-cohort validation.
78
In line 1, we retrieve the number of independent cohorts we use to tentatively
validate a significant result, and initialize our significance threshold for a
finding in 1 cohort (eg FDR < 10%). We then initialize two hash data
structures which contain the number of cohorts where the factor is significant,
indexed by the environmental factor name string and do the same for ‘null’
results, or results attained through permutation of the phenotype label (line 3
and 4). Next, for each individual cohort, we do our EWAS stage 1 screen (line
6) and compute the within-cohort FDR (line 7). Then, we collect the number
of significant factors that exceed the nominal FDR threshold (lines 8-10).
Then, we iterate through these significant factors and increment a counter for
the factor (lines 11-12); for factors that are tentatively validated, the count will
be greater than 1. We do the analogous operation for the permuted dataset
(lines 13-14). Next, we count the number of validated factors by totaling up
the factors that had a count greater than 1 (lines 15-19) and the analogous
operation for the permuted dataset (lines 20-24). Finally, we compute the FDR
of validating a factor in lines 25-29: for each possible validation scenario
numCohort (where numCohort is the number of cohorts where a significant is
in 2, 3, 4, etc cohorts) we estimate the false positive rate as the number of
“validated” findings, or rate a factor was found to be significant in numCohort
permuted cohorts divided by the actual number of factors validated in the real
dataset. Thus, this FDR corresponds validating a factor with the significance
rule of FDR for a single cohort.
Final estimates for validated factors are computed by combining independent
cohorts. Tests for heterogeneity between cohorts are also performed to ensure
the final overall estimate is unbiased by any specific cohort.
Stage 4: Sensitivity Analyses
Confounding and reverse causality influence the strength of association, bias
the effect size estimate, and in general affect causal inference of environmental
79
factors to phenotypes. Thus, we propose a method to begin to measure these
biases approximately. However, we cannot claim to find these biases nor
eliminate confounding; nevertheless, we describe methods to assess bias given
that they were measured.
In the first, we systematically comb through all measured variables that were
not considered in our list of environmental factors – but could influence the
association – and sequentially add them to the linear model as an additional
covariate. Then the p-value of association and effect size corresponding to the
environmental factor calculated from the extended model is compared original
model computed in Stage 2. The difference between the extended and original
factor coefficients quantifies the approximate bias due to the new variable.
Types of variables that might bias our associations depend on the phenotype
and environmental factors under study, but often include knowledge of clinical
status (e.g., diagnosis of a disease), recent food, supplement, or drug intake,
and physical activity. For example, knowledge regarding one’s disease state
might induce behavioral change, resulting in increased exposure to foods high
in vitamins and certain nutrients; association between these vitamin factors and
disease might then be attributed to reverse causality. Or, use of a drug might
induce phenotypic change, biasing estimated effects toward the null.
This method is dependent on a multitude of measured potential confounders.
Large epidemiological datasets arising from the public domain or of large
consortia often measure many of these other clinical and behavioral non-
genetic variables which can be utilized to test the “sensitivity” of the final
validated effects of environmental factors associated with a phenotype. We
give specific examples of our sensitivity analyses when covering applications
in sections below.
80
Stage 5: Correlation Globes
The correlation/covariance structure between non-genetic measures are known
to be “dense”, and this structure also influences our ability to infer the
independent effect of factors on phenotype as discovered in EWAS.
Furthermore, our initial screen methodology assumes independence between
factors and we therefore have little idea about their correlation.
Concretely, given a list of discovered factors, their joint association to the
phenotype of interest might be due to their correlation, such as similar routes
of exposure. We assess the degree of dependency between validated factors by
computing their raw correlation coefficient (Pearson’s ρ) and visualizing this
with a correlation “globe”. By visualizing relationships in this way, we can
infer non-independent exposures associated with phenotype [176, 177].
DISCUSSION
As described above, EWAS may facilitate many different ways of screening
for factors. We describe extensions that might be used off-the-shelf to
accommodate longitudinal data and statistical learning methods that consider
the entire matrix of dependent variables at once.
Longitudinal data
As discussed, environmental factors are dynamic. One way to capture the
dynamic relationship between environmental factors and a phenotype of
interest includes repeatedly measuring individuals over time. An example
includes a longitudinal cohort study, in which a cohort is followed for a certain
amount of time beginning prior to disease onset, such as childhood or
adolescence. This type of study design might lessen the bias of reverse
causality, but not completely [34].
81
For a binary dependent variable, the Cox proportional hazard model is a
common analytic model that can accommodate both time-dependent
independent and dependent variables. With this model, we simply substitute
line 4 of algorithm 1 with the Cox model that inputs time-dependent variables.
For both continuous and dependent variables, hierarchical modeling techniques
such as generalized estimating equations may be utilized. The EWAS as
described by algorithm 1 depends on the computation of individual p-values
and effect size for the environmental factor, and statistical tests for these
modeling techniques provide this requirement. Calculation of the empirical
FDR proceeds also in the same way [191].
Feature Selection: Shrinkage Methods
The EWAS screening method considers each environmental factor in a
separate linear model iteratively (algorithm 1). This makes feasible the
screening and interpretation of many variables and not over-fitting the linear
model (i.e., p << n, where p are the number of predictors, n are the number of
individuals). However, this falsely assumes independence between
environmental factors. Statistical learning methods, such as “shrinkage”
methods, enable one to model the dependent variables simultaneously in the
“over-determined” (p ≥ n) setting.
2 such popular shrinkage methods include the “Lasso” [192] and “elastic net”
[193]. These methods are extensions of multivariate regression and have some
relation to tree “boosting” methods [194] and are applicable over the
generalized linear model family, including Cox proportional hazards for
longitudinal data [191]. Both the lasso and elastic net are able to fit an over-
determined model by constraining the size of coefficients (“shrinking”).
Because these methods consider the entire set of independent variables
simultaneously (ie multiple regression), algorithm 1 is supplanted with the
shrinkage procedure. Further, k-fold cross-validation is utilized to select
82
features that have the lowest prediction variability on k number of datasets
held out of the model building process [194].
Feature selection operates through optimizing prediction accuracy of the
dependent variable and not by through ordering of test-statistics of individual
coefficients used in inference. Thus, we must re-configure parts of the Stage 1
(FDR estimation) and Stage 2 (Validation) to accommodate this.
Reconfiguring Stage 1, we use one cohort as the “discovery” cohort, applying
the shrinkage method to find factors associated with the phenotype. Within
this cohort, k-fold cross-validation is applied in order to optimize prediction
accuracy with prediction cohort. Thereafter, the top factors found through this
method are “validated” individually in additional validation cohorts using
common tools for inference (e.g., GLM). Successful validation requires low
nominal p and FDR values for the validation analyses.
Of course, “classical” methods for feature selection exist in the linear
regression domain, such as “forward-stepwise” and “backward-stepwise”.
These methods may be used to select environmental factors, but we opt out of
discussion of these methods due to their high variability in subset selection due
to the step-wise procedure, ultimately reducing their prediction accuracy [195].
The shrinkage methods discussed above avoid this problem.
In this chapter, we have presented a straightforward and generalizable way to
associate environmental variables of large dimension to disease. Furthermore,
we present a way of ranking what variables we may want to pursue for further
study through computation of the FDR. Because of its proposed utility, the
method has become a center point of discussion and debate [1, 196-200]. In
the following chapter, we demonstrate this claimed utility, applying the
method to Type 2 Diabetes and Serum Lipid Levels, risk factors for
cardiovascular disease.
83
CHAPTER 4: ENVIRONMENT-WIDE ASSOCIATIONS TO DISEASE
AND ADVERSE PHENOTYPES: APPLICATIONS TO TYPE 2
DIABETES (T2D) AND SERUM LIPID LEVELS
INTRODUCTION
In the following, we exemplify methods and techniques presented in the
previous chapter with published or submitted “Environment-wide Association
Studies” (EWAS) on diseases including type 2 diabetes (T2D) [35] as well as
on phenotypes that are risk factors for disease, such as serum lipid levels [36].
As described in the previous chapter, EWAS is a framework to
comprehensively and systematically test for environmental association to
disease analogous to “Genome-wide association Study” (GWAS), a now
standard framework in genetic epidemiology to associate genetic variants in a
genome-wide dimension to disease.
The EWASes presented concern complex disease which known to be
multifactorial in etiology in which both many environmental and genetic
factors are known to play a role [2]. Second, they are of great concern given
their rise to epidemic status [201]. Third, through GWAS, we have a robust set
of common genetic variants associated to these diseases, for example T2D
[202] and lipid levels [65] executed on samples of significant size.
Furthermore, and most importantly, this list of genomic loci is being updated
and examined continuously [9, 203]; however, we lag behind in identifying the
comprehensive set of environmental factors (Chapter 1).
The following studies are made possible by the National Health and Nutrition
Examination Survey (NHANES), a representative biannual health survey of
non-institutionalized population of the US [37]. In NHANES, participants are
84
queried regarding their health status and an extensive battery of clinical and
laboratory tests are performed on a subset of these individuals. Specific
environmental attributes are assayed, such as chemical toxins, pollutants,
allergens, bacterial/viral organisms, and nutrients. Of biomedical relevance,
we identified novel environmental factors such as nutrients and industrial
pollutants associated with these diseases that should be examined in follow-up
validation studies.
ENVIRONMENT-WIDE ASSOCIATION STUDY ON TYPE 2
DIABETES
EWAS on T2D: Methods
We associate 266 unique environmental factors to T2D status from the
NHANES. We downloaded the all of the available NHANES data for 1999-
2000, 2001-2002, 2003-2004, and 2005-2006 cohorts and collated
corresponding variables across them. For example, if a variable LBXVIE from
1999-2000 described “A-Tocopherol ug/dL” and variable with name LBXATC
from 2001-2002 also described “a-tocopherol ug/dL”, we applied the same
name for each, LBXATC.
Figure 15 presents a schematic representation of our analysis methodology. We
analyzed all environmental factors from the NHANES that were a direct
measurement of an environmental attribute, such as the amount of pesticide or
heavy metal present in urine or blood. We did not consider internal biological
system laboratory measures such as red blood cell count, triglyceride level,
cholesterol level, or other physiological measures. By using direct and
quantitative measures of factors, we potentially avoid issues of self-report bias.
85
There was a total of 543 factors in our EWAS, but not all factors were present
in all cohorts: 111 factors measured in the 1999-2000 cohort, 146 from 2001-
2002, 211 from 2003-2004, and 75 from 2005-2006. This comprised of 266
unique environmental factors in total, with 157 factors measured in more than
one cohort. Using NHANES categorization, we binned factors into 21 “class”
groupings in order to discern patterns among related groups of factors,
analogous to chromosomal units in GWAS (not shown). Different
environmental factors were measured in varying numbers of participants,
ranging from 507 to 3318 individuals over the different environmental factors.
86
Correlation Globes ofTentatively Validated Factors
(ρ > 0.2)
Recompute βfactor, adjusted by self-report data:
log10(LDL)N=101-3368
log10(trigly.)N=109-3618
zfactor
phen
otyp
e
βfactor
Envi
ronm
enta
l Fac
tor "
Cla
sses
"
P-value(βfactor) < α in 2 or more cohorts?
zfactor = transformed xfactor - adjustment variables
Empirical FDR estimationPermute Phenotype Levels
1000x
log10(HDL)N=222-7485
A
B
C
D
E
FDR(α) ≤ 10%
compute estimate of validated factor using all cohorts
Combined Cohort βfactor
Estimation of R2FG
H
Sensitivity Analyses
24- or 48-hour dietary recall (n=58)
physical activity
total supplement use any use of drugs (metformin, statins, etc)
any metabolic health history
Fasting Glucose> 125 mg/dL?
N=109-3190(8% of total)
total 182 96169 258
11 0Pesticides, Pyrethyroid 10Pesticides, Organophosphate 22 2
13 1110Pesticides, Organochlorine 00 1Pesticides, Chlorophenol 10
0 0Pesticides, Carbamate 0 10 50 0Pesticides, Atrazine
02214Volatile Compounds 2910Virus 666
Polyflourochemicals 0 10 1200 0Polybrominated Ethers 120
Phytoestrogens 6 066Phthalates 12 07 12
15 12911Phenols
0 020Perchlorate23Polychlorinated Biphenyls 26 38 0
22Vitamin E 321Vitamin D 10 11 1Vitamin C 0 0
Vitamin B 54 343Vitamin A 333
22Mineral Nutrients 2 176Carotenoid Nutrients 150
001Latex 022 2114 0Hydrocarbons
23Heavy Metals 18 18 259Furans 55 0
077Dioxins 50Diakyl 77 6
1 1 1Cotinine 11Bacterial 178 13
0Allergen Test 0 2000200Acrylamide
1999-20002001-2002
2003-20042005-2006
87
Figure 15. A.) Summary of the 32 factor classes and the number of factors within them for each NHANES cohort. Each factor is measured in blood or urine.. B.) 100-7,500 individuals had their fasting blood glucose (FBG), HDL-C, LDL-C and triglyceride levels measured for each of these factors in each cohort; lipid levels were log transformed to assume normality for least squares regression. Type 2 Diabetes status was assessed by considering those who had a FBG > 125 mg/dL. C.) Each of these 96 to 258 factors was tested for association with the logarithm of HDL-C, LDL-C, and triglyceride level with a linear regression model adjusted for age, age-squared, sex, BMI, ethnicity, and SES. To test against T2D status, a logistic regression model was utilized, adjusting for age, sex, ethnicity, SES, and BMI. D.) To account for multiple testing, we estimated the empirical null distribution by permuting the lipid levels and estimating the false-discovery rate (FDR). The p-value threshold (α) for statistical significance was determined by controlling the FDR to be under 10%. We deemed a factor to be tentatively validated if it was found to be significant in 2 or more cohorts with an effect in the same direction in all cohorts where it was significant. E.) For lipid level phenotypes, we estimated a final coefficient for tentatively validated factors by combining all cohorts and adjusting for age, age-squared, sex, ethnicity, SES, BMI, waist circumference, T2D status, blood pressure, and cohort. F.) We estimated the coefficient of determination (R2) for the final, combined models. G.) We re-computed our final models, adding 62 self-report variables one-by-one to attempting to check the validity of the environmental effect. H.) We computed the pair-wise correlation between each of the tentatively validated factors along with other clinical co-variates and analyzed these relationships with correlation globes [10]. We omitted from our EWAS 73 factors that varied little across individuals in
our sample. Specifically, we omitted those that had a majority (> 90%) of the
observations below a detection limit threshold as defined by in the NHANES
codebook. We also removed factors that targeted a subset of the population,
such as the test for Trichomonas vaginalis, an infectious pathogen found
primarily in women.
T2D cases were individuals who had a fasting blood glucose (FBG) level
greater or equal to 126 mg/dL, as advised by the American Diabetes
Association (ADA) [204] (Figure 15B). We chose specificity and accuracy of
diagnosis over sensitivity, as we acknowledge this definition ignores those
who were previously diagnosed as diabetic, but now keep their blood glucose
under tight control; in fact, a larger proportion of NHANES respondents
described themselves as diabetics or were taking medications often used to
treat diabetes than were classified by FBG levels. Neither FBG nor the self-
reported diabetes status distinguishes between Type 1 Diabetes (T1D) and
88
T2D; as T2D has a prevalence rate more than 40 times higher than T1D, we
assumed all our cases have T2D.
We used survey-weighted logistic regression to associate each of the 543
environmental attributes to diabetes status while adjusting for age, sex, body
mass index (BMI), ethnicity, and an estimate for SES (Figure 15C). We
acknowledge that estimating SES is difficult; nevertheless, we used the tertile
of poverty index, equivalent to the participant’s household income divided by
the time-adjusted poverty threshold, as the estimate for SES. We used R with
the survey module to conduct all survey-weighted analyses [205, 206].
Exposures were captured either as continuous or a categorical variable. Most
chemical exposure data arising from mass spectrometry or absorption
measurements occurred within a very small range and had a right skew; thus,
we log transformed these variables. Further, we applied a z-score
transformation (adjusting each observation to the mean and scaling by the
standard deviation) in order to compare odds ratios from the many regressions.
Similarly, for categorical variables, we made the definition of the referent
consistent, defining them to be the “negative” results of the test.
We calculated the false discovery rate (FDR), the estimated proportion of false
discoveries made versus the number of real discoveries made at a given
significance level, to control for type I error due to multiple hypotheses testing
in associating the factors to disease status [40]. To estimate the number of
false discoveries, we created a “null distribution” of regression test statistics by
shuffling the diabetes status labels 1000 times and recomputing the regressions.
The FDR was then estimated to be the ratio of the proportion of results that
were called significant at a given level α in the null distribution and the
proportion of results called significant from our real tests. To choose factors
significantly associated with T2D in the first single-cohort phase, we used a
89
significance level (α=0.02), which corresponded to a FDR of 10% across three
out of four cohorts (1999-2000, 2003-2004, and 2005-2006) and 30% for the
2001-2002 cohort.
To improve our power, we used the four independent cohorts to validate
significant findings (Figure 15D). We considered a significant factor as
“validated” if it was found to be significant (α=0.02) in more than one cohort,
at the expense of having to drop those factors not measured in a second cohort.
We then assessed the FDR of the multi-cohort validation. We first estimated
the number of false positives by counting the number of factors found
significant at a level α in two or more cohorts from the permuted datasets. We
then estimated the FDR by computing the ratio between the number of false
positives and the number of validated factors. This value was 2% with α equal
to 0.02.
We fit a final logistic regression model with data combined from multiple
NHANES cohorts utilizing all measurements for a specific environmental
factor, attaining an overall odds ratio. The covariates of the final model were
age, sex, BMI, ethnicity, SES, and cohort. We computed new sample weights
for the combined datasets by taking the average of the original sample weights
as described by the NHANES analytic guidelines [207].
We conducted 3 secondary analytic tests for the validity and sensitivity of our
final estimates. We first attempted to check for reverse causality, or
association of exposure due to T2D diagnosis. Our second test attempted to
take into account the lipophilic characteristics of the environmental factors
found. Our last test attempted to take into account recent food and supplement
consumption as a potential bias for exposure measures. For adequate sample
size and ease of comparison to the final fit model, we utilized all available data
combining multiple NHANES cohorts as the sample to conduct these tests.
90
To attempt to account for one’s T2D diagnosis as a modifier of environmental
exposure, known as “reverse causality”, we recomputed our models omitting
those who had been diagnosed with diabetes. Individuals with a diabetes
diagnosis were identified through yes answers submitted on a NHANES health
questionnaire (“Doctor told you have diabetes?”). Thus, we refit our final
models with individuals who were only at risk for T2D diagnosis.
Our second test attempted to account for the lipophilic chemical characteristics
of our significant factors. Many of the environmental factors measured in
NHANES absorb readily in fatty tissue; presence of fatty tissue is also
associated with T2D and a potential confounder. Thus, we recomputed the
models taking into account total triglycerides and cholesterol measured in
blood specimen of participants.
In our third test, we attempted to compare dietary and supplement consumption
of cases or controls gathered from 24- and 48-hour recall and supplement use
questionnaires reasoning that recent intake may confound exposure-disease
association. The NHANES data contains amount of food components
consumed based on the dietary recall available for all participants examined
above. Specifically, amounts of food components are computed from the
questionnaire using the United States Department of Agriculture (USDA) Food
and Nutrient Database. Some of the vitamin and nutrient components included
vitamin A, vitamin B-6, vitamin B-12, vitamin C, vitamin E, vitamin K,
carotenes, lycopene, thiamin, riboflavin, niacin, folate, calcium, iron,
magnesium, phosphorus, potassium, sodium, iron, zinc, copper, and selenium.
Other components included macronutrients, such as protein, carbohydrates, fat,
fiber, and cholesterol. The total amount of food components considered
numbered 51 to 63 for the different cohorts. Further, the 2003-2004 and 2005-
2006 cohorts contained both 24- and 48-hour recall data. Supplement use
91
included count of consumption of vitamins, minerals, botanicals, and/or their
mixture of them over the past month prior to the survey. To check for possible
confounding by recent consumption, we added each food and supplement
variable to the logistic regression models specified above and re-evaluated
significance and effect size of the validated environmental factors. We coded
food component content as the logarithm (base 10) of the amount entered. We
coded supplement use as an integer count value. We acknowledge the
potential of bias with the use of questionnaire data and a pre-determined
database of food items but assumed it was a reliable proxy of consumption and
behavioral data in lieu of other information.
EWAS on T2D: Results
Population characteristics
Across the cohorts, the total non-weighted and weighted numbers of those who
were diabetic compared to non-diabetic were similar. However, we did see
significant differences with demographic factors such as sex, age, and
socioeconomic status between cases and controls. T2D occurred in higher age
groups in all cohorts (p < 0.001, 2-sided t-test). There were significantly more
male participants than females in all cohorts (p < 0.001, 0.02, 0.03, χ2 test)
except for 2005-2006. Furthermore, there was a significant association
between lowest SES (first tertile of poverty index) and T2D (p=0.006, 0.03,
0.04, logistic regression) in for the 1999-2000, 2001-2002, and 2005-2006
cohorts respectively. While we did not see a univariate association between
ethnicity and T2D as diagnosed by FBG, we did confirm previously reported
associations of ethnicity to T2D when stratifying by age and sex [208]. As
expected, BMI was significantly associated with T2D status (p < 0.001, t-test)
for all cohorts. Given these differences between the cases and controls, we
adjusted our logistic regression models described below accordingly.
92
Environment Associations to T2D
Figure 16 shows the distribution of p-values of association for each
environmental factor and class, adjusted for sex, age, BMI, ethnicity, and the
estimate for SES, plotted in a “Manhattan plot” analogous to the association
results from a GWAS study. The 37 significant or notable factors are
annotated in the figure. Specific categories show association with T2D, such as
organochlorine pesticides, nutrients/vitamins, polychlorinated biphenyls, and
dioxins (Figure 16), having between 10 to 30% of the factors in the class with
p-values less than 0.02. Many positive (low p-values) and negative (high p-
values) associations replicated well among the different cohorts.
93
Figure 16. “Manhattan plot” style graphic showing the environment-wide association with T2D. Y-axis indicates -log10(p-value) of the adjusted logistic regression coefficient for each of the environmental factors. Colors represent different environmental classes as represented in Figure 15A. Within each environmental class, factors are arranged left to right in order from lowest to highest odds ratio (OR). Plot symbols represent different cohorts: 1999-2000 (diamonds), 2001-2002 (square), filled dot (2003-2004), circle (2005-2006). Red horizontal line is –log10(α)=1.8 (α=0.02). Validated factors significant in 2 or more NHANES cohorts are in bold face (α=0.02 in two or more cohorts, FDR of 2%) with larger plot points. Other significant factors (α=0.02) are annotated with numeric label corresponding to the environmental factor class color key on the right. Figure abbreviations: Validated factors: t-β-carotene: trans β-carotene; c-β-carotene: cis β-carotene; PCB170: 2,2',3,3',4,4',5-Heptachlorobiphenyl. Group 1 (dioxins): 1-hxcdd: 1,2,3,6,7,8-Hexachlorodibenzo-p-dioxin; 2-hxcdd: 1,2,3,7,8,9-Hexachlorodibenzo-p-dioxin. Group 2 (furans): OCDF: 1,2,3,4,6,7,8,9-Octachlorodibenzofuran. Group 3 (heavy metals): Ur: uranium; Sb: antimony; Pb: Lead. Group 4 (nutrients): tot-β-car: total β-carotene; α-car: alpha-carotene; retnl: retinol; Vita. D: vitamin D; δ-t: delta-tocopherol. Group 5 (organochlorine pesticides): DDE: dichlorodiphenyltrichloroethylene. Group 6 (PCB): PCB169: 3,3',4,4',5,5'-hexachlorobiphenyl; PCB138: 2,2',3,4,4',4',5-Hexachlorobiphenyl; PCB195: 2,2',3,3',4,4',5,6-Octachlorobiphenyl; PCB183: 2,2',3,4,4',5',6-Heptachlorobiphenyl; PCB199: 2,2',3,3',4,5,5',6'-Octachlorobiphenyl; PCB178: 2,2',3,3',5,5',6-
94
Heptachlorobiphenyl; PCB187: 2,2',3,4',5,5',6-Heptachlorobiphenyl; PCB180: 2,2',3,4,4',5,5'-Heptachlorobiphenyl; PCB146: 2,2',3,4',5,5'-Hexachlorobiphenyl; PCB196: 2,2',3,4,4',5,5',6-Octachlorobiphenyl. Group 7 (bacteria): H2: Herpes Simplex 2; HSBA: Hepatitis B Surface Antibody. Table 7 shows those factors that were validated as being significant in two or
more of the independent cohorts (multi-cohort validation FDR of 2%).
Predicted probabilities of having T2D were computed for a prototype
participant, a 45 year old white male with BMI of 27 (middle of the range for
non-diabetics in the NHANES sample) and from the middle SES, at high and
low exposure levels. For combined cohorts, the predicted probability applies to
a prototype participant from the 2005-2006 cohort. We also computed the
overall estimate by combining NHANES cohort data in a final model
additionally adjusted for cohort; the predicted probabilities for these models
were computed for a prototype participant as defined above. We defined low
exposure as having a log transformed exposure level one standard deviation
lower than the transformed mean, and high exposure as having a log
transformed exposure level one standard deviation higher than the transformed
mean. For example, a 45-year-old male from the 1999-2000 cohort with high
levels (0.09 ng/g) of heptachlor epoxide has a 6% likelihood of being in our
diabetes subset.
95
Factor Cohort
N† T2D, No T2D P
OR (95% CI)
Factor Level (Lo-Hi)
Predicted Probability (Lo-Hi)
cis-β-carotene 2001-2002
211, 2852 0.01
0.6 (0.5, 0.8) 0.4-1.4 µg/dL 0.12-0.05
2003-2004 207, 2698 0.002
0.63 (0.5, 0.7) 0.4-1.9 0.13-0.06
2005-2006 186, 2425 0.02
0.6 (0.5, 0.8) 0.4-1.6 0.15-0.06
2001-2006* 604, 7975 < 0.001
0.6 (0.5, 0.7) 0.4-1.7 0.15-0.06
trans-β-carotene 2001-2002
211, 2854 0.01
0.6 (0.5, 0.8) 5.1-27.2 µg/dL 0.13-0.05
2003-2004 207, 2698 0.002
0.7 (0.6, 0.8) 4.8-24.7 0.13-0.06
2005-2006 203, 2701 0.004
0.6 (0.4, 0.7) 4.8-29.0 0.16-0.06
2001-2006 * 621, 8253 < 0.001
0.6 (0.5, 0.7) 4.9-27.0 0.15-0.06
γ-tocopherol 1999-2000
146, 2091 0.02
1.8 (1.3, 2.4) 107-360 µg/dL 0.03-0.09
2003-2004 207, 2698 0.01
1.6 (1.3, 2.0) 103-356 0.06-0.13
1999-2006* 767, 10307 < 0.001
1.5 (1.3, 1.7) 107-352 0.06-0.13
Heptachlor Epoxide 1999-2000 46, 635 0.002
3.2 (2.4, 4.4) 0.02-0.09 ng/g 0.01-0.06
2003-2004 67, 809 0.01 1.9 (1.3, 2.6) 0.01-0.07 0.02-0.07
1999-2004* 178, 2367 < 0.001
1.7 (1.3, 2.1) 0.02-0.08 0.03-0.07
PCB170 1999-2000 45, 716 0.02 2.3 (1.5, 3.6) 0.03-0.12 ng/g 0.01-0.06
2003-2004 53, 773 0.01 4.5 (2.1, 9.9) 0.01-0.12 0.03-0.42
1999-2004* 165, 2426 < 0.001
2.2 (1.6, 3.2) 0.02-0.13 0.04-0.15
Table 7. Highly statistically significant environmental factors associated to T2D found in more than one NHANES cohort. Odds ratio for each exposure, adjusted for BMI, age, sex, ethnicity, and SES is calculated for a change in the log exposure level by one standard deviation. Factor level is the amount of exposure defined by the low (1 SD lower than the average logged exposure level) and high range (1 SD higher than the average logged exposure level). The predicted probability range is an estimate for a 45-year-old white male with BMI of 27 kg/m2 from the middle SES to develop the disease in the low to high range of exposure. * denotes analysis using combined NHANES cohorts; models adjusted for age, sex, ethnicity, BMI, SES, and cohort; predicted probabilities for combined cohorts applies to an individual from the 2005-2006 cohort. † denotes unweighted number.
96
Nutrients and Vitamins: Carotenes and γ-tocopherol
Several vitamins were found to have levels inversely associated with T2D.
The first type included an antioxidant in the isoforms of β-carotene (final
adjusted OR of 0.6; 95% CI 0.5-0.7; p < 0.001). For the prototypical
participant, high levels of trans or cis β-carotene equated to a 9% improvement
in risk (15 vs. 6%) for T2D status. We were able to confirm the inverse
association of β-carotenes seen in multiple epidemiological studies in Saudi
Arabia [209], among older people [210], among Swedish men [211], and in an
earlier NHANES III cohort (pre-1999) [212], as well as another small study
that showed an inverse response between fasting glucose level and β-carotene
[213]. However, in a prospective case-control study β-carotene was not
significantly inversely associated to T2D [214]. Because T2D is associated
with reduced anti-oxidant defense, anti-oxidants, such as carotenes, have been
occasionally recommended as a therapy [215]. However, the evidence of
mitigation of T2D with these vitamins as therapies has been negligible in
clinical trials, including women who are high risk of cardiovascular disease
[216] or male smokers [217].
We discovered a vitamin that increased risk for T2D. Surprisingly, γ-
tocopherol, a form of vitamin E, was highly significantly and positively
associated with T2D (final adjusted OR 1.5; 95% CI 1.3, 1.7; p < 0.001) in two
cohorts (adjusted OR of 1.8 and 1.6; p=0.02 and 0.01 for 1999-2000 and 2001-
2002 cohorts) and nearly significant in the two others (adjusted OR of 1.3 and
1.6; p=0.06 and 0.04 for 2001-2002 and 2005-2006 cohorts). For the
prototypical participant, low levels of the γ-tocopherol equated to a 7%
improvement in risk (13% vs. 6%). To our knowledge, this is a novel
association between γ-tocopherol and T2D.
Persistent Pollutants: Polychlorinated Biphenyls and Organochlorine
Pesticides
97
We found organochlorinated pesticides and polychlorinated biphenyls (PCBs),
both related pollutant factors, to be a highly positively associated with T2D.
Among the PCBs, we specifically discovered PCB170 (2,2',3,3',4,4',5-
Heptachlorobiphenyl; final adjusted OR of 2.2; 95% CI 1.6-3.2; p < 0.001).
The effect sizes in the individual cohorts for PCB170 were large (adjusted OR
2.3 and 4.5; p = 0.02 and 0.01 for 1999-2000 and 2003-2004 cohorts). The
models predicted up to 15% T2D risk for the prototype participant, more than
double the risk of those with low concentrations of PCB170. The association
between the class of PCBs with T2D has been well described within Native
American [218], Japanese [219], Swedish [220], and Taiwanese [221] cohorts.
Heptachlor epoxide, an oxidation product of the organochlorine pesticide
heptachlor, was among the most highly associated factor (final adjusted OR
1.7; 95% CI 1.3-2.1; p < 0.001) in our EWAS. The effect sizes in the
individual cohorts were also large (adjusted OR 3.2 and 1.9; p=0.002 and 0.01
for 1999-2000 and 2003-2004 cohorts). The predicted probability for the
prototypical participant with high levels of the pollutant was 7%, more than 2
times greater than those who had lower levels of this pollutant.
Secondary analysis to test validity of the final estimates
We then attempted to test the validity of our final estimates by conducting 3
additional analytic tests. In the first test, we attempted to consider the
possibility of “reverse causality” or differential exposure status due to T2D
diagnosis. Second, we attempted to assess the effect of potential confounding
bias due to the lipophilic characteristics on our final environmental factor
effect estimates. Third, we attempted to assess the effect of recent nutrient and
supplement consumption on our final effect estimates.
To consider T2D diagnosis as a modulator of exposure, we removed all
individuals who answered yes when questioned about a past history of diabetes
98
in the NHANES health questionnaire (“Doctor told you have diabetes?”).
Thus, T2D cases were those who had a FBG higher than 125 mg/dL and were
at risk for T2D diagnosis. We recomputed the effect of exposure, adjusted for
age, sex, SES, ethnicity, BMI, and cohort. For all validated factors significant
in more than 2 cohorts above (Table 7), the estimates remained stable and
statistically significant. The effect size for Heptachlor Epoxide was marginally
smaller with an adjusted OR of 1.6 (95% CI 1.1, 2.1; p = 0.008). The adjusted
OR for PCB170 was also marginally smaller, 2.1 (95% CI 1.2, 3.9; p = 0.02).
The effect of γ-tocopherol was larger, with an adjusted OR of 1.8 (95% CI 1.3,
2.2; p < 0.001) and there was no change to effect sizes of the carotenes
(adjusted OR 0.6; 95% CI 0.5, 0.7; p < 0.001). We concluded that there was
not enough evidence to support the phenomenon of reverse causality based on
the effect sizes estimated for those who were at risk for T2D.
We next attempted to account for potential confounding bias of lipid levels.
To assess the degree of possible confounding we refit the logistic regression
adjusting for the logarithm (base 10) of total triglyceride and cholesterol levels
in addition to age, sex, BMI, SES, ethnicity, and cohort. We did not observe a
great change in effect sizes estimates for the environmental factors after this
further adjustment for total triglycerides and cholesterol. The odds ratio after
adjusting for lipid levels for carotenes was 14% higher, 0.7 (95% CI 0.6, 0.8; p
< 0.001) compared to 0.6. Similarly, the odds ratio for γ-tocopherol was
attenuated by 7%, 1.4 (95% CI 1.2, 1.6; p < 0.001) compared to 1.5 (Table 7).
For the pesticide factor, the odds ratio was smaller by 6%, 1.6 (95% CI 1.3,
2.0; p < 0.001) versus 1.7 (Table 7). Lastly, for PCB factor, we observed a 3%
higher odds ratio of 2.3 (95% CI 1.4, 3.7, p = 0.002) versus 2.2. Consistent
with this secondary analysis, we observed a similar degree of effect size
differences when using the “Lipid Adjusted” NHANES environmental factors,
which are only provided for few of the pollutant factors (not shown). We
99
concluded that the effect sizes of the environmental factors were affected by
lipid levels, but not substantially biased by them.
We then searched for differences in food and supplement consumption patterns
between diabetics and non-diabetics for all 4 cohorts close to the time of
survey derived from dietary recall and supplement use questionnaires. In
comparing dietary nutrients, we did not observe a difference for any dietary
nutrient except one between cases and controls. This exception included a
lower total carbohydrate intake for diabetics versus controls, confirming that
many diabetics may have known about their disease; specifically, the adjusted
OR was 0.7 (95% CI 0.6, 0.8; p=0.001) for a 10% increase in total
carbohydrate consumption, adjusted for sex, age, ethnicity, SES, and cohort.
We also observed an inverse association between any supplement use and T2D,
with an adjusted OR of 0.6 (95% CI 0.5, 0.8, p < 0.001), also consistent with
our expectation of increased health awareness for those with T2D. However,
we specifically could find no difference in consumption of carotenes or
tocopherol (p=0.85 and 0.2 respectively) between cases and controls, two of
the validated nutrient factors found in our EWAS (Table 7).
Having observed some difference in consumption behavior between cases and
controls, we then attempted to assess the influence of recalled dietary
consumption on the environmental associations by recomputing the logistic
regression models in presence of dietary and supplement use variables. Adding
the new dietary or supplementary vitamin consumption variables did not
attenuate the odds ratios (maximum change of 1-2%), nor did they lessen the
strength of the associations for all of the 5 validated environmental factors
described in Table 7. Thus, we did not have evidence to support that recent
consumption influenced the factor-disease effect sizes for the validated factors
found in our EWAS.
100
We took a further step in assessing the strength of the environmental
associations, adjusting for total triglycerides and cholesterol, any supplement
use, and food intake simultaneously. Specifically, the odds ratio for a SD
increase in γ-tocopherol levels was 1.3 (95 % CI 1.1, 1.5; p=0.004) when
adjusting for logarithm base 10 of triglycerides, cholesterol, total vitamin E
consumption, beta carotene consumption, total carbohydrate consumption, and
any supplement use along with age, sex, ethnicity, BMI, and SES. The
analogous models for the cis and trans β-carotene resulted in adjusted OR of
0.7 (95% CI 0.6, 0.8; p < 0.001). Odds ratios were consistently high and
significant for the pollutant factors Heptachlor Epoxide and PCB170 after
further analogous adjustment of recent consumption and total lipid levels, with
odds ratios of 1.6 (95% CI 1.3, 2.1; p < 0.001) and 2.2 (95% CI 1.4, 3.5;
p=0.003) respectively. We concluded that recent consumption as encoded by
the dietary recall questionnaire in conjunction with lipid levels did not alter the
validity of the associations of the 5 environmental factors found.
To summarize of our secondary tests for validity, we concluded that reverse
causality, recent food and supplement consumption, and total lipid levels did
not substantially bias our effect estimates for the 5 validated factors. These
tests were made possible by the extensive list of co-variates available in the
NHANES.
EWAS on T2D: Conclusions
We have described a prototype Environmental-Wide Association Study
(EWAS) and applied it to the study of Type 2 Diabetes (T2D), and validated
many of our significant findings across independent cohorts and confirmed
some of them through the literature. This study is made possible by the
examination of multiple cohorts present in the nationally representative
NHANES dataset. We have rediscovered factors such as carotenes and PCBs
with previously known association to T2D. Unexpectedly, we found higher
101
levels of γ-tocopherol were associated with higher likelihood of T2D,
independent of dietary intake. Of the components of Vitamin E, γ-tocopherol is
the most abundant form in the US diet [222], and makes up to 50% of the total
vitamin E in human muscle and adipose tissue [223], two known insulin-target
tissues. As γ-tocopherol has been previously suggested as a preventive agent
against colon cancer [224], any potential adverse metabolic effects for this
vitamin should be studied closely.
Another novel finding was in the significant association between heptachlor
epoxide levels and T2D. Heptachlor is a pesticide; most uses of heptachlor
were discontinued in the late 1980s [225]. The main source of heptachlor and
its breakdown product, heptachlor epoxide, is from food, but heptachlor
epoxide is persistent in the environment and can even be passed in breast milk
[226]. While a significant association with T2D has been reported across
thirty-thousand pesticide applicators who used the pesticide heptachlor [227],
to our knowledge, this broad association between heptachlor epoxide and T2D
in the general public, as surveyed by NHANES, is novel.
While GWAS has allowed us to find novel variants associated with T2D of
possible mechanistic importance and provided a model for a comprehensive
study of the environment described here, associated variants have had only
moderate effect sizes to date. Most of the risk loci identified with GWAS have
small individual odds ratios, generally less than 1.3 [164, 202, 228] and the
highest has been reported to be 1.71, belonging to a variant in the TCF7L2
gene [163, 165]. Albeit from different populations and analytical scenarios, the
effect sizes of our validated environmental factors on T2D were comparable to
the highest odds ratios seen in GWAS.
102
ENVIRONMENT-WIDE ASSOCIATION STUDY ON SERUM LIPID
LEVELS
Serum lipid levels correlate with the risk of coronary heart disease (CHD),
atherosclerosis, stroke, and even the disease described above, type 2 diabetes
(T2D) . Both genetic and environmental factors influence lipid phenotypes.
Lipid level variation is 20-70% heritable [229, 230], while well-documented
environmental or lifestyle factors include physical exercise, smoking, and diet
[231-235]. Other less tangible factors, however, may also be important, as for
example air pollution [236]. Here, we have applied the EWAS paradigm –
extended from above -- to evaluate 322 environmental attributes for their
association with triglycerides, high-density lipoprotein-cholesterol (HDL-C),
and low-density lipoprotein cholesterol (LDL-C).
EWAS on Serum Lipids: Methods
Data
Laboratory data analyzed included serum and urine measures of environmental
factors and clinical measures including lipid levels. We analyzed all factors
that were a direct measurement of an extrinsic environmental attribute (e.g.
amount of pesticide or heavy metal in urine or blood) as described earlier. Of
824 potentially eligible variables across all cohorts, we omitted 119 that varied
little across individuals (continuous variables with > 99% of the observations
below the detection limit threshold and binary variables with > 99% of
observations in either the “negative” or “positive” class). Of the 705
remaining variables, 169 were measured in the 1999-2000 cohort, 182 from
2001-2002, 258 from 2003-2004, and 96 from 2005-2006. Cumulatively, they
comprised 332 unique environmental factors, with 207 factors measured in >1
cohort. We binned these factors into 32 “classes” of related factors, analogous
to chromosomal units in GWAS (Figure 15A) as described earlier.
103
Different environmental factors were measured in varying numbers of
participants: 109-3610 (median 938), 101-3388 (median 896), and 222-7485
(median 1958) individuals for triglyceride, LDL-C, and HDL-C levels
respectively (Figure 15B). Individuals are selected randomly to have these
measurements and the selection procedure is dependent on their demographic
characteristics due to the complex stratified population sampling of NHANES
[237]. Serum triglyceride levels were measured in the morning after >8.5 hours’
fasting. LDL-C levels were derived from total cholesterol and direct HDL-C
measurements used the Friedewald calculation.
Statistical analysis
The systematic EWAS analysis encompasses multiple steps (Figure 15 C-H) as
described earlier. First, survey-weighted linear regressions are performed for
each environmental factor against log10 transformed lipid levels, adjusting for
age, age-squared, sex, body mass index (BMI), ethnicity, and socioeconomic
status (SES) (Figure 15C). For SES we used the tertile of poverty index
(participant’s household income divided by the time-adjusted poverty
threshold), as previously described. Ethnicity was coded in 5 groups (Mexican
American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other).
We used R survey module for all survey-weighted analyses [205, 206].
We calculated the false discovery rate (FDR), the estimated proportion of false
discoveries made versus the number of total discoveries made for a given
significance level α, to control for multiple hypothesis testing (Figure 15D)[40,
238]. We created a “null distribution” of regression test statistics for each
cohort separately, shuffling the triglycerides, HDL-C, and LDL-C levels 1000
times and refitting the linear regression models. FDR is the ratio of the results
called significant at a given level α in the null distribution and the results
called significant from our real tests. We used FDR<10% to select significant
104
associations. This corresponds to α=0.02 for triglycerides, 0.02 for HDL-C,
and 0.01 for LDL-C.
Next, we used the four independent cohorts to validate significant findings
(Figure 15D). We considered a significant factor as “tentatively validated” if it
was significant (FDR<10%) in more than one cohort and with the same
direction of effect in all cohorts. Of the 332 factors, 125 were assessed in only
1 cohort and thus they could not be considered validated; 73 factors were
assessed in 2 cohorts, 102 were assessed in 3 cohorts, and 32 were assessed in
all 4 cohorts. We assessed the FDR of the multi-cohort validation empirically
through permutations, as described in the previous chapter. Briefly, we first
estimated the number of false positives by counting the number of factors
found significant at level α in multiple cohorts from the permuted analyses.
For example, to assess the FDR of validating a factor in 2 cohorts, we collected
the factors that fell below the significance threshold α in the permuted data
corresponding to two different cohorts and counted the number of factors
found significant in both. We repeated this operation on all possible pairs
(n=6) of cohorts, adding up numbers found to be significant in each pair. We
then estimated the FDR by computing the ratio between the total number of
false positives and the number of true validated factors. We repeated the
analogous operation for factors significant in 3 and 4 cohorts. For triglyceride
levels, FDR=0.008, 0.0003, and 5x10-5 for results validated in 2, 3 and 4
cohorts, respectively. For HDL-C, the respective FDR is 0.01, 0.0002, and
5x10-5. For LDL-C, the respective FDR is 0.009, 3x10-5, and < 10-9.
We then fit a final linear regression model with data combined from multiple
NHANES cohorts utilizing all measurements available for a tentatively
validated environmental factor, attaining an overall estimate and p-value
(Figure 15E). We utilized the larger sample size to adjust for additional
quantitative factors that we were unable to adjust for in the single cohort
105
analyses (due to small residual degrees of freedom). In addition to initial
covariates, we also adjusted for waist circumference, T2D status (as defined in
the previous section, fasting blood glucose ≥ 126 mg/dL), systolic and diastolic
blood pressure (mm Hg), and cohort. To estimate how much of the variance
was described by each environmental factor, we estimated the change in the
coefficient of determination (R2) adding that factor versus a model including
only the adjusting factors (Figure 15F). We also performed regressions on
untransformed lipid levels to estimate raw effect size.
Sensitivity analyses
We conducted sensitivity analyses to account for recent food, alcohol,
supplements, medications, exercise, and history of cardiovascular health
(Figure 15G). Sixty-two questionnaire items were used. For adequate sample
size and consistency with the final-fit model, we combined all available
NHANES cohorts.
Intake variables (total calories, carbohydrates, saturated fat, monounsaturated
fat, alcohol, cholesterol, vitamins, and iron) are computed from the
questionnaire using the USDA Food and Nutrient Database. All cohorts
contained 24-hour recall data. For 2003-2004 and 2005-2006 cohorts we
computed an average of 24- and 48-hour recall data. Supplement use included
the integer count of consumption of vitamins, minerals, botanicals, and/or their
mixtures the month prior to the survey. Consumption of any fish or shellfish
during the last month was also considered.
For individuals with abnormal levels of lipids, drug therapies such as statins,
resins, and fibrates are often prescribed [239]. Therefore, we sought to adjust
for any use of these medications. Drug use definition required that the
individual used the drug during the month prior to the survey and the
interviewer saw the prescription bottle.
106
Physical activity also influences lipid levels. We therefore classified
individuals in high, medium, or light intensity weekly exercise categories by
computing metabolic equivalents of recalled activity levels [240, 241],
including components such as leisure time, occupational and household
routines-related activity.
Finally, recalled cardiovascular health history was based on positive response
to questions on the presence of coronary heart disease angina/angina pectoris,
heart attack(s), or congestive heart failure.
To evaluate the impact of these 62 adjusting variables, we recomputed the
regression models by adding each variable to our final model one-by-one and
observed the change in the effect size for each putative environmental factor.
We also built a model adjusting for lipid-lowering drugs, supplement use,
exercise, and self-report cardiovascular-related disease simultaneously.
Correlation pattern between factors associated with lipid levels
Identified factors that are associated with lipid levels may not be independent.
Therefore, we also computed all Pearson correlation coefficients between each
of the validated environmental factors as well as the demographic (age, sex,
ethnicity, SES), and clinical (BMI, waist circumference, blood pressure, and
diabetes status) risk factors to ascertain the pattern of relationships among
them (Figure 15H). Next, we visualized all of these correlations as a
“correlation globe” to infer their inter-dependence as a function of all variables
examined. This approach has been utilized to discover inter-related or
dependent sets of genes in a gene expression microarray experiment [176, 177].
Power calculations
107
We estimated [242] that the EWAS had >90% median power for all cohorts for
detection of 5% change in HDL-C (p<0.02) and LDL-C (p<0.01) and 10%
change of triglyceride levels (p<0.02).
EWAS on Serum Lipids: Results
Demographic and baseline associations with lipid levels
As expected[243], demographics, BMI, ethnicity, and SES correlated with
lipid levels. For example, consistent positive correlations existed between age
and triglycerides (5-10% higher per 10 years, p-values<0.02), and BMI and
triglycerides (2% higher per 1 unit of BMI, p-values<0.004), and consistent
negative correlations between black ethnicity and triglycerides (13% lower vs.
white, p-values<0.001) [244]. Consistent polynomial relationships existed
between age and both HDL-C and LDL-C. Negative correlations existed
between BMI and HDL-C (1% lower per BMI unit, p-values<0.0001). In
addition, SES was associated with HDL-C (1-5% lower for lower vs. higher
tertile, p-values<0.03).
Environment associations with lipid levels
Figure 17 shows the distribution of p-values of association for each
environmental factor binned by its class, adjusted for sex, age, age-squared,
BMI, ethnicity, and SES. For triglyceride levels, 10/169, 24/182, 49/258, and
12/96 factors passed the requested threshold of significance for the 1999-2000,
2001-2002, 2003-2004, and 2005-2006 cohorts respectively. Likewise for
LDL-C, 1/169, 8/182, 15/258, and 11/96 were significant, respectively. For
HDL-C, 2/169, 21/182, 39 /258, and 15/96 were significant. Using other
cohorts, we tentatively validated significant findings from our initial screen.
Across cohorts, there were 22, 8, and 17 tentatively validated factors for
triglycerides, LDL-C and HDL-C, respectively (Figure 17 A-C).
108
The data was combined across cohorts for each tentatively validated factor and
estimates were further adjusted for waist circumference, T2D status, blood
pressure, and cohort. The variance ascribed to baseline co-variates was 22-25%
(triglycerides), 15-16% (LDL-C), and 23-26% (HDL-C). Each of the
tentatively validated environmental factors described an additional 0.7-18.4%
(triglycerides), 1.8-14.1% (LDL-C), and 0.4-4.0% (HDL-C) of the variance in
lipid levels.
109
Figure 17. “Manhattan plot” style graphic showing the environment-wide associations to lipid levels. Y-axis indicates -log10(p-value) of the adjusted linear regression coefficient for each of
−log
10(p
valu
e)
!
! !!
!
!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!!
!
!!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!! !
!!
!
!
!
!
!
!
!!
!
!!
!
!
!
!
!!
!
!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!!
!
!
!
!!!
!
!
!
!!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!!!!
!!
!
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!!!
!!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!!!
!
!!
!!
!!
!
!
!
!
!
acry
lam
ide
alle
rgen
test
bact
eria
l inf
ectio
nco
tinin
edi
akyl
diox
ins
fura
ns d
iben
zofu
ran
heav
y m
etal
s
hydr
ocar
bons
late
xnu
trien
ts c
arot
enoi
dnu
trien
ts m
iner
als
nutri
ents
vita
min
Anu
trien
ts v
itam
in B
nutri
ents
vita
min
Cnu
trien
ts v
itam
in D
nutri
ents
vita
min
E
pcbs
perc
hlor
ate
pest
icid
es a
trazi
nepe
stic
ides
car
bam
ate
pest
icid
es c
hlor
ophe
nol
pest
icid
es o
rgan
ochl
orin
epe
stic
ides
org
anop
hosp
hate
pest
icid
es p
yret
hyro
idph
enol
s
phth
alat
esph
ytoe
stro
gens
poly
brom
inat
ed e
ther
spo
lyflo
uroc
hem
ical
svi
ral i
nfec
tion
vola
tile
com
poun
ds
01
23
4
−log
10(p
valu
e)
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
! !
!
!!
!
!
!
!
!!
!
!
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!
!! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!!!
!
!
!
!
!!
!!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!! !
! !!
!
!
!
!
!!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
acry
lam
ide
alle
rgen
test
bact
eria
l inf
ectio
nco
tinin
edi
akyl
diox
ins
fura
ns d
iben
zofu
ran
heav
y m
etal
s
hydr
ocar
bons
late
xnu
trien
ts c
arot
enoi
dnu
trien
ts m
iner
als
nutri
ents
vita
min
Anu
trien
ts v
itam
in B
nutri
ents
vita
min
Cnu
trien
ts v
itam
in D
nutri
ents
vita
min
E
pcbs
perc
hlor
ate
pest
icid
es a
trazi
nepe
stic
ides
car
bam
ate
pest
icid
es c
hlor
ophe
nol
pest
icid
es o
rgan
ochl
orin
epe
stic
ides
org
anop
hosp
hate
pest
icid
es p
yret
hyro
idph
enol
s
phth
alat
esph
ytoe
stro
gens
poly
brom
inat
ed e
ther
spo
lyflo
uroc
hem
ical
svi
ral i
nfec
tion
vola
tile
com
poun
ds
01
23
4
−log
10(p
valu
e)
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!
! !!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
! !
!
! !
!
!
!
!
!
!
!!
!!
!!
!
!
!!
!!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !!!
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!!!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!!!
!
!!
!
!
!
! !!
!!
acry
lam
ide
alle
rgen
test
bact
eria
l inf
ectio
nco
tinin
edi
akyl
diox
ins
fura
ns d
iben
zofu
ran
heav
y m
etal
s
hydr
ocar
bons
late
xnu
trien
ts c
arot
enoi
dnu
trien
ts m
iner
als
nutri
ents
vita
min
Anu
trien
ts v
itam
in B
nutri
ents
vita
min
Cnu
trien
ts v
itam
in D
nutri
ents
vita
min
E
pcbs
perc
hlor
ate
pest
icid
es a
trazi
nepe
stic
ides
car
bam
ate
pest
icid
es c
hlor
ophe
nol
pest
icid
es o
rgan
ochl
orin
epe
stic
ides
org
anop
hosp
hate
pest
icid
es p
yret
hyro
idph
enol
s
phth
alat
esph
ytoe
stro
gens
poly
brom
inat
ed e
ther
spo
lyflo
uroc
hem
ical
svi
ral i
nfec
tion
vola
tile
com
poun
ds
01
23
4
A triglycerides
B LDL-C
C HDL-C
cohort markers1999-20002001-20022003-20042005-2006
110
the environmental factors. Colors represent different environmental classes as represented in Figure 15. Plot symbols represent different cohorts: 1999-2000 (diamonds), 2001-2002 (square), filled dot (2003-2004), circle (2005-2006). Red horizontal line represents the level of significance corresponding to FDR less than 10%. A) log10(triglycerides), B) log10(LDL-C), C) log10(HDL-C).
Effects for the top tentatively validated associations for triglycerides, LDL-C,
and HDL-C are shown in Figure 18, Figure 19, and Figure 20. Although we
found 22 and 17 factors for triglycerides and HDL-C respectively, we display
the top 2 findings (total of 12) for each category for visualization. Furthermore,
we discuss here some of them in more detail. Effect sizes for continuous
variables are for 1 SD of log-transformed value of the environmental factor.
111
Figure 18. Forest plot for top 12 validated environmental factors per cohort associated with triglycerides in a model adjusting for age, age-squared, SES, ethnicity, sex, BMI. Combined cohort denotes the estimate attained when combining all cohorts available for exposure in a model adjusting for age, age-squared, SES, ethnicity, sex, BMI, waist circumference, T2D status, blood pressure, and cohort. Percent change (x-axis) is the percent change of lipid level for a change in 1SD of logged exposure value. Effect size (in mg/dL) attained when fitting the untransformed lipids to the model. Symbols proportional to sample size and colors represent different environmental classes as represented in Figure 15.
cohort
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022005−2006combined
1999−20002001−20022003−20042005−2006combined
1999−20002003−2004combined
2001−20022003−2004combined
1999−20002001−20022003−2004combined
1999−20002001−20022003−2004combined
2001−20022003−2004combined
1,2,3,4,7,8−hxcdf
trans−b−carotene
cis−b−carotene
Retinol
Retinyl palmitate
a−tocopherol
g−tocopherol
PCB74
PCB170
Oxychlordane
Trans−nonachlor
Enterolactone
N
534806
1735
3605323328897374
3605323325967135
29963610323328899519
3480323327558903
2981360928897140
25853579323328729194
811832
2202
1004825
2155
704986877
2131
8141001
8652228
114910732358
pvalue
0.010.005
2e−05
0.0020.01
7e−041e−08
0.010.01
0.0011e−06
0.0055e−042e−043e−046e−21
7e−042e−040.003
6e−17
0.0012e−042e−048e−20
0.010.0020.002
7e−051e−17
0.010.005
1e−06
0.010.002
4e−06
0.020.0020.003
5e−09
0.020.0020.005
1e−08
0.020.006
2e−07
effe
ct (m
g/dl
)
554830
−18−18−19−16
−9−17−15−12
2323292725
37622441
49865767
2451423941
376138
628650
53785357
42664947
−14−20−17
−20 −10 0 10 20 30 40
% change
112
Figure 19. Forest plot for validated environmental factors associated with LDL-C. See Figure 18.
cohort
2001−20022003−20042005−2006combined
2001−20022005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022005−2006combined
1999−20002001−20022005−2006combined
2001−20022003−20042005−2006combined
trans−b−carotene
cis−b−carotene
b−cryptoxanthin
Combined Lutein/zeaxanthin
trans−lycopene
Retinyl palmitate
a−tocopherol
g−tocopherol
N
3315317428307043
331725416809
3294317428057012
3317317428307043
3315317428307043
320026988425
2734331728306665
3288317428148696
pvalue
0.0030.004
9e−042e−13
0.0020.004
5e−11
9e−046e−040.001
4e−13
0.0012e−045e−043e−15
5e−041e−042e−048e−17
8e−040.001
4e−13
0.0028e−057e−057e−19
0.0030.0020.005
3e−14
effe
ct (m
g/dl
)
8698
776
7798
98109
10101412
586
14171716
8666
0 5 10 15 20
% change
113
Figure 20. Forest plot for top 12 validated environmental factors associated with HDL-C. See Figure 18.
cohort
2003−20042005−2006combined
2003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
2001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
1999−20002001−20022003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2003−20042005−2006combined
2001−20022003−20042005−2006combined
2001−20022003−2004combined
2001−20022003−2004combined
Cotinine
Mercury, total
2−fluorene
3−fluorene
Combined Lutein/zeaxanthin
cis−b−carotene
Iron, Frozen Serum
Retinyl stearate
Folate, serum
Vitamin C
Vitamin D
g−tocopherol
Heptachlor Epoxide
N
726769599513
727369616323
233221922252
233221762243
7473679068687388
7478679062647151
63837457270625246764
7251679063378421
746872679559
679969114852
7056727369667401
742867909216
202218352108
pvalue
0.0030.02
2e−06
0.010.002
6e−07
0.010.0060.004
0.020.01
0.006
2e−042e−054e−042e−16
3e−049e−042e−043e−12
0.0090.0030.0060.002
6e−11
0.0020.0030.002
4e−05
0.0040.02
2e−05
0.0060.02
0.002
0.010.004
0.011e−06
0.0010.01
6e−06
0.010.02
0.006
effe
ct (m
g/dl
)
−2−1−1
122
−2−1−1
−2−1−1
3334
2333
22222
−1−1−2−1
111
211
1212
−1−1−1
−2−1−2
−6 −4 −2 0 2 4 6
% change
114
Vitamins A and E: unfavorable association with lipid levels
For all three lipids, we found a consistent association for lipid-soluble, anti-
oxidant vitamins, such as vitamin A, E, and carotenoids (Figure 17, Figure 18,
Figure 19, Figure 20). For example, a form of vitamin A, retinol, was
positively associated with triglycerides (p=6x10-21, effect=10% or 25 mg/dL
higher triglycerides per 1SD) in all cohorts examined. Another form of
vitamin A, retinyl palmitate was also positively associated with triglycerides
(p=6x10-21, effect=10%) and LDL-C (p=4x10-13, effect=5% or 6 mg/dL).
Retinyl stearate was negatively associated with HDL-C (p=4x10-5, effect=-3%
or -1 mg/dL). Retinol is the functional form of vitamin A produced in the
body from β-carotene and is a co-factor in biological processes associated with
vision and gene transcription [245]. Retinyl palmitate and stearate are animal-
and supplement-sourced vitamin A esters stored in the liver [245].
We observed a consistent association between forms of vitamin E (α and γ
tocopherol) and lipid levels. α-tocopherol strongly correlated with higher
triglyceride and LDL-C levels (effect=35% (p=8x10-20) and 16% (p=7x10-19),
or 67 and 16 mg/dL, respectively). γ-tocopherol was also correlated with
higher triglycerides (effect=17% higher, p=10-17) and LDL-C (6% higher,
p=3x10-14) levels, but also with lower HDL-C (effect=-2% , p=6x10-6).
Vitamin E is consumed via vegetables, nuts, oils, and supplements.
Tocopherols are highly lipophilic and their absorption is enhanced by
triglycerides.
Carotenoids: favorable association with HDL-C and triglycerides and
unfavorable association with LDL-C
Both isomers of β-carotene, cis- and trans- were associated with lower
triglyceride levels (p=10-6, effect=-7% or 12 mg/dL; p=10-8, effect=-10% or 16
115
mg/dL respectively). However, both isomers of carotene, in addition to other
carotenoids such as β-cryptoxanthin and lycopene were consistently associated
with higher levels of both HDL-C and LDL-C. The effect was 5% (p=3x10-12)
and 6% (p=5x10-11) for HDL-C and LDL-C levels respectively for cis-β-
carotene and 3% (p=10-10) and 12% (p=8x10-17) for lycopene. Carotenoids are
primarily sourced from consumption of fruits and leafy vegetables[246]; β-
and α-carotene (but not lycopene) are vitamin A precursors [245, 246].
Favorable lipid correlations with vitamins B, C, D, iron, mercury, and
enterolactone
We found serum levels of folate (vitamin B), C, D, iron, and mercury to be
favorably associated with HDL-C (Figure 20). Effect sizes of vitamin, iron
and mercury levels on HDL-C were similar, ranging from 3 to 4% (1-2 mg/dL)
higher HDL-C (p<0·002). Last, we found enterolactone, a product of lignan
metabolism in the intestine, to be associated with 10% (17mg/dL) lower
triglyceride levels (p=2x10-7, Figure 18).
Persistent pollutants: unfavorable association with triglycerides and HDL-C
Polychlorinated biphenyls (PCBs), dibenzofurans, and organochlorine
pesticides, all persistent organic pollutants, were unfavorably associated with
both triglyceride and HDL-C levels (Figure 18, Figure 20). Seven PCB factors
were tentatively validated and the most significant cogeners PCB74 and
PCB170 were associated with 15% (p=10-6) and 19% (p=4x10-6) higher
triglyceride levels. Five organochlorine factors were tentatively validated,
among which oxychlordane and trans-nonachlor changes were linked to 29%
and 30% higher (p=5 x 10-9, 1 x 10-8) triglyceride levels. Another
organochlorine pesticide, heptachlor epoxide, was associated with 3% lower
HDL-C (p=0.006). While use of these compounds is banned, they are known
to persist and accumulate due to their stability and lipophilicity.
116
Markers for air pollution and nicotine: unfavorable association with HDL-C
Several markers of air pollution and nicotine exposure were unfavorably
associated with HDL-C (Figure 20). The polyaromatic hydrocarbon markers
of fluorene, 3-hydroxyfluorene and 2-hydroxyfluorene, were associated with
3% lower HDL-C (p=0.006 and p=0.004). Cotinine, a serum biomarker for
nicotine, was associated also with a 3% lower HDL-C (p=2 x 10-6).
Polyaromatic compounds are formed as a result of burning of hydrocarbon-
based substances, such as tobacco, coal, gas, oil, and meats.
Sensitivity analyses with further adjustments
For most questionnaire variable adjustments, we did not see a sizable
difference in estimated coefficients or p-values for the environmental factors
(Figure 21, Figure 22, Figure 23), including questionnaire items regarding self-
report cardiovascular-related disease status and use of drugs. Interestingly, in
most exceptions, adjustments increased the effect size of the environmental
factor. For example, after adjusting for vitamin and supplement intake, the
associations between γ- and α-tocopherol and triglyceride and LDL-C levels
became stronger. Similarly, adjustment for total fiber intake also strengthened
the association of β-carotenes and cryptoxanthin with LDL-C. The association
of cotinine, 3-, and 2-fluorene with HDL-C strengthened after adjustment for
alcohol intake. Adjusting for any fish and shellfish consumption strengthened
the association between pollutants and triglycerides. Adjustment for fish and
shellfish consumption strengthened the association between retinyl stearate and
HDL-C and triglyceride levels. Conversely, the effect of vitamin C and folate
in relation to HDL-C decreased when taking supplement count, total fiber
intake, and physical activity into account. Adjusting for supplement count
decreased the effect of γ-tocopherol on HDL-C.
117
Figure 21. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(triglycerides). “Original” estimates were adjusted for sex, age, age-squared, SES, ethnicity, BMI, waist circumference, blood pressure, T2D status (fasting blood glucose >= 125 mg/dL), and cohort. “Extended” estimates were adjusted for the same co-variates in the original model (age, age-squared, sex, ethnicity, SES, BMI, waist circumference, blood pressure, diabetes, cohort), in addition to questionnaire items added sequentially. For points annotated as “cardiovascular” (red diamond), the extended estimates were adjusted for the same co-variates in the original model in addition to “count”, “physical_activity”, “any_heart_disease”, and “cholesterol_lowering” simultaneously. The estimates of βfactor that were greater than 10% than the original estimate upon adding the extra co-variate are annotated in color. P-values for the “extended” βfactor are shown to the left of the point. Legend abbreviations: TLZ: total lutein/zeaxanthin; TATOC: total tocopherol; TBCAR: total β-carotene; any_shellfish: any shellfish consumed in past 30 days; any_fish: any fish consumed past 30 days, count: total supplement used in 30 days; physical_activity: total physical activity in metabolic equivalents past 30 days.
-15 -10 -5 0 5 10 15 20
100*(extended-original)/original
0.003
9e-04
0.003
0.003
1e-071e-07
8e-16
1e-17
2e-18
7e-20
0.0062e-04
0.002
0.002
0.0026e-04
2e-04
2e-04
0.0030.06
0.001
0.001
2e-04
2e-04
1,2,3,4,7,8-hxcdf
a-Carotene
trans-b-carotene
cis-b-carotene
Retinyl palmitate
Retinyl stearate
Retinol
g-tocopherol
a-tocopherol
PCB199
PCB74
PCB99
PCB156
PCB170
PCB196 & 203
PCB206
Beta-hexachlorocyclohexane
Dieldrin
Heptachlor Epoxide
Oxychlordane
Trans-nonachlor
Enterolactone
TLZcardiovascularTATOC
TBCAR
any_shellfishany_fishcountphysical_activity
118
Figure 22. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(LDL-C). See Figure 21 for complete caption. Legend abbreviations: TFIBE: total fiber; TVC: total vitamin C; TCRYP: total cryptoxanthin; count: total supplement use in past 30 days; cardiovascular: on lipid lowering drug past 30 days or doctor said participant had heart disease.
-5 0 5 10 15
100*(extended-original)/original
3e-138e-11
1e-12
1e-12
1e-123e-142e-14
7e-19
1e-18
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Combined Lutein/zeaxanthin
trans-lycopene
Retinyl palmitate
g-tocopherol
a-tocopherol
TFIBE
TVC
TCRYP
countcardiovascular
119
Figure 23. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(HDL-C). See Figure 21 for complete caption. Legend abbreviations: TFIBE: total fiber; cardiovascular: on lipid lowering drug past 30 days or doctor said participant had heart disease.; TALCO: total alcohol; TMAGN: total magnesium; TFF: total food folate; any_fish: any fish consumed in past 30 days; any_shellfish: any shellfish consumed in past 30 days; count: total supplement use past 30 days; physical activity: total physical activity in past 30 days; TPOTA: total potassium.
-40 -30 -20 -10 0 10 20
100*(extended-original)/original
2e-06
5e-06
2e-06
7e-076e-06
1e-06
0.02
0.007
0.006
0.0060.01
0.005
7e-101e-08
1e-09
2e-10
3e-11
5e-05
0.10.1
8e-042e-04
1e-051e-05
2e-05
1e-04
5e-05
1e-05
0.0049e-04
9e-04
0.0029e-04
9e-04
0.004
2e-067e-06
5e-06
6e-05
1e-05
Cotinine
Mercury, total
3-fluorene
2-fluorene
a-Carotene
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Combined Lutein/zeaxanthin
trans-lycopene
Iron, Frozen Serum
Retinyl stearate
Folate, serum
Vitamin C
Vitamin D
g-tocopherol
Heptachlor Epoxide
TFIBE
cardiovascularTALCO
TMAGN
TFFany_fishany_shellfishcountphysical_activityTPOTA
120
Simultaneous adjustment for self-reported cardiovascular-related disease,
supplement count, lipid-lowering drugs, and physical activity strengthened the
association between tocopherols and pollutant factors and triglycerides, while
attenuating the association to α-carotene (Figure 21). For HDL-C levels,
effects of cotinine, mercury, 3- and 2-flourene, folate, vitamin C, vitamin D,
and γ-tocopherol were all attenuated > 15% (Figure 23). However, the
direction and significance of the effects were preserved throughout.
Correlation patterns
Evaluation of Pearson correlations showed a dense correlation pattern for
triglycerides (Figure 24) and more sporadic strong correlations between
various factors for HDL-C and LDL-C (Figure 25, Figure 26). As expected,
we observed strong correlations among closely related factors, such as between
PCBs (ρ > 0.6) or carotenoids (ρ > 0.5), and even within factor classes, such as
organochlorine pesticides (ρ > 0.3). Of note, the hydrocarbon factors 2-and 3-
hydroxyfluorene were highly correlated with cotinine (ρ=0.6 and 0.7). The
baseline demographics were not strongly related (e.g, ρ > 0.5) with any of the
environmental factors, with the exception of age that showed several strong
associations with many of them.
121
Figure 24. Pair-wise correlation globes for validated environmental and risk factors associated with triglycerides. Each node corresponds to a validated environmental (in color of environmental class, see Figure 15) or demographic/clinical risk factor (in white). Correlations > 0.2 and < -0.2 are shown with line thickness proportional to the absolute value of correlation. Line color corresponds to the sign of correlation (positive=red, negative=blue). To avoid overcrowding, only the most highly associated PCB and organochlorine factors are shown .
122
Figure 25. Pair-wise correlation globes for validated environmental and risk factors associated with LDL-C.
123
Figure 26. Pair-wise correlation globes for validated environmental and risk factors associated with HDL-C. To avoid overcrowding only the most highly associated PCB and organochlorine is shown.
EWAS on Serum Lipids: Conclusions
In the current application, our findings reveal complex relationships between
serum lipid levels and fat-soluble antioxidant vitamins. Randomized studies
and meta-analyses[43, 247-250] have shown these vitamins to have no benefits
or even confer harm when given in high doses, in contrast to previous
favorable associations in observational studies [251, 252]. The unfavorable
lipid profile that we observed with vitamin E forms is possibly consistent with
the randomized evidence on hard clinical outcomes and we also found an
unfavorable lipid profile for vitamin A forms. Carotenoids have a mixed effect,
improving triglycerides and HDL-C, but worsening LDL-C.
124
These associations may reflect a complex web of physiological correlation or
even reverse causality. For example, α-tocopherol and carotenes are
transported in serum with HDL and LDL [246, 253, 254] and accurate
measurement of serum α-tocopherol is dependent on serum lipids [255]. In this
regard, the strong association between α-tocopherol and LDL-C and
triglycerides might be considered a true positive result. On the other hand,
given the lack of evidence for γ-tocopherol or retinol associating with
lipoprotein complexes, their association might be due to reverse causality, or
increased anti-oxidant consumption among those who know about their
adverse lipid level profile. Nevertheless, given that vitamin E consumption has
been found to increase mortality in meta-analysis[43], the large effect sizes
suggest that prospective studies may be scrutinized for any potentially adverse
effects of vitamin E on lipid levels and other metabolic disorders, such as
T2D .
We observed an association of vitamins B (folate), C and D, mercury, and iron,
to higher HDL-C levels. Folate [256] and vitamin D [257] have previously
been associated with higher HDL-C. Fish, a source of cardioprotective omega-
3 fatty acids, are also a large source of mercury[258]; however, we did not
observe a large change in effect size of mercury when accounting for
consumption of fish. These nutrients and metals may be to some extent
surrogate markers of “healthy diet” behaviors; however what exactly
constitutes “healthy diets” is currently very difficult to define, in contrast to
earlier claims [259, 260]. The strength of the association for these dietary
markers is similar on HDL-C, ranging from 1-3 mg/dL for a standardized
change per factor. These are small effects and it is unclear whether
cumulatively they could have a much larger impact in raising HDL-C level,
given the correlations between these markers.
125
We also identified enterolactone to be strongly associated with favorable
triglyceride levels in this study. Enterolactone is a metabolite of lignans,
which are found in foods such as flaxseed, and have been associated with
favorable cholesterol profiles in this form [261, 262]. Again, it is unclear what
role, if any, this marker plays as a surrogate of “healthy diets” and effects on
heart disease have been inconsistent [263].
We found biomarkers of hydrocarbons, 2- and 3-hydroxyfluorene to be
strongly associated with unfavorable HDL-C levels. While others have shown
the association of these metabolites to self-report cardiovascular disease with
the NHANES data [264], to our knowledge the association with HDL-C is
novel. Relatedly, we also found a marker of nicotine, cotinine, to have a
similar association with HDL-C. Particulate matter air pollution, composed of
many types of hydrocarbons, and smoking long have been a major concern for
cardiovascular-related diseases [236, 265, 266]. Smoking is well known to
influence HDL-C levels [267, 268] and acute and chronic exposures to tobacco
smoke have been shown to decrease HDL-C substantially [269]. The high
correlation of the hydrocarbon markers to cotinine suggests that these
associations might all indicate exposure to cigarette smoke.
We also found that persistent organic pollutants, such as organochlorine
pesticides and polychlorinated biphenyls, were associated with large increase
of triglycerides and large decrease in HDL-C. These compounds have been
implicated in other metabolic-related diseases and other populations. For
example, PCB170 and heptachlor epoxide have been associated to T2D in our
previous EWAS on T2D. Similarly, PCBs and dibenzofurans have been
associated with metabolic syndrome in a Japanese population [270] and have
been claimed to have an “obesogenic” effect [271].
126
We should acknowledge that these associations might be confounded due to
the fat solubility of these pollutants. Nevertheless, there have been efforts to
characterize this relationship. For example, in a study analytically considering
causal pathways and confounding bias via structural equation modeling, the
investigators found a relationship between polychlorinated biphenyls and lipid
levels consistent with forward causality for a native population with high
exposure of these pollutants in upstate New York [272]. Another study found
an ecological relationship between cardiovascular-related hospitalization rates
in areas close to PCB pollution [273]. Further, higher incidence of
cardiovascular disease was observed in an occupational Swedish cohort [274].
Nevertheless, and notably, these studies took place in populations in which the
source of exposure was known and dosages were much higher than the general
population levels seen in NHANES.
Finally, identifying specific heritable components through GWAS has proven
difficult: one recent study attributed 10-12% of variability of the lipid levels to
95 genetic loci in a sample of >100,000 individuals [65], and each genetic
factor explained less than 0.5% of the variance. By comparison, each of the
validated environmental factors described a larger portion of the variance in
lipid levels, occasionally exceeding 10%; however, reverse causality cannot be
excluded as for the genetic variants.
DISCUSSION
By combing through all environmental exposure measures using a systematic
EWAS approach, we have found novel multiple environmental factors
associated with type 2 diabetes and lipid levels beyond the level of false
discovery. The method is general enough to apply to diverse datasets, and, in
fact, collaborators have begun utilizing EWAS to study blood pressure and
kidney disease. The EWAS approach bypasses the problem of selectively
testing and reporting one or a few associations at a time that has been
127
repeatedly debated as a source of biased results and false positives in
epidemiological studies [10, 42, 179, 275, 276]. In its current form, EWAS
offers a new way to generate a comprehensive list of associations that have
robust support after multiple comparisons, a practice not followed in
environmental epidemiology currently. The ensuing associations are then
carefully scrutinized for their validity in sensitivity analyses. Adjustments for
potential confounders are also systematic and correlation patterns between
variables are evaluated and visualized.
Like GWAS, the EWAS framework can be used to propose targets for further
study. For example, many factors are correlated; some are similar structurally,
such as the isomers of β-carotene, or show dependent patterns of exposure
environment, such as the PCBs and organochlorine pesticides or the serum
markers for cotinine and urine markers for hydrocarbons. As we extend the
GWAS analogy, and provide a precise definition of the envirome (Chapter 1,
3), these and other environmental factors could be said to be in “linkage
disequilibrium” with each other. Just as is done for preliminary GWAS
findings, EWAS findings can and should be used to identify further factors that
may be in “disequilibrium,” for further detailed measurement and causal
identification.
EWAS allows for comprehensive and systematic analysis of the effects of the
environment in association to disease on a broad scale. While many
investigators have already utilized the NHANES to address the effect of a
limited number of factors on disease, they do not provide a global view of
these associations [277, 278]. Further, the previous studies use differing
definitions of disease status (such as a medical questionnaire), exposure coding
(discretization vs. log transformation), and lack methods for multiple
comparison control [279-281]. It is the well-established toolkit of the GWAS
128
that has provided us with methods to overcome these limitations and to enable
us to postulate about environment-wide association to disease.
Limitations of these studies remind us that measuring environment-wide
aspects in relation to phenotypic states such as disease will be a difficult
undertaking [10]. While the NHANES provides a large number of factors to
study, a comprehensive assessment will require precise definition over a
broader dimension (including more factors). While laboratory measurements
are collected during a baseline fasting state for all participants in NHANES, we
will still have to account for the dynamic and heterogeneous nature of different
exposures and their associated responses by taking replicate measurements at
different physiological states.
Furthermore, the observed cross-sectional correlations in the EWAS setting do
not offer proof for causality (Chapter 3). While we attempted to check the
validity of our estimates by systematically adjusting for known, self-reported
confounders, residual confounding and confounding from unmeasured
variables cannot be ruled out and reverse causality remains always a possibility
for findings of cross-sectional data. We have also shown how to
systematically evaluate the correlation pattern between known and novel
environmental correlates of lipids to communicate the complex inter-
relationships between these variables. We hypothesize that this approach
would be helpful in designing future studies such randomized trials that may
try to intervene at one or more nodes in the correlation globes.
To more formally ascertain causality, we would need to perform prospective
EWAS over the life course, consider incident cases, consider randomization
methods [69], or even evaluate gene-environment interactions as additional
validation (Chapters 3 and 5). Due to the number of hypotheses generated, we
would need to integrate more evidence from large-scale collaborative studies
129
in order to confirm (or refute) etiological aspects of these factors while being
as comprehensive as possible in the observation of potential confounding
variables. For example, additional factors such as behavior (food consumption,
drug use, and/or exercise patterns), geographic location, and occupation must
also be ascertained to account for associated risk factors and reverse causality.
The measurement of 300 environmental factors is hardly a comprehensive
study of the environment, but this is still a greater number of factors measured
than the 30 microsatellite markers [282], or 100 single nucleotide
polymorphisms (SNPs) in some of the earliest implementations of GWAS
[283]. We suggest that measurement technologies for the environment can and
will improve in resolution, as novel associations are made using even few
measurements in these prototype studies. Measurement of the panel of
environmental factors used here, most of which are performed by mass
spectrometry, currently costs an estimated $40,000 per individual [284], or
close to the current pricing for candidate SNPs and copy number variation
sequencing.
Another type of hypothesis we may generate is regarding the complex cause of
disease. For example, we can now use an EWAS to hypothesize about “gene-
environment” interactions and their relation to disease etiology. In the next
chapter, we address how to screen for gene-environment interactions through
integration of GWAS and EWAS, where genetic variability is assessed
simultaneously along with robustly identified environmental factors. As will
be seen, while resource intensive, this type of study design could perhaps
facilitate an explanation of disease causation that has eluded genomic-wide
scans, provide additional validity for the marginal effect of exposure, and
enable more accurate estimates of risk [32].
130
CHAPTER 5: TOWARD ENVIROME-GENOME INTERACTIONS IN
THE CONTEXT OF HUMAN HEALTH: COMPREHENSIVELY
SCREENING FOR GENE-ENVIRONMENT INTERACTIONS IN
ASSOCIATION TO TYPE 2 DIABETES.
INTRODUCTION
In previous chapters, we focused on comprehensive and agnostic methods to
attain robust environmental disease associations on a population scale, notably
known as EWAS (Chapter 3) and we applied the method to find factors
robustly associated with Type 2 Diabetes and serum lipid levels (Chapter
4chapter).
It is hypothesized that both multiple genetic and environmental factors interact
to induce complex disease. Genome-wide association studies (GWAS) have
led to the discovery of many common variants associated with disease [9, 203];
however, each of these common genetic variants confer very modest risks and
cumulatively explain only a limited portion of the disease variance [32]. It is
hypothesized that some of the yet unexplained risk may be accounted for by
“gene-environment interactions”, or that the joint effect of genetics and of the
environment may be different than the marginal effects of each of these two
factors alone [32, 44, 47, 285, 286].
In the following, we introduce a method for screening for gene-environment
interactions between prevalent environmental factors found in EWAS and
common variants found in GWAS in application to T2D, overcoming a few
outstanding challenges in the field. Before describing these challenges, the
method, and application we first define and report some examples of the gene-
environment interactions in context of disease.
131
Background
A classic example of a gene-environment interaction involves the disease
phenylketonuria (PKU) [287]. Those with PKU have inherited a rare genetic
variant that codes for a deficient phenylalanine hydroxylase liver enzyme and
are unable to convert the amino acid compound phenylalanine from their diets
to another amino acid, tyrosine. In the presence of both the deficient enzyme
and phenylalanine, an intermediate compound accumulates, phenylketone,
leading to mental retardation. However, even with the rare genetic variant
coding for the deficient protein, controlling phenylalanine exposure mitigates
adverse phenotypes.
The study of gene-environment interactions is akin to “pharmacogenetics,” [81,
288] which relates genetic differences and variability of molecular responses to
drugs . In fact, the term “ecogenetics” – the study of gene-environment
interactions [289] -- discriminates environmental responses from drug
responses. Over 8 decades ago Archibald Garrod undertook initial studies
regarding genetics and metabolic-focused response to environmental chemicals,
observing “inborn errors of metabolism” in adverse phenotypes such as
alkaptonuria [290]. In 1931, Garrod further described adverse phenotypes that
occur only in certain individuals, “…substances contained in particular foods,
certain drugs, and exhalations of animals or plants produce in some people
effects wholly out of proportion to any which they bring about in average
individuals (sic)” [291]. This observation was the first classic pharmacogenetic
“responder” vs.“non-responder” phenotypes that would come to dominate the
field. Later, Motulsky set the stage for pharmacogenetics (and later
ecogenetics) in which he described the adverse response to drugs as an
environmental and dose-dependent “trigger” for genetically susceptible
individuals [292].
132
The interaction of human genetic variants and specific environmental factors
can be assessed through population-based studies, in which the presence of
both a genetic variant and factor is associated with a disease phenotype [289,
293]. In statistical models, the hypothesis of joint effect is tested against the
marginal association between each of the factors alone and phenotype, as to be
discussed below. However, as both epidemiologists and toxicologists alike
would note, population-based statistical interaction does not mean biological or
molecular interaction [34, 294, 295]. Nevertheless, presence of a statistical
interaction enables us to hypothesize about underlying biological processes
that occur between genes and environmental factors.
Most documented gene-environment interactions between genetic variants and
chemical factors come in the context of genes that control metabolic processes,
or “pharmacokinetic” genes, such as the direct conversion of chemical factors
to products for use or excretion. Often these interaction studies occur amongst
a finite set of diseases, most notably cancer. A famous example includes the
product of CYP1A1, which oxidizes polyaromatic hydrocarbons. Early
hypotheses surrounded the metabolic efficiency of different variants of this
gene and hydrocarbons in lung cancer [296]. Molecular processes involving
acetylation carried out by the N-acetyltransferase and associated proteins
(NATs) have received the most attention in relation to variable host responses.
For example, altered function due to NAT variants and exposure to cigarette
smoke in the context of colon and bladder cancer has been well studied [297,
298], gathering robust evidence in GWAS [45].
Evidence for interactions of specific factors has been less strong for T2D. First,
the environment is often attributed to abstract factors such as “lifestyle” or a
proxy for a collection of environmental influences, such as body mass index
[299-302], but there exceptions, such as the interaction hypothesized between
variants in PPARG (Pro12Ala) and dietary fat intake [303]. More surprisingly,
133
there are few examples of interaction between the strongest hits from GWAS
for T2D and environmental factors, such as between rs7903146 (TCF7L2) and
dietary carbohydrate, although the strength of association was weak to
moderate [304]. What is needed is a method to screen a space of possible
interactions to prioritize further study.
Screening for Gene-Environment Interactions: “G-EWAS”
Despite the hypothesis that gene-environment interactions play a role in
diseases of multifactorial nature such as T2D, there is an absence of
documented evidence for specific gene-environment interactions for the
disease. Investigating gene-environment interactions is a challenging
undertaking. First, analyzing gene-environment interactions is a complex and
power-intensive exercise [47]. Second, traditionally, most population-based
epidemiological studies examine either only genetic risk factors or only
environmental risk factors; there is a smaller set of studies that capture
information comprehensively on both genetics and the environment. It is quite
rare for significant numbers of genotypes and environmental factors to be
measured concurrently. Even still, there is another practical challenge that we
propose to address using comprehensive analytic techniques introduced in
Chapter 3, the selection of what candidate factors to measure in the first place.
The outstanding practical challenge we address here revolves around the
domain of factors: which of the millions of genetic variants or thousands of
environmental factors do we choose to measure jointly? Often, genetic
variants and environmental factors are selected by convenience, without
sufficient documentation of the strength of their marginal associations. It is
possible that given the complexity of gene-environment interaction analyses
[47], there is a problem with selective analyses and selective reporting of only
some of the results from each study in a fragmented and possibly biased
fashion [42, 305]. Many studies do not account for the multiplicity of all the
134
interaction effects that they have explored. There is a need to select common
variants and exposures resulting from comprehensive studies, and in turn,
systematically screen their interactions to avoid the spurious results seen in
many candidate-driven investigations [299, 301, 306].
Instead of the traditional candidate locus and environmental factor approach,
one new way forward would be to screen a set of gene and environmental
factors and use the “best hits” as candidates for further study and validation.
To construct such a screen, we propose utilizing factors arising from
comprehensive and systematic studies that have resulted in robust and
replicated associations with disease of interest. For example, much has been
written about using genome-wide association studies (GWAS) to find common
genetic variants associated with complex disease [203]. We recently published
an analogous approach for finding associated environmental factors, called an
environment-wide association study (EWAS) [35].
We propose here a systematic approach to select and test gene-environment
interactions in association to a common disease such as T2D, specifically
testing the interaction between robust factors found in GWAS and EWAS. We
are able to conduct this study because of the specific nature of the Centers for
Disease Control (CDC) National Health and Nutrition Examination Survey
(NHANES) [37], which we introduced earlier in Chapter 4. To recap, the
survey includes 261 genotyped loci, more than 50 environmental factors
measured in blood and urine (e.g. nutrients, vitamins, and pollutants), and
clinical biomarkers (such as fasting blood glucose) for the same individuals.
Focusing on the top GWAS and EWAS hits on T2D, we systematically
investigate variant-environment interactions in association with the disease on
these cohorts, creating hypotheses for further investigation.
135
METHODS
Figure 27 schematically shows the systematic approach for testing gene-
environment interactions. The analysis of interactions is conducted in a dataset
that has available measurements for both genetic variants and environmental
factors. We select genetic variants and environmental factors that have strong
evidence of association for their marginal effects.
For genetic associations, the current paradigm of GWAS has provided the
framework to assemble robustly replicated sets of common genetic variants
with previously documented genome-wide significance (p<5x10-8 [8]). For
environmental associations, we conducted EWAS to comprehensively search
for and validate prevalent environmental factors in association to T2D
(Chapter 4). For environmental variables, there is less strong consensus on
what are robust enough standards of replication [307] and it should be
acknowledged that, in contrast to genetics, reverse causality cannot be easily
excluded. Here we selected environmental exposures that have shown
significant associations in at least two (and up to 4) independent cohorts after
accounting for the multiplicity of analyses and after adjusting for demographic
factors.
First, we examined the marginal effects of each of the “G” number of genetic
variants or “E” number of environmental factors on T2D separately. Second,
we computed the association between each environmental factor and variant
pair (total of E x G tests) to ascertain the degree of their dependence. In our
main screen, each environmental factor and variant pair (total of E x G tests) is
tested for interaction while adjusting for other known risk factors (Figure 27B).
Finally, multiplicity of analyses is accounted for with both Bonferroni-adjusted
p-values and false discovery rate (FDR) estimation (Figure 27C).
136
Figure 27. Schematic for comprehensive testing and screening for gene-environment interactions against T2D. A.) Genetic and environmental factors are chosen by their strength of marginal association in GWAS and EWAS, B.) Each genetic variant and exposure pair is tested for interaction in association to disease (example shown: γ-tocopherol and the variant rs13266634 in association to T2D (fasting blood glucose [FBG] > 125 mg/dL)) in a logistic regression model adjusting for other risk factors and main effects of exposure and variant, C.) Multiple hypotheses are controlled for using a modified Bonferroni correction and the FDR is estimated. The correction factor for the multiplicity adjustment is the number of estimated independent tests conducted, and the empirical false positive rate for FDR estimation is estimated through a parametric bootstrap approach. D.) Sensitivity analyses are conducted, restricting the samples analyzed to a Caucasian-only subgroup and an over-age-40 subgroup.
Data and selected genetic and environmental factors
We used data from the National Health and Nutrition Examination Survey
(NHANES) described in Chapter 4 [37]. On the genetics side, we considered
18 genetic variants that have been previously documented through GWAS to
have robust association (reaching genome-wide signficance) with T2D. These
variants have been assayed among consenting individuals in two NHANES
surveys, those conducted in 1999-2000 and 2001-2002. A total of 8000
individuals from these cohorts had both consented use of their DNA for
rs7901695rs13266634
rs4402960
rs1260326
rs864745
rs1111875
rs10811661
rs7578597
rs1801282
rs780094
rs12779790
rs2237895
rs2383208
rs10923931
rs4712523rs4607103
rs7903146
rs8050136
THADA
CDKAL1unknown
NOTCH2
GCKR
JAZF1
HHEX
TCF7L2
GCKR
FTO
SLC30A8
KCNQ1
CDKN2A
CAMK1D
IGFBP2
ADAMTS9
TCF7L2
PPARG
18
13
10
5
9
1
6
12
4
7
17
3
8
21
5
16
14
11
2
3 4
15
cis-β-
carot
ene
trans
-β-ca
roten
e
γ-toc
ophe
rol
hepta
chlor
epox
ide
PCB199
A
zγ-tocopherol
logi
t(FBG
≥ 1
26 m
g/dL
) rs13266634
(0)
(1)(2)
Test for InteractionB
Factor Selection
(# of risk alleles)
Sensitivity AnalysesCaucasian sub-group > Age 40 sub-group
D
Multiplicity AdjustmentCBonferroni Correction (meffective=FDR (estimated using parametric bootstrapping)
)Σ Σ×
137
research and had blood samples available for genetic testing. Genetic variants
were chosen a priori by different groups of independent researchers
investigating other research topics. We computed allele frequencies of each
variant stratified on self-report race to confirm their presence. In NHANES,
this was coded in 5 groups (Mexican American, Non-Hispanic Black, Non-
Hispanic White, Other Hispanic, Other).
On the environment side, we previously identified and tentatively validated 5
environmental factors associated with T2D, including trans-β-carotene, cis-β-
carotene, γ-tocopherol, heptachlor epoxide, and PCB170 after systematically
screening 266 environmental factors measured by blood or urine tests (Chapter
4) [10]. To recap, the false-discovery rate for each of these 5 associations was
less than 10% in at least 2 independent cohorts and the overall FDR, assessed
by considering all combinations of attaining significance in more than 1 cohort,
was less than 1% for all 5 factors.
T2D cases are defined as individuals who had 8.5-hour fasting blood glucose
(FBG) greater than or equal to 126 mg/dL as advised by the American
Diabetes Association (ADA), similar to our EWAS on T2D (Chapter 4). To
increase our power for detection of interaction effects, we combined data from
the two cohorts. Depending on the genetic and environmental variables tested
for interaction, there were a total of 921 to 2924 controls and 82 to 278 cases.
Each genetic variable was coded for the number of risk alleles as designated by
the papers from which they were found [23, 308]. Environmental factors were
log-transformed and standardized (expressed in standard deviation units) .
Regression Analyses
We assessed the marginal effect between the genetic variant or environmental
factor on T2D with survey-weighted logistic regression, adjusting for self-
report race, age, sex, and BMI. Next, we ascertained whether genetic variants
138
might be correlated with levels of environmental factor. We evaluated the
correlation between genetic and environmental factors through survey-
weighted linear regression, regressing log base 10 of the environmental
exposure variable on each genetic variable, adjusted for self-report race, sex,
age, and BMI. We used 4-year survey weights corresponding to the smallest
subsample analyzed as advised by the National Centers for Health Statistics
(NCHS) [207].
Next, we conducted our systematic interaction screen (Figure 27A-B).
Specifically, we screened the space of possible pairs of interactions, totaling 90
(18 genetic loci times 5 environmental factors). We utilized survey-weighted
logistic regression to associate each pair of factors to T2D incorporating a
multiplicative interaction term and main effects of both factors. Each model
was further adjusted by age, sex, self-reported race, BMI. As above, 4-year
survey weights corresponding to the smallest subsample analyzed as advised
by the NCHS [207].
Multiplicity Correction and FDR
We corrected for multiple hypotheses through direct Bonferroni correction of
the statistical significance level and FDR estimation (Figure 27C). Bonferroni
multiplicity correction adjusts the threshold for statistical significance (for
example, p=0.05) by the number of statistical tests conducted. Since our tests
are not independent, we estimated the total number of “effective” genetic loci
and environmental exposures tested jointly by taking into account the
correlation between the selected factors. This approach, which more
accurately estimates the number of hypotheses for a group of correlated factors,
has been applied previously to the study of genetic variants [309]. Here, we
expanded the use of the method for environmental factors. For the 18 genetic
loci, we calculated the correlation between the genetic factors stratified by
ethnicity and concluded that there were 17.7 effective genetic factors. For the
5 environmental factors, we calculated 4.41 effective factors. Thus, the total
139
number of effective tests was 78.1 (17.7 x 4.41). The adjusted level of
significance for a single test threshold of p=0.05 therefore is 0.0006
(0.05/78.1).
We also calculated the FDR, the estimated proportion of false positives among
the total of significant hypotheses for a given significance level [238]. To
estimate the number of false positives, we empirically generated a distribution
of null test statistics corresponding to the interaction term while preserving the
main effects of the variant and exposure terms using a parametric bootstrap
method [310].
Briefly, a parametric bootstrap approach samples with replacement responses
from a model representing the “null” hypothesis many times to enable the
creation of a null distribution of test statistics corresponding to the interaction
term. To create the null distribution of test statistics corresponding to the
interaction term (β GxE), we fit a “null” logistic regression model omitting the
interaction term (β GxE = 0) while leaving in the model the parameters
modeling the main effects of the environmental factor, genetic variant, and
remaining covariates (age, sex, race, and BMI). We bootstrapped the
responses from this null model and refit the original model described above,
adding the covariate corresponding to the interaction between variant and
environmental factor. To estimate a null distribution, this procedure is
repeated 100 times. Finally, the FDR is estimated to be the ratio of the results
of our interaction term called significant in the null distribution and all the
results called significant (both true and false positives) at a given significance
level.
All presented analyses include data from diverse ancestry and age groups, as
reflected in the US population that is sampled by NHANES. Given that the
stronger evidence for the specific T2D associations has been procured by
140
studies in Caucasian-descent individuals, we also performed a sensitivity
analysis limiting the data to participants who were coded as non-Hispanic
white and other Hispanic (Figure 27D). Next, as NHANES is cross-sectional
and presumably many genetically at risk participants might not be diagnosed at
time of sampling, we performed an analysis using an older set of individuals
(greater than 40 years of age).
BMI is a notable risk factor for T2D [311, 312]. As means of comparison, we
also sought to document any interaction between the 18 genetic variants and
BMI. To conduct these interaction tests, we standardized BMI by centering
the measurements about the mean and dividing by the standard deviation. As
above, we fit a survey-weighted logistic regression model, modeling diabetes
status as a function of the main effect of the variant (coded as number of risk
alleles), main effect of BMI, interaction term, in addition to sex, age, and self-
report race. We estimated greater than 90% power to detect interactions
between BMI and these genetic loci for interaction OR 1.5 at the 0.05 level of
significance [313].
For all analyses, we used SAS version 9.2 accessed through a Remote Data
Center (RDC) located in Hyattsville, Maryland. As NHANES is a complex,
multi-stage, stratified survey, we utilized survey sampling units, strata, and
weights for all analyses as in the previous chapter [207].
RESULTS
We implemented a systematic screen for detecting interaction effects between
19 genetic variants identified in GWAS on T2D and 5 environmental factors
identified in EWAS on T2D (Chapter 4) [35], a total of 90 interaction tests
(Figure 27A). We modeled T2D using logistic regression with a multiplicative
interaction term between each genetic variant and environmental factor pair
while adjusting for age, sex, self-report race, and body mass index (BMI)
141
(Figure 27B). We assessed multiple hypotheses using a Bonferroni and a false
discovery rate (FDR) approach (Figure 27C) and conducted sensitivity analysis
to assess the robustness of our results (Figure 27D). We begin by assessing
power, marginal associations, and correlation between genetic variants and
environmental factors.
Allele frequencies
We estimated the minor allele frequencies of each of the 18 variants in our two
US NHANES cohorts. For most of the loci, minor allele frequencies were
greater than 5% for all of the self-reported ethnicities. The only exception was
rs1801282, which showed a 3% minor allele frequency in self-reported Blacks.
This suggests that all the surveyed ethnicities had a reasonable frequency of
minor alleles at these loci for study.
Power Calculations
In order to study the interactions between concurrently measured genotypes
and environmental factors in our two US NHANES cohorts, we determined
whether we had sufficient power to proceed. Power computations for
genotype-environment interactions depend on minor allele frequency for loci,
environmental factor variability, ratio of cases to controls, and marginal
associations to disease [313]. We estimated the minor allele frequencies from
our cohorts (5-44%), the exact ratio of cases and controls available for each
genotype-environment factor pair, assumed standardized environmental
variables (SD=1), and assumed a marginal OR of 1.2 and 2.0 for the genetic
loci and environmental factor (gathered from previous GWAS and EWAS)
respectively. Under these assumptions, we determined to have or 30-96%
(median=71%) and 63-99% (median=98%) power for 82 to 278 cases (921 to
2924 controls) to detect an interaction OR of 1.5 and 2.0 respectively for a
significance threshold α=0.05 [313] (Figure 28).
142
Figure 28 Power estimation for detection of interaction for each genetic locus and environmental factor pair tested against T2D (FBG > 125 mg/dL) [313]. Assumptions include an interaction odds ratio of 1.5, a main effect of genetic locus of 1.2 and environmental factor 2.0 estimated from previous studies, minor allele frequencies as in Supplementary Table 1, approximately 10 controls per case, environmental factor SD of 1, and p-value of 0.05. Markers alternate between filled and open for each locus.
Marginal Associations
Three of 18 genetic variants were marginally associated with T2D in the
NHANES cohorts at significance level of 0.05 (uncorrected for multiple
hypotheses here, given that they have been previously robustly documented to
be associated with T2D) after adjustment for age, sex, race, and BMI. These
included loci rs10923931 (NOTCH2), rs7903146 (TCF7L2), and rs13266634
(SLC30A8) (Table 1). In addition to these genetic variants, we also list here
the five environmental factors we previously found associated strongly with
T2D in cohorts examined here (Table 8) [35].
0.3
0.4
0.5
0.6
0.7
0.8
0.9
T2D Power: p=0.05, OR=1.5
powe
r
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●●
●●
●
●
●
●●
●
●●
●●
●
●
rs10
8116
61(U
nkno
wn)
rs10
9239
31(N
OTC
H2)
rs11
1187
5(U
nkno
wn)
rs12
6032
6(G
CKR
)
rs12
7797
90(U
nkno
wn)
rs13
2666
34(S
LC30
A8)
rs18
0128
2(PP
ARG
)
rs22
3789
5(KC
NQ
1)
rs23
8320
8(U
nkno
wn)
rs44
0296
0(IG
F2BP
2)
rs46
0710
3(U
nkno
wn)
rs47
1252
3(C
DKA
L1)
rs75
7859
7(TH
ADA)
rs78
0094
(GC
KR)
rs79
0169
5(TC
F7L2
)
rs79
0314
6(TC
F7L2
)
rs80
5013
6(FT
O)
rs86
4745
(JAZ
F1)
●
●
●
●
●
PCB170trans−β−carotene cis−β−carotene γ−tocopherol Heptachlor Epoxide
143
Locus (gene) or Environmental Factor
N(cases) p-value OR (95% CI)
rs10923931(NOTCH2) 3429 (297) 0.0043 1.50 (1.14,1.98) rs7903146(TCF7L2) 3401 (296) 0.015 1.32 (1.06,1.65) rs13266634(SLC30A8) 3427 (298) 0.018 1.33 (1.05,1.69) rs1260326(GCKR) 3408 (296) 0.13 1.27 (0.93,1.73) rs7901695(TCF7L2) 3402 (293) 0.13 1.22 (0.94,1.57) rs780094(GCKR) 3430 (298) 0.15 1.25 (0.92,1.70) rs4607103(Unknown) 3421 (298) 0.41 0.89 (0.68,1.17) rs2383208(Unknown) 3122 (298) 0.42 0.86 (0.60,1.24) rs4402960(IGF2BP2) 3406 (296) 0.52 0.93 (0.74,1.17) rs7578597(THADA) 3416 (291) 0.52 0.88 (0.59,1.30) rs12779790(Unknown) 3415 (293) 0.58 0.91 (0.66,1.27) rs2237895(KCNQ1) 3415 (296) 0.61 0.95 (0.77,1.17) rs1801282(PPARG) 3405 (296) 0.62 0.90 (0.58,1.38) rs4712523(CDKAL1) 3431 (298) 0.63 1.07 (0.81,1.42) rs8050136(FTO) 3403 (295) 0.74 0.96 (0.75,1.23) rs1111875(Unknown) 3407 (296) 0.75 1.04 (0.83,1.30) rs864745(JAZF1) 3430 (298) 0.8 0.96 (0.70,1.31) rs10811661(Unknown) 3406 (296) 0.84 0.96 (0.62,1.48) trans-β-carotene 3033 (189) 5 x 10-5 0.64 (0.52,0.79)* γ-tocopherol 5349 (314) 5 x 10-5 1.46 (1.25,1.72)* cis-β-carotene 3032 (189) 2 x 10-4 0.63 (0.50,0.81)* PCB170 1807 (98) 0.005 1.72 (1.18,2.52)* Heptachlor Epoxide 1711 (94) 0.005 1.49 (1.13,1.98)*
Table 8. Marginal association of each locus (n=18) or environmental factor (n=5) to T2D (FBG > 125 mg/dL). Per-risk-allele ORs are shown, adjusted by sex, age, ethnicity, and BMI. *Per 1 SD OR are shown, adjusted by sex, age, ethnicity, and BMI.
Correlation between genetic variants with environmental variables
We found little evidence for correlation between the 18 genetic variants and
the 5 environmental factors. Nominal relationships included a negative
association between rs10923931 and Heptachlor Epoxide (p=0.02), where
levels of Heptachlor Epoxide decreased 10% per risk allele. We also observed
a negative association between rs10923931 and cis-β-carotene (p=0.04), where
levels of Heptachlor Epoxide deceased 5% per risk allele.
144
Screening for Genetic Variant by Environment Interactions
We then proceeded to study interactions between 18 genetic variants and the 5
environmental factors, a total of 90 interactions tested using survey-weighted
logistic regression adjusted for age, sex, self-reported race, and BMI. Figure
29 presents a Manhattan-style plot where all the 90 interaction terms are
plotted with their p-values. From these 90, we found 8 results at p < 0.05 and
false discovery rates between 1.5 and 37%, involving six genetic variants and
four environmental factors. Further, from these 90, we found 4 results with
FDR less than 20% (p < 0.01) involving 2 variants and 3 environmental factors
worth pursuing for further study.
Figure 29. Manhattan plot of significance values of interaction term (-log10(p-value) for interaction term of pair of factors). The x-axis is grouped by variant (n=18); within each group are 5 points corresponding to the environmental factor tested in interaction with that variant. Top 8 factors (p-value ≤ 0.05) are annotated with their false discovery rate. For example, the interaction between rs13266634 (in SLC30A8 gene) and γ-tocopherol is annotated and has a FDR of 18%. The Bonferroni threshold is seen in the dotted line. Markers alternate between filled and open for each locus.
01
23
4
−log
10(p
valu
e in
tera
ctio
n te
rm)
rs10
8116
61(U
nkno
wn)
rs10
9239
31(N
OTC
H2)
rs11
1187
5(U
nkno
wn)
rs12
6032
6(G
CKR
)
rs12
7797
90(U
nkno
wn)
rs13
2666
34(S
LC30
A8)
rs18
0128
2(PP
ARG
)
rs22
3789
5(KC
NQ
1)
rs23
8320
8(U
nkno
wn)
rs44
0296
0(IG
F2BP
2)
rs46
0710
3(U
nkno
wn)
rs47
1252
3(C
DKA
L1)
rs75
7859
7(TH
ADA)
rs78
0094
(GC
KR)
rs79
0169
5(TC
F7L2
)
rs79
0314
6(TC
F7L2
)
rs80
5013
6(FT
O)
rs86
4745
(JAZ
F1)
●●
●●
●●
●
●
●
●●●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●●
●●
●
●
●
●
●●●
●
●●
●
●
●
●
●●
●
●●
●
●●●●
●●
●
●
●●
●●
●
●●
●
●
●●
●●
●
●●
●●
0.015
0.06
0.160.18
0.220.240.24
0.37
●
●
●
●
●
PCB170trans−β−carotene cis−β−carotene γ−tocopherol Heptachlor Epoxide
145
The top four of eight findings are discussed here. Our top result included the
interaction between the nutrient marker trans-β-carotene and the non-
synonymous SNP rs13266634 (SLC30A8) and it was significant beyond the
Bonferroni-adjusted cutoff significance level (interaction p = 5 x 10-5,
Bonferroni adjusted p-value 0.006, FDR=1.5%). At lower levels of trans-β-
carotene, defined as a point value 1 SD below the mean level of the factor, the
per-allele effect size (odds ratio, OR) was 1.8, (95% CI: 1.3, 2.6) 40% greater
than the marginal effect (Figure 30). The adjusted OR per change in trans-β-
carotene levels was protective for those who had 2 risk alleles for the variant
(adjusted OR 0.5, 95% CI: 0.4, 0.8), while for those with 0 or 1 risk alleles had
negligible effects. We observed similar effects for cis-β-carotene and
rs13266634 (Figure 30).
On the other hand, we observed an opposite effect for individuals who carried
the rs13266634 risk alleles with rising levels of γ-tocopherol (interaction
p=0.0095, FDR=18%). The adjusted OR for individuals with γ-tocopherol
levels 1 SD higher than the mean was 1.6 (adjusted 95% CI: 1.3, 2.1), a 25%
increase in per-allele adjusted OR when compared to the marginal effect
(Figure 30). For individuals below the mean levels of γ-tocopherol, their
genetic risk appears mitigated.
While we did not detect a marginal individual association between intergenic
SNP rs12779790 and T2D, we observed an interaction with this locus and
trans-β-carotene (p < 0.01, FDR = 16%) with T2D (Figure 30). Specifically,
the protective effect of trans-β-carotene increased 50%, an adjusted per-SD
environmental factor OR of 0.3 (95% CI: 0.2, 0.5) for those with 2 risk alleles
compared to 0.6 for the marginal per-SD effect of the factor.
Interestingly, our weakest result (not in top 4), included an interaction between
rs7903146 (TCF7L2), the most highly replicated T2D GWAS variant in
146
Caucasian populations as observed in the NHGRI catalog [203], and trans-β-
carotene (interaction p=0.04, FDR=40%). While the result may be spurious,
we observed that those with 2 risk alleles and low levels of trans-β-carotene
have roughly 8% higher OR (1.4, 95% CI: 1.1, 1.9) compared to the significant
marginal effect (Figure 30).
147
Figure 30. Per-risk allele effect sizes for top putative interactions with p < 0.05. Black markers denote OR for marginal effect of variant; the red markers denote interaction OR computed at low (at 1 SD lower than the mean), mean, or high (at 1SD greater than the mean) levels of exposure respectively. Marker sizes are proportional to inverse variance.
Sensitivity Analyses limited to non-Hispanic white and other Hispanic
participants and older individuals
In sensitivity analyses limited to only Caucasian participants (self-reported
non-Hispanic white and Hispanics, 55 to 58% of the population in the
trans-!-carotene (low(-1SD))trans-!-carotene (mean)trans-!-carotene (high(+1SD))
cis-!-carotene (low(-1SD))cis-!-carotene (mean)cis-!-carotene (high(+1SD))
cis-!-carotene (low(-1SD))cis-!-carotene (mean)cis-!-carotene (high(+1SD))
!-tocopherol (low(-1SD))!-tocopherol (mean)!-tocopherol (high(+1SD))
trans-!-carotene (low(-1SD))trans-!-carotene (mean)trans-!-carotene (high(+1SD))
trans-!-carotene (low(-1SD))trans-!-carotene (mean)trans-!-carotene (high(+1SD))
PCB170(low(-1SD))PCB170(mean)PCB170(high(+1SD))
trans-!-carotene (low(-1SD))trans-!-carotene (mean)trans-!-carotene (high(+1SD))
rs13266634(SLC30A8)
rs13266634(SLC30A8)
rs12779790(Unknown)
rs13266634(SLC30A8)
rs4402960(IGF2BP2)
rs4712523(CDKAL1)
rs2237895(KCNQ1)
rs7903146(TCF7L2)
OR (95% CI)1.3 [1.1,1.7]1.8 [1.3,2.6]1.1 [0.8,1.5]
0.67 [0.41,1.1]p-value (FDR):7.8e-05 (0.015)
1.3 [1.1,1.7]1.8 [1.3,2.5]
1.2 [0.85,1.6]0.76 [0.47,1.2]
p-value (FDR):0.0015 (0.06)
0.91 [0.66,1.3]1.2 [0.72,1.9]
0.78 [0.55,1.1]0.52 [0.34,0.8]
p-value (FDR):0.0062 (0.16)
1.3 [1.1,1.7]0.84 [0.52,1.3]
1.2 [0.88,1.5]1.6 [1.3,2.1]
p-value (FDR):0.0095 (0.18)
0.93 [0.74,1.2]0.82 [0.58,1.2]
1.1 [0.83,1.4]1.4 [1,1.8]
p-value (FDR):0.014 (0.22)
1.1 [0.81,1.4]1.4 [0.86,2.4]1.1 [0.76,1.6]
0.85 [0.61,1.2]p-value (FDR):0.021 (0.24)
0.95 [0.77,1.2]0.44 [0.21,0.93]
0.61 [0.34,1.1]0.85 [0.5,1.5]
p-value (FDR):0.023 (0.24)
1.3 [1.1,1.6]1.4 [1.1,1.9]
1.1 [0.88,1.5]0.88 [0.58,1.4]
p-value (FDR):0.043 (0.37)
0 0.5 1 1.5 2 2.5 3Per risk allele OR
148
originally analyzed NHANES cohorts), we were able to reconfirm the top 4
interactions (FDR < 20%) found to the extent of their effect and strength of
association. Specifically, there was less than 10% change between interaction
effect sizes between the Caucasian-only analysis and the full cohort analyzed.
Further, the statistical significance of association remained less than 0.05
despite decreased power. We concluded we had limited power to claim the 4-
8th ranked interactions were preserved in this sub-sample (p > 0.05); however,
there was a negligible change in their effect sizes.
As NHANES is cross-sectional, there remains a possibility that individuals
who are at genetic risk for T2D might not be diagnosed as such at the time of
their sampling. To estimate the effect of this bias, we estimated the size of
interaction effects for a sub-sample aged 40 and older (63-64% of the
population of the originally analyzed NHANES cohorts). When analyzing
the sub-sample of older individuals, there was negligible change in interaction
effects for all of the top 4 factors (FDR < 20%) and their statistical significance
level remained less than 0.05.
Limited Evidence to Support Interactions with Body Mass Index
We sought to compare interaction effects between BMI, a notable risk factor
for T2D, and the 18 top genetic factors tested in this pilot study. Interestingly,
while adequately powered, we were unable to uncover substantial interaction
effects that would survive Bonferroni correction. We did observe a modest
interaction between BMI and rs8050136 of the FTO gene (uncorrected
interaction p-value=0.03). rs8050136 is known to be an obesity related locus
whose association with T2D is explained primarily through its effect on BMI
[228].
149
DISCUSSION
We have shown here how results from two comprehensive association
approaches on genetics and the environment, GWAS and EWAS, can be
combined to systematically screen for gene-environment interactions. We
implemented ways to correct for multiple hypotheses using a modified
Bonferroni adjustment method and through estimation of the FDR. In
particular, we have implemented a method to estimate the FDR empirically
using the parametric bootstrap [310], a less conservative way to mitigate the
cost of multiple hypotheses. We propose that the most promising hypotheses
that emerge from this systematic process are candidates for replication in
additional independent cohorts in prospective studies.
We restricted our analyses to environmental factors and genetic variants on the
basis of strength of the evidence on their marginal associations in GWAS and
EWAS. One could also consider evaluating gene-environment interactions for
genetic loci or environmental factors that do not have robust support for the
presence of marginal associations. Given the small marginal effects for most
common genetic variants, many genuine associations do not reach genome-
wide significance and remain false-negatives [41]. Some of those may have
strong interactions with the environment [301] , and may only be discovered if
the appropriate joint environmental variables are considered. However,
selecting which of the millions of non-genome-wide-significant SNPs to test is
challenging. It is well known that testing for interactions is power-intensive
[44]; furthermore, testing a large number or all of them imposes an even
greater power and multiplicity burden [47]. For environmental factors, the
choice of which ones to test for interaction is even more tenuous. Notably, in
contrast to common genetic variants, there is yet no high-throughput
measurement platform that captures all the environmental factors and lack of
measuring capacity limits data availability. Measurement error can be
substantial for many environmental exposures [10, 314].
150
We had the ability to screen 266 environmental factors measured in serum and
urine systematically through a prior EWAS process for association with T2D.
We selected for interaction testing only the 5 of them that had the strongest
support, as judged by FDR, persistence of effects after adjustment for
confounders, and replication in independent cohorts. The proposed approach
creates a systematic list of tested interaction terms, while at the same time it
reduces the number of tested interaction terms to a number that is not very high.
However, it is still very important to account for multiple hypothesis testing.
Here, we used here two approaches, multiplicity correction and the FDR, but
other approaches may also be employed [307].
Our application highlights other challenges of testing and validating gene-
environment interactions. First, we had low to moderate power to detect
moderate interaction effects for some of the interactions tested. Not
surprisingly, we found modest p-values of which only one survived the
Bonferroni correction and we had modest FDR estimates for the other highest-
ranking interactions. This documents that great caution is needed in claiming
gene-environment interactions and, more importantly, the need for extensive
replication of the top findings in larger well-powered studies. We stress that
the current exercise focuses on hypothesis generation.
Replication studies can also examine whether genetic effects and their
respective interactions are similar in populations of different ancestry.
Population stratification [181] is the equivalent of confounding for genetic
effects. We used analyses adjusting for self-report race, however we should
acknowledge that the genetic effects identified to-date from GWAS are best
documented in Caucasian populations and that self-report ethnicity is subject
to bias. Genetic effects for GWAS-discovered markers may be different in
different ancestry groups [315-320]. While under-powered, analyses limited to
151
Caucasians showed similar effects for the top hits in our analyses. However,
little is known on how gene-environment interactions may behave in
populations of different ancestry and this should be further investigated.
Another issue in studying complex and age-related diseases such as T2D
includes the classification of cases and controls. For example, a fraction of
non-cases at high genetic risk for T2D will not be diagnosed at the time of their
sampling. To estimate the effect of this bias, we conducted a sensitivity
analyses limiting the cohort to those older than 40. While we observed little
difference in our estimates, we acknowledge that effects might be diluted due
to this type of bias.
There are but a few documented examples of interaction effects between
genome-wide significant loci and environmental or dietary factors on T2D
[304]. Through this screen, we have been able to hypothesize about possible
new ones. For example, the strongest evidence for interaction in our data
existed between rs13266634, a non-synonymous coding variant in the
SLC30A8 gene, and three nutrient factors, trans- and cis-β-carotene, and γ-
tocopherol. The non-synonymous variant rs13266634 is a highly replicated
variant, connected with beta cell function and insulin secretion [29, 50, 64].
For example, in a SLC30A8 knockout mouse model, normal glucose-induced
insulin release was preserved; however, after a high fat diet, the SLC30A8
knockout mouse became glucose intolerant and diabetic [49]. Our data-driven
gene-environment screen has enabled us to hypothesize that impaired insulin
secretion imparted by the rs13266634 variant, combined with presence or
deficiency of specific nutrients in the diet, leads to greater risk for T2D. We
plan to test this hypothesis in depth in both model systems and in other human
populations.
152
Nevertheless, attributing causality of interactions is challenging. For
environmental factors, confounding [68] and reverse causality [10] are major
issues for studying environmental factors. First, little is known about the causal
nature, if any, regarding these environmental factors and T2D [216]. Second,
statistical interaction does not equate to biological interaction [295]. Given the
modest interaction effect sizes and levels of false discovery, the joint effect of
these factors need to be evaluated in independent, larger populations, including
prospective cohorts where the time-dependent associations of environmental
factors can be assessed. Nevertheless, genetic information can sometimes be
helpful in identifying genuine environmental risk factors through Mendelian
randomization [69].
Finally, other infrastructure-related challenges remain for the future of
systematic screening for gene-environment interactions [44]. First, we lack a
complete list of candidate environmental factors regarding the marginal effects
of exposure to disease. In comparison, the analogous list of common genetic
variants is well known and is constantly being updated [9]. Screening and
validating gene-environment interactions is power-intensive and will require
new types of biobanks that can accommodate large amounts of environmental
and genetic measures measured on the same individuals across multiple studies
and cohorts [10]. A straightforward first step includes augmenting current
GWAS with environmental information [45, 301] and adopting high-priority
exposure measures published by public initiatives such as PhenX
(www.phenx.org), whose goal is to build consensus regarding the minimal set
of impactful environmental factors to measure in large GWAS-like studies [66,
67]. A systematic approach to measuring and testing the environment – as we
have shown in previous chapters -- and its interaction with the genetic profile
of individuals may help find and explain a substantial component of the
disease risk for some common health conditions or even lead to hypotheses
regarding disease pathology.
153
CHAPTER 6: CONCLUSION AND DISCUSSION
In this dissertation we have described and implemented methods to create
robust and ranked hypotheses through massive, comprehensive, and systematic
association of the envirome to disease and adverse phenotypes, both on
molecular and population scale.
In the first method operating on molecular-based toxicogenomics data, we use
tools of integrative genomics to merge once disparate datasets, toxicological
gene expression responses and disease gene expression states. We developed a
generalizable representation of environmentally induced molecular responses
called the “Envirome Map”. Assuming that functional states induced by
environmental agents are similar to disease, we correlated the Envirome Map
to cancer expression states. Specifically, we show how the expression states
associated with certain factors, such as Bisphenol A, are correlated with breast
cancer, prompting further study. Importantly, the Envirome Map enables
hypothesis generation in a scalable and practical way, utilizing data from the
public domain in databases such as the Gene Expression Omnibus.
We also have developed and implemented a method to associate the envirome
to disease on a population scale, called the “Environment-wide Association
Study” (EWAS), analogous to the data-driven genome-wide association study
(GWAS). We showed how EWAS provides the benefits of GWAS through
transparent and comprehensive reporting. Most importantly, EWAS has
enabled the discovery of pollutants and dietary markers in association with
common diseases such as Type 2 Diabetes and risk factors for heart disease,
serum lipid levels at low levels of false discovery. This type of discovery is
not standard practice in current day epidemiology. In fact, it has rung a bell in
environmental health circles [1, 197, 200], and has introduced a new area of
154
study to genome scientists [196, 198, 199]. Last, and critically, EWAS calls for
the study of the highest-ranking novel and robust associations in depth in
different study designs and scenarios.
For example, we present a novel way to systematically screen for gene-
environment interactions, integrating results from comprehensive studies on
the envirome and genome dubbed a “G-EWAS”. In our prototype G-EWAS
on T2D, we screened all possible interactions between robustly identified
environmental factors from EWAS and genetic factors in GWAS. Further, we
implemented two ways of accounting for multiple hypotheses, and after
diverse sensitivity analyses, converged on interaction between a non-
synonymous functional variant in the SLC30A8 gene and nutrient markers.
With experimental models for the SLC30A8 knockout gene, investigators have
hypothesized that the risk associated for T2D comes as a result of a dietary
interaction; through G-EWAS we are able to speculate about the role of
specific factors for this hypothesized interaction.
We predict that studies like G-EWAS are just the tip of the glacier. Our
capability of measuring molecular modalities on a population scale is
improving exponentially. We are on the brink of a deluge of genomic
sequence data, capable of measuring both genotypes and molecular responses
on a massive number of individuals. How will we merge data of different
scales with dynamic environmental information to better predict disease?
Given that traditionally these data have not been analyzed jointly, we face
immense infrastructural and analytic challenges with more complex data
modes. It is inevitable new methods of comprehensive inference in the spirit
of EWAS and G-EWAS will need to be developed to take advantage of them.
To even begin to design and conduct such studies, collaborative efforts now
need to be focused on defining standardizing -- from nomenclature to means of
155
measure -- the “envirome”, a concept we only introduced here. For example,
genomic studies have benefitted from the efforts put forth by the National
Centers of Biotechnology Information, cataloging genetic information in an
accessible manner for the scientific and engineering communities to utilize.
Further, standardization enabled projects such the HapMap in which a
federation of institutions was assembled to characterize common genomic
variation on the planet. Such efforts need to take place to conduct true
envirome-wide studies. Surveys such as the National Health and Nutrition
Examination Survey may enable us to attain the map of environmental
variation, but they are only a start.
Investigators have now associated genomic variation with hundreds of
common diseases [321]. Specifically, studies such as GWAS (Figure 5B) have
us allowed us to hypothesize about disease etiology and predict genetic risk
[203]. As a result, we now have a greater understanding of genetic
contribution to disease, but critically, the majority of genetic risk for common
disease has yet to be explained. There needs to be more envirome-based
studies to achieve a fuller understanding of disease. Furthermore, the
investigation of the environment lags behind the genome (Figure 1). To begin
closing the gap, we propose conducting many more EWAS beginning with
common, multifactorial diseases prioritized by the World Health Organization
(e.g., cardiovascular disease, T2D, hypertension, premature births, lung cancer,
asthma). For example, as of this writing, we are conducting EWAS on
hypertension on multiple European cohorts, on chronic kidney disease with
cohorts in the United States, on asthma with cohorts in the United States, and
on mothers with premature births in the United States.
As a result of multiple GWAS on many diseases, we are now able to provide
clinically relevant information to patients for disease prognosis and prevention
[23]. However, we have yet to combine this data with environmental
156
information. For example, how might specific environmental factors modify
our genetic risk for disease? Individuals are now quantifying toxicants in their
own tissue [322] and aggregated results from multiple EWAS will enable
investigators to accurately estimate personal environmental risk. Furthermore,
results from G-EWAS have implications for personal genomics whereby we
may stratify genetic risk for disease by levels of specific environmental factors.
This effort will help us attain truly personalized medicine, whereby specific
modifiable environmental attributes can be the new targets of therapeutics
based on individuals’ genetic profile.
In this dissertation, we have presented and applied new analytical paradigms to
comprehensively connect the environment to disease. Just as in the last 10
years we have witnessed the fruits of genome-wide studies, it is now time to
usher in envirome-wide studies for a more complete understanding of etiology
aligned towards therapeutics and prevention.
157
REFERENCES
1. Rappaport, S.M. and M.T. Smith, Environment and Disease Risks.
Science, 2010. 330(6003): p. 460-461. 2. Schwartz, D. and F. Collins, Medicine. Environmental biology and
human disease. Science, 2007. 316(5825): p. 695-6. 3. Klaassen, C.D., ed. Casarett and Doull's Toxicology - The Basic
Science of Poisons (7th Edition). 7 ed., ed. C.D. Klaassen2008, McGraw-Hill.
4. Willett, W.C., Balancing life-style and genomics research for disease prevention. Science, 2002. 296(5568): p. 695-8.
5. Christiani, D.C., Combating Environmental Causes of Cancer. New Engl J Med, 2011. 364(9): p. 791-793.
6. Ramachandrappa, S. and I.S. Farooqi, Genetic approaches to understanding human obesity. J Clin Invest, 2011. 121(6): p. 2080-6.
7. The Wellcome Trust Case Control Consortium, Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 2007. 447(7145): p. 661-678.
8. Pearson, T.A. and T.A. Manolio, How to interpret a genome-wide association study. J Am Med Assoc, 2008. 299(11): p. 1335-44.
9. Hindorff, L., et al. A Catalog of Published Genome-Wide Association Studies. 2009 [cited 2009 7/28/2009]; Available from: http://www.genome.gov/gwastudies.
10. Ioannidis, J., et al., Researching Genetic Versus Nongenetic Determinants of Disease: A Comparison and Proposed Unification. Sci. Transl. Med., 2009. 1(7): p. 8.
11. Baker, D. and M. Nieuwenhuijsen, eds. Environmental Epidemiology. 2008, Oxford University Press: Oxford.
12. Judson, R.S., et al., In Vitro Screening of Environmental Chemicals for Targeted Testing Prioritization -- The ToxCast Project. Environ Health Perspect, 2009. 118(4).
13. Committee on Toxicity Testing and Assessment of Environmental Agents and National Research Council, Toxicity Testing in the 21st Century: A Vision and a Strategy2007, Washington, D.C.: National Academies Press.
14. Hubal, E.A., Biologically relevant exposure science for 21st century toxicity testing. Toxicol Sci, 2009. 111(2): p. 226-32.
15. Krewski, D., et al., New directions in toxicity testing. Annu Rev Publ Health, 2011. 32: p. 161-78.
16. Wild, C.P., Complementing the genome with an "exposome": the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers Prev, 2005. 14(8): p. 1847-50.
158
17. World Health Organization. Global Health Observatory Data Repository. 2011 [cited 2011 8/9/2011]; Available from: http://apps.who.int/ghodata/.
18. Mullis, K., Process for amplifying nucleic acid sequences, Cetus Corporation: USA.
19. Illumina, I. 2011 [cited 7/19/2011 7/19/2011]; Available from: http://www.illumina.com/.
20. National Center for Biotechnology Information. National Center for Biotechnology Information. 2011 [cited 2011 7/18/2011]; Available from: http://www.ncbi.nlm.nih.gov/guide/.
21. Anthony, J.C., The promise of psychiatric enviromics. Br J Psychiatry Suppl, 2001. 40: p. s8-11.
22. Liu, Y.I., P.H. Wise, and A.J. Butte, The "etiome": identification and clustering of human disease etiological factors. BMC Bioinformatics, 2009. 10 Suppl 2: p. S14.
23. Ashley, E.A., et al., Clinical assessment incorporating a personal genome. Lancet, 2010. 375(9725): p. 1525-1535.
24. Kawakami, N., et al., Effects of smoking on the incidence of non-insulin-dependent diabetes mellitus. Replication and extension in a Japanese cohort of male employees. Am J Epidemiol, 1997. 145(2): p. 103-9.
25. International HapMap, C., A haplotype map of the human genome. Nature, 2005. 437(7063): p. 1299-320.
26. Goh, K.I., et al., The human disease network. Proc Natl Acad Sci U S A, 2007. 104(21): p. 8685-90.
27. Alan D. Lopez, et al., eds. Global Burden of Disease and Risk Factors. 2006, The International Bank for Reconstruction and Development / The World Bank: Washington DC.
28. Lettre, G. and J.D. Rioux, Autoimmune diseases: insights from genome-wide association studies. Hum Mol Genet, 2008. 17(R2): p. R116-21.
29. Rutter, G.A., Think zinc: New roles for zinc in the control of insulin secretion. Islets, 2009. 2(1): p. 49-50.
30. Majithia, A.R. and J.C. Florez, Clinical translation of genetic predictors for type 2 diabetes. Curr Opin Endocrinol, 2009. 16(2): p. 100-6.
31. Lyssenko, V., et al., Mechanisms by which common variants in the TCF7L2 gene increase risk of type 2 diabetes. J Clin Invest, 2007. 117(8): p. 2155-2163.
32. Manolio, T.A., et al., Finding the missing heritability of complex diseases. Nature, 2009. 461(7265): p. 747-753.
33. Goldstein, D.B., Common genetic variation and human traits. N Engl J Med, 2009. 360(17): p. 1696-8.
34. Rothman, K., S. Greenland, and T. Lash, eds. Modern Epidemiology, 3rd Ed. 3rd ed. 2008, Lippincott Williams & Wilkins: Philadelphia.
159
35. Patel, C.J., J. Bhattacharya, and A.J. Butte, An Environment-Wide Association Study (EWAS) on type 2 diabetes mellitus. PLoS ONE, 2010. 5(5): p. e10746.
36. Patel, C.J., et al., Non-genetic associations and correlation globes for determinants of lipid levels: an environment-wide association study. In Review, 2011.
37. Centers for Disease Control and Prevention (CDC). National Health and Nutrition Examination Survey. 2009 [cited 2009 9/1/2009]; Available from: http://www.cdc.gov/nchs/nhanes/.
38. Noble, W.S., How does multiple testing correction work? Nat Biotech, 2009. 27(12): p. 1135-1137.
39. Benjamini, Y. and Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B, 1995.
40. Storey, J.D. and R. Tibshirani, Statistical significance for genomewide studies. Proc Natl Acad Sci U S A, 2003. 100(16): p. 9440-5.
41. Ioannidis, J.P., R. Tarone, and J.K. McLaughlin, The False-positive to False-negative Ratio in Epidemiologic Studies. Epidemiology, 2011. 22(4): p. 450-6.
42. Ioannidis, J.P.A., Why Most Published Research Findings Are False. PLoS Med, 2005. 2(8): p. e124.
43. Miller, E.R., 3rd, et al., Meta-analysis: high-dosage vitamin E supplementation may increase all-cause mortality. Ann Intern Med, 2005. 142(1): p. 37-46.
44. Hunter, D.J., Gene-environment interactions in human diseases. Nat Rev Genet, 2005. 6(4): p. 287-98.
45. Rothman, N., et al., A multi-stage genome-wide association study of bladder cancer identifies multiple susceptibility loci. Nat Genet, 2010. 42(11): p. 978-84.
46. Garcia-Closas, M., et al., NAT2 slow acetylation, GSTM1 null genotype, and risk of bladder cancer: results from the Spanish Bladder Cancer Study and meta-analyses. Lancet, 2005. 366(9486): p. 649-59.
47. Thomas, D., Gene-environment-wide association studies: emerging approaches. Nat Rev Genet, 2010. 11(4): p. 259-272.
48. Ioannidis, J.P., Genetic associations: false or true? Trends Mol Med, 2003. 9(4): p. 135-8.
49. Lemaire, K., et al., Insulin crystallization depends on zinc transporter ZnT8 expression, but is not required for normal glucose homeostasis in mice. Proc Natl Acad Sci USA, 2009. 106(35): p. 14872-7.
50. Nicolson, T.J., et al., Insulin Storage and Glucose Homeostasis in Mice Null for the Granule Zinc Transporter ZnT8 and Studies of the Type 2 Diabetes-Associated Variants. Diabetes, 2009. 58(9): p. 2070-2083.
51. Waters, M.D. and J.M. Fostel, Toxicogenomics and systems toxicology: aims and prospects. Nat Rev Genet, 2004. 5(12): p. 936-48.
160
52. Lamb, J., et al., The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science, 2006. 313(5795): p. 1929-35.
53. Barrett, T., et al., NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucleic Acids Res, 2007. 35(Database issue): p. D760-5.
54. Davis, A.P., et al., Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res, 2009. 37(Database issue): p. D786-92.
55. Lim, E., et al., T3DB: a comprehensively annotated database of common toxins and their targets. Nucleic Acids Res, 2010. 38(Database issue): p. D781-6.
56. Dix, D.J., et al., The ToxCast program for prioritizing toxicity testing of environmental chemicals. Toxicol Sci, 2007. 95(1): p. 5-12.
57. Dudley, J.T., et al., Disease signatures are robust across tissues and experiments. Mol Syst Biol, 2009. 5.
58. Sirota, M., et al., Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data. Sci Transl Med, 2011. 3(96): p. 96ra77.
59. MAQC Consortium, The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotech, 2006. 24(9): p. 1151-1161.
60. MAQC Consortium, The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotech, 2010. 28(8): p. 827-838.
61. Patel, C. and A. Butte, Predicting environmental chemical factors associated with disease-related gene expression data. BMC Med Genomics, 2010. 3(1): p. 17.
62. Wang, T.J., et al., Metabolite profiles and the risk of developing diabetes. Nat Med, 2011. 17(4): p. 448-53.
63. Dawber, T.R., G.F. Meadors, and F.E. Moore, Jr., Epidemiological approaches to heart disease: the Framingham Study. Am J Public Health Nations Health, 1951. 41(3): p. 279-81.
64. Voight, B.F., et al., Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat Genet, 2010. 42(7): p. 579-589.
65. Teslovich, T.M., et al., Biological, clinical and population relevance of 95 loci for blood lipids. Nature, 2010. 466(7307): p. 707-713.
66. Hamilton, C.M., et al., The PhenX Toolkit: Get the Most From Your Measures. Am J Epidemiol, 2011. 174(3): p. 253-260.
67. NHGRI. PhenX. 2011; Available from: http://www.phenx.org. 68. Davey Smith, G., Use of genetic markers and gene-diet interactions for
interrogating population-level causal influences of diet on health. Genes Nutr, 2010. 6(1): p. 27-43-43.
161
69. Davey Smith, G. and S. Ebrahim, 'Mendelian randomization': can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiol, 2003. 32(1): p. 1-22.
70. Thorgeirsson, T.E., et al., A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature, 2008. 452(7187): p. 638-42.
71. Liu, J.Z., et al., Meta-analysis and imputation refines the association of 15q25 with smoking quantity. Nat Genet, 2010. 42(5): p. 436-440.
72. Wang, K.S., et al., A meta-analysis of two genome-wide association studies identifies 3 new loci for alcohol dependence. J Psychiatr Res, 2011.
73. Heath, A.C., et al., A Quantitative-Trait Genome-Wide Association Study of Alcoholism Risk in the Community: Findings and Implications. Biol Psychiatry, 2011.
74. Schumann, G., et al., Genome-wide association and genetic functional studies identify autism susceptibility candidate 2 gene (AUTS2) in the regulation of alcohol consumption. Proc Natl Acad Sci U S A, 2011. 108(17): p. 7119-24.
75. Rauch, A., et al., Genetic variation in IL28B is associated with chronic hepatitis C and treatment failure: a genome-wide association study. Gastroenterology, 2010. 138(4): p. 1338-45, 1345 e1-7.
76. Kamatani, Y., et al., A genome-wide association study identifies variants in the HLA-DP locus associated with chronic hepatitis B in Asians. Nat Genet, 2009. 41(5): p. 591-5.
77. Petrovski, S., et al., Common human genetic variants and HIV-1 susceptibility: a genome-wide survey in a homogeneous African population. AIDS, 2011. 25(4): p. 513-8.
78. Sulem, P., et al., Sequence variants at CYP1A1-CYP1A2 and AHR associate with coffee consumption. Hum Mol Genet, 2011. 20(10): p. 2071-7.
79. De Moor, M.H., et al., Genome-wide association study of exercise behavior in Dutch and American adults. Med Sci Sports Exerc, 2009. 41(10): p. 1887-95.
80. Tanaka, T., et al., Genome-wide association study of vitamin B6, vitamin B12, folate, and homocysteine blood concentrations. Am J Hum Genet, 2009. 84(4): p. 477-82.
81. Yang, J.J. and R.M. Plenge, Genomic Technology Applied to Pharmacological Traits. J Am Med Assoc, 2011. 306(6): p. 652-653.
82. Peters, L.L., et al., The mouse as a model for human biology: a resource guide for complex trait analysis. Nat Rev Genet, 2007. 8(1): p. 58-69.
83. Romanoski, C.E., et al., Systems Genetics Analysis of Gene-by-Environment Interactions in Human Cells. Am J Hum Genet, 2010.
162
84. Idaghdour, Y., et al., Geographical genomics of human leukocyte gene expression variation in southern Morocco. Nat Genet, 2010. 42(1): p. 62-7.
85. Judson, R., et al., The toxicity data landscape for environmental chemicals. Environ Health Perspect, 2009. 117(5): p. 685-95.
86. Weis, B.K., et al., Personalized exposure assessment: promising approaches for human environmental health research. Environ Health Perspect, 2005. 113(7): p. 840-8.
87. Wang, Y., et al., PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res, 2009. 37(Web Server issue): p. W623-33.
88. Williams-DeVane, C.R., M.A. Wolf, and A.M. Richard, DSSTox chemical-index files for exposure-related experiments in ArrayExpress and Gene Expression Omnibus: enabling toxico-chemogenomics data linkages. Bioinformatics, 2009. 25(5): p. 692-694.
89. Andrew, A.S., et al., Drinking-water arsenic exposure modulates gene expression in human lymphocytes from a U.S. population. Environ. Health Perspect., 2008. 116(4): p. 524-31.
90. Malard, V., et al., Global gene expression profiling in human lung cells exposed to cobalt. BMC Genomics, 2007. 8: p. 147.
91. Wang, W., et al., NDRG3 is an androgen regulated and prostate enriched gene that promotes in vitro and in vivo prostate cancer cell growth. Int J Cancer, 2009. 124(3): p. 521-30.
92. Gottipolu, R.R., et al., One-month diesel exhaust inhalation produces hypertensive gene expression pattern in healthy rats. Environ. Health Perspect., 2009. 117(1): p. 38-46.
93. Bild, A.H., et al., Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature, 2006. 439(7074): p. 353-7.
94. Ashburner, M., et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9.
95. Gohlke, J.M., et al., Genetic and environmental pathways to complex diseases. BMC Syst Biol, 2009. 3: p. 46.
96. Becker, K.G., et al., The genetic association database. Nat Genet, 2004. 36(5): p. 431-2.
97. Mattingly, C.J., et al., The comparative toxicogenomics database: a cross-species resource for building chemical-gene interaction networks. Toxicol Sci, 2006. 92(2): p. 587-95.
98. Tusher, V.G., R. Tibshirani, and G. Chu, Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A, 2001. 98(9): p. 5116-21.
99. National Center for Biotechnology Information. Homologene. 2010 3/2008]; Available from: http://www.ncbi.nlm.nih.gov/homologene.
100. Zeeberg, B.R., et al., High-Throughput GoMiner, an 'industrial-strength' integrative gene ontology tool for interpretation of multiple-microarray experiments, with application to studies of Common
163
Variable Immune Deficiency (CVID). BMC Bioinformatics, 2005. 6: p. 168.
101. R Core Team, R: A language and enviornment for statistical computing, 2008, R Foundation for Statistical Computing: Vienna, Austria.
102. Bossé, Y., K. Maghni, and T.J. Hudson, 1alpha,25-dihydroxy-vitamin D3 stimulation of bronchial smooth muscle cells induces autocrine, contractility, and remodeling processes. Physiol Genomics, 2007. 29(2): p. 161-8.
103. Tijet, N., et al., Aryl hydrocarbon receptor regulates distinct dioxin-dependent and dioxin-independent gene batteries. Mol Pharmacol, 2006. 69(1): p. 140-53.
104. Li, Z., et al., Discrimination of vanadium from zinc using gene profiling in human bronchial epithelial cells, in Environ. Health Perspect.2005. p. 1747-54.
105. Selvaraj, V., et al., Gene expression profiling of 17beta-estradiol and genistein effects on mouse thymus. Toxicol Sci, 2005. 87(1): p. 97-112.
106. Lin, C.Y., et al., Whole-genome cartography of estrogen receptor alpha binding sites. PLoS Genet, 2007. 3(6): p. e87.
107. Chandran, U.R., et al., Gene expression profiles of prostate cancer reveal involvement of multiple molecular pathways in the metastatic process. BMC Cancer, 2007. 7: p. 64.
108. Yu, Y.P., et al., Gene expression alterations in prostate cancer predicting tumor aggression and preceding development of malignancy. J Clin Oncol, 2004. 22(14): p. 2790-9.
109. Landi, M.T., et al., Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival. PLoS ONE, 2008. 3(2): p. e1651.
110. Liu, R., et al., The prognostic role of a gene signature from tumorigenic breast-cancer cells. N Engl J Med, 2007. 356(3): p. 217-26.
111. Wang, Y., et al., An overview of the PubChem BioAssay resource. Nucleic Acids Res, 2010. 38(Database issue): p. D255-66.
112. Ho, S.M., et al., Developmental exposure to estradiol and bisphenol A increases susceptibility to prostate carcinogenesis and epigenetically regulates phosphodiesterase type 4 variant 4. Cancer Res, 2006. 66(11): p. 5624-32.
113. Shazer, R.L., et al., Raloxifene, an oestrogen-receptor-beta-targeted therapy, inhibits androgen-independent prostate cancer growth: results from preclinical studies and a pilot phase II clinical trial. BJU Int, 2006. 97(4): p. 691-7.
114. Benbrahim-Tallaa, L., et al., Molecular events associated with arsenic-induced malignant transformation of human prostatic epithelial cells: aberrant genomic DNA methylation and K-ras oncogene activation. Toxicol Appl Pharmacol, 2005. 206(3): p. 288-98.
164
115. Bertilaccio, M.T., et al., Vasculature-targeted tumor necrosis factor-alpha increases the therapeutic index of doxorubicin against prostate cancer. Prostate, 2008. 68(10): p. 1105-15.
116. Borden, L.S., Jr., et al., Vinorelbine, doxorubicin, and prednisone in androgen-independent prostate cancer. Cancer, 2006. 107(5): p. 1093-100.
117. Amato, R.J. and H. Sarao, A phase I study of paclitaxel/doxorubicin/ thalidomide in patients with androgen- independent prostate cancer. Clin Genitourin Cancer, 2006. 4(4): p. 281-6.
118. Kang, J., et al., Subtoxic concentration of doxorubicin enhances TRAIL-induced apoptosis in human prostate cancer cell line LNCaP. Prostate Cancer Prostatic Dis, 2005. 8(3): p. 274-9.
119. Benbrahim-Tallaa, L., et al., Estrogen signaling and disruption of androgen metabolism in acquired androgen-independence during cadmium carcinogenesis in human prostate epithelial cells. Prostate, 2007. 67(2): p. 135-45.
120. Raschke, M., K. Wahala, and B.L. Pool-Zobel, Reduced isoflavone metabolites formed by the human gut microflora suppress growth but do not affect DNA integrity of human prostate cancer cells. Br J Nutr, 2006. 96(3): p. 426-34.
121. Takahashi, Y., et al., Using DNA microarray analyses to elucidate the effects of genistein in androgen-responsive prostate cancer cells: identification of novel targets. Mol Carcinog, 2004. 41(2): p. 108-119.
122. Li, Y., et al., Regulation of gene expression and inhibition of experimental prostate cancer bone metastasis by dietary genistein. Neoplasia, 2004. 6(4): p. 354-63.
123. Koike, H., et al., Insulin-like growth factor binding protein-6 inhibits prostate cancer cell proliferation: implication for anticancer effect of diethylstilbestrol in hormone refractory prostate cancer. Br J Cancer, 2005. 92(8): p. 1538-44.
124. Oh, W.K., The evolving role of estrogen therapy in prostate cancer. Clin Prostate Cancer, 2002. 1(2): p. 81-9.
125. Tokar, E.J., et al., Cholecalciferol (vitamin D3) and the retinoid N-(4-hydroxyphenyl)retinamide (4-HPR) are synergistic for chemoprevention of prostate cancer. J Exp Ther Oncol, 2006. 5(4): p. 323-33.
126. Costello, L.C. and R.B. Franklin, The clinical relevance of the metabolism of prostate cancer; zinc and tumor suppression: connecting the dots. Mol Cancer, 2006. 5: p. 17.
127. Uzzo, R.G., et al., Diverse effects of zinc on NF-kappaB and AP-1 transcription factors: implications for prostate cancer progression. Carcinogenesis, 2006. 27(10): p. 1980-90.
128. Michael, I.P., et al., Human tissue kallikrein 5 is a member of a proteolytic cascade pathway involved in seminal clot liquefaction and
165
potentially in prostate cancer progression. J Biol Chem, 2006. 281(18): p. 12743-50.
129. Uzzo, R.G., et al., Zinc inhibits nuclear factor-kappa B activation and sensitizes prostate cancer cells to cytotoxic agents. Clin Cancer Res, 2002. 8(11): p. 3579-83.
130. Filyak, Y., O. Filyak, and R. Stoika, Transforming growth factor beta-1 enhances cytotoxic effect of doxorubicin in human lung adenocarcinoma cells of A549 line. Cell Biol Int, 2007. 31(8): p. 851-5.
131. Shen, J., et al., Fetal onset of aberrant gene expression relevant to pulmonary carcinogenesis in lung adenocarcinoma development induced by in utero arsenic exposure. Toxicol Sci, 2007. 95(2): p. 313-20.
132. Waalkes, M.P., et al., Enhanced urinary bladder and liver carcinogenesis in male CD1 mice exposed to transplacental inorganic arsenic and postnatal diethylstilbestrol or tamoxifen. Toxicol Appl Pharmacol, 2006. 215(3): p. 295-305.
133. Waalkes, M.P., et al., Animal models for arsenic carcinogenesis: inorganic arsenic is a transplacental carcinogen in mice. Toxicol Appl Pharmacol, 2004. 198(3): p. 377-84.
134. Devereux, T.R., et al., Map kinase activation correlates with K-ras mutation and loss of heterozygosity on chromosome 6 in alveolar bronchiolar carcinomas from B6C3F1 mice exposed to vanadium pentoxide for 2 years. Carcinogenesis, 2002. 23(10): p. 1737-43.
135. Zanesi, N., et al., Lung cancer susceptibility in Fhit-deficient mice is increased by Vhl haploinsufficiency. Cancer Res, 2005. 65(15): p. 6576-82.
136. Diament, M.J., et al., Inhibition of tumor progression and paraneoplastic syndrome development in a murine lung adenocarcinoma by medroxyprogesterone acetate and indomethacin. Cancer Invest, 2006. 24(2): p. 126-31.
137. Moody, T.W., et al., Indomethacin reduces lung adenoma number in A/J mice. Anticancer Res, 2001. 21(3B): p. 1749-55.
138. Levin, G., et al., Indomethacin inhibits the accumulation of tumor cells in mouse lungs and subsequent growth of lung metastases. Chemotherapy, 2000. 46(6): p. 429-37.
139. Meira, L.B., et al., Cancer predisposition in mutant mice defective in multiple genetic pathways: uncovering important genetic interactions. Mutat Res, 2001. 477(1-2): p. 51-8.
140. Fan, J.G., Q.E. Wang, and S.J. Liu, Chrysotile-induced cell transformation and transcriptional changes of c-myc oncogene in human embryo lung cells. Biomed Environ Sci, 2000. 13(3): p. 163-9.
141. Carvajal, A., et al., Progesterone pre-treatment potentiates EGF pathway signaling in the breast cancer cell line ZR-75. Breast Cancer Res Treat, 2005. 94(2): p. 171-83.
166
142. Kato, S., et al., Progesterone increases tissue factor gene expression, procoagulant activity, and invasion in the breast cancer cell line ZR-75-1. J Clin Endocrinol Metab, 2005. 90(2): p. 1181-8.
143. Verheus, M., et al., Plasma phytoestrogens and subsequent breast cancer risk. J Clin Oncol, 2007. 25(6): p. 648-55.
144. Nobert, G.S., M.M. Kraak, and S. Crawford, Estrogen dependent growth inhibitory effects of tamoxifen but not genistein in solid tumors derived from estrogen receptor positive (ER+) primary breast carcinoma MCF7: single agent and novel combined treatment approaches. Bull Cancer, 2006. 93(7): p. E59-66.
145. Seo, H.S., et al., Stimulatory effect of genistein and apigenin on the growth of breast cancer cells correlates with their ability to activate ER alpha. Breast Cancer Res Treat, 2006. 99(2): p. 121-34.
146. Lakshmanaswamy, R., R.C. Guzman, and S. Nandi, Hormonal prevention of breast cancer: significance of promotional environment. Adv Exp Med Biol, 2008. 617: p. 469-75.
147. Bergman Jungestrom, M., L.U. Thompson, and C. Dabrosin, Flaxseed and its lignans inhibit estradiol-induced growth, angiogenesis, and secretion of vascular endothelial growth factor in human breast cancer xenografts in vivo. Clin Cancer Res, 2007. 13(3): p. 1061-7.
148. Vogel, V.G., Recent results from clinical trials using SERMs to reduce the risk of breast cancer. Ann N Y Acad Sci, 2006. 1089: p. 127-42.
149. Eliassen, A.H., et al., Endogenous steroid hormone concentrations and risk of breast cancer among premenopausal women. J Natl Cancer Inst, 2006. 98(19): p. 1406-15.
150. Russo, J., et al., Estrogen and its metabolites are carcinogenic agents in human breast epithelial cells. J Steroid Biochem Mol Biol, 2003. 87(1): p. 1-25.
151. Ackerstaff, E., et al., Anti-inflammatory agent indomethacin reduces invasion and alters metabolism in a human breast cancer cell line. Neoplasia, 2007. 9(3): p. 222-35.
152. Green, M., et al., Diallyl sulfide induces the expression of estrogen metabolizing genes in the presence and/or absence of diethylstilbestrol in the breast of female ACI rats. Toxicol Lett, 2007. 168(1): p. 7-12.
153. Walter, G., R. Liebl, and E. von Angerer, Synthesis and biological evaluation of stilbene-based pure estrogen antagonists. Bioorg Med Chem Lett, 2004. 14(18): p. 4659-63.
154. Vegran, F., et al., Overexpression of caspase-3s splice variant in locally advanced breast carcinoma is associated with poor response to neoadjuvant chemotherapy. Clin Cancer Res, 2006. 12(19): p. 5794-800.
155. Untch, M., et al., Cardiac safety of trastuzumab in combination with epirubicin and cyclophosphamide in women with metastatic breast cancer: results of a phase I trial. Eur J Cancer, 2004. 40(7): p. 988-97.
167
156. Machiels, J.P., et al., Cyclophosphamide, doxorubicin, and paclitaxel enhance the antitumor immune response of granulocyte/macrophage-colony stimulating factor-secreting whole-cell vaccines in HER-2/neu tolerized mice. Cancer Res, 2001. 61(9): p. 3689-97.
157. Murray, T.J., et al., Induction of mammary gland ductal hyperplasias and carcinoma in situ following fetal bisphenol A exposure. Reprod Toxicol, 2007. 23(3): p. 383-90.
158. Uehara, T., et al., A toxicogenomics approach for early assessment of potential non-genotoxic hepatocarcinogenicity of chemicals in rats. Toxicology, 2008. 250(1): p. 15-26.
159. Yager, J.D. and N.E. Davidson, Estrogen carcinogenesis in breast cancer. N Engl J Med, 2006. 354(3): p. 270-82.
160. Dairkee, S.H., et al., Bisphenol A induces a profile of tumor aggressiveness in high-risk cells from breast cancer patients. Cancer Res, 2008. 68(7): p. 2076-80.
161. Buteau-Lozano, H., et al., Xenoestrogens modulate vascular endothelial growth factor secretion in breast cancer cells through an estrogen receptor-dependent mechanism. J Endocrinol, 2008. 196(2): p. 399-412.
162. Subramanian, A., et al., Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA, 2005. 102(43): p. 15545-50.
163. Salonen, J.T., et al., Type 2 diabetes whole-genome association study in four populations: the DiaGen consortium. Am J Hum Genet, 2007. 81(2): p. 338-45.
164. Saxena, R., et al., Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science, 2007. 316(5829): p. 1331-6.
165. Sladek, R., et al., A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature, 2007. 445(7130): p. 881-5.
166. McClellan, J. and M.-C. King, Genetic Heterogeneity in Human Disease. Cell, 2010. 141(2): p. 210-217.
167. Hardy, J. and A. Singleton, Genomewide Association Studies and Human Disease. New Engl J Med, 2009. 360(17): p. 1759-1768.
168. Manolio, T.A., L.D. Brooks, and F.S. Collins, A HapMap harvest of insights into the genetics of common disease. J Clin Invest, 2008. 118(5): p. 1590-605.
169. Wetterstrand, K. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program. 2011 2011/08/12]; Available from: http://www.genome.gov/sequencingcosts.
170. Frayling, T.M., et al., A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science, 2007. 316(5826): p. 889-94.
168
171. McCarthy, M.I., et al., Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet, 2008. 9(5): p. 356-369.
172. NCI-NHGRI Working Group on Replication in Association Studies, Replicating genotype‚ phenotype associations. Nature, 2007. 447(7145): p. 655-660.
173. Ioannidis, J.P., et al., Replication validity of genetic association studies. Nat Genet, 2001. 29(3): p. 306-9.
174. Christakis, N.A. and J.H. Fowler, The spread of obesity in a large social network over 32 years. N Engl J Med, 2007. 357(4): p. 370-9.
175. Pearson, J.F., et al., Association Between Fine Particulate Matter and Diabetes Prevalence in the U.S. Diabetes Care, 2010. 33(10): p. 2196-2201.
176. Butte, A.J. and I.S. Kohane, Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput, 2000: p. 418-29.
177. Butte, A.J., et al., Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc Natl Acad Sci U S A, 2000. 97(22): p. 12182-6.
178. Austin, P.C., et al., Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health. J Clin Epidemiol, 2006. 59(9): p. 964-9.
179. Young, S.S., Acknowledge and fix the multiple testing problem. Int J Epidemiol, 2010. 39(3): p. 934; author reply 934-5.
180. Young, S.S. and M. Yu, Association of bisphenol A with diabetes and other abnormalities. J Am Med Assoc, 2009. 301(7): p. 720-1; author reply 721-2.
181. Smith, G.D., et al., Clustered environments and randomized genes: a fundamental distinction between conventional and genetic epidemiology. PLoS Med, 2007. 4(12): p. e352.
182. Greenland, S., Randomization, Statistics, and Causal Inference. Epidemiology, 1990. 1(6): p. 421-429.
183. Greenland, S. and H. Morgenstern, Confounding in Health Research. Annu Rev Public Health, 2001. 22(1): p. 189-212.
184. Peto, R., et al., Can dietary beta-carotene materially reduce human cancer rates? Nature, 1981. 290(5803): p. 201-208.
185. Omenn, G.S., et al., Effects of a combination of beta carotene and vitamin A on lung cancer and cardiovascular disease. N Engl J Med, 1996. 334(18): p. 1150-5.
186. Hooper, L., A.R. Ness, and G.D. Smith, Antioxidant strategy for cardiovascular diseases. Lancet, 2001. 357(9269): p. 1705-6.
187. Bartell, S.M., W.C. Griffith, and E.M. Faustman, Temporal error in biomarker-based mean exposure estimates for individuals. J Expo Anal Environ Epidemiol, 2004. 14(2): p. 173-179.
169
188. Manly, B.F., Randomization, Bootstrap and Monte Carlo Methods in Biology. 3 ed2007, Boca Raton: Chapman and Hall/CRC.
189. Efron, B., Large-Scale Inference2010, Cambridge: Cambridge University Press.
190. Peter H. Westfall and S.S. Young, Resampling-based Multiple Testing1993, New York: Wiley.
191. Witten, D.M. and R. Tibshirani, Survival analysis with high-dimensional covariates. Stat Methods Med Res, 2010. 19(1): p. 29-51.
192. Tibshirani, R., Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 1996. 58(1): p. 267-288.
193. Zou, H. and T. Hastie, Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 2005. 67(2): p. 301-320.
194. Hastie, T., R. Tibshirani, and J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 ed2009: Springer.
195. Vittinghoff, E., et al., Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models2005, New York: Springer.
196. Todd, J.A., D'oh! Genes and Environment Cause Crohn's Disease. Cell, 2010. 141(7): p. 1114-1116.
197. Fallin, M.D. and W.H.L. Kao, Is 'X'-WAS the Future for All of Epidemiology? Epidemiology, 2011. 22(4): p. 457-459 10.1097/EDE.0b013e31821d3a9f.
198. Mak, H.C., Trends in computational biology - 2010. Nat Biotech, 2011. 29(1): p. 45-45.
199. Heard, E., et al., Ten years of genetics and genomics: what have we achieved and where are we heading? Nat Rev Genet, 2010. 11(10): p. 723-733.
200. Borrell, B., Epidemiology: Every bite you take. Nature, 2011. 470(7334): p. 320-2.
201. Mathers, C.D. and D. Loncar, Projections of global mortality and burden of disease from 2002 to 2030. PLoS Med, 2006. 3(11): p. e442.
202. Zeggini, E., et al., Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet, 2008. 40(5): p. 638-45.
203. Hindorff, L.A., et al., Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA, 2009. 106(23): p. 9362-9367.
204. ADA. Diabetes Information -- All About Diabetes. 2009 6/1/2009]; Available from: http://www.diabetes.org/about-diabetes.jsp.
205. Lumley, T., survey: analysis of complex survey samples, 2009. 206. R Development Core Team, R: A language for statistical computing,
2009, R Foundation for Statistical Computing: Vienna, Austria.
170
207. CDC and National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Analytic Guidelines. 2003 [cited 2010 2/19/2010]; Available from: http://www.cdc.gov/nchs/data/nhanes/nhanes_03_04/nhanes_analytic_guidelines_dec_2005.pdf.
208. Cowie, C.C., et al., Prevalence of diabetes and impaired fasting glucose in adults in the U.S. population: National Health And Nutrition Examination Survey 1999-2002. Diabetes Care, 2006. 29(6): p. 1263-8.
209. Abahusain, M.A., et al., Retinol, alpha-tocopherol and carotenoids in diabetes. Eur J Clin Nutr, 1999. 53(8): p. 630-5.
210. Polidori, M.C., et al., Plasma levels of lipophilic antioxidants in very old patients with type 2 diabetes. Diabetes Metab Res Rev, 2000. 16(1): p. 15-9.
211. Arnlov, J., et al., Serum and dietary beta-carotene and alpha-tocopherol and incidence of type 2 diabetes mellitus in a community-based study of Swedish men: report from the Uppsala Longitudinal Study of Adult Men (ULSAM) study. Diabetologia, 2009. 52(1): p. 97-105.
212. Ford, E.S., et al., Diabetes mellitus and serum carotenoids: findings from the Third National Health and Nutrition Examination Survey. Am J Epidemiol, 1999. 149(2): p. 168-76.
213. Ylonen, K., et al., Dietary intakes and plasma concentrations of carotenoids and tocopherols in relation to glucose metabolism in subjects at high risk of type 2 diabetes: the Botnia Dietary Study. Am J Clin Nutr, 2003. 77(6): p. 1434-41.
214. Wang, L., et al., Plasma lycopene, other carotenoids, and the risk of type 2 diabetes in women. Am J Epidemiol, 2006. 164(6): p. 576-85.
215. Montonen, J., et al., Dietary antioxidant intake and risk of type 2 diabetes. Diabetes Care, 2004. 27(2): p. 362-6.
216. Song, Y., et al., Effects of vitamins C and E and beta-carotene on the risk of type 2 diabetes in women at high risk of cardiovascular disease: a randomized controlled trial. Am J Clin Nutr, 2009. 90(2): p. 429-37.
217. Kataja-Tuomola, M., et al., Effect of alpha-tocopherol and beta-carotene supplementation on the incidence of type 2 diabetes. Diabetologia, 2008. 51(1): p. 47-53.
218. Codru, N., et al., Diabetes in relation to serum levels of polychlorinated biphenyls and chlorinated pesticides in adult Native Americans. Environ Health Perspect, 2007. 115(10): p. 1442-7.
219. Uemura, H., et al., Associations of environmental exposure to dioxins with prevalent diabetes among general inhabitants in Japan. Environ Res, 2008. 108(1): p. 63-8.
220. Rignell-Hydbom, A., L. Rylander, and L. Hagmar, Exposure to persistent organochlorine pollutants and type 2 diabetes mellitus. Hum Exp Toxicol, 2007. 26(5): p. 447-52.
171
221. Wang, S.L., et al., Increased risk of diabetes and polychlorinated biphenyls and dioxins: a 24-year follow-up study of the Yucheng cohort. Diabetes Care, 2008. 31(8): p. 1574-9.
222. Jiang, Q., et al., gamma-tocopherol, the major form of vitamin E in the US diet, deserves more attention. Am J Clin Nutr, 2001. 74(6): p. 714-22.
223. Burton, G.W., et al., Human plasma and tissue alpha-tocopherol concentrations in response to supplementation with deuterated natural and synthetic vitamin E. Am J Clin Nutr, 1998. 67(4): p. 669-84.
224. Campbell, S., et al., Development of gamma (gamma)-tocopherol as a colorectal cancer chemopreventive agent. Crit Rev Oncol Hematol, 2003. 47(3): p. 249-59.
225. Agency for Toxic Substances and Disease Registry. Heptachlor and Heptachlor Epoxide. 2007 [cited 2009 8/1/2009]; Available from: http://www.atsdr.cdc.gov/tfacts12.html.
226. Office of Water Regulations and Standards, Ambient Water Quality Criteria for Heptachlor, ed. U.S.E.P. Agency. Vol. EPA 440 5-80-052. 1980, Washington, DC: United States Environmental Production Agency.
227. Montgomery, M.P., et al., Incident diabetes and pesticide exposure among licensed pesticide applicators: Agricultural Health Study, 1993-2003. Am J Epidemiol, 2008. 167(10): p. 1235-46.
228. Zeggini, E., et al., Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science, 2007. 316(5829): p. 1336-41.
229. Heller, D.A., et al., Genetic and environmental influences on serum lipid levels in twins. N Engl J Med, 1993. 328(16): p. 1150-6.
230. Costanza, M.C., et al., Relative Contributions of Genes, Environment, and Interactions to Blood Lipid Concentrations in a General Adult Population. American Journal of Epidemiology, 2005. 161(8): p. 714-724.
231. Kris-Etherton, P.M., et al., The effect of diet on plasma lipids, lipoproteins, and coronary heart disease. J Am Diet Assoc, 1988. 88(11): p. 1373-400.
232. Schaefer, E.J., Lipoproteins, nutrition, and heart disease. Am J Clin Nutr, 2002. 75(2): p. 191-212.
233. Varady, K.A. and P.J. Jones, Combination diet and exercise interventions for the treatment of dyslipidemia: an effective preliminary strategy to lower cholesterol levels? J Nutr, 2005. 135(8): p. 1829-35.
234. Craig, W.Y., G.E. Palomaki, and J.E. Haddow, Cigarette smoking and serum lipid and lipoprotein concentrations: an analysis of published data. BMJ, 1989. 298(6676): p. 784-8.
235. Kraus, W.E., et al., Effects of the amount and intensity of exercise on plasma lipoproteins. N Engl J Med, 2002. 347(19): p. 1483-92.
172
236. Brook, R.D., et al., Particulate Matter Air Pollution and Cardiovascular Disease: An Update to the Scientific Statement From the American Heart Association. Circulation, 2010. 121(21): p. 2331-2378.
237. Ezzati, T.M., et al., Sample design: Third National Health and Nutrition Examination Survey. Vital Health Stat 2, 1992(113): p. 1-35.
238. Storey, J.D., A Direct Approach to False Discovery Rates. J R Statist Soc B, 2002. 64: p. 479-498.
239. American Heart Association. Drug Therapy for Cholesterol. 2010 [cited 2010 10/5]; Available from: http://www.heart.org/HEARTORG/Conditions/Cholesterol/PreventionTreatmentofHighCholesterol/Drug-Therapy-for-Cholesterol_UCM_305632_Article.jsp.
240. Ainsworth, B.E., et al., Compendium of physical activities: an update of activity codes and MET intensities. Med Sci Sports Exerc, 2000. 32(9 Suppl): p. S498-504.
241. Nelson, M.E., et al., Physical activity and public health in older adults: recommendation from the American College of Sports Medicine and the American Heart Association. Med Sci Sports Exerc, 2007. 39(8): p. 1435-45.
242. Cohen, J., Statistical power analysis for the behavioral sciences. 2 ed1988, Hillsdale, NJ: Erlbaum.
243. Fryar, C.D., et al., Hypertension, high serum total cholesterol, and diabetes: racial and ethnic prevalence differences in U.S. adults, 1999-2006. NCHS Data Brief, 2010(36): p. 1-8.
244. Ford, E.S., et al., Hypertriglyceridemia and Its Pharmacologic Treatment Among US Adults. Arch Intern Med, 2009. 169(6): p. 572-578.
245. Harrison, E.H., Mechanisms of digestion and absorption of dietary vitamin A. Annu Rev Nutr, 2005. 25: p. 87-103.
246. Willett, W.C., Nutritional Epidemiology1998, New York: Oxford University Press.
247. Yusuf, S., et al., Vitamin E supplementation and cardiovascular events in high-risk patients. The Heart Outcomes Prevention Evaluation Study Investigators. N Engl J Med, 2000. 342(3): p. 154-60.
248. Omenn, G.S., et al., Long-term vitamin A does not produce clinically significant hypertriglyceridemia: results from CARET, the beta-carotene and retinol efficacy trial. Cancer Epidemiol Biomarkers Prev, 1994. 3(8): p. 711-3.
249. Redlich, C.A., et al., Effect of long-term beta-carotene and vitamin A on serum cholesterol and triglyceride levels among participants in the Carotene and Retinol Efficacy Trial (CARET). Atherosclerosis, 1999. 145(2): p. 425-32.
173
250. Vivekananthan, D.P., et al., Use of antioxidant vitamins for the prevention of cardiovascular disease: meta-analysis of randomised trials. Lancet, 2003. 361(9374): p. 2017-2023.
251. Mente, A., et al., A Systematic Review of the Evidence Supporting a Causal Link Between Dietary Factors and Coronary Heart Disease. Arch Intern Med, 2009. 169(7): p. 659-669.
252. Willcox, B.J., J.D. Curb, and B.L. Rodriguez, Antioxidants in cardiovascular health and disease: key lessons from epidemiologic studies. Am J Cardiol, 2008. 101(10A): p. 75D-86D.
253. Bender, D., Nutritional Biochemistry of the VItamins2003, Cambridge: University of Cambridge Press.
254. Ogihara, T., et al., Distribution of tocopherol among human plasma lipoproteins. Clin Chim Acta, 1988. 174(3): p. 299-305.
255. Winbauer, A.N., S.S. Pingree, and K.L. Nuttall, Evaluating serum alpha-tocopherol (vitamin E) in terms of a lipid ratio. Ann Clin Lab Sci, 1999. 29(3): p. 185-91.
256. Semmler, A., et al., Plasma folate levels are associated with the lipoprotein profile: a retrospective database analysis. Nutrition Journal, 2010. 9(1): p. 31.
257. Jorde, R., et al., High serum 25-hydroxyvitamin D concentrations are associated with a favorable serum lipid profile. Eur J Clin Nutr, 2010.
258. Smith, K.M., et al., Relationship between fish intake, n-3 fatty acids, mercury and risk markers of CHD (National Health and Nutrition Examination Survey 1999-2002). Public Health Nutr, 2009. 12(8): p. 1261-9.
259. Hu, F.B. and W.C. Willett, Optimal Diets for Prevention of Coronary Heart Disease. J Am Med Assoc, 2002. 288(20): p. 2569-2578.
260. Joshipura, K.J., et al., The Effect of Fruit and Vegetable Intake on Risk for Coronary Heart Disease. Ann Intern Med, 2001. 134(12): p. 1106-1114.
261. Bassett, C.M., D. Rodriguez-Leyva, and G.N. Pierce, Experimental and clinical research findings on the cardiovascular benefits of consuming flaxseed. Appl Physiol Nutr Metab, 2009. 34(5): p. 965-74.
262. Pan, A., et al., Meta-analysis of the effects of flaxseed interventions on blood lipids. Am J Clin Nutr, 2009. 90(2): p. 288-97.
263. Park, D., T. Huang, and W.H. Frishman, Phytoestrogens as cardioprotective agents. Cardiol Rev, 2005. 13(1): p. 13-7.
264. Xu, X., et al., Studying associations between urinary metabolites of polycyclic aromatic hydrocarbons (PAHs) and cardiovascular diseases in the United States. Sci Total Environ, 2010. 408(21): p. 4943-4948.
265. Pope, C.A., III, et al., Cardiovascular Mortality and Long-Term Exposure to Particulate Air Pollution: Epidemiological Evidence of General Pathophysiological Pathways of Disease. Circulation, 2004. 109(1): p. 71-77.
174
266. Miller, K.A., et al., Long-term exposure to air pollution and incidence of cardiovascular events in women. N Engl J Med, 2007. 356(5): p. 447-58.
267. Wilson, P.W., et al., Factors associated with lipoprotein cholesterol levels. The Framingham study. Arteriosclerosis, 1983. 3(3): p. 273-81.
268. Njolstad, I., E. Arnesen, and P.G. Lund-Larsen, Smoking, serum lipids, blood pressure, and sex differences in myocardial infarction. A 12-year follow-up of the Finnmark Study. Circulation, 1996. 93(3): p. 450-6.
269. Moffatt, R.J., et al., Acute exposure to environmental tobacco smoke reduces HDL-C and HDL2-C. Prev Med, 2004. 38(5): p. 637-41.
270. Uemura, H., et al., Prevalence of metabolic syndrome associated with body burden levels of dioxin and related compounds among Japan's general population. Environ Health Perspect, 2009. 117(4): p. 568-73.
271. Dirinck, E., et al., Obesity and Persistent Organic Pollutants: Possible Obesogenic Effect of Organochlorine Pesticides and Polychlorinated Biphenyls. Obesity (Silver Spring), 2010.
272. Goncharov, A., et al., High serum PCBs are associated with elevation of serum lipids and cardiovascular disease in a Native American population. Environ Res, 2008. 106(2): p. 226-39.
273. Sergeev, A.V. and D.O. Carpenter, Hospitalization rates for coronary heart disease in relation to residence near areas contaminated with persistent organic pollutants and other pollutants. Environ Health Perspect, 2005. 113(6): p. 756-61.
274. Gustavsson, P. and C. Hogstedt, A cohort study of Swedish capacitor manufacturing workers exposed to polychlorinated biphenyls (PCBs). Am J Ind Med, 1997. 32(3): p. 234-9.
275. Morgan, T.M., et al., Nonvalidation of reported genetic risk factors for acute coronary syndrome in a large-scale replication study. J Am Med Assoc, 2007. 297(14): p. 1551-61.
276. Boffetta, P., et al., False-positive results in cancer epidemiology: a plea for epistemological modesty. J Natl Cancer Inst, 2008. 100(14): p. 988-95.
277. Lang, I.A., et al., Association of urinary bisphenol A concentration with medical disorders and laboratory abnormalities in adults. J Am Med Assoc, 2008. 300(11): p. 1303-10.
278. Navas-Acien, A., et al., Arsenic exposure and prevalence of type 2 diabetes in US adults. J Am Med Assoc, 2008. 300(7): p. 814-22.
279. Everett, C.J., et al., Association of a polychlorinated dibenzo-p-dioxin, a polychlorinated biphenyl, and DDT with diabetes in the 1999-2002 National Health and Nutrition Examination Survey. Environ Res, 2007. 103(3): p. 413-8.
280. Lee, D.H., et al., Association between serum concentrations of persistent organic pollutants and insulin resistance among nondiabetic adults: results from the National Health and Nutrition Examination Survey 1999-2002. Diabetes Care, 2007. 30(3): p. 622-8.
175
281. Lee, D.H., et al., Relationship between serum concentrations of persistent organic pollutants and the prevalence of metabolic syndrome among non-diabetic adults: results from the National Health and Nutrition Examination Survey 1999-2002. Diabetologia, 2007. 50(9): p. 1841-51.
282. Kitao, Y., et al., A contribution to genome-wide association studies: search for susceptibility loci for schizophrenia using DNA microsatellite markers on chromosomes 19, 20, 21 and 22. Psychiatr Genet, 2000. 10(3): p. 139-43.
283. Ohnishi, Y., et al., A high-throughput SNP typing system for genome-wide association studies. J Hum Genet, 2001. 46(8): p. 471-7.
284. Duncan David E, Experimental man : what one man's body reveals about his future, your health, and our toxic world2009, Hoboken, NJ: Wiley.
285. Ober, C. and D. Vercelli, Gene-environment interactions in human disease: nuisance or opportunity? Trends Genet, 2011. 27(3): p. 107-15.
286. Eichler, E.E., et al., Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet, 2010. 11(6): p. 446-50.
287. National Institute of Child Health and Human Development. Phenylketonuria. 2010 3/24/2010 [cited 2010 8/18]; Available from: http://www.nichd.nih.gov/health/topics/phenylketonuria.cfm.
288. Crowley, J.J., P.F. Sullivan, and H.L. McLeod, Pharmacogenomic genome-wide association studies: lessons learned thus far. Pharmacogenomics, 2009. 10(2): p. 161-3.
289. Khoury, M.J., M.J. Adams, Jr., and W.D. Flanders, An epidemiologic approach to ecogenetics. Am J Hum Genet, 1988. 42(1): p. 89-95.
290. Garrod, A., Alkaptonuria. Lancet, 1902: p. 653-656. 291. Garrod, A., The Inborn Factors in Disease: An Essay1931, Oxford:
Clarendon Press. 292. Motulsky, A.G., Drug reactions, enzymes, and biochemical genetics. J
Am Med Assoc, 1957. 165(7): p. 835-837. 293. Khoury, M.J., T.H. Beaty, and B. Cohen, Fundamentals of Genetic
Epidemiology. 1 ed. Vol. 1. 1993, New York: Oxford University Press. 294. Siemiatycki, J. and D.C. Thomas, Biological models and statistical
interactions: an example from multistage carcinogenesis. Int J Epidemiol, 1981. 10(4): p. 383-7.
295. Wang, X., R.C. Elston, and X. Zhu, Statistical interaction in human genetics: how should we model it if we are looking for biological interaction? Nat Rev Genet, 2010. 12(1): p. 74.
296. Kellermann, G., C.R. Shaw, and M. Luyten-Kellerman, Aryl hydrocarbon hydroxylase inducibility and bronchogenic carcinoma. N Engl J Med, 1973. 289(18): p. 934-7.
176
297. Stern, M.C., et al., Polymorphisms in DNA repair genes, smoking, and bladder cancer risk: findings from the international consortium of bladder cancer. Cancer Res, 2009. 69(17): p. 6857-64.
298. Vineis, P., et al., Current smoking, occupation, N-acetyltransferase-2 and bladder cancer: a pooled analysis of genotype-based studies. Cancer Epidemiol Biomarkers Prev, 2001. 10(12): p. 1249-52.
299. Grarup, N. and G. Andersen, Gene-environment interactions in the pathogenesis of type 2 diabetes and metabolism. Curr Opin Clin Nutr Metab Care, 2007. 10(4): p. 420-6.
300. Romao, I. and J. Roth, Genetic and environmental interactions in obesity and type 2 diabetes. J Am Diet Assoc, 2008. 108(4 Suppl 1): p. S24-8.
301. Khoury, M.J. and S. Wacholder, Invited commentary: from genome-wide association studies to gene-environment-wide interaction studies--challenges and opportunities. Am J Epidemiol, 2009. 169(2): p. 227-30; discussion 234-5.
302. Hetherington, M.M. and J.E. Cecil, Gene-environment interactions in obesity. Forum Nutr, 2010. 63: p. 195-203.
303. Memisoglu, A., et al., Interaction between a peroxisome proliferator-activated receptor gamma gene polymorphism and dietary fat intake in relation to body mass. Hum Mol Genet, 2003. 12(22): p. 2923-9.
304. Cornelis, M.C., et al., TCF7L2, dietary carbohydrate, and risk of type 2 diabetes in US women. Am J Clin Nutr, 2009. 89(4): p. 1256-1262.
305. Ioannidis, J.P., Why most discovered true associations are inflated. Epidemiology, 2008. 19(5): p. 640-8.
306. Omenn, G.S., Overview of the symposium on public health significance of genomics and eco-genetics. Annu Rev Public Health, 2010. 31: p. 1-8.
307. Ioannidis, J.P., Commentary: grading the credibility of molecular evidence for complex diseases. Int J Epidemiol, 2006. 35(3): p. 572-8; discussion 593-6.
308. Chen, R., et al., Non-synonymous and synonymous coding SNPs show similar likelihood and effect size of human disease association. PLoS One, 2010. 5(10): p. e13574.
309. Nyholt, D.R., A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet, 2004. 74(4): p. 765-9.
310. Bůžková, P., T. Lumley, and K. Rice, Permutation and Parametric Bootstrap Tests for Gene–Gene and Gene–Environment Interactions. Ann Hum Genet, 2011. 75(1): p. 36-45.
311. Wilson, P.W., et al., Prediction of incident diabetes mellitus in middle-aged adults: the Framingham Offspring Study. Arch Intern Med, 2007. 167(10): p. 1068-74.
177
312. Lyssenko, V., et al., Predictors of and Longitudinal Changes in Insulin Sensitivity and Secretion Preceding Onset of Type 2 Diabetes. Diabetes, 2005. 54(1): p. 166-174.
313. Gauderman J. and J. Morrison, QUANTO - a program to compute power for G x E and G x G studies, 2009: Los Angeles.
314. Vineis, P., A self-fulfilling prophecy: are we underestimating the role of the environment in gene-environment interaction research? Int J Epidemiol, 2004. 33(5): p. 945-946.
315. Hayes, M.G., et al., Identification of type 2 diabetes genes in Mexican Americans through genome-wide association studies. Diabetes, 2007. 56(12): p. 3033-44.
316. Ioannidis, J.P., Population-wide generalizability of genome-wide discovered associations. J Natl Cancer Inst, 2009. 101(19): p. 1297-9.
317. Shu, X.O., et al., Identification of new genetic risk variants for type 2 diabetes. PLoS Genet, 2010. 6(9): p. e1001127.
318. Yamauchi, T., et al., A genome-wide association study in the Japanese population identifies susceptibility loci for type 2 diabetes at UBE2E2 and C2CD4A-C2CD4B. Nat Genet, 2010. 42(10): p. 864-8.
319. Tsai, F.J., et al., A genome-wide association study identifies susceptibility variants for type 2 diabetes in Han Chinese. PLoS Genet, 2010. 6(2): p. e1000847.
320. Unoki, H., et al., SNPs in KCNQ1 are associated with susceptibility to type 2 diabetes in East Asian and European populations. Nat Genet, 2008. 40(9): p. 1098-102.
321. Mailman, M.D., et al., The NCBI dbGaP database of genotypes and phenotypes. Nature genetics, 2007. 39(10): p. 1181-6.
322. Environmental Working Group and Commonweal. EWG || Human Toxome Project. [cited 2009 11/11/2009]; Available from: http://www.ewg.org/sites/humantoxome/.