environment-wide associations to disease and disease ...mg775gw7130/... · environment-wide...

ENVIRONMENT-WIDE ASSOCIATIONS TO DISEASE AND DISEASE-

RELATED PHENOTYPES

A DISSERTATION

SUBMITTED TO THE PROGRAM IN BIOMEDICAL INFORMATICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Chirag Jagdish Patel

August 2011

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/mg775gw7130

© 2011 by Chirag Jagdish Patel. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/mg775gw7130

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Atul Butte, Primary Adviser


Jayanta Bhattacharya


Mark Cullen

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

iv

ABSTRACT

Common diseases arise out of combination of both genetic and environmental

influences. Advances in genomic technology have enabled investigators to

create hypotheses regarding the contribution of genetic factors at a

breathtaking pace. However, the assessment of multiple and specific

environmental factors—and their interactions with the genome-- has not. We

lack high-throughput analytic methodologies to comprehensively and

systematically associate multiple physical and specific environmental factors,

or the “envirome”, to disease and human health.

We claim that the creation of hypotheses regarding the environmental

contribution to disease is practicable through high-throughput analytic methods

that have been well established in genomics. In the following dissertation, we

develop and apply methods to systematically and comprehensively associate

specific factors of the envirome with disease states, prioritizing factors for in-

depth future study.

The current disciplines of studying the environmental determinants of health

include toxicology and epidemiology, which operate on molecular and

population scales, respectively. This dissertation proposes approaches in both

of these disciplines. For example, we have developed a framework to conduct

the first “Environment-wide Association Study” (EWAS), systematically

associating environmental factors to disease on a population scale. We have

applied this framework to investigate type 2 diabetes and heart disease on

cohorts that are representative United States population, finding novel and

robust associations in diverse and independent cohorts. Given the lack of

explained risk resulting from current day genome-wide studies, the time is ripe

to usher in a more comprehensive study of the environment, or “enviromics”,

toward better understanding of multifactorial diseases and their prevention.

v

ACKNOWLEDGEMENTS

Foremost, I thank my advisor, Dr. Atul Butte, for his undying confidence,

inspiration, and guidance. Even just three years ago, it was far from my belief

that the scientist whom I admired from afar would eventually take me on as a

student and teach me how to compute, see, and enlighten. For Dr. Atul Butte’s

supervision I am forever indebted and most fortunate.

I am also indebted to my dissertation committee, Drs. Jay Bhattacharya, Mark

Cullen, John Ioannidis, and Robert Tibshirani. Much of this work has come

out of discussions with these individuals and it is inspired by and stands on

their fundamental teachings. I thank my academic advisors, Drs. Mark Musen

and Betty Cheng, for encouraging me to keep taking courses that enabled this

work.

I thank my many friends and colleagues in the Butte Laboratory and in the

Biomedical Informatics program whom I continue to look up to and draw

inspiration from. I feel honored and privileged to be among you. In particular,

I thank Dr. Rong Chen, Alex Morgan, Joel Dudley, and Nick Tatonetti for

providing support and encouragement when it was least expected but most

needed.

From teaching me how to read and write and to gifting me the newest

computers, I thank my parents, Neela and Jagdish Patel. I will always be

grateful to them for initiating this most rewarding journey of lifelong learning.

I thank my brother, Ankur Patel, for his unflagging support and faith through

thick and thin.

vi

I thank my in-laws, Tapan and Kokila Chaudhuri, for their support and

encouragement.

I do not have the words to thank my partner in life, Trina Chaudhuri. I hope

that I can some day enable her to achieve her aspirations as she has done for

me.

I am grateful to the National Library of Medicine and Applied Biosystems, Inc.

for financial support. I thank Centers for Disease Control and Prevention

(CDC), the National Center for Health Statistics (NCHS), and the staff and

individuals who take part in the National Health and Nutrition Examination

Survey (NHANES). In particular, I thank Vijay Gambhir and Peter Meyer of

the CDC/NCHS for their support in accessing and processing NHANES

restricted genetic data. I am grateful again to Dr. Atul Butte for providing

funds to access the NHANES restricted data. I thank the staff of the

Biomedical Informatics Training program and the Butte Laboratory, Mary

Jeanne Oliva, Susan Aptekar, Alex Skrenchuk, Dr. Russ Altman, and Dr. Larry

Fagan. Without the support of these institutions and people, this work would

have not been possible.

A portion of the work in this dissertation derives from two published articles

and two articles currently in review for publication:

Chapter 2:

1. Patel, C. J. and A. J. Butte, Predicting environmental chemical factors associated with disease-related gene expression data. BMC Med Genomics, 2010. 3(1): p. 17.

vii

Chapter 4:

2. Patel, C.J., J. Bhattacharya, and A.J. Butte, An Environment-Wide Association Study (EWAS) on type 2 diabetes mellitus. PLoS ONE, 2010. 5(5): p. e10746.

3. Patel, C.J., M. R. Cullen, J.P.A. Ioannidis, A.J. Butte, Non-genetic associations and correlation globes for determinants of lipid levels: an environment-wide association study. Submitted, 7/2011.

Chapter 5:

4. Patel, C.J., R. Chen, J.P.A. Ioannidis, A.J. Butte, Systematic identification of interaction effects between validated genome- and environment-wide associations on Type 2 Diabetes Mellitus. Submitted, 8/2011.

In the Chapter 2 work, I devised the methodology and wrote the manuscript

with my advisor, Atul Butte. In the Chapter 4 work, I devised the

“Environment-wide-Association Study” (EWAS) framework and carried out

the analyses. For the EWAS on Type 2 Diabetes, I wrote the manuscripts with

Jay Bhattacharya and Atul Butte. For the EWAS on serum lipid levels, I wrote

and edited the manuscripts with Mark Cullen, John Ioannidis, and Atul Butte.

Finally, in the Chapter 5 work, I devised the “Gene-Environment-Wide

Association Study” (G-EWAS) framework and implemented the software to

carry out the analyses. Rong Chen and Atul Butte provided the database of

curated genetic information. I interpreted the data and wrote the manuscript

with Rong Chen, John Ioannidis, and Atul Butte.

viii

TABLE OF CONTENTS

CHAPTER 1: INTRODUCING MULTI-DIMENSIONAL AND DATA-DRIVEN APPROACHES TO CREATE HYPOTHESES REGARDING ENVIRONMENTAL ASSOCIATIONS TO DISEASE ................................ 1

What is the “Environment”? What is the “Envirome”? .................................... 3 Creation of robust hypotheses connecting the environment, genome, and multifactorial disease ............................................................................................ 12

Creating hypotheses comprehensively on a population scale ............................. 14 Creating hypotheses comprehensively on a molecular or toxicological scale .... 18

Discussion ............................................................................................................... 21

CHAPTER 2. MAPPING MULTIPLE TOXICOLOGICAL RESPONSES TO COMPLEX DISEASE ............................................................................. 25

INTRODUCTION ................................................................................................. 25 METHOD TO PREDICT ENVIRONMENTAL ASSOCIATION TO GENE EXPRESSION RESPONSE ................................................................................. 30 RESULTS ............................................................................................................... 41

Verification Phase ............................................................................................. 42 Predicting Environmental Chemicals Associated with Cancer Data Sets ... 44 Clustering Significant Predictions by PubChem-derived Biological Activity ............................................................................................................................ 54

DISCUSSION ........................................................................................................ 57

CHAPTER 3. METHODS TO EXECUTE ENVIRONMENT-WIDE ASSOCIATIONS ON DISEASE AND DISEASE-RELATED PHENOTYPES ON POPULATIONS. ......................................................... 61

INTRODUCTION ................................................................................................. 61 METHODS BACKGROUND .............................................................................. 63

Genome-wide association to disease .................................................................. 63 Environment-wide association to disease ........................................................... 65 Genetic versus non-genetic associations in population scaled studies ............... 68

EWAS METHOD .................................................................................................. 72 Stage 1: Linear Modeling ................................................................................... 72 Stage 2: Controlling for Multiple Hypotheses by Estimating the False Discovery Rate ..................................................................................................................... 74 Stage 3: Validation .............................................................................................. 76 Stage 4: Sensitivity Analyses .............................................................................. 78 Stage 5: Correlation Globes ................................................................................ 80

DISCUSSION ........................................................................................................ 80

CHAPTER 4: ENVIRONMENT-WIDE ASSOCIATIONS TO DISEASE AND ADVERSE PHENOTYPES: APPLICATIONS TO TYPE 2 DIABETES (T2D) AND SERUM LIPID LEVELS ..................................... 83

INTRODUCTION ................................................................................................. 83 ENVIRONMENT-WIDE ASSOCIATION STUDY ON TYPE 2 DIABETES 84

EWAS on T2D: Methods .................................................................................... 84 EWAS on T2D: Results ...................................................................................... 91

ix

EWAS on T2D: Conclusions ............................................................................ 100 ENVIRONMENT-WIDE ASSOCIATION STUDY ON SERUM LIPID LEVELS ............................................................................................................... 102

EWAS on Serum Lipids: Methods ................................................................... 102 EWAS on Serum Lipids: Results ...................................................................... 107 EWAS on Serum Lipids: Conclusions .............................................................. 123

DISCUSSION ...................................................................................................... 126

CHAPTER 5: TOWARD ENVIROME-GENOME INTERACTIONS IN THE CONTEXT OF HUMAN HEALTH: COMPREHENSIVELY SCREENING FOR GENE-ENVIRONMENT INTERACTIONS IN ASSOCIATION TO TYPE 2 DIABETES. ................................................ 130

INTRODUCTION ............................................................................................... 130 Background ....................................................................................................... 131 Screening for Gene-Environment Interactions: “G-EWAS” ............................ 133

METHODS .......................................................................................................... 135 Data and selected genetic and environmental factors ....................................... 136 Regression Analyses ......................................................................................... 137 Multiplicity Correction and FDR ...................................................................... 138

RESULTS ............................................................................................................. 140 Allele frequencies ............................................................................................. 141 Power Calculations ........................................................................................... 141 Marginal Associations ...................................................................................... 142 Correlation between genetic variants with environmental variables ................ 143 Screening for Genetic Variant by Environment Interactions ............................ 144 Sensitivity Analyses limited to non-Hispanic white and other Hispanic participants and older individuals ..................................................................... 147 Limited Evidence to Support Interactions with Body Mass Index ................... 148

DISCUSSION ...................................................................................................... 149

CHAPTER 6: CONCLUSION AND DISCUSSION ................................. 153 REFERENCES ............................................................................................. 157

x

LIST OF TABLES

Table 1. Tentative categories of environmental factors as collected from MEDLINE MeSH terms. .................................................................................... 8 Table 2. Gene expression dataset summary for verification stage. .................. 37 Table 3. Chemical Prediction Results from the Verification Phase. ................ 43 Table 4. Prediction of environmental chemicals associated with prostate cancer samples (GSE6919). ......................................................................................... 50 Table 5. Prediction of environmental chemicals associated with lung cancer samples (GSE10072). ....................................................................................... 52 Table 6. Prediction of environmental chemicals associated with breast cancer samples (GSE6883). ......................................................................................... 53 Table 7. Highly statistically significant environmental factors associated to T2D found in more than one NHANES cohort. ............................................... 95 Table 8. Marginal association of each locus (n=18) or environmental factor (n=5) to T2D (FBG > 125 mg/dL). ................................................................ 143

xi

LIST OF FIGURES

Figure 1. Number of publications investigating genetics (red) or the environment (black) in MEDLINE from 1971 onward. ..................................... 3 Figure 2. Environmental factors investigated in context of disease sourced from MEDLINE. ................................................................................................ 7 Figure 3. Envirome-disease network for WHO priority diseases .................... 10 Figure 4. “Zoomed-in” Envirome-disease network for WHO priority diseases. .......................................................................................................................... 11 Figure 5. Overview of population- and molecular-scale methods to create hypotheses across the envirome and genome.. ................................................. 13 Figure 6. Creation of the chemical-gene signatures based on the Comparative Toxicogenomics Database (CTD). ................................................................... 32 Figure 7. Creation of the ‘Envirome Map’ using CTD chemical-gene signatures. ......................................................................................................... 33 Figure 8. Predicting environmental chemical association to gene expression datasets.. ........................................................................................................... 35 Figure 9. Clustering chemical prediction lists by biological activity archived in PubChem. ......................................................................................................... 41 Figure 10. Curated disease-chemical enrichment versus prediction lists for prostate cancer datasets. ................................................................................... 46 Figure 11. Curated disease-chemical enrichment versus prediction lists for lung cancer datasets. ................................................................................................. 47 Figure 12. Curated disease-chemical enrichment versus prediction lists for breast cancer datasets. ...................................................................................... 48 Figure 13. Chemical predictions for Prostate, Lung, and Breast Cancer datasets clustered by PubChem BioActivity. ................................................................. 56 Figure 14. Sample data structure for EWAS. ................................................... 66 Figure 15. Method summary for EWAS on NHANES data. ........................... 87 Figure 16. “Manhattan plot” style graphic showing the environment-wide association with T2D. ....................................................................................... 93 Figure 17. “Manhattan plot” style graphic showing the environment-wide associations to lipid levels. ............................................................................. 109 Figure 18. Forest plot for top 12 validated environmental factors per cohort associated with triglycerides in a model adjusting for age, age-squared, SES, ethnicity, sex, BMI. ........................................................................................ 111 Figure 19. Forest plot for validated environmental factors associated with LDL-C.. .......................................................................................................... 112 Figure 20. Forest plot for top 12 validated environmental factors associated with HDL-C. ................................................................................................... 113

xii

Figure 21. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(triglycerides). ........... 117 Figure 22. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(LDL-C). See Figure 21 for complete caption. Legend abbreviations: TFIBE: total fiber; TVC: total vitamin C; TCRYP: total cryptoxanthin; count: total supplement use in past 30 days; cardiovascular: on lipid lowering drug or had heart disease. ................ 118 Figure 23. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(HDL-C). ................... 119 Figure 24. Pair-wise correlation globes for validated environmental and risk factors associated with triglycerides. .............................................................. 121 Figure 25. Pair-wise correlation globes for validated environmental and risk factors associated with LDL-C. ...................................................................... 122 Figure 26. Pair-wise correlation globes for validated environmental and risk factors associated with HDL-C. ..................................................................... 123 Figure 27. Schematic for comprehensive testing and screening for gene-environment interactions against T2D. ........................................................... 136 Figure 28 Power estimation for detection of interaction for each genetic locus and environmental factor pair tested against T2D (FBG > 125 mg/dL). ....... 142 Figure 29. Manhattan plot of significance values of interaction term (-log10(p-value) for interaction term of pair of factors). ................................................ 144 Figure 30. Per-risk allele effect sizes for top putative interactions with p < 0.05. ................................................................................................................ 147

1

CHAPTER 1: INTRODUCING MULTI-DIMENSIONAL AND DATA-

DRIVEN APPROACHES TO CREATE HYPOTHESES REGARDING

ENVIRONMENTAL ASSOCIATIONS TO DISEASE

Environmental factors -- “non-genetic” and often modifiable attributes such as

diet, drugs, chemical pollutants, ecological processes, and infectious agents--

in addition to genetic factors, contribute to disease and health [1, 2].

Epidemiologists and toxicologists have been formally studying the

contribution and how much disease risk both environmental and genetic factors

impart on the population, for decades [3, 4].

For example, epidemiologists have measured the attributable fraction, or the

fraction of disease that would be eliminated if the factor were to be eliminated.

For some complex diseases, up to 70-90% of attributable risk can be attributed

to differing “environments” [4, 5], defined here as a cumulative effect of

specific factors. Genetic factors, on the other hand, may also play a large role;

for example, heritability in obesity is estimated to range from 40-70% [6]. To

determine what specific genetic or environmental factors contribute to disease,

epidemiologists begin with an associative study in which variation for a factor

is compared to disease status; for example, the presence of an environmental

factor is compared in diseased versus undiseased individuals.

Despite both types of factors contributing to disease and health, many

investigations since the 1990s have focused investigating the genetic factors

(Figure 1). Recently, genetic association studies have advanced through a

framework known as the “Genome-wide Association Study”, or GWAS. For

example, the Wellcome Trust Case Control Consortium study of 7 common

diseases [7] is a notable one. In GWAS, 100,000 to 1 million genetic factors

are compared in frequency between diseased and non-diseased populations [8]

2

and to date, there have already been over 350 of these studies [9]. This is

contrasted with environmental epidemiological studies, which at most

investigate currently a handful of candidate factors at a time in association to

disease or phenotype. Relatedly, we lack methods to comprehensively and

systematically report environmental associations to disease [10].

Environmental epidemiological studies are neither data-driven nor multi-

dimensional and therefore do not allow for discovery as is the case in genome-

wide studies.

While epidemiology attempts to ascertain the contribution of factors on a

population scale, toxicology uses model biological systems to study the

influence of specific physical factors through assessment of molecular

responses [3, 11]. While technology exists to measure these molecular

responses on a genome-wide dimension [12], we lack methods to

comprehensively connect these with human disease state. In fact, the National

Academy of Sciences has called for a molecular-based effort to map

comprehensively map toxicological findings with health and risk-associated

phenotypes [13-15].

Our claim is that creation of hypotheses regarding the environmental

associations to disease is possible through data-driven analytical methods that

are standard in multi-dimensional genome-wide studies. To this end, we

propose and implement 1.) An integrative approach to connect molecular and

toxicological response to human disease states (Chapter 2), and 2.) A

population-based study framework to correlate multiple environmental factors

to disease, called an “Environment-wide Association Study” (EWAS) (Chapter

3 and 4), and furthermore, 3.) A method to predict how environmental factors

interact with genetic variants through unbiased integration of EWAS and

GWAS (Chapter 5). First, we introduce a concept that assumes a large subset

of specific and possible environmental factors known as the “envirome”.

3

Figure 1. Number of publications investigating genetics (red) or the environment (black) in MEDLINE from 1971 onward. We queried MEDLINE for all articles that investigating World Health Organization priority diseases (Cardiovascular Disease, Coronary Disease, Hypertension, Type 2 Diabetes, Lung Cancer, Breast Cancer, Colon Cancer, Asthma, Compulsive Obstructive Pulmonary Disorder, Premature birth, Kidney Disease, and Alzheimer Disease) and tabulated those strictly studying either genetics or the environment. Only articles investigating either disease and genetics or disease and environment as a primary subject matter were considered. Since the 1990s, genetic factors have been studied more than environmental factors for WHO priority diseases.

What is the “Environment”? What is the “Envirome”?

The “environment” is a loosely defined, heterogeneous mixture of non-genetic

factors. For example, specific physical environmental factors include

infectious agents (bacteria, viruses, fungi), dietary components (nutrients and

vitamins), ultraviolet radiation, and non-nutrient chemicals (such as drugs or

cigarette smoke). They may be by products of man-made processes, such as

air pollutants, or of natural processes, such as toxins from animals and plants.

Other types of environmental factors are non-“physical” – not associated with

one concrete factor -- and are intertwined with behavior and life-style, such as

stress and exercise. The environment also includes factors that arise as a

1970 1980 1990 2000

0200

400

600

800

1000

1200

year published

tota

l pub

licat

ions

geneticsenvironment

4

result of interaction between internal biological processes and other

environmental factors, such as metabolites of environmental chemicals or

infectious agents [16]. Air pollution, lead, tobacco smoke, ultraviolet

radiation, occupational risk factors, and climate – in addition to infectious

agents –are example factors prioritized by the World Health Organization for

constant monitoring [17].

Environmental factors are unique in their routes of exposure, mode of measure,

and dynamic. This heterogeneous mixture is contrasted with genetic variant

factors assessed in GWAS, which are discrete and static units. The

homogenous nature of genetic factors has been a factor in enabling

standardization of measurement (e.g., polymerase chain reaction [18] and gene

“chips” [19]) and organization (through efforts by the National Center of

Biotechnology Information, NCBI [20]).

Despite these characteristics, we propose a concept that allows for the multi-

dimensional study of the environment, the ”envirome”, the total “ensemble of

the environment” [21] that can influence biological processes such as disease.

While the “environment” refers to non-genetic factors in an abstract sense, we

refer to the “envirome” in an analogous way as the genome, a collection of

specific and varying factors.

To get a better grasp of the types of and specific factors investigated in the

context of disease, we searched through all of the scientific and medical

literature in MEDLINE up to year 2009 for evidence of investigation between

an environmental factor and disease condition. This search is made possible

using Medical Subject Headings (MeSH), an annotation system administered

by the National Library of Medicine (NLM) and applied to all articles in

MEDLINE [22, 23]. These subject headings contain categories of terms, such

as diseases (or condition), chemicals, drugs and physiological attributes. These

5

sets of terms also have “qualifiers”, which contextualize relationships between

terms; sample qualifiers include “etiology” (indicating an etiological

relationship among a term corresponding to a condition and environmental

factor), or “epidemiology” (indicating a population-based study between a

condition and factor), and “drug therapy” (indicating an a therapeutic

relationship studied between the condition and factor), among others.

Specifically, we went through each the MeSH annotations for each article and

looked for terms with environmental factors and diseases, qualified by an

etiological or epidemiological relationship. For example, an article examining

the effects of smoking on incident type 2 diabetes (e.g., [24]) is annotated with

the condition “Diabetes Mellitus, Type 2”, the specific environmental factor,

“Smoking”, qualified by “etiology”. Because many different aspects of factors

and conditions can be reported for a given article, we looked specifically for

those factors and conditions that MEDLINE annotators deemed the “major”, or

the main, subjects of the article. From this scraping of MEDLINE, we attained

a comprehensive list of disease-environmental factor pairs investigated in the

medical literature. Furthermore, we also attained the number of scientific

publications that investigated each factor-disease pair. This indicated the

degree or intensity to which a particular disease and factor has been studied.

We attempt to assemble the envirome into coherent categories of

environmental factors using the MeSH annotations. Specifically, each factor in

MeSH is described by a set of terms. For example, “Smoking” is described by

the term “Individual Behavior” and “Lead” by the terms “Hazardous Substance”

and “Element”. We manually categorized each factor based on these

descriptors into 21 categories based on these descriptors (Table 1). There is

notable overlap between factor categories involving drug and chemicals; for

these factors, we opted to categorize all factors that are drugs as “drug” and

non-drug environmental chemicals (such as pesticides or materials) in

6

chemically oriented bins (“organic chemical”, “inorganic chemical”,

“element”).

The most highly investigated factor categories are drugs and medical

procedures, comprising 24% and 29% of all factor-disease pairs respectively.

In comparison, organic chemicals comprise of but 8% of all factor-disease

relations. However, there are a large number of these organic chemicals in

relation to all others (15%), suggesting an opportunity to explore these factors.

Another opportunity lies in pinning down further composite factors lying in the

most abstract categories, such as “chemical”; for example, air pollution is a

complex matrix of specific factors belonging to other categories, each of which

might have a distinct contribution to disease. The envirome is a complex and

entity due to the heterogeneity of factors.

We acknowledge overlap and imbalance of this categorization; however, we

stress that this first pass is exploratory and not definitive. We propose future

work, the “envirome sequencing project”, focused on how exactly the

envirome is defined and categorized, considering composition of physical

factors (e.g., chemical structure), scope of biological effect (e.g., toxicological

responses), potential routes of exposure, and modes of direct measurement in

human tissue (e.g., cell assays, mass spectrometry, self-report). The next phase

of such a project would follow in the footsteps of the HapMap project [25],

characterizing how common environmental factors vary in different

populations.

Our search resulted in a total of 89,653 unique disease-factor pairs assembled

from 4977 unique environmental factors and 3189 unique disease conditions.

Figure 2 shows the number of publications published for each factor versus the

number of diseases investigated in context of that factor. Specifically, the

median number of publications for a factor was 8; the most highly investigated

7

factor was “Smoking” (2416 publications) and 17% of all factors were only

investigated once (1 publication). The median number of diseases investigated

per factor was 6; the most highly investigated disease was “Dermatitis” (3353

publications), “Drug Eruptions” (3151 publications), and “Occupational

Disease” (3105 publications). 11% of diseases have only been investigated

once (1 publication). In general, the more a factor is investigated the more

diseases it is investigated with (Figure 2). The most highly studied factor-

disease pairs are seen in Table 1 and include factors such as Asbestos, UV

Radiation, Smoking, and Air Pollution studied with diseases such as Lung

Cancer, Skin Cancer, and Mesothelioma.

Figure 2. Environmental factors investigated in context of disease sourced from MEDLINE. Each point in the figure represents an environmental factor, where the x-axis represents the number of publications for the factor and the y-axis represents the number of diseases investigated for that factor. For example, the factor “Smoking” is the top right most point in the plot with over 2,416 publications investigating it among 560 diseases. The red line depicts median and grey lines depicts decile. Markers are faint and jittered to depict concentration of data.

1 5 10 50 100 500

12

510

2050

100

200

500

number of publications for a factor

num

ber o

f dis

ease

s fo

r a fa

ctor

8

Category Total

number of factors (%)

Number of factor-disease relations (%)

Top Factor and Disease relationship

Animal (General)

7 (< 1%) 15 (< 1%) Birds (Bird Disease)

Bacteria 189 (4%) 857 (1%) E. coli (Diarrhea) Behavior 28 (<1%) 1101 (1%) Smoking (Lung neoplasms) Chemical

(General) 220 (4%) 4392 (5%) Air pollution (Asthma)

Dietary component

290 (6%) 4912 (5%) Lipids (Cardiovascular disease)

Drug 1322 (27%)

21071 (24%) Analgesics, Opioid (Pain)

Element 134 (2%) 3229 (4%) Lead (Lead Poisoning) Eukaryotes 99 (2%) 218 (< 1%) Mites (Mite Infestations) Fungus 35 (<1%) 168 (< 1%) Candida (Candidiasis) Hormone 263 (5%) 5283 (5%) Estrogens (Breast neoplasms) Immune

factors 38 (5%) 5826 (6%) Measles-Mumps-Rubella

Vaccine (Autistic Disorder) Poisoning /

injury process

38 (<1%) 930 (1%) Occupational Exposure (Dermatitis)

Inorganic chemical

145 (2%) 2554 (3%) Asbestos (Lung neoplasms)

Man-made object

58 (1%) 1153 (1%) Laser (Eye injuries)

Nucleic Acids

28 (<1%) 207 (<1%) RNA, viral (Hepatitis C)

Occupation 23 (<1%) 177 (<1%) Equipment design (Burns) Organic

chemical 737 (15%) 7318 (8%) Latex (Dermatitis)

Procedure 801 (16%) 26388 (29%) Radiotherapy (Neoplasms) Natural

Process 73 (1%) 2244 (3%) UV Rays (Neoplasms)

Protein 83 (2%) 952 (1%) Streptokinase (Thrombosis) Virus 141 (3%) 658 (<1%) HIV (AIDS)

Table 1. Tentative categories of environmental factors as collected from MEDLINE MeSH terms. Second column shows number of factors within that category and the percentage of all factors. Third column shown the number of disease relationships a category. Right-most column shows an example factor belonging to the category and a disease studied with that factor. Left-most column is color key for Figure 3.

9

Just as the “Human disease network” has enabled the definition of the

“diseaseome”, a comprehensive representation of diseases and their

interrelationships with genomic factors [26], we introduce the “Envirome-

disease network” to aid in the definition of the envirome and its interplay with

common diseases (Figure 3, Figure 4). The “Envirome-disease network”

consists of weighted links corresponding to the number of publications

between specific factors and WHO-prioritized diseases [27], including

cardiovascular disease, coronary disease, hypertension, type 2 diabetes (T2D),

kidney disease, premature births, lung cancer, prostate cancer, colorectal

cancer, Alzheimer’s disease, asthma, and chronic obstructive pulmonary

disease (COPD). In other words, the more publications published between a

disease and a factor, the stronger the link. As seen in the network (Figure 4),

the cardiovascular-related and metabolic diseases cardiovascular disease,

hypertension, T2D, kidney disease), appear to share many diet- and therapy-

related factors, such as carbohydrates and anti-hypertensive drugs. Further,

lung-related diseases such as asthma, COPD, and lung cancer share factors

such as smoking, tobacco smoke, and air pollution. Conditions such as

premature births, colorectal cancer, and COPD are less studied relative to

cardiovascular related diseases or breast and lung cancer.

10

Figure 3. Envirome-disease network for WHO priority diseases [27]. A link between an environmental factor and disease node denotes their association in MEDLINE. Each link is scaled in size according to the amount of citations for that association; for example, the largest link is observed between “Smoking” and “Lung Neoplasms”. Environmental factor and disease nodes are scaled in proportion to the number of connections they have with other nodes (their “degree”); similarly, labels appear for nodes with degree ≥ 4. Factors studied unique to specific diseases are observed on the outer portion of the figure with single links, while factors linked with many diseases are toward the center. Specific factors colored according to their category (Table 1); disease nodes are not colored. Top 5 cited factors for each disease are annotated offset with the number of citations in parentheses.

Smoking (30)Environmental Exposure (7)Occupational Exposure (7)

Air Pollutants (5)Air Pollution, Indoor (5)

Smoke

Nitrogen Dioxide

Adrenergic beta-Agonists

Anti-Asthmatic Agents

DustNickel Air

Pollution, Indoor

Construction Materials

Metals Pulmonary Disease, Chronic

Obstructive

Ultraviolet Rays

EggsHeating GarlicHumidity Cobalt Formaldehyde Detergents Antibodies, Monoclonal Seafood

Antitubercular Agents

Acetaminophen Endotoxins Cesarean Section Respiratory

Therapy

Chlorine

Asthma

Cromolyn Sodium

Marine Toxins

Aerosols Infant Food

Polyurethanes Albuterol

Cereals

Cyanates

Allergens Sulfur Dioxide Bronchodilator

Agents

Irritants Tetanus Toxoid Powders

Water Pollutants, Chemical Zinc

Progesterone Soybeans

Bedding and

Linens

Fatty Acids,

Unsaturated Prolactin

Polychlorinated Biphenyls

Medroxyprogesterone 17-Acetate

Environmental Pollution

Animal Feed Fungi

Toluene 2,4-Diisocyanate Resins,

Synthetic Beclomethasone

Isocyanates

Viral Vaccines Anti-Inflammatory

Agents Immunoglobulin E

Dinoflagellida FlourFoodLatexPlant Extracts

Terpenes PaintOzone

Phthalic Anhydrides

Resins, Plant Delivery, Obstetric

Influenza Vaccines

Amylases

Selective Estrogen Receptor

Modulators

Endocrine Disruptors

Hydrocarbons, Chlorinated

Dehydroepiandrosterone Estrogens,

Non-Steroidal DDT

Hematinics Superoxide Dismutase Pyridoxine

Vitamin B

Complex Cardiovascular Agents

Endoscopy, Gastrointestinal

Ascorbic Acid

Pyrazoles Tacrolimus

von Willebrand

Factor

Metals, Heavy Diet, Vegetarian

HIV Protease Inhibitors Vasectomy Contraceptives, Oral,

Synthetic

Growth Hormone

Lactones Piperazines

Carbon Monoxide

Anesthesia, Epidural

Polyvinyls Metformin Sirolimus

Tissue Plasminogen

Activator

Smog

Phosphodiesterase Inhibitors

Intercellular Adhesion

Molecule-1 Pravastatin

Cyclooxygenase 2

Inhibitors

Antibodies, Bacterial Cathartics

Human Growth

Hormone Lipoproteins, HDL

Erythropoietin, Recombinant

Thyroxine Enterovirus

Estrogen Antagonists

Anti-HIV Agents

Silicon Dioxide

Malondialdehyde Diet,

Reducing Interleukin-6

Arginine

Peritoneal Dialysis,

Continuous Ambulatory

Hot Temperature

Aldosterone Antioxidants C-Reactive

Protein Cyclooxygenase Inhibitors

Hydrocortisone Carbon

Disulfide

Peptide Fragments

Sodium Chloride, Dietary

Losartan Anesthesia, General

Diuretics

Surgical Procedures, Operative

Cholesterol

Kidney Transplantation

Magnesium Hypoglycemic

Agents Antilipemic

Agents Triglycerides

Lipids

Hysterectomy Cardiovascular Diseases

Cadmium

Kidney Diseases

Blood Transfusion Liver

Transplantation Anesthesia Angiotensin-Converting

Enzyme Inhibitors

Renal Dialysis

Immunosuppressive Agents Calcium

Progesterone Congeners

Atenolol

Calcium Channel Blockers Vibration

Cholesterol, LDL

Peritoneal Dialysis

Cyclosporine Platelet

Aggregation Inhibitors

Natriuretic Peptide,

Brain Noise,

Occupational

Antihypertensive Agents

Anti-Inflammatory Agents,

Non-Steroidal

Sodium Reserpine Tyramine Nifedipine

Angiogenesis Inhibitors

Aluminum Silicates 7,8-Dihydro-7,8-dihydroxybenzo(a)pyrene 9,10-oxide

GlassTitanium

Radon Daughters CokeRadioactive

Pollutants Industrial

Waste

Plutonium Alpha Particles Coal

Tar Quinazolines Asbestos,

Amphibole

Radon Minerals Tobacco

Tobacco Smoke

Pollution

Carbon Hydrocarbons

Polycyclic Compounds

Polonium Calcium

Compounds Talc

Urethane Coal

Ceramics

Chromates Hydrazines

Isoniazid

Tobacco, Smokeless

Iron Benzopyrenes Mineral Fibers

Cholecystectomy Polypropylenes

Beer

Smoking

Particulate Matter Occupational

Exposure Gastrins

Vehicle Emissions

Insecticides Vitamin

A Carcinogens, Environmental Carcinogens Inhalation

Exposure Air Pollutants,

Occupational

Carotenoids

Glycoproteins Nitric Oxide Inflammation

Mediators Bile Acids and Salts

Radiotherapy

Tumor Markers, Biological

Colorectal Neoplasms

Air Pollutants

Air Pollution

Bismuth

Nitrosamines Polycyclic Hydrocarbons,

Aromatic

Mustard Gas

Orthomyxoviridae TarsAsbestos,

Serpentine Beryllium Tin

Thorium Dioxide

Air Pollution,

Radioactive Soot

beta Carotene

Uranium

Lung Neoplasms

Acrylonitrile

Air Pollutants, Radioactive Asbestos,

Crocidolite

Sodium, Dietary

Antiretroviral Therapy,

Highly Active

Iron, Dietary

Cytokines

Noise

Testosterone

Corticotropin-Releasing Hormone

Fertilization in

Vitro

Laser Therapy

Electrosurgery Pregnancy Reduction, Multifetal

Abortion, Legal

Clindamycin

Amyloid beta-Protein

Positive-Pressure Respiration Lipid

Peroxides Recombinant Proteins

Adrenocorticotropic Hormone

Aldosterone Antagonists

Prostaglandins

Fatty Acids,

Omega-3 Estradiol Congeners Contraceptive

Devices

Iodine Radioisotopes

Silicones Deodorants Lignans Electricity

Tetrachloroethylene

Genistein

Ovulation Induction

Estrogens, Conjugated

(USP) Progestins Splenectomy

Estrogens

Diethylstilbestrol Follicle

Stimulating Hormone

Estrogen Replacement

Therapy

Prostheses and

Implants

Sunlight Pesticides

Breast Implants Mammaplasty

Central Nervous System

Depressants

Antidepressive Agents

LightDichlorodiphenyl Dichloroethylene

Pacemaker, Artificial Hair

Dyes Estrone

Hazardous Substances

Obstetric Labor,

Premature Folic Acid

Surgery, Plastic Abortion, Induced

Piperidines

Alzheimer Disease

Hypertension

Vitamin D

Insulin-Like Growth Factor

Binding Protein

3

Contraceptives, Oral

Dietary Fats

Adrenergic beta-Antagonists

Cholesterol, HDL

Adrenal Cortex

Hormones

Noise, Transportation

Beverages Hydroxymethylglutaryl-CoA Reductase Inhibitors

Estradiol Ovariectomy Contraceptives,

Oral, Hormonal Raloxifene

Serotonin Uptake

Inhibitors

Environmental Exposure

Androgens

Glucocorticoids Cold

Temperature

Alcoholic Beverages

Propranolol Dehydroepiandrosterone Sulfate

Epinephrine

Breast Neoplasms

Anti-Bacterial Agents

MeatContraceptives,

Oral, Combined

Gonadal Steroid

Hormones

Hormone Replacement

Therapy Caffeine Coffee

Electromagnetic Fields

Insulin

Ethanol

Fatty Acids

Alcohol Drinking

Diet

Hydralazine Nicardipine Norepinephrine

Sodium Chloride Intubation,

Intratracheal Erythropoietin Diet, Protein-Restricted

Phenelzine Nephrectomy

Cilazapril

Desoxycorticosterone Electrocoagulation Oxygen Guanethidine

Enalapril Benzimidazoles Tourniquets Ethanolamines

Arteriovenous Shunt,

Surgical

Electroconvulsive Therapy Spironolactone Glycyrrhiza Endarterectomy,

Carotid Angioplasty, Balloon Tranylcypromine 17-Hydroxycorticosteroids

Tetrazoles Angiotensin II

Lithium Biphenyl Compounds Methyldopa Dexamethasone Coronary

Artery Bypass

Clonidine Mineralocorticoids Adrenalectomy

Cholinesterase Inhibitors

Indans Memantine

Nootropic Agents Dopamine

Agents Neuroprotective

Agents

Dental Amalgam

Aluminum

Anticholesteremic Agents

Perindopril

Catheterization Steroids Traction Blood

Vessel Prosthesis

Endarterectomy Catecholamines

Monoamine Oxidase

Inhibitors Lisinopril Vasodilator

Agents Potassium

Phenoxybenzamine Furosemide Amlodipine

Arsenic

Fibrinogen Marijuana Smoking Plant

Oils

Glucose Fatty Acids,

Nonesterified

Thiazoles Chromans Meat

Products Antipsychotic

Agents Carbonated Beverages

Dietary Sucrose

Antineoplastic Agents

Vitamin E

Heart Transplantation

Lipoproteins, LDL

Diabetes Mellitus,

Type 2

Asbestos

Selenium

Heterocyclic Compounds

Dietary Carbohydrates

Environmental Pollutants

Lead

Cholesterol, Dietary Aspirin Nicotine

Vaccination Analgesics,

Non-Narcotic Lung

Transplantation

Thiazolidinediones Ferritins

Cardiac Surgical

Procedures Creatinine Prednisone Lithotripsy

Angiotensin II

Type 1

Receptor Blockers

Cardiopulmonary Bypass Cystatins

Extracorporeal Circulation

Whole-Body Irradiation Aminocaproic

Acids

Pharmaceutical Preparations

Analgesics, Opioid

Sulfonamides Organomercury Compounds

Aminoglycosides Embolization, Therapeutic

Urinary Diversion

Hematopoietic Stem Cell

Transplantation Dioxoles Radiotherapy,

High-Energy

BCG Vaccine

Bone Marrow

Transplantation Smallpox Vaccine Phenacetin Edetic

Acid Tetracycline Analgesics

Tyrosine Immunoglobulin G Methoxyflurane Radioisotope

Teletherapy

Complement System Proteins

Condiments Mercury Chlorothiazide Nephrostomy,

Percutaneous

Chlorambucil Oxalates

Dietary Proteins

Angioplasty, Transluminal, Percutaneous

Coronary

Smoking (22)Diet (14)

Alcohol Drinking (13)Meat (11)

Dietary Fats (8)

Smoking (30)Environmental Exposure (7)Occupational Exposure (7)

Air Pollutants (5)Air Pollution, Indoor (5)

Hypoglycemic Agents (23)Insulin (15)

Smoking (11)Coffee (9)

Carbohydrates (7)

Hypoglycemic Agents (23)Insulin (15)

Smoking (11)Coffee (9)

Carbohydrates (7)Lithotripsy (18)

Kidney Transplant (15)Smoking (12)

Phenacetin (8)Renal Dialysis (6)

Antihypertensive Agents (133)Salt, Dietary (29)

Kidney Transplant (24)Renal Dialysis (23)Aldosterone (21)

Estrogen Replacement (77)Dietary Fats (66)

Alcohol Drinking (43)Abortion, Induced (36)

Oral Contraceptives (34)

Smoking (14)Abortion, Induced (12)

Air pollutants (6)Alcohol Drinking (3)

Anti-bacterial agents (3)

Smoking (12)Aluminum (6)

Electromagnetic Fields (6)Cholinesterase Inhibitors (5)

Cholesterol (4)

11

Figure 4. “Zoomed-in” Envirome-disease network for WHO priority diseases. For clarity, only nodes of degree ≥ 4 are seen here. See figure 3 for caption and full network.

Car

diov

ascu

lar

Dis

ease

s

Sele

nium

Dia

bete

s M

ellit

us,

Type

2

Rad

ioth

erap

y

Kidn

ey

Dis

ease

s Die

tary

C

arbo

hydr

ates

Insu

lin

Die

tC

ontra

cept

ives

, O

ral

Cho

lest

erol

Kidn

ey

Tran

spla

ntat

ion

Lipi

ds

Cof

fee

Fatty

Ac

ids

Brea

st

Neo

plas

ms

Caf

fein

e An

ti-Ba

cter

ial

Agen

ts

Die

tary

Fa

ts

Envi

ronm

enta

l Ex

posu

re

Lung

N

eopl

asm

s En

viro

nmen

tal

Pollu

tant

s

Occ

upat

iona

l Ex

posu

re

Smok

ing

Alco

hol

Drin

king

C

arot

enoi

ds

Air

Pollu

tant

s,

Occ

upat

iona

l

Smok

e

Asth

ma

Col

orec

tal

Neo

plas

ms

Elec

trom

agne

tic

Fiel

ds

Inha

latio

n Ex

posu

re

Air

Pollu

tant

s

Air

Pollu

tion

Toba

cco

Smok

e Po

llutio

n Pu

lmon

ary

Dis

ease

, C

hron

ic

Obs

truct

ive

Cho

lest

erol

, H

DL

Surg

ical

Pr

oced

ures

, O

pera

tive

Ant

i-Inf

lam

mat

ory

Agen

ts,

Non

-Ste

roid

al

Diu

retic

s

Alzh

eim

er

Dis

ease

Hyp

erte

nsio

n Antih

yper

tens

ive

Agen

ts

Test

oste

rone

Obs

tetri

c La

bor,

Prem

atur

e

12

From such a comprehensive representation, we may begin to assemble the

envirome from a set of factors investigators have prioritized through study of

their relationship with disease. However, for discovery and hypothesis

generation, arguably the most important associations are ones that have few or

no citations, for example factors found in the lower left quadrant of Figure 2,

or corresponding factors with a single or no link in the envirome-disease

network (Figure 3). In the following section, we present how we may

systematically create these hypotheses to establish links for further study.

Creation of robust hypotheses connecting the environment, genome, and

multifactorial disease

Genomics and informatics have enabled the creation of novel and validated

hypotheses on a multidimensional breadth for multifactorial diseases. This

scale is required for complex diseases where many such factors are thought to

take part. Through GWAS we have discovered genetic variants that have

enabled scientists to postulate about the function of the pathways discovered in

diseased individuals and have led to biological and clinical experimentation

(for example, [28-31]). While GWAS has come short to explain total heritable

risk of complex disease [32], the data-driven methodology has enabled us to

collect robust, multidimensional evidence, opening new avenues of

investigation [32, 33].

In this dissertation, we have developed and implemented analytic methods to

associate environmental factors to multifactorial disease (Figure 5).

Specifically, we propose methods that scale across experimental frameworks,

from populations (Figure 5 A, C) down to molecules (Figure 5 D, F). Further,

they scale in resource utilization through use of publicly accessible data sets.

13

Figure 5. Overview of population- and molecular-scale methods to create hypotheses across the envirome and genome. Examples are fictional and shown for illustrative purposes. A.) An Environment-wide association study (EWAS) is a population scale method to screen multiple factors in the envirome for association with a disease of interest. Depicted is a “Manhattan Plot”, a way to visualize strength of association for each factor over the envirome. For example, the pesticide Heptachlor is depicted as the highest ranking finding in an EWAS in association to Type 2 Diabetes. B.) Genome-wide Association study or GWAS in which genomic variants are associated to disease on a genome-wide dimension. C.) Genome-Environment-Wide Association Study (G-EWAS), in which marginal findings from Genome-wide Association Studies (GWAS) and EWAS are integrated and screened jointly for interaction in context of a disease of interest. For example, evidence for Heptachlor and SLC30A8 interaction against Type 2 Diabetes is shown in the 2D plot. D.) Representation of an “Envirome Map” whereby gene expression “signatures” induced by physical environmental factors on model systems are summarized in a matrix. For example, Bisphenol A has a signature consisting of CYP1A1, MAPK1, and ESR1 expression. E.) Standard gene expression disease responses collected on multiple diseases, for example Type 2 Diabetes, Breast Cancer, and Coronary Heart Disease. F.) Method to correlate environmental factor expression signatures to disease state, for example Bisphenol A to Breast Cancer. A-C.) considers studies on a population scale, D-F.) on a molecular or toxicological scale. Depiction of the envirome domain is seen in green, the genome domain in red. Examples are shown in italics.

Diseases

Genome domain

bisphenol Aarse

nicPCB170

Gen

ome-

wid

e fu

nctio

n(ie

, mR

NA

exp

ress

ion)

Heptachlor

SLC30A8

dise

ase

ass

ocia

tion

sign

ifica

nce

e.g.

Typ

e 2

Dia

bete

s

Type 2 Diabetes

Coronary Heart Disease

Breast Cancer

A. Environment-wide Association Studies(EWAS) (Chapter 3)

B. Genome-wide Association Studies(GWAS)

bisphenol A

Breast Cancer

Corr ( )

,E. Disease expression studies

D. Envirome Map (Chapter 2)

ESR1

MAPK1

CYP1A1F. Envirome-disease expression

signature correlation(Chapter 2)

C. G-EWAS (Chapter 4)

dise

ase

ass

ocia

tion

sign

ifica

nce

e.g.

Typ

e 2

Dia

bete

s SLC30A8 & Heptachlor

Envirome-wideGenome-wide

Envirome domain

Popu

latio

n sc

ale

Mol

ecul

ar s

cale

dise

ase

ass

ocia

tion

sign

ifica

nce

e.g.

Typ

e 2

Dia

bete

s

GWAS

lociEWAS

factors

Legend

Illustrative examples in italics

14

Creating hypotheses comprehensively on a population scale

In EWAS (Chapter 3 and 4, Figure 5A), as in GWAS (Figure 5B), multiple

environmental factors, or the “envirome”, are interrogated against disease state

using an epidemiological study design. These studies can be “case-control”,

whereby factor variability is compared between incident disease cases

(individuals recently diagnosed) and non-diseased individuals. EWAS can be

utilized in other observational study designs, most notably a “cohort” or

“cross-sectional” one, in which individuals are sampled from a pre-defined

population and their disease status is estimated after the sampling process.

Both selection and determination of cases and controls is a well-studied in

epidemiology, and non-optimal selection can lead to biases in estimates [34].

For example, if a disease cases are misclassified as the opposite, estimates may

be attenuated.

Other types of biases abound in observational studies and must be

acknowledged and considered. For example, in a cross-sectional study one

cannot easily resolve the temporal relationship between exposure to a factor

and disease (i.e., did the disease come first, or the exposure?). This bias is

known as “reverse causality” [34]. Other biases, such as confounding, also

hinder inference in observational studies. A confounding variable is one that

both correlates with the disease state and the factor; thus, association of the

factor to the disease can be thought of as a stand-in between the confounding

variable and disease. Confounding variables need not be measured. While

introduce means to estimate confounding bias in this dissertation utilizing

measured variables (Chapter 3), confounding bias cannot be avoided altogether.

We have developed and applied EWAS [35, 36] using a cross-sectional dataset

known as the National Health and Nutrition Examination Survey (NHANES),

a representative survey of the non-institutionalized United States population

carried out by the Centers for Disease Control and Prevention and the National

15

Center for Health Statistics [37]. Participants of NHANES are surveyed

regarding their health and disease through a battery of questionnaires,

physician-led physical exam, and urine- and blood-based laboratory tests. A

series of lab tests (N=150-266, depending on survey year) consist of blood or

urine markers of environmental factors, such as heavy metals, persistent

organic pollutants, nutrients and vitamins, antibodies against allergens, and

indicators of pathogenic exposure. Furthermore, several questionnaire items

are used as proxies of environmental factors (N=300-1000, depending on

survey year), such as years smoked and pharmaceutical and drug use.

Furthermore, many clinical biomarkers used for disease diagnosis and risk

prediction, such as body mass index, fasting glucose, and serum lipid levels are

jointly measured, providing a platform to create hypotheses regarding

prevalent diseases without investment in recruitment of subjects.

In its current form, each factor in an EWAS – or the “envirome” -- is

comprehensively associated to disease or phenotypic state, often depicted in a

“Manhattan plot” (Figure 5A), a transparent representation of all findings. Like

GWAS, the framework calls for massive number of comparisons, which can

lead to false positives. EWAS utilizes the “false discovery rate” (FDR) to

account for multiple comparisons, which provides a quantitative estimate of

the number of false discoveries at a given level of statistical significance and is

less conservative than frequentist methods for control [38-40]. Along with a

more stringent threshold to account for false positives, positive findings are

evaluated in independent cohorts and surveys. Multiple comparisons are not

considered in most environmental epidemiological studies and this level of

stringency allows for robust and quantitative prioritization of findings [10, 41,

42].

We have applied EWAS on multiple disease related phenotypes, notably T2D

[35] and serum lipid levels [36], risk factors for coronary heart disease

16

(Chapter 4). Like GWAS, we were able to identify and validate in independent

surveys both novel and known factors associated with the phenotypes that

should be followed up in additional epidemiological or toxicological studies.

For example, we have created hypotheses about the pollutant factors such as

polychlorinated biphenyls and organochlorine pesticides, both associated with

significant increased T2D prevalence and adverse lipid profiles. Surprisingly, a

vitamin marker, γ-tocopherol, was also observed to have an adverse

relationship with the diseases, leading us to hypothesize both about reverse

possible harmful effects of the vitamin [43].

Common disease is hypothesized to be a combination of both genetic and

environmental factors, but GWAS and EWAS examine these domains without

consideration of the other, or conduct marginal associations. We propose

another population-based study, called a Gene-Environment-Wide-Study (or

G-EWAS, Figure 5C, Chapter 5), to examine the joint effect of these factors.

Specifically, we interrogate individual findings from GWAS and EWAS,

testing whether each pair-wise combination of a genetic and environmental

factor “interact”, known as “gene-environment interaction” [44]. When testing

for interaction, we examine whether the joint effects are greater or smaller than

when considering marginal effects alone.

Interaction effects enable investigators to postulate about biological

mechanisms underlying the disease of interest. As an example, investigators

have recently confirmed increased risk of bladder cancer for individuals with

variants in NAT2 gene and who smoke [45]. NAT2 is a gene that metabolizes

chemical compounds; presence of a statistical interaction has prompted a

hypothesis between altered NAT2 function, chemical compounds found in

cigarette smoke and their metabolites, and pathology of bladder cancer [46].

17

However, there are a few problems with current gene-environment

investigations [44, 47] (Chapter 5). First, like most environmental

epidemiological investigations, gene-environment interaction studies rely on a

priori selection of factors to test. The task of choosing factors is particularly

daunting given that the number of common genetic and environmental variants

is approximately on the order of thousands to millions. Second, gene-

environment studies are resource intensive, requiring exponentially greater

sample sizes than studies that study either component alone. Relatedly, these

studies can be analytically intensive, incurring a large multiple hypotheses cost

due to the many combinations of factors to test. Therefore, in the spirit of

EWAS and GWAS, we propose G-EWAS. This approach attempts to solve

the problem of choosing which factors to test while alleviating the some of the

analytical burden of testing a large hypothesis space.

Like EWAS and GWAS, G-EWAS is a data-driven method to find interactions

between robust and replicated variants marginally associated with disease in

EWAS and GWAS (Figure 5B). This method avoids the variable selection

bias that has plagued candidate gene studies [48] while keeping the hypothesis

space constrained. Specifically, each possible pair of factors found in EWAS

and GWAS are tested for interaction association to the disease of interest,

screening a 2-dimensional hypothesis space on the order of hundreds, not

thousands (Figure 5C). Gene-environment studies are power-intensive and

keeping the hypothesis space as small as possible is desired [47]. Relatedly,

increasing the number of hypotheses increases the burden of false positives and

using frequentist methods for multiple hypothesis control (i.e., Bonferroni)

becomes too conservative and methods to estimate the FDR are necessary.

We demonstrated the utility of G-EWAS in application to T2D (Chapter 5) and

found an interaction between a non-synonymous variant in the SLC30A8 gene

and 2 vitamin markers, γ-tocopherol and trans-β-carotene after adjustment for

18

risk factors and consideration of multiple comparisons. Of note, we observed

up to 30-40% increased genetic risk when considering specific environmental

factors. Of course, proof of statistical interaction does not imply an etiological

relationship between the factors. However, investigators have observed that

diabetes is only induced for SLC30A8 knockout models in presence of a high-

fat diet [49, 50]. Results here strengthen this hypothesis and offer specific

factors present or absent within diet to induce a diabetic state. With G-EWAS,

we have a platform to produce multiple data-driven hypotheses regarding

biological mechanisms through gene-environment interactions. Further, the

these interaction findings have implications for personalized genetic risk [23].

Creating hypotheses comprehensively on a molecular or toxicological scale

Toxicology is concerned with the physical substances and exposures that lead

to adverse changes on the organism and/or molecular level, and, how

organisms are exposed to substances. Specifically, toxicologists utilize the

physical sciences to measure how physical substances interact with biological

systems to induce physiological change. For example, how do biochemical

processes alter a substance for digestion, absorption, and excretion, or what are

the “toxicokinetic” properties of a system given exposure? Second, how does

the substance induce “toxicodynamic” change, for example, how does the

substance influence specific targets? Last, how do toxicokinetics and

dynamics influence functional change such as in cell viability and metabolism

[3]? For example, a cornerstone of toxicology is known as “dose-response”

modeling, in which a molecular response is correlated with controlled doses of

a substance. Ascertaining a dose-response relationship enables inference

regarding the type of relationship between a substance and a biological system

connected to the response (e.g., adverse or protective effect).

Another way to ascertain molecular response includes utilizing genome-wide

measurements, such as commoditized gene expression microarrays. This

19

subfield of toxicology is known as “toxicogenomics” and aims to “study the

response of a whole genome to toxicants” [51]. In contrast to the population-

based EWAS approach, toxicogenomics offers how specific environmental

factors may perturb a biological system; however, these responses are

unconnected to complex diseases.

In Chapter 2, we show how one may use to tools of integrative genomics to

connect toxicogenomic responses with disease-associated responses, thus

enabling hypothesis generation between specific physical environmental

factors to complex diseases, such as cancer (Figure 5D-F). For example,

landmark genomic research have linked chemically induced functional changes

to disease and related phenotypes in context of therapeutic prediction. Lamb et

al., in an effort dubbed the “Connectivity Map”, correlated 164 drug-induced

gene expression changes on cell lines to human disease-associated gene

expression states, predicting novel molecules for therapeutics [52]. Analogous

to this work, we ask what potentially environmentally induced changes are

correlated with, and might explain variation in functional disease states.

The proposed method takes full advantage of the publicly available

toxicological and disease-related data such as the Gene Expression Omnibus

(GEO) [53], the Comparative Toxicogenomics Database (CTD) [54], the Toxin

and Toxin-Target Database (T3DB) [55], and the National Toxicology

Program’s ToxCast effort [12, 56], thus providing a scalable way to derive

hypotheses with minimal effort in upfront experimental design.

To begin, numerous experiments examining disease-associated gene

expression are accessible in GEO. Furthermore, it is possible to compare these

disparate datasets to make inferences over the aggregate. For example, Dudley

et al. have collected 238 disease-associated expression responses from GEO

for cross-disease analysis [57]. Further, the authors have shown that signal

20

associated with disease is stronger than experimental or tissue-of-origin

artifacts [57]. And, most importantly, the authors have used this representation

to predict novel therapeutics connected with disease in an unbiased manner

[58]. We claim the same can be achieved with environmental factors.

Toward this goal, we represent a compendium of disease-associated gene

expression datasets in matrix form, where columns represent different diseases,

and rows genes, and each entry a measure of differential gene expression

(Figure 5F) corresponding to the gene-disease pair. This representation allows

one to systematically infer over these disparate datasets. Of course, when

aggregating data over multiple experiments from different investigators, care

must be taken to ensure the inter-comparability of the data [59, 60].

Nevertheless, the commoditization of measurement platforms has enabled data

standardization easing some of the burden of ensuring their comparability [57].

Prototypical gene expression experiments in toxicogenomics can be framed

similarly (single columns of Figure 5D). These experiments typically involve

characterizing a range of dosages of a handful of environmental chemicals on

gene expression of a model organism, such as mouse or rat, or cell line system.

These experimental data files are then submitted to GEO or are summarized in

databases such as the CTD. Just as the Connectivity Map contains has

compiled “signatures”, or patterns of expression for individual chemicals for

prediction of therapeutics for disease states, one can do the analogous utilizing

numerous toxicogenomic experiments covering multiple environmental

chemicals: we call this effort an “Envirome Map” (Figure 5D).

We then query the Envirome Map (Figure 5D) with specific disease-associated

expression datasets (Figure 5E), correlating environmental signatures with

disease-associated expression states (Figure 5F). In Chapter 3, we develop a

method to compute correlations between these datasets [61]. Specifically, each

21

environmental chemical expression signature is queried for enrichment for

genes expressed in the disease expression set. If a disease expression dataset

has many genes expressed in a chemical signature greater than chance alone,

one concludes the chemical signature and disease expression states are highly

correlated. This process is repeated for every chemical in the Envirome Map.

Again, multiple hypotheses are considered by estimating the FDR and the top

ranked correlations are the top hypotheses generated from the procedure.

We utilized the Envirome Map to create hypotheses regarding Breast, Lung,

and Prostate Cancers (Chapter 3, Figure 5F). Specifically, gene signatures of

established factors such as estrogens were highly correlated with breast cancer.

We also observed an endocrine disruptor, bisphenol A, to be associated with

expression states of the disease. However, we lack of directionality of such

associations and further experimentation is needed to characterize the

toxicokinetics and toxicodynamics of BPA in relation to breast cancer. We

discuss validation of these hypotheses in the next section.

Discussion

We have introduced a representation of the collection of all dynamic and

specific environmental factors called the envirome. In our brief survey of how

this entity is studied, we observed its breadth and heterogeneity (Figure 1,

Figure 2, Figure 3). However, despite its breadth, the envirome is not studied

as rigorously as the genome. To this end, we propose population- and

molecular-scale methods to enable scalable hypothesis generation between the

envirome and disease.

The next question is what happens with these hypotheses? We discuss and

introduce methods for validation to infer population risk and second, discuss

new study designs to investigate molecular response of predicted factors.

22

On a population scale, validation of associations to affirm risk ideally occurs in

diverse populations in study designs that minimize confounding and reverse

causal bias. For example, the randomized trial is the “gold standard” for

validation of therapeutics. As directly randomized trials are not possible for

factors with adverse associations, prospective studies may be executed,

utilizing cohorts followed through time or even of multiple familial

generations, such as the Framingham Study [62, 63]. However, while we may

understand the temporal pattern of the exposure-disease relationship, biases

still cannot be excluded.

Taking a step back, standardization of methodology and measurements has

enabled validation of results and verification of risk estimates derived from

genome-wide studies. For example, it is now typical for large consortia to

validate genetic results; for example, recent GWAS results have been

strengthened by multiple meta-analyses, collecting individuals on up to

100,000 individuals around the world [64, 65]. We argue that data standards

for the envirome would enable validations similarly, aiding design of

longitudinal studies that can be combined for meta-analyses. On this point, the

PhenX project is centered on building consensus of type and measurement for

high priority environmental factors and phenotypes [66, 67]; however, they

have yet to be adopted in high-profile epidemiological studies.

Introduction of standards would enable methods such as “Mendelian

randomization” to be comprehensively evaluated as a tool for validation [68,

69]. Mendelian randomization provides a way to approximate a randomized

trial through use of genetic loci that vary with exposure independent of

phenotypic state. Therefore, the association between disease and an exposure

is mimicked by the genetic variant and disease. Given that genetic variants

assort randomly, we avoid the biases in traditional association analyses by

using variants as stand-ins for exposures. Following from this, a set of these

23

variants can possibly be utilized to validate factors found in EWAS. The

central challenge would be in the determining what variants vary with the

massive numbers of possible factors that could be found in an EWAS. Further,

the variation must be described in populations of interest. Nevertheless,

GWAS have explored genetic variation in relation to environmental factors,

such as smoking dependence and consumption [70, 71], alcohol consumption

and dependence [72-74], infection susceptibility [75-77], coffee consumption

[78], exercise [79], and vitamin B levels [80], in addition to existing

“pharmacogenetics” studies which associate genetic variation to drug response

[81]. For example, suppose one hypothesizes a relationship between coffee

consumption and diabetes. Suppose also a (hypothetical) genetic variant called

COFFEE has been identified as strong marker correlated with the amount of

coffee an individual drinks per day. Therefore, evidence that the COFFEE

marker is associated with diabetes will support our original hypothesis; if it

does not, we might conclude that the association is biased. This is just a

simple example; there is need for investigation and novel methods that utilize

sets of such proxy genetic variants as means to validate environment-disease

associations.

On the other hand, we must test how chronic and low-doses of specific

environmental factors modulate molecular responses in model, but clinically

relevant, systems in order to learn about disease biology. On a molecular scale,

it has long been possible to attain a wealth of both phenotypic and genotypic

data from model systems [82] and these methods should complement

toxicology to elucidate disease biology of hypothesized factors [83]. Ideally,

however, we should attempt to study molecular response in actual populations,

eventually investigating both disease biology and population risk

simultaneously.

24

In fact, initial investigations have blurred the line between traditional

molecular- and population-based approaches in studying external factors in

context of complex phenotypes. For example, Idaghdour and colleagues have

shown the relationship between genetic variation and leukocyte gene

expression in the context of urban and/or rural habitation in Morocco on 194

individuals [84]; however, geography is an abstraction of specific

environmental factors. Future studies of similar design should assess how

hypothesized factors correlate with changes in genome-wide molecular

measures on a population scale among diseased individuals.

In conclusion and in the following chapters, we describe and apply methods to

comprehensively associate multiple specific environmental factors, a subset of

the “envirome”, to complex disease for hypothesis generation. Specifically,

we introduce methods to link molecular responses to disease states through a

representation we call the “Envirome Map”. Second, we describe population-

level methods to find novel and robust associations between the envirome and

disease called EWAS. Third, we apply EWAS in the context of T2D and

serum lipid levels. Last, we show how we can integrate EWAS findings with

GWAS to predict how environmental factors modify genetic risk. Future work

in deciphering environmental contributions to disease will benefit from

specific definition of the envirome, standardization of its measurement, and

comprehensive integration of molecular-scale measures on at-risk populations.

25

CHAPTER 2. MAPPING MULTIPLE TOXICOLOGICAL RESPONSES

TO COMPLEX DISEASE

“All substances are poisons; there is none which is not a poison.”

-- Paracelsus (1493–1541)

INTRODUCTION

Certainly Paracelsus, a physician from the Enlightenment Period credited for

the beginnings of the study of “poisons” and toxicology, would remark

similarly if living today. As a result of modernization and commercialization,

the breadth of “poisons”, what we will call in the abstract environmentally

sourced physical exposures, have become immense in type and property [85].

As introduced in Chapter 1, toxicology is the study that studies the effect of

physical agents on biological systems and often the study is in regard to

adverse effect [3].

Specifically toxicologists try to understand fundamental mechanisms – be they

biochemical, cellular, genetic, or molecular – of these effects. The history of

the field goes back to Paracelsus’ time, and there is a rich literature in

methodology in elucidating these effects. One area of practice in toxicology is

known as “risk assessment”, or prediction of how toxicological effect results in

changes in health [3]. However, “risk assessment” is used in terms of potential,

immediate hazard, and non-chronic effect, inferred in model systems and

organisms, and high doses [11, 13]. Toxicological risk assessment says very

little about chronic or complex disease [86], which is the subject of this

dissertation.

Nonetheless, our knowledge regarding the ways chemical exposures induce

low-level biological response is increasing with the advent of high-throughput

26

measurement and screening modalities [12, 54, 87, 88]. However, while

toxicological response remains unconnected to complex disease and public

health, it is also currently difficult to ascertain multiple associations of

chemicals to health status without significant experimental investment or large-

scale epidemiological study. Use of publicly-available environmental

chemical factor and genomic response data – such as toxicogenomic gene

expression data-- may facilitate the discovery of these associations.

What is “toxicogenomics”? Toxicogenomics ramps up signatures of

“biological response” to the dimension of the entire genome. That is,

toxicogenomics refers to the patterns of changes in response due to exposure to

physical agents measured via modalities in functional genomics, such as

proteomic mass spectrometry and gene expression microarrays and analyzed

using bioinformatics techniques [51]. These modalities have become

entrenched in functional genomics such that there already exists a rich,

publicly available data sources and methods to explore toxicogenomic-level

response (Chapter 1).

In the following, we propose to use pre-existing datasets and knowledge-bases

in order to derive hypotheses regarding chemical toxicological association to

disease without upfront experimental design, extending the work of

toxicogenomics. Specifically, we have asked what environmental chemicals

could be associated with gene expression data of disease states such as cancer,

and what analytic methods and data are required to query for such correlations.

This study describes a method for answering these questions. We integrated

publicly available data from gene expression studies of cancer and toxicology

experiments to examine disease/environment associations. Central to our

investigation was the Comparative Toxicogenomics Database (CTD) [54],

which contains information about chemical/gene/protein responses and

chemical/gene/disease relationships, and the Gene Expression Omnibus (GEO)

27

[53], the largest public gene expression data repository. Information in the

CTD is curated from the peer-reviewed literature, while gene expression data

in GEO is uploaded by submitters of manuscripts. We use the CTD to create

an “Envirome Map” which is ultimately used to create hypotheses about the

molecular links between environmental factors and disease states (Figure 5D-

F).

Most approaches to date to associate environmental chemicals with genome-

wide response can be put into 2 categories. These approaches either 1.) have

tested a small number of chemicals on cells and measured responses on a

genomic scale, or 2.) used existing knowledge bases, such as Gene Ontology,

to associate annotated pathways to environmental insult.

The first method involves measuring physiological response on a gene

expression microarray. This approach allows researchers to test chemical

association on a genomic scale, but the breadth of discoveries is constrained by

the number of chemicals tested against a cell line or model organism. These

experiments are not intended for hypothesis generation across hundreds of

potential chemical factors with multiple phenotypic states. Only a few

chemicals can be tractably tested for association to gene activity [89, 90], or

disease on cell lines [91], or on model organisms, including rat and mouse [92].

In rare cases, this approach has reached the level of a hundred or thousand

chemical compounds, such as the Connectivity Map, developed by Lamb,

Golub, and colleagues [52], which attempts to associate drugs with gene

expression changes. After measuring the genome-wide effect on gene

expression after application of hundreds of drugs at various doses, drug

signatures are calculated and are then queried with other datasets for which a

potential therapeutic is desired. While this has proven to be an excellent

system to find chemicals that essentially reverse the genome-wide effects seen

in disease, the approach of measuring gene expression and calculating

28

signatures across tens of thousands of environmental chemicals is not always

feasible or scalable. Although other data-driven approaches have been

described [93], few have given insight into external causes of disease.

A second approach has been to use knowledge bases, such as Gene Ontology

[94] to aid in the interpretation of genomic results. For example, Gene

Ontology analysis of a cancer experiment might elucidate a molecular

mechanism related to an environmental chemical. Unfortunately, there is still

a lack of methodology to derive hypotheses for environmental-genetic

associations in disease pathogenesis, as Gene Ontology and general gene-set

based approaches have limited information on environmental chemicals.

In contrast to the previous approaches, we claim that the integration of pre-

existing data and knowledge bases can derive hypotheses regarding the

association of chemicals to gene activity and disease from multiple datasets in

a scalable manner. Gohlke et al. have proposed an approach to predict

environmental chemicals associated with phenotypes also using knowledge

from the CTD [95]. Their method utilizes the Genetic Association Database

(GAD) [96] to associate phenotypes to genetic pathways and the CTD to link

pathways to environmental factors. This method has proved its utility,

allowing for production of hypotheses for chemicals associated with diseases

categorized as metabolic or neuropsychiatric disorders. However, in its current

configuration, their method is dependent on the GAD, which contains statically

annotated phenotypes in relation to genes containing variants; such DNA

changes are not likely to be reflective of molecular profiles of tissues being

suspected for environmental influence. Unlike this method, our proposed

approach is tissue- and data-driven in that the phenotype is determined by the

individual measurements of gene expression in cells and tissues, allowing for

the dynamic capture of phenotypes.

29

The approach we propose here is agnostic to experiment protocol, such as cell

line or chemical agent tested, and provides for a less resource-intensive

screening of chemicals to biologically validate. Our methodology essentially

combines the best features of these current approaches. We start by compiling

“chemical signatures” in a scalable way using the CTD. As the CTD is a hand-

crated collection of chemical-response data, theoretically a chemical signature

can theoretically be constructed from primary data of individual chemical

expression experiments in GEO. These chemical signatures capture known

changes in gene expression secondary to hundreds of environmental chemicals.

The representation of these gene expression states for all of these chemicals we

dub the “Envirome Map”, introduced in Chapter 1 (Figure 5D). In the

following, we describe how to merge the Envirome Map with disease states

(Figure 5 D-F).

In a manner similar to how Gene Ontology categories are tested for

over-representation, we then calculate the genes differentially expressed in

disease-related experiments and determine which chemical signatures are

significantly over-represented. We first verified the accuracy of our

methodology by analyzing microarray data of samples with known chemical

exposure. After these verification studies yielded positive results, we then

applied the method to predict disease-chemical associations in breast, lung, and

prostate cancer datasets. We validated some of these predictions with curated

disease-chemical relations, warranting further study regarding pathogenesis

and biological mechanism in context of environmental exposure. Our method

appears to be a promising and scalable way to use existing datasets to connect

genome-wide toxicological response to disease [61].

30

METHOD TO PREDICT ENVIRONMENTAL ASSOCIATION TO

GENE EXPRESSION RESPONSE

The Comparative Toxicogenomics Database (CTD) includes manually-curated,

cross-species relations between chemicals and genes, proteins, and mRNA

transcripts [97]. We downloaded the knowledge-base spanning 4,078

chemicals and 15,461 genes and 85,937 relationships between them in January

2009. An example of a relationship in the CTD is “Chemical TCDD results in

higher expression of CYP1A1 mRNA as cited by Anwar-Mohamed et al. in H.

sapiens” (demonstrated in Figure 6A). The median, 70th, and 75th percentile of

the number of genes related to a chemical is 2, 5, 7 respectively.

With the single gene, single chemical relationships, we created “chemical

signatures”, or gene sets associated with each chemical (Figure 6B). Gene sets

were created from gene-expression relations spanning 249 species, but most

relations came from H. sapiens, M. musculus, R. norvegicus, and D. rerio. We

eliminated chemical-gene sets that had less than 5 genes in the set. This step

yielded a total of 1,338 chemical-gene sets.

We assemble the Envirome Map (Figure 7) by aggregating all 1,338 chemical-

gene signatures from the CTD (Figure 6B). Specifically, each signature can be

depicted as a vector whereby each entry represents a functional link between a

chemical and gene. The Envirome Map is a collection of these vectors. While

we present and apply the Map as a matrix of binary associations, it can easily

be configured to represent a richer set of relationships, such as ordinal values

depicting the scale of association.

The CTD also contains curated data regarding the association of diseases to

chemicals. These associations are either shown in an experimental model

physiological system or through epidemiological studies. We used these

curated associations to validate our predicted factors associated to disease.

31

There are 3,997 diseases-chemical associations in the CTD, consisting of 653

diseases (annotated by unique MeSH terms) and 1,515 chemicals (Figure 6C).

The median, 70th, and 75th , and 80th percentile of the number of curated

chemicals per disease is 2, 3, 4, and 5 respectively.

32

Figure 6. Creation of the chemical-gene signatures based on the Comparative Toxicogenomics Database (CTD). A.) The CTD contained 85,937 total unique chemical-gene relations over 4,078 chemicals and 15,461 genes. Each relation had one or more citations of support. An example hypothetical relation, “TCDD lead to higher expression of CYP1A1 mRNA in H. sapiens as shown in Anwar-Mohamed et al.” is seen on the right panel. B.) Creation of chemical-gene set relations. Each chemical-gene relation had a number of citations of support, xi. For each chemical, we constructed a gene set, or “signature” from the individual chemical-gene relations. We filtered out signatures that had at least 5 genes in the set, leaving a total of 1,338 chemical-gene sets. An example of one chemical-gene set (a column of Figure 5D, Figure 7) is seen on the right panel of B: the genes CYP1A1, AHR, AHR2 are shown to have multiple citations for the relation, 60, 40, and 9 respectively. Each of these signatures in the aggregate forms the “Envirome Map” (Figure 7). C.) Representation of disease-chemical associations in CTD is used for validation.

chemical4,078

gene15,461

organism249

1 ncited relations85,937 total unique chemical-gene relations

chemical1,338

gene 1

gene 2

gene n

1,338 total chemical-gene setsor "signatures"

A

B

TCDD

CYP1A1

increased expressionAnwar-Mohamed et al

H. sapiens

Example: "Chemical TCDD leads to higher expression of CYP1A1 mRNA in H. sapiens (Anwar-Mohamed et al)"

Dioxins

CYP1A1

AHR

AHR2

Example: Gene set for the Dioxins chemical,with 60, 40, and 9 references of CYP1A1, AHR

and AHR2 interactions

60

40

9

x1x2xn

xi denote number of references for a chemical-gene relation

C

chemical 1

chemical 2

chemical n

disease653

x1x2

xn

xi denote number of references for a disease-chemical relation

Prostatic Neoplasms

sodium arsenite

cadmium

bisphenol A

Example: Chemical set for the "Prostatic Neoplasms" disease

(MeSH: D011471),with references of associating the disease

to sodium arsenite, cadmium, and bisphenol A

33

Figure 7. Creation of the ‘Envirome Map’ using CTD chemical-gene signatures. A.) Each chemical in the CTD is functionally associated with a set of genes, described earlier (Left panel, see also Figure 6B). This signature can be represented in the ‘Envirome Map’ (Right panel), whereby each column represents the genes (rows) associated for each environmental chemical in binary form. B.) An example representation for the “Dioxins” signature. The entire Envirome Map is populated with 1338 signatures (columns).

We built a system to test whether genes significantly differentially expressed

within a gene expression dataset could be associated with any of the calculated

chemical signatures in the Envirome Map (Figure 8A). We conducted two

phases of analysis in this study. The first phase was a verification one, testing

whether the method could accurately predict known chemical exposures

applied to samples Figure 8B). Our input for this first phase were gene

chemical1,338

gene 1

gene 2

gene n chemical

Gen

ome-

wid

e fu

nctio

n(ie

, mR

NA

exp

ress

ion)

gene n

gene 2gene 1

Genome domainEnvirome domain

Legend

Envirome MapCTD chemical-gene signature

Dioxins

CYP1A1

AHR

AHR2

60

40

9

Dioxins

Gen

ome-

wid

e fu

nctio

n(ie

, mR

NA

exp

ress

ion)

AHR2

AHR

CYP1A1

1338

A

B

34

expression datasets of chemically-exposed samples and unexposed control

samples, and our output were lists of chemicals predicted to be associated with

each dataset. The second investigation phase involved predicting chemicals

associated with cancer gene expression datasets (Figure 8C). Our input for this

second phase were gene expression datasets of cancer samples and control

samples, and our output were lists of chemicals predicted to be associated with

the dataset. We attempted to validate these findings further by using curated

disease-chemical relations (Figure 8D). Finally, we attempted to group our

chemical predictions associated with cancer dataset by PubChem-derived

BioActivity similarity measures, seeking further evidence of potential

underlying mechanism or similar modes of action between chemicals.

35

Figure 8. Predicting environmental chemical association to gene expression datasets. A.) A representation of the 1338 chemical-gene sets in the Envirome Map. B.) For the validation step, we conducted SAM to find genes whose expression was altered in each of our datasets. We then mapped the differentially expressed genes to corresponding extra-species genes in our database by using Homologene. For each chemical-gene set signature in the Map, we conduct a hypergeometric test for enrichment and ranked each result by p-value. C.) We applied the approach used in B to predict chemical association to prostate, breast, and lung cancer data and validated these results with curated disease-chemical annotations from the CTD represented in D.). D.) Representation of the curated disease-chemical associations in the CTD.

We used Significance Analysis of Microarrays (SAM) software to select

differentially expressed genes from a microarray experiment [98]. The FDR

for SAM for all of our predictions were controlled up to a maximum of 5 to

7% in order to reduce false associations.

We mapped microarray annotations to other corresponding representative

species, H. sapiens, M. musculus, and R. norvegicus using Homologene [99].

In the CTD, gene identifiers were commonly associated with H. sapiens;

however, some are mapped to specific organisms, such as M. musculus and R

norvegicus. Most mappings in the CTD are among these 3 organisms. By

Chemically Perturbed Microarray Data:

Exposed vs. Non-exposedChemical annotated dataset

Estradiol (2), TCDD, Zinc, Bisphenol A, Vitamin D

Accuracy:rank of correctly identified

chemical

Disease Microarray Data:Disease vs. Non-disease

Prostate, Lung, Breast Cancers

Significance Analysis of Microarrays

Homologene mapping

Hypergeometric test

Literature Validation:proof of disease association

among highly ranked chemicals

A B C

CTD derived disease-chemical relations

diseasechemical

chemical

1n

D

Predictions:ranked by p-value, q-value

factor p1.)2.)

q


factor p1.)2.)

q

chemical

Gen

ome-

wid

e fu

nctio

nde

rived

from

CTD

gene n

gene 2gene 1

1338

36

mapping our expression annotation to these organisms, we ensured gene

compatibility with a large portion of the CTD.

We checked for enrichment of differentially expressed genes among each of

the 1,338 chemical-gene sets in the Envirome Map with the hypergeometric

test. To account for multiple hypothesis testing, we computed the q-value, or

false discovery rate for a given p-value, by using 100 random resamplings of

genes from the microarray experiment and testing each of these random

resamplings for enrichment against each of the 1,338 chemical-gene sets. This

methodology is similar to the q-value estimation method described in

“GoMiner”, a gene ontology enrichment assessment tool [100]. We assessed a

positive prediction for those that had exceeded a certain p-value and q-value

threshold in our list of 1,338 tested associations. All analyses were conducted

using the R statistical environment [101].

Method Verification Phase

For our verification phase, we surveyed publicly available data from the Gene

Expression Omnibus (GEO) for experiments in which sets of samples exposed

to chemicals were compared with controls. We found and used six datasets in

the validation phase. Set 1 included GSE5145 (3 study samples and 3 controls)

in which H. sapiens muscle cell samples were exposed to Vitamin D [102].

Set 2 was GSE10082 (6 study samples and 5 controls) in which wild-type M.

musculus were exposed to tetradibenzodioxin (TCDD) [103]. Set 3 was

GSE17624 in which H. sapiens Ishikawa cells (4 study samples and 4 controls)

were exposed to high doses of bisphenol A (no reference). Set 4 was GSE2111

in which H. sapiens bronchial tissue (4 study samples and 4 controls) were

exposed to zinc sulfate [104]. The CTD had some chemical-gene relations

based on this dataset; we removed these relations prior to computing the

predictions for this dataset. Set 5 was GSE2889 in which M. musculus thymus

tissues (2 study samples and 2 controls) were exposed to estradiol [105].

37

Finally, set 6 was GSE11352 in which H. sapiens MCF-7 cell line was

exposed to estradiol at 3 different time points [106]. In all cases except for set

6, we treated SAM analysis as unpaired t-tests; for set 6, we used the time-

course option in SAM. See Table 2 for the number of differentially expressed

genes found for each dataset along with their median false discovery rate.

Dataset Chemical

Tested Number of

Samples/Controls (tissue type)

SAM: median FDR

Number of Differentially

Expressed Genes / Total

GSE5145 [102] Vitamin D3 3/3 (H.sapiens muscle)

0.04 805/20555

GSE10082 [103]

TCDD 6/5 (M. musculus injection)

0.05 2066/21863

GSE17624 Bisphenol A 4/4 (H. sapiens Ishikawa cells)*

0.04 8406/20828

GSE2111 [104] Zinc sulfate 4/4 (H. sapiens bronchial tissue)

0.05 31/13306

GSE2889 [105] Estradiol (M. musculus thymus)

0.07 112/13383

GSE11352 [106]

Estradiol (H. sapiens MCF7) 0.05 114/20555

Table 2. Gene expression dataset summary for verification stage. 1st column denotes GEO accession, 2nd column is the chemical exposed to the samples. 4th column is the median FDR for SAM. * denotes “high” dosage of Bisphenol A used for the exposed sample group.

38

Predicting Environmental Factors Associated with Disease-related Gene

Expression Data Sets: Prostate, Lung, and Breast Cancer

We found previously measured cancer gene expression datasets to identify

potential environmental associations with cancer. We used measurements from

human prostate cancer from GSE6919 [107, 108], lung cancer from GSE10072

[109], and breast cancer from GSE6883 [110]. We conducted all SAM

analyses using an unpaired t-test between disease and control samples. See

Table 2 for the number of differentially expressed genes measured for each

dataset along with the level of FDR control.

We deliberately chose cancer datasets that used a different population of

controls rather than normal tissues from the same patients. The prostate cancer

dataset (GSE6919) consisted of 65 prostate tissue cancer samples and 17

normal prostate tissue samples as controls.

The lung cancer dataset (GSE10072) consisted of two patient groups: non-

smokers with cancer (historically and currently), and current smokers with

cancer. We conducted the predictions on these groups separately. The cancer-

non smoker group consisted of 16 samples and the cancer-smoker group had

24 samples. The control group consisted of 15 samples.

The breast cancer dataset (GSE6883) consisted of two distinct cancer sub-

groups: non-tumorigenic and tumorigenic. As with the lung cancer data, we

conducted our predictions on these groups separately. The non-tumorigenic

group consisted of three samples and the tumorigenic group had six samples.

The control group contained three samples.

We then validated our highly ranked factor predictions with disease-chemical

knowledge from the CTD. In particular, we determined if the highly

39

significant chemicals in our prediction list included those that had curated

relationship with cancer in the CTD (disease-chemicak relation). This step was

similar to measuring association to chemicals via enriched gene sets using the

hypergeometric test as described above. We used curated factors associated

with Prostatic Neoplasms (MeSH ID: D011471), Lung Neoplasms (D008175),

and Breast Neoplasms (D001943), to validate our predictions generated with

the prostate cancer, lung cancer, and breast cancer datasets respectively.

Further, we assessed the validation by computing the actual number of false

positives and true negatives. To compute this number, we assessed whether

the prediction list was enriched for chemicals associated with any of the other

diseases in the CTD at a higher significance level than the true disease; for this

test, we chose diseases that had at least 5 chemical associations, a total of 141

diseases. As an example, to assess the false positive rate for the prostate

cancer (MeSH ID: D011471) predictions, we determined the curated

enrichment of our predictions for all 140 other disease-chemical sets and

counted the number of diseases that had a lower p-value than that computed for

D011471.

Clustering Significant Predictions By PubChem-derived Biological Activity

Chemical-gene sets derived from the CTD are but one representation of how a

chemical might affect biological activity. Biological activity of chemicals may

also be derived from high-throughput, in-vitro chemical screens such as those

archived in PubChem [87, 111]. Specifically, the PubChem database provides

a large number of phenotypic measurements (or “BioAssays”) for many of the

chemicals we predicted for cancer. In addition, PubChem provides tools to

compare BioAssay measurements for different chemicals. Quantitative and

standardized BioAssay measurements (normalized “scores”) allow comparison

of biological activities of chemicals and derivation of biological activity

similarity between chemicals. For example, PubChem represents the biological

40

activity of a compound through a vector of BioAssay scores and assembles a

bioactivity similarity matrix between each pair of chemicals with this data.

We sought further external evidence of the relevance of the predicted

chemicals though comparison of their patterns of PubChem-sourced biological

activity (Figure 9). First, we produced a list of chemical predictions for each

cancer dataset as described above (Figure 8, Figure 9A, Figure 9B) and

submitted our list of chemicals to PubChem for activity comparison (Figure 9).

Finally, we observed patterns of correlation between PubChem-derived

biological activities of the compounds to their chemical-gene set association

significance by clustering the chemicals in the prediction list by their

biological activity.

41

Figure 9. Clustering chemical prediction lists by biological activity archived in PubChem. A.) A representation of the CTD-based Envirome Map as shown in detail in Figure 6. B.) Prediction of the chemicals associated to each cancer dataset using chemical-gene sets from the CTD. We selected highly significant chemical predictions for each cancer and clustered these chemicals by their “Bioactivity” similarity as defined and computed in PubChem. C.) Within PubChem, each of these chemicals has a vector of standardized BioAssay scores. PubChem had 790 BioAssay scores for 66 of our significant predictions. The PubChem BioActivity similarity tool uses these vectors of scores to computes the biological activity similarity for each pair of chemicals and similarity is represented as a matrix.

RESULTS

We implemented a method to predict a list of environmental factors associated

with differentially expressed genes (Figure 8). The method is centered on

creation of the Envirome Map (Figure 5D-F, Figure 7), an aggregation of

chemical-gene sets that are derived from single curated chemical-gene

response relationships in the CTD (Figure 6). We determine whether the

differentially expressed genes are associated to a chemical by assessing if the

Disease Microarray Data:Disease vs. Non-disease

Prostate, Lung, Breast Cancers


factor p1.)2.)

qPubChem BioActivity Score Data

chemicalbioassa

y 1

bioassay 2

bioassay 3

bioassay 7

90

chemical

PubChem BioActivity Similarity MatrixCorrelation of bioassay scores

6666

Significant Predictions:clustered by BioActivity Similarity

factor p

A B C

significant chemicalsp < α1

66

1338

42

expressed genes are enriched for a chemical-gene set, or contain more genes

from the chemical-gene set than expected at random using the hypergeometric

test. We applied this method in two phases, the first a verification phase in

which we sought to rediscover known exposures applied to samples, and a

query phase, in which we sought to find factors associated with cancer gene

expression datasets. We refer to significant chemical-gene set associations to

gene expression data as “associations” or “predictions” in the following.

Verification Phase

We first applied our method to gene expression data from experiments in

which samples were exposed to specific chemicals, reasoning that if our

method could identify these known chemical exposures, we could use the

method to predict chemicals that may have perturbed gene expression in

unknown experimental or disease conditions. Our goal was to determine

where a gene expression-altering chemical might lie in the range of

significance rankings applied by the prediction method.

We applied our method on datasets that measured gene expression after

exposure to vitamin D, tetrachlorodibenzodioxin (TCDD), bisphenol A, zinc,

and estradiol (2 datasets) on different tissue types. Table 3 shows the results of

our predictions along with a subset of genes in the chemical-gene set that were

differentially expressed.

43

Actual

Chemical Exposure

(GEO accession)

Chemicals Predicted

Hypergeo-metric P-

value

Rank (Percentile)

q-value Relevant Genes Expressed

Vitamin D3 on H. sapiens

muscle cells (GSE5145)

Calcitriol 1x10-23 1 (100) ~0 VDR (25), CYP24A1 (14)

TCDD on M. musculus

(GSE10082)

TCDD 2x10-15 3 (99) ~0 CYP1A1 (59), CYP1B1 (15),

AHRR(6), CYP1A2 (14)

Bisphenol A on H. sapiens Ishikawa cells (GSE17624)

Bisphenol A

1x10-6 15 (99) ~0 ESR1(31), ESR2(7), S100G

(6)

Zinc sulfate on H. sapiens bronchial

tissue (GSE2111)

Zinc sulfate

3x10-3 15 (99) 0.04 SLC30A1 (3), MT1F(2), MT1G(2)

Estradiol on M. musculus thymus

(GSE2889)

Estradiol 5x10-3 17 (99) 0.08 C3(6), LPL (4), CTSB (2)

Estradiol on H. sapiens MCF7

cell line (GSE11352)

Estradiol 5x10-3 19 (99) 0.08 ISG20 (2), MGP (2), SERPINA1

(2)

Table 3. Chemical Prediction Results from the Verification Phase. Each row represents a gene expression dataset and relevant prediction and ranking. The first column specifies the gene expression dataset, the 2nd column the actual exposure applied to the samples for the gene expression set. The 3rd and 4th columns represent the hypergeometric p-value for chemical-gene set enrichment along with the rank of the chemical in the prediction list. The 5th column shows the 5th percentile of the ranking derived from 100 random samplings of genes from the gene expression dataset. The 6th column show notable genes expressed in the chemical-gene set along with the number of references the chemical-gene relation in the CTD.

We were able to satisfactorily predict the exposures applied to the gene

expression datasets. We ascertained a positive prediction if the exposure had a

relatively high ranking (low p-value for enrichment) and if the q-value was

lower than 0.1. For the datasets measuring expression after exposure to

Vitamin D, calcitriol, a type of vitamin D, was ranked first in the list (p=10-23,

44

q~=0). Similarly, TCDD was predicted third in its respective list (p=10-15,

q~=0). The other exposures ranked within the top percentile, ranging from 15

to 19; the lower bound of p-values were between 10-6 and 0.01 and q-values

less than 0.1. We reasoned that we could detect true associations between

environmental chemicals and gene expression phenotypes provided they met

these significance thresholds.

Predicting Environmental Chemicals Associated with Cancer Data Sets

We applied our prediction methods to predict association to cancer disease

states, specifically merging the Envirome Map with prostate, breast, and lung

cancer datasets. In particular, we computed predictions for prostate cancer

from primary prostate tumor tissue, lung adenocarcinomas from lung tissue

from non-smoking individuals, and non-tumorigenic breast cancer cells grown

in mouse xenografts. To validate and select specific predictions from our

ranked list of 1,338 environmental chemicals from the Envirome Map, we

measured how enriched top-ranking chemicals were for annotated disease-

chemical citations in for diseases of interest (“Prostate Neoplasms”, “Breast

Neoplasms”, and “Lung Neoplasms”). To call a positive chemical association

or prediction to disease phenotype, we used p-value thresholds similar to what

we observed during the verification phase (α ≤ 10-4, 0.001, 0.01) along with q-

values as low as possible, specifically less than 0.1. For comparison, we also

used the typical p-value threshold of 0.05.

Figure 10, Figure 11, and Figure 12 shows the result of the disease validation

phase. In all cases, the significant chemicals contained many of the specific

curated disease-chemical relations. For example, if we call chemicals with p-

values less than 0.01 as positive predictions, then we were able to capture 18%,

16%, and 7% of all of the curated relationships for prostate, lung, and breast

cancers respectively (p=10-7, 10-4, and 4x10-5). We assessed specificity of our

list by computing how many curated chemicals we found for all other diseases

45

in the CTD (Figure 10, Figure 11, and Figure 12, offset points in orange and

black). We achieved false positive rates between 1 to 4% for prostate cancer, 8

to 20% for lung cancer, and 2 to 10% for breast cancer. However, most all of

the “false positives” were other types of neoplasms or cancers (Figure 10,

Figure 11, and Figure 12, examples annotated in italics/arrows). For example,

for the lung and prostate cancer predictions at α=0.001 only 1 disease other

than neoplasm or carcinoma was detected: Liver Cirrhosis, Experimental

(MeSH ID: MESH:D008325).

46

Figure 10. Curated disease-chemical enrichment versus prediction lists for prostate cancer datasets. For a prediction list, we selected chemicals that ranked within α=10-4, 10-3, 10-2, and 0.05. This –log10(threshold) along with number of total chemicals found (in parentheses) for each threshold is seen on the x-axis of each figure. We tested if these highly ranked chemicals found under each threshold were enriched for chemicals that had known curated association with the cancer in question. The –log10(p-value) for this enrichment is seen on the y-axis. The solid round red marker represents the enrichment test for the actual disease for which the predictions were based; the number underneath represents the total number of chemicals found in the prediction list that had a curated association with the disease and the percent found among all curated relations for that disease. We estimated accuracy and precision by computing disease-chemical enrichment for all other diseases; false positives are offset in black and true negatives are in yellow. The false positive rate is bracketed and in italics. Examples of false positives are annotated in blue italics along with the number of chemicals found in the prediction list corresponding to that disease and the percent found among all curated relations for that disease.

−log10(pvalue) threshold for factor ranking (number of chemicals found)

−log

10(p

valu

e) fa

ctor−d

isea

se e

nric

hmen

t

!

!

!

!

7(10%)

10(15%)12(18%)

13(19%)!

!

!

! !!

!

!

! !

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!! !

!

!!

!

!

!

!

!

!

!!

!

!

! !

!

!

! !

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!! !!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!! !

!

!!! !!

!

!

!!

!

!

!

!!

!

!

!

!

!!

!!

!

!

!

!

!!!

!

!! !

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!! !

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!!

!

!

!

!

!

!

!!

!

!

!

!

1.3 (89) 2 (68) 3 (50) 4 (27)

02

46

810

!

!

!

True NegativesFalse PositivesProstatic Neoplasms

2%

1%

3%4%

}}

} }

Carcinoma, Hepatocellular 10(30%)

Liver Neoplasms 10(29%)

Liver Cirrhosis 10(16%)

47

Figure 11. Curated disease-chemical enrichment versus prediction lists for lung cancer datasets. See Figure 10 for complete legend.


−log

10(p

valu

e) fa

ctor−d

isea

se e

nric

hmen

t

!

!

!

!

4(8%)

7(14%)8(16%)

9(18%)

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!!

!

!!

!

!!

!

!

!

!

!

!

!!!

!

!!

!

!

!

!!

!

!

!

!

!!

!

! !

!

!

!!

!

!

!!

!

!!!

!

!

!

!

! !! !

!

!

!

!

!!

!

!

!!

!

!

!!

!

!!

!

! !!!!

!!

!!!

!

!!!

! !

!

!! !!

!!

! !!!!

!

!! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!!!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!!

!

!!!

!!

!

!

!

!

!

!!!

!!! !!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!

!

!

!!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

! !

!!

!

!!!

!

!

!

!

!

!

!

!!

!

!

! !!!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!!

!

!!!!

!

!

!

!!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

!!

!

!!

!!

!

!!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

1.3 (84) 2 (73) 3 (42) 4 (29)

02

46

810

1214

!

!

!

True NegativesFalse PositivesLung Neoplasms

10%} 9%}8%}

20%}

Prostatic Neoplasms (15%)

Carcinoma, Hepatocellular9(27%)

Mammary Neoplasms,

Experimental 7(28%)

Liver Cirrhosis 9(15%)

48

Figure 12. Curated disease-chemical enrichment versus prediction lists for breast cancer datasets. See Figure 10 for complete legend.

For the prostate cancer dataset, we chose a chemical signature association

threshold of 0.001 (q ≤ 0.01). Of 1,338 chemicals tested, 50 total were found

under this threshold. Of these 50 chemicals predicted, 10 had a curated

relation with the MeSH term “Prostate neoplasms”. This amounted to

prediction of 15% of all CTD curated disease-chemical relations for the

Prostatic Neoplasms term (p = 3x10-7). These chemicals are seen in Table 4

and include estradiol, sodium arsenite, cadmium, and bisphenol A. Also

predicted were known therapeutics, including raloxifene, doxorubicin,

genistein, diethylstilbestrol, fenretinide, and zinc. We observed that many of


−log

10(p

valu

e) fa

ctor−d

isea

se e

nric

hmen

t

!

!

!

!

0(0%)

3(3%)

7(7%)

9(10%)

!!!! !!

!!

!

!

!!!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!! !!!!!!!!!!!!!!!!! !!!!!! !!!!!! !!!!!! !!! !!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! !!!!!!!! !!!! !! !!!! !!! !!

!

!

!

!

!

!!!!!!!!!!! !!

!

!!!!

!

!

!

!

!

!

!!!

!

!!!

!

!!!

!!

!!!!

!

!!!!!!!!

!

!!!!!!!!

!

!

!

!! !!!

!

!

!

!

!!

!

!

!!!!

!

!

!

!!

!

!!!!!!!!

!

!!!!

!

!!!!!!

!

!!!!!!!!!

!

!!! !!!!

!

!

!

!!!!!!!!!

!

!

!

!

!

!

!

!

!!!

!

!

!

!!

!

!!!

!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!!

!

!!

!

!

!!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

! !!

!

!

!

!

!

!

!

!

!!

!

!!!!!!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!!

!!

!

!!!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!!

!!

!!

!

!

!

!

!!

1.3 (86) 2 (28) 3 (11) 4 (5)

01

23

45

6!

!

!

True NegativesFalse PositivesBreast Neoplasms

Carcinoma, Hepatocellular 5(33%)

Prostatic Neoplasms 7(11%)

Skin Neoplasms4 (19%)

2%}10%}

4%}

49

the genes detected were well-studied, additional support to our predictions.

For example, ESR2, PGR, and MAPK1 had 37, 34, and 14 references

respectively citing their activity in the context of estradiol exposure (Table 3,

second-to-right column). Second, we observed common occurrence of genes

such as ESR2, BCL2, and MAPK1, among some of the gene sets associated

with chemicals such as estradiol, raloxifene, sodium arsenite, doxorubicin,

diethylstilbestrol, and genistein.

50

Chemical Predicted

Hypergeo- metric

P-value

Rank (percentile)

q-value Relevant genes in set (number of references)

Citations

Estradiol 4x10-10 5 (99) ~0 ESR2(37), PGR(34),

MAPK1(14)

[112]

Raloxifene 1x10-9 6 (99) ~0 ESR2(6), IGF1(5), BCL2(4)

[113]

Sodium arsenite 1x10-8 8 (99) ~0 JUN(13), MAPK1(9), CCND1(8),

FOS(6)

[114]

Doxorubicin 7x10-7 11 (99) ~0 BCL2(23), MAPK1(14),

TNF(10)

[115-118]

Cadmium 6x10-6 13 (99) ~0 MT2A(14), MT1A(12), MT3(11), MT1(6)

[119]

Genistein 3x10-5 19 (99) 6x10-4 ESR2(22), PGR (10),

MAPK1 (5)

[120-122]

Diethylstilbestrol 3x10-5 22 (98) 0.001 ESR2(8), FOS(8),

HOXA10(4)

[123, 124]

Fenretinide 3x10-4 40 (97) 0.004 BCL2(3), ELF3(2), LDHA(2)

[125]

Bisphenol A 6x10-4 47 (96) 0.01 PGR(8), ESR2(7), IL4RA(2)

[112]

Zinc 9x10-4 53 (96) 0.01 MT3(18), MT2A(13), MT1A(11)

[126-129]

Table 4. Prediction of environmental chemicals associated with prostate cancer samples (GSE6919). Shown in the table are a subset of the highly ranked chemicals (p < 0.001) that were predicted to have association with prostate cancer gene expression and had evidence of association with the MeSH term “Prostatic Neoplasms” as in the CTD. The 1st column represents the chemical predicted and the 2nd and 3rd columns show the hypergeometric p-value and ranking. The 4th column shows q-value derived from random samples of genes. The 5th column shows the notable genes in the chemical-gene set that were differentially expressed. The 6th column contains references for the prostate cancer and chemical association found from the CTD. For the lung cancer dataset, we also chose a threshold of 0.001 (q ≤ 0.004). Of

1,338 chemicals tested, 42 were found under this threshold. Of these 42

chemicals, 7 had a cited relation with “Lung neoplasms”, 14% of all curated

disease-chemical relations for the term (p = 1x10-5). These chemicals are seen

51

in Table 5. For lung cancer, we observed cited chemicals such as sodium

arsenite, vanadium pentoxide, dimethylnitroamine, 2-acetylaminoflourene, and

asbestos. Therapeutics observed included doxorubicin and indomethacin. We

did not observe common genes represented for different chemical-gene sets,

unlike the prostate cancer predictions. Predictions for the smoker-lung cancer

samples were similar, resulting in sodium arsenite, dimethylnitrosamine, and

vanadium pentoxide, albeit through different differentially expressed genes.

52

Chemical Predicted Hypergeo-

metric P-value

Rank (percentile)


Citations

Doxorubicin 1x10-6 16 (99) 4x10-4 CASP3(60), ABCB1(28),

BAX(26), BCL2 (23)

[130]

Sodium arsenite 8x10-6 20 (98) 4x10-4 JUN(13), NQ01(6), EGR1(6)

[131-133]

Vanadium pentoxide 1x10-5 24 (98) 6x10-4 HBEGF(3), CDK7(1), CDKN1B

(1), CDKN1C(1)

[134]

Dimethylnitrosamine 6x10-5 27 (98) 7x10-4 TGFB1(23), TIMP1(15), PCNA(6)

[135]

Indomethacin 2x10-4 34 (97) 0.002 BIRC5(3), CDKN1B(2),

MMP9(2)

[136-138]

2-Acetylaminofluorene

3x10-4 36 (97) 0.003 ABCB1(4), ABCG2(4), KRT19(2)

[139]

Asbestos, Serpentine 4x10-4 39 (97) 0.004 IL6(2), MMP9(2),

MMP12(2), PDGFB(2)

[140]

Table 5. Prediction of environmental chemicals associated with lung cancer samples (GSE10072). Shown in the table are subsets of the highly ranked chemicals (p < 0.001) that were predicted to have association with lung cancer gene expression (non-smokers) and had evidence of association with the MeSH term “Lung Neoplasms”.

For the breast cancer dataset, we chose a threshold of 0.01 (q ≤ 0.08). Of

1,338 chemicals tested, 28 were found under this threshold. Of these 28

chemicals, 7 had a cited relation with “Breast neoplasms”, 7% of all curated

disease-chemical relations for the disease. These chemicals are seen in Table 4

(p = 4x10-5). The chemicals predicted included progesterone and bisphenol A.

Therapeutics found included indomethacin and cyclophosphamide. There was

evidence for both a harmful chemical and a therapeutic for chemicals such as

estradiol, genistein, and diethylstilbestrol for breast cancer. Unlike the

predictions shown for prostate and lung cancer, the genes utilized in the

predictions for breast cancer were not as well studied, with 1 to 3 references

53

for the gene and environment association. We observed some commonality in

chemical-gene sets, such as the presence of IL6 and CEBPD in several of the

top chemicals predicted in association to the disease. Similar chemicals were

predicted for the tumorigenic breast cancer dataset, such as estradiol and

progesterone. However, chemicals not highly ranked in the non-tumorigenic

predictions included benzene and the therapies tamoxifen and resveratrol.

Chemical Predicted Hypergeo-

metric P-value

Rank (percentile)


Citations

Progesterone 2x10-4 6 (99) 0.01 IL6(3), STC1(3),

CEBPD(2)

[141, 142]

Genistein 6x10-4 10 (99) 0.03 CEBPD(1), APLP2(1), MLF1(1)

[143-145]

Estradiol 7x10-4 11 (99) 0.03 LPL(4), IL6(3),

CEBPD(2)

[146-150]

Indomethacin 3x10-3 17 (99) 0.05 CCDC50(1), BIRC3(1), DNAJB(1)

[151]

Diethylstilbestrol 3x10-3 18 (99) 0.05 IL6(1), MARCKS(1),

MXD1(1), MMP7(1)

[152, 153]

Cyclophosphamide 4x10-4 19 (99) 0.06 IL6(3), MARCKS(1),

PSMA5(1)

[154-156]

Bisphenol A 6x10-3 21 (99) 0.08 CEBPD(1), MLF1(1), DTL(1)

[157]

Table 6. Prediction of environmental chemicals associated with breast cancer samples (GSE6883). Shown in the table are subsets of the highly ranked chemicals (p < 0.01) that were predicted to have association with breast cancer gene expression (non-tumorigenic) and had evidence of association with the MeSH term “Breast Neoplasms”.

54

Some of the chemicals found were common to more than one type of cancer.

For example, we predicted chemicals such as sodium arsenite for both prostate

cancer and lung cancers, and bisphenol A for both prostate and breast cancers.

In some of the cases, the predicted chemical overlap across different cancers

are due to the expression of distinct genes for each dataset, highlighting the

potential of many possibilities for interaction between environmental

chemicals and genes.

Clustering Significant Predictions by PubChem-derived Biological

Activity

We have described a method of generating a list of chemical predictions

associated with disease-annotated gene expression datasets and applied the

method on gene expression data for several cancers, in effect merging a

comprehensive representation known as the Envirome Map with disease

datasets. We have validated a subset of our predictions with evidence from the

literature as described above.

We sought further evidence of the biological relevance of our

predictions through internal comparison of their potential activity archived in

PubChem. Specifically, we expected some degree of correlation between

“similar” chemicals and their gene set significance to the cancer datasets. We

opted to use PubChem BioActivity to assess chemical similarity, assuming this

measure of phenotypic similarity would be representative of underlying

biological pathways of action. We picked chemicals that were deemed

significant for thresholds used above (p=0.001, 0.001, 0.01, for the prostate,

lung, and breast cancer datasets) for all of the cancer datasets. This resulted in

a total of 130 chemicals, 66 of which had BioActivity data in PubChem. The

BioActivity similarity for each of the 66 chemicals was computed through 790

BioAssay scores. Figure 13 shows the –log10 of significance for the highest

ranked chemical predictions clustered by their BioActivity similarity.

55

We found some chemicals with similar biological activity profiles in

PubChem had similar patterns of chemical-gene set association across the

cancer datasets. For example, sodium arsenite, sodium arsenate, and

doxorubicin have closely related biological profiles as well as high

significance of chemical-gene set association for the prostate and lung cancer

data (Figure 13, enclosed in orange box); however, we did not observe other

biologically similar chemicals such as Tetradihydrobenzodioxin. On the other

hand, we also observed correlation between the biological activity similarity

and chemical-gene set association for hormone or steroidal chemicals such as

ethinyl estradiol, estradiol, and diethylstilbestrol as well as progesterone and

corticosterone (Figure 13, enclosed in purple boxes).

56

Figure 13. Chemical predictions for Prostate, Lung, and Breast Cancer datasets clustered by PubChem BioActivity. Highly significant chemical prediction p-values for the prostate, lung, and breast cancer datasets (p=0.001, 0.001, 0.01, for the prostate, lung, and breast cancer datasets) are reordered by their BioActivity similarity computed by PubChem. A column represents the cancer analyzed and each cell corresponds to the chemical-gene set association –log10(p-value). Examples of correlation between BioActivity similarity and chemical-gene set significance include the sodium arsenite, sodium arsenate, and Doxorubicin cluster (labeled in orange), the Genistein, Estradiol, Ethinyl Estradiol, and Diethylbisterol and Progesterone, Tretinoin, and Corticosterone clusters (labeled in purple). Other examples of BioActivity similarity and chemical-gene set association include chemicals vinclozolin, tert-Butylhydroperoxide, and Carbon Tetrachloride (outlined in blue).

chandran

landi

liu

IndomethacinresveratrolCorticosteroneTretinoinProgesteroneVitamin K 3DisulfiramMifepristoneCalcitriol4-hydroxytamoxifenDiethylstilbestrolEthinyl EstradiolEstradiolGenisteinFenretinidepirinixic acidAm 580pyrazoleCadmium ChloridehydroquinoneFurosemideAcetaminophenVitamin AMethylnitronitrosoguanidineCholecalciferolalachlorindole-3-carbinolFenthionTrichloroethylenenaphthalene3-dinitrobenzenePiperonyl ButoxideAcroleinEthanolmono-(2-ethylhexyl)phthalateDimethylnitrosamineFolic AcidCyclosporineMechlorethamineLindaneIsotretinoinTamoxifenEtoposidebisphenol AMethapyrileneRaloxifene4-(N-methyl-N-nitrosamino)-1-(3-pyridyl)-1-butanonebenzyloxycarbonylleucyl-leucyl-leucine aldehydeCarbon Tetrachloride2-nitrofluorenetert-ButylhydroperoxidevinclozolinalitretinoinMetribolonefulvestrantflavopiridolnickel chloride2-AcetylaminofluoreneAflatoxin B1Hydrogen PeroxideBenzeneThioacetamideTetrachlorodibenzodioxinDoxorubicinsodium arsenatesodium arsenite

0 2 4 6 8Value

020

4060

Color Keyand Histogram

Count

prostate cancer

GSE 6919 lung cancer

non-smokers

GSE 10072 breast cancer

non-tumorigenic

GSE 10072-log10(pvalue)

BioA

ctivi

ty S

imila

rity

57

DISCUSSION

We have developed a knowledge- and data-driven method to predict chemical

associations with gene expression datasets, using publicly available and

previously disjoint datasets. Specifically, we have created a functional, gene

expression representation of 1,338 environmental chemicals called the

Envirome Map (Figure 5D-F) and have developed a quantitative method to

query this map. To our knowledge, there are few methods that generate

hypotheses regarding environmental associations with disease from gene

expression data. Most current approaches in toxicology have focused on a

small number of environmental influences on single or small groups of genes,

while current approaches in toxicogenomics have been concentrated on

measuring genome-wide responses for a few chemicals [158]. Our prediction

method enables the generation of hypotheses in a larger scalable manner using

existing data, examining the potential role of hundreds of chemicals over

thousands of genome-wide measurements and diseases.

As an example, we found predicted chemicals such as sodium arsenite in its

association with prostate and lung cancers, estrogenic compounds such as

bisphenol A and estradiol with prostate and breast cancers, and

dimethylnitrosamine with lung cancer. Although each has curated knowledge

behind the association in the CTD, mechanisms for the action are not well

known and call for further study. So far, Benbrahim-Talaa et al. have found

hypomethylation patterns in the presence of arsenic in prostate cancer cells

[114]. Zanesi et al. show a potential interaction role of FHIT gene and

dimethylnitrosamine to produce lung cancers [135]. Evidence of a complex

mechanistic action of estrogens, such as estradiol, on breast cancer

carcinogenesis has been established [159]; however the role of other

estrogenic-like compounds have only recently been studied. For example,

bisphenol A has been shown to invoke an aggressive response in cancer cell

58

lines [160], possibly by affecting estrogen-dependent pathways [161]. It is

evident that more experimentation is required involving the measurements of

exposure-affected proteins and genes and their activation state in cellular

models and their relation to the chemical signatures.

An overlap of activity of the same genes induced by different chemicals would

suggest a common physiological action by these chemicals. For example, the

ESR2 and MAPK1 genes in the prostate cancer prediction, and the IL6 and

CEBPD in the breast cancer predictions, were associated with several

chemicals for each of the diseases. We also found an overlap between

chemicals amongst different cancers. This result comes as a result of the

correlation in the significant pathways shared by these cancers; however, it

may also indicate a need to explore less significant associations in order to find

unique and specific gene expression/chemical exposure relationships for a

given disease. Furthermore, this result may also indicate a bias of gene and

chemical relationships cataloged in the CTD. For example, it could be that

genes specific to common cancer-related pathways are those that are well

studied, such as BCL2 or ESR2.

Related to this, we have attempted to show how biological activity, as assayed

in a high-throughput chemical screen in PubChem, can be correlated with

chemical gene-set associations. Observing a correlation in both PubChem-

derived bioactivity in addition to a chemical-gene set association from the

CTD provides a way to identify shared modes of action among groups of

similar or related chemicals. This data serves to both provide internal

validation for list of predicted chemicals acting through similar pathways (such

as those induced by estrogen) but also to prioritize hypotheses. For example,

we did not find curated evidence in the CTD for association of the chemicals

vinclozolin, tert-Butylhydroperoxide, and Carbon Tetrachloride to prostate or

59

lung cancers; however, their similar bioactivity profiles (Figure 13, enclosed in

blue box) and high chemical-gene set association calls for further review.

We do acknowledge some arbitrariness in our choice of methods and

thresholds; most of these were chosen to show significance in our

methodology without adding complexity. We could have chosen any of

several alternative approaches to implementing our method; however,

predictions made with the Gene Set Enrichment Analysis (GSEA) [162]

method during the verification phase were not as sensitive (not shown).

Another limitation in our first implementation is that in calculating the

chemical signatures associating chemicals with gene sets, we ignored the

specific degree of expression change (up or down) encoded in the CTD. We

decided not to use this information due to the presence of contradictions (some

references may point to an increase of exposure-induced gene expression while

another reference might claim the opposite), and other preliminary work

suggesting that filtering by the degree of change reduced sensitivity (data not

shown). Because of these limitations, direction of association cannot be

inferred. Further still, we acknowledge richer and more refined chemical

signatures along with further integration with resources like PubChem will

need to be built to make the most accurate predictions.

Another issue with querying the microarray data of any experiment is the lack

of full sample information to stratify results; for example, different exposures

may be associated with a subset of the samples. A related concern includes

small sample sizes of some of the datasets used to evaluate the method. For

example, the best predictive power was seen the largest dataset (prostate

cancer, GSE6919), and the worst with one of the smallest, (breast cancer,

GSE6883). Despite this heterogeneity and lack of power, we still arrived at

noteworthy and literature-backed findings warranting further study. We also

60

urge that more evaluation must occur with datasets that have a larger number

of samples.

Most importantly, we stress that these types of association remain as

predictions and hypotheses that need validation and verification. The method

presented here is not a substitute for traditional toxicology or epidemiology.

These studies are required to provide quantitative and population generalizable

estimates of disease risk and dose-response relationships. However, as the

space of potential environmental chemicals potentially causing biological

effects is large, we suggest that this methodology would give investigators at

least some clue where to start the search for environmental causal factors to

study in these other modes. We believe, like the Connectivity Map, the

Envirome Map is a feasible and practical way to represent toxicological

response for use in prediction. Predicting a linkage between chemicals, genes,

and clinically-relevant disease phenotypes using existing resources falls in line

with the National Academies’ vision of high-throughput efforts to decipher

genome-wide toxicity response to disease [13].

61

CHAPTER 3. METHODS TO EXECUTE ENVIRONMENT-WIDE

ASSOCIATIONS ON DISEASE AND DISEASE-RELATED

PHENOTYPES ON POPULATIONS.

INTRODUCTION

Complex diseases and adverse phenotypes arise due the contribution of

multiple interacting genetic and environmental factors [2], but despite this

many recent epidemiological or population-based studies have emphasized the

genetic components. For example, the Genome-wide Association Study

(GWAS) is a low-cost, commoditized, and popular framework used by

researchers to evaluate genetic factors that correlate with disease status on a

genome-wide scale [9, 163-165] (Figure 5B). A function of its accessibility

and the nature of the simple measurements assayed in GWAS, standards for

cross-study comparison and reporting of genetic association in epidemiology

have established, in the very least calling for comprehensive, systematic, and

agnostic reporting of associations and their validation results. As a result, over

370 GWAS have been published, often over 20 for specific diseases such as

T2D [9].

While GWAS has strengthened the epidemiological process and methodology

of screening and validating genetic variants, most of the findings attained

through the many studies have been unable to explain a large portion of risk

variability between individuals and are of modest effect size [32, 33].

Furthermore, variants found have not been able to shed light on biology of

disease. One hypothesis for this includes that complex disease arises as a

result of sum of effects of variants that are less prevalent than that assayed in

GWAS [166]. Another is that these studies have not considered the joint

contributions of both genetics and the environment. However, before we may

62

address the latter hypothesis, we must understand what environmental factors

are associated with disease.

Despite little relevance of genetic variants found and, more importantly, the

fact that diseases arise out of the contribution of both genetics and the

environment, there exists no analogous or comparable platform to assay and

analyze enviromic associations to disease on an epidemiological scale.

Specifically, given multiple environmental factors – or “envirome” (Chapter 1)

measured on a population– we ask a analogous question to that asked in

genome-wide association studies: what specific environmental factors are

associated with a disease or phenotype of interest out of all possible individual

environmental factors measured on epidemiological scale? Note that these

associations are of different scale than that covered in Chapter 2 (“Mapping

Multiple Toxicological Responses To Complex Disease”).

To answer this question, we propose an analogous framework to GWAS,

called “Environment-wide association study” (EWAS), to search for and

analytically validate environmental factors associated with continuous

phenotypes or discrete ones such as disease (Figure 5A). This type of question

is different from a hypothesis-driven approach in which candidate

environmental factors are chosen a priori and tested individually in their

association to a phenotype and analogous to questions facilitated by GWAS.

We begin our description of EWAS by introducing the genome-wide analog,

GWAS. Second, we describe the EWAS framework and third, describe

differences between genome-wide and envirome-wide epidemiological studies.

Fourth, we describe the current EWAS methodology. Last, we discuss our

results and posit ways to extend the EWAS methodology. In the following

chapter, we describe specific, peer-reviewed, published applications of EWAS.

63

METHODS BACKGROUND

Genome-wide association to disease

With the sequencing of the genome and projects that characterized common

genetic variation such as the HapMap, investigators are now able to interrogate

how genome-wide genetic differences are associated with disease and disease-

related phenotypes on an epidemiological scale [25, 167]. These revolutionary

studies, known as “genome-wide association studies” (GWAS), have enabled

investigators to ask what common genetic loci are associated to a phenotype in

an agnostic, systematic, and comprehensive way with explicit control of

multiplicity.

Specifically, during the HapMap project, common single nucleotide (SNP)

variants and have been catalogued on basis of their population frequency (≥

10% population frequency), and major and minor allele versions [168]. The

location of each SNP along the genome is referred to as a “locus” and the

presence of variation at a particular locus denotes a “polymorphism” or a

“polymorphic” locus. “Common” polymorphisms are those that occur at

approximately greater than 5-10% in the population. Thus, by definition, a

“common” SNP must reside at a polymorphic locus. There are greater than 1

million common SNPs in the genome [25]. While SNPs are the most common

type of polymorphism in the genome accounting for 90% of genetic variation,

many other types of genetic variation exists, such as copy number variants,

insertions, and deletions.

GWAS relate traits to variation at each – or a large subset of—common

polymorphic locus in the genome and are enabled by genomic technologies,

known as “SNP microarrays”, which can assay greater than 1 million loci

simultaneously for an individual. These microarrays are now mere commodity

items, like computers, making accessible genome-wide measurements on a

64

large number of individuals [169]. Further, these technology platforms are

known to have very low measurement error [10].

GWAS are constructed by recruiting thousands of individuals with (“cases”)

and without a trait or disease (“controls”). Genotype frequency at each locus

across the genome are then compared between cases and controls using

common statistical tests such as chi-squared test [8], assuming the

independence between each locus. Continuous traits, such as levels of a

biomarker, may also be related to genetic variation using by modeling the

continuous phenotype in a linear regression model [170]. Multiple

comparisons are accounted for through conservative Bonferroni adjustment

and significant loci are validated in independent populations, often (but not

always) of differing demographic character than of the original screen.

Preceding GWAS were “candidate gene studies”, a hypothesis-driven study to

correlate a handful of genetic variants to a trait of interest using a “smaller”

sample size. As a consequence of lack of power and prohibitive genotyping

cost, the agnostic, comprehensive, and systematic analytical and validation

procedure of GWAS eluded traditional genetic association studies [32, 33,

171-173]. To facilitate discussion regarding “Environment-wide association”,

we describe these “agnostic”, “systematic”, and “comprehensive”

characteristics of GWAS.

GWAS is agnostic and data-driven, not hypothesis-driven. Traditionally,

genetic epidemiology association studies were hypothesis driven, testing a

handful of genetic variants at a time against a phenotype. The process of

GWAS also calls for both systematic associations both within an individual

study and between multiple studies. The process of simultaneous association

requires accounting for multiple testing, controlling for the family-wise error

rate and false positives. a notable problem in the fragmented literature. Often

65

the threshold for significance is fixed a priori (Bonferroni correction). Second,

GWAS significant results are validated in additional populations at the same

stringent level. Last, and related to the agnostic characteristic, GWAS

becomes close to comprehension: each common variant present on the

measurement chip is associated to the phenotype and its strength of association

is reported in context of all other common genotypes assayed (as seen in the

“Manhattan plot”, Figure 5A).

Environment-wide association to disease

In the following, we propose a study design analogous to GWAS, called

“environment-wide association study” (EWAS) to search for and analytically

validate environmental factors associated with complex diseases and

phenotypes.

EWAS assumes similar “data structure” to that of GWAS. Recall that in

GWAS, multiple genetic factors are assayed along with phenotypic

information on each individual (Figure 14). In other words, the genetic factors

are the independent variables, and the phenotype is the dependent variable. In

EWAS, the genome domain are substituted with envirome domain (Figure 5A).

Specifically, the quantity or presence of environmental factors is directly

measured on each individual, such as the amount of a chemical in bodily tissue,

or a proxy measure, such as self-report historical exposure (Chapter 2). This is

in contrast to data that are self-report and subject to bias, “ecological” [11],

data summarized on a level higher than that of individuals but on samples

grouped by some common characteristic, such as family, social network [174],

and town or city regions [175]. As discussed in Chapter 1 and further below,

the environment is a dynamic entity, unlike the data structure of GWAS. Thus,

the dimension of time may also be added to the structure of EWAS data

structure, framing it in a longitudinal context. While we describe methods to

66

accommodate a longitudinal data structure below, specific applications

described do not consider time (Figure 14).

Figure 14. Sample data structure for EWAS. “Phenotype” is the dependent variable. “Sex”, “Age”, “ethnicity”, “SES” (socioeconomic status) are examples of adjustment variables. X1 through Xp are environmental factors; sample1…samplen are the individuals that make up the sample. Values inside each cell denote an example of the data type for the variable. For example, “Phenotype“ here is a binary variable taking on 1 if the phenotype is present, 0 if absent; “sex” is a categorical variable for males and females. X variables representing environmental factor may be continuous (e.g., X1, Xp), positive/negative (e.g., X2), or ordinal (X3). Data might be missing (e.g., NA cells). The vertical axis denotes individual in the sample. Each environmental factor belongs to a disjoint “class”, or grouping, that represents a common characteristic of those factors, represented in the figure as “Class A”, “Class B”, and “Class Z”.

GWAS variables are “binned” by their chromosomal location, facilitating the

description of their correlation structure – known as Linkage Disequilibrium

(LD)-- when visualizing associations. Specifically, LD is the correlation of the

two loci in the genome. Further, LD is a function of relative location of the

two loci; that is, the closer together two loci are on a chromosome in general,

the higher their LD. Suppose we are considering one locus: in this scenario,

we inherit alleles from our parents, one from the mother and one from the

father. The genotype at one locus is a random event and is dependent on the

frequency of alleles present at that one locus in the mother and father. Now

0

2

35 1M

A

M

00

W

2

60

NA

1

20 B

-

-.2

M

-.3

55

0

F

1

-

1

0.53

NA

M 2

-

-

M0

1

1

10

W

+

phenotype

sex

age

ethnicity

SES X 1 X 2 X 3 X p

-.2

NA

0

-1

1

...

3-02W401 F .3

sample1

sample2

sample3

sample4

sample5

samplen

...

...

...

Class A

Class B

Class Z

67

suppose two loci (two sets of genotypes) are in “LD”. This means that their

pattern of inheritance are correlated; that is the occurrence of a particular allele

“A” at a locus A and “B” at a locus B are non-random, or dependent with

respect to one another. In other words, the presence of one allele can predict

the presence of another. LD among different populations has been

characterized by the HapMap project and is ongoing with the 1000 Genomes

project [25]. In GWAS, LD structure “buys” us several things. First, since we

are but only assaying a prevalent subset of polymorphic loci, LD allows us to

narrow down what variants might be causal; for example, given an association

signal for a variant, the causal variant might be one in strong LD with it. LD

also gives us an internal gauge of validity; for example, given a strong

association signal of a variant at loci X, one would expect measured common

variants that are also in LD with X to also harbor some signal.

At present, LD in EWAS is qualitative not quantitative as in GWAS. In our

applications (Chapter 4) we binned factors according to categories that

described the compound “class”, had shared environmental health “relevance”,

or described some other arbitrary shared characteristic as a group of factors.

Current categories and examples within each are seen in Chapter 1. We

anticipate, as investigators characterize the envirome, that these categories will

encompass assays for stress, microbial flora, drugs, noise, and ecological

measures. A research effort will be to fully characterize the LD of the

“envirome”, including their correlation/covariance structure and population-

wide prevalence as has been done with the HapMap.

EWAS achieves the agnostic, systematic, and comprehensive qualities that

characterize GWAS. First, instead of testing a few environmental associations

at a time, EWAS evaluates multiple environmental factors agnostically.

EWAS is comprehensive in that each factor measured is associated with

phenotype. Next, associations are systematically adjusted for multiplicity of

68

comparisons. Further, EWAS calls for validation of significant associations in

an independent population.

The EWAS framework calls for systematic and comprehensive sensitivity

analyses of highly significant or validated factors. Specifically, all possible

measured confounders are included in final models and their effect on the

estimate of the environmental factor is assessed. Last, given the dense web of

correlation for non-genetic measures, such as between environmental factors

and clinical measures, the correlation structure between validated

environmental factors and risk factors are systematically computed and

visualized to understand the degree of their interdependence. By visualizing

relationships in this way, we can infer groups of non-independent exposures

associated with phenotype, similar to “relevance network” or clustering

analyses [176, 177].

Genetic versus non-genetic associations in population scaled studies

While EWAS has been inspired by GWAS, there are both critical differences

and similar drawbacks between genetic versus non-genetic epidemiology. In

the following, we discuss these differences and similarities, between 1.) current

day non-genetic association studies versus GWAS and, 2.) the new paradigm

of non-genetic association studies, or EWAS, and GWAS. Work done by

Ioannidis et al. guides this section [10].

Current association studies seeking association between environmental or non-

genetic factors and phenotypes test a few factors at a time. Results may be

further biased by selective reporting of subsets of analyses, phenotypes, and

adjustments, leading to fragmented body of literature [10, 178-180]. Second,

related to selective reporting of subsets of analyses, consideration of

multiplicity of tests are not considered. Current environmental epidemiology

69

studies are not agnostic, systematic, and comprehensive; however, the EWAS

analytic method amends these differences as described in the previous section.

However, there remain some critical differences and drawbacks between the

new paradigm of “enviromic” and genome-wide association. First, high-

throughput, low-error, commoditized, assay technologies have facilitated

systematic, agnostic, and comprehensive interrogation of genome-wide

variants. An analogous high-throughput and low-error assay technology

platform does not exist for the environmental factors.

Of course, a high-throughput assaying technology can be realized only after

the domain of what to measure – common loci of the genome – have been

characterized. The HapMap project has enabled us to characterize the

variability across the genome. Further, as a result of this characterization, we

also have an idea of how genetic variants are “correlated”, or the pattern of

linkage disequilibrium.

We are far from describing the “LD” of the envirome, let alone what

environmental factors make up the envirome (Chapter 1). However, in our

own applications (Chapter 4) and from others we know that the correlation

matrix of environmental variables is dense [181]; many variables are correlated

with each other strongly. Therefore, it is difficult to pinpoint both what factors

are independently associated with the phenotype and the directionality of

association.

Issues related to observational studies influence all association studies, be it

from hypothesis-driven candidate factor study, GWAS, or EWAS. In contrast

to “gold-standard” randomized trial study data, both genetic and non-genetic

studies rely on observational study data, such as longitudinal cohort, case-

control, or cross-sectional data. Both types of epidemiological studies are

70

subject to confounding biases that hinder causal inference and are avoided, to

some degree, in randomized studies [182]; however, the gold-standard scenario

of a clinical trial is not suited for agnostic study of the envirome as it is

impossible to randomize such a matrix of factors.

“Confounding” is used to describe a scenario in which a variable is correlated

with both the factor of interest (the independent variable) and phenotype

(dependent variable) [183]; in our analyses, the factor acts as a “proxy” to the

confounding variable, resulting in a false association between the dependent

and independent variable. A partial solution to this type of bias is including the

confounder as a covariate in the statistical model, or “controlling” for the

confounder. This of course is only possible when the confounder is known and

measured.

In modern-day genome-wide studies, a notable example of confounding

included the initial association of a variant belonging to the FTO gene and T2D

[165]. In subsequent analyses adjusting for body mass index (BMI), a clinical

risk factor associated with both T2D and the FTO variant, the association was

nullified. Subsequently, FTO was shown and validated in its association to

BMI and obesity in GWAS [170]. Confounding is a major issue in non-genetic

studies, especially noting the dense correlation structure of non-genetic and

environmental variables and many such examples exist of associations biased

by confounders. Famous examples include associations derived from

observational studies later contradicted by randomized control trials (RCT): 1.)

β-carotene, thought to have mute risk for smoking-induced cancer [184], only

to be refuted by a RCT later [185], 2.) same with of vitamin E and decreased

risk of coronary heart disease (CHD) [186], and even, 3.) for vitamin C and

CHD, relative risks between of observational studies and RCTs had even

switched direction [69]!

71

Another source of “bias” includes “reverse causality”, or reverse association.

Reverse causality leads to the failure to infer proper “forward” direction

between the independent variable and dependent variable, the phenotype.

Specifically, it occurs when the independent variable comes directly or

indirectly as a result of the dependent variable. For example of this in includes

a sample-wide behavioral shift due to the dependent variable, such as increased

intake of a vitamin due to an adverse phenotype. If we were to associate the

environmental factor, the vitamin, with the phenotype as the dependent

variable, the interpretation of the model as is suggests that change in vitamin

exposure leads to change in phenotype when in fact the opposite is true. These

biases are especially present in case-control or cross-sectional studies in which

individuals are measured at one point in time. A way to take into account the

dynamic nature of non-genetic variables and biases such as reverse causality

includes conducting a longitudinal study in which we may observe jointly

changes in phenotype and exposure pattern as a function of time [34]. Lastly,

the notion of reverse causality is a non-issue in genetic variant association

studies due to the static state of nucleotide variants.

The nature of the environmental factor themselves also biases results. First,

the assessment of the quantity of environmental factors in blood and serum is

subject to measurement error [10] and self-report variables are subject to recall

bias. Further, physiological characteristics of factors themselves influence

estimates, including the variability of the kinetics of chemical factors, such as

how long they are retained in accessible body tissue. For example, chemical

compounds that are easily measured include those that are lipophilic, persistent

in fatty tissue. As adiposity is related to both the measurement of the factor

and often the phenotype of interest (e.g., metabolic syndrome), a positive

correlation might indicate confounding. On the other hand, many types of

factors are excreted quickly, also affecting their measurement and association

to the phenotype of interest; however, “steady-state” or constant exposure

72

might allay a kinetic effect of environmental chemicals [187]. Nevertheless, in

genetic studies, these issues are altogether avoided: error rates of array-based

assays and DNA sequencing are miniscule to that of environmental factors.

EWAS METHOD

The EWAS methodology and analysis framework is analogous to that utilized

in GWAS. First, we conduct an initial scan for environmental factors

associated with a phenotype of interest through general linear modeling, such

as logistic or linear regression. Since environmental association occurs in the

observational (vs. randomized scenario), these models include variables that

adjust for known confounders, such as clinical risk factors. Second, we

account for multiple hypotheses by estimating the false discovery rate (FDR).

Third, factors that we deem significantly associated with the phenotype beyond

the region of false discovery are “validated” in independent cohorts. Factors

that are validated are considered true discoveries.

The EWAS framework also calls for systematic sensitivity analyses, whereby

validated factors are modeled under different assumptions or with additional

covariates. Further, the pair-wise correlation between each validated factor is

computed and examined to determine their dependence, which can be

interpreted as potential evidence for route of exposure or confounding. Each

step is described further.

Stage 1: Linear Modeling

Each environmental factor is associated with a phenotype of interest using

general linear models; for example, each associated with disease status using

logistic regression. Normally distributed continuous phenotypes are correlated

to environmental factors with linear regression. Common risk, demographic,

and clinical factors are added as adjusting variables, such as age, sex, ethnicity,

73

socioeconomic status, as phenotypic states and environmental factors are

confounded by these variables. Thus, for an environmental factor Xi in our list

of measured factors Xi … Xp we model the disease state (Y) as a linear

function of environmental factors and adjustment variables (represented by Z):

Y = α + βI Xi + ζ Z

Xi corresponds to the environmental factor and βi corresponds to the effect size

of that factor, adjusted by other variables.

The strength of association is computed by the 2-sided p-value for βi, which

tests the “null hypothesis” that βi is equal to zero. When modeling the

phenotype as the logit (logistic regression), the exponentiation of βi serves as

the odds ratio, or the change in the odds in disease versus un-diseased status

for a unit change of the factor. In the linear regression setting, βI can be

interpreted as the change in phenotype per unit change of the factor. In

summary, the screening procedure of stage 1 can be described by this pseudo-

code:

1. Pvalues <- NewList() 2. Effect_sizes <- NewList() 3. For xi in [X1…Xp]: 4. Modi <- GeneralLinearModel(phenotype,xi,Xses,Xeth,Xsex,Xage) 5. ListAppend(Pvalues, getPvalue(Modi, xi)) 6. ListAppend(Effect_sizes, getEffectSize(Modi, xi)) Algorithm 1. Screening for Environmental Factors (Stage 1) of EWAS.

In Algorithm 1, line 1 and 2 initialize an array data structure to store p-values

and effect sizes (coefficients) for each environmental factor. In line 3, we

compute a linear model (‘GeneralLinearModel’) that models phenotype as a

function of environmental factor Xi and the adjusting factors. We simply take

the p-value and coefficient from each model and store them in our list. P-

values are computed through common tests of significance, such as Wald tests.

74

Continuous factors are z-transformed (centered about the mean and divided by

their standard deviation) in order to compare the effect sizes. Many factor

measured in tissue have a right skew and thus are log-transformed prior to z-

transformation. Binary factors (such as presence or absence of a factor) are

standardized such that effect size reflects a unit change between exposed and

un-exposed status; that is, the referent is consistently the “negative” result of a

binary test. Ordinal factors are left untransformed.

Stage 2: Controlling for Multiple Hypotheses by Estimating the False

Discovery Rate

Given a set of “discoveries”, or a list of potentially significant factors, how can

we deem those that are false discoveries? In the GWAS setting, Bonferroni

correction is utilized to adjust for multiple comparisons. The Bonferroni

adjustment is straightforward: it simply divides the significance threshold α for

the total number of tests conducted. This adjustment guarantees the “family-

wide error rate” – the probability of having one or more false positive in a set

of results is equivalent to a setting in which only one hypothesis was tested at

level α. However, the threshold is conservative and therefore we lose power

for detection.

To account for multiple comparisons, we compute an empirical estimate of the

False Discovery Rate (FDR) derived through permutations of the phenotype

multiple times, effectively creating a “null distribution” of test statistics. In

contrast to the Bonferroni correction, the FDR provides a quantitative estimate

of the number of false positives in a set of “discoveries”. The FDR is less

conservative and therefore more powerful than the Bonferroni correction [38].

Further, since our estimate of the FDR utilizes the data itself, it inherently

considers the covariance structure of the data, an important quality given the

dense correlation of non-genetic factors [38].

75

The FDR is the estimated proportion of false discoveries made versus the

number of real discoveries made for a given significance level α, to control for

multiple hypothesis testing. To estimate the number of false discoveries, we

create a “null distribution” of regression test statistics shuffling the phenotype

a large number of times (100-1000) and refit the regression models. The FDR

is the ratio of the proportion of results that were called significant at a given

level α in the null distribution and the proportion of results called significant

from our real tests. We use a significance level that corresponds to FDR of 5-

10% to select associations.

The pseudo-code to compute the FDR follows: 1. Do: ‘Algorithm 1’. 2. nullPvalues <- NewList() 3. For i in [1…numberPermutations]: 4 randomPheno <- permutePhenotypeWithoutReplacement(phenotype) 5. For xi in [X1…Xp]: 6. Modi <-GeneralLinearModel(randomPheno,xi,Xses,Xeth,Xsex,Xage) 7. ListAppend(nullPvalues, getPvalue(Modeli, xi)) 8. fdrRaw <- [] 9. for pvalue in Pvalues: 10. numerator <- sum(nullPvalues < pvalue)/numberPermutations 11. denominator <- sum(Pvalues < pvalue) 12. listAppend(fdrRaw, numerator/denominator) 13. fdrs <- [] 14. for I in [1…p]: 15. fdr <- min(rawFdr[i…p]) 16. ListAppend(fdrs, fdr) Algorithm 2. Computing the FDR (q-value) for each p-value during Stage 1 of EWAS. To begin algorithm 2, we need to have established the stage 1 of EWAS with

Algorithm 1. Then, for a number of permutations, we refit the regression

model for the random phenotype for each environmental factor and collect all

of these ‘null’ p-values (line 3-7). For each p-value computed in Stage 1, we

compute the raw FDR, or the ratio if raw number of results that are exceed that

p-value threshold in the permuted data and the number of results that exceed

that p-value in stage 1 (line 11,12,13). As FDR should be a monotonically

increasing function of the p-value, we ensure that the FDR for a p-value is the

76

minimum of the FDRs for all p-values equal to or greater than that p-value

(line 15). The resulting array of FDR values corresponds to the FDR for each

p-value computed in Stage 1.

Of course, the original method for estimating the FDR can be used [39],

eliminating the need for Algorithm 2. However, as discussed earlier,

estimating the FDR through permutations of the dependent variable is

preferred in the scenario in which the variables are correlated. In addition,

much has been documented about what variables to permute or bootstrap. For

example, it has been suggested that model residuals, the difference between the

predicted and true values, should be permuted (or bootstrapped) as opposed to

the original outcome variables (replacing line 4 in Algorithm 4 appropriately).

In our experience (Chapter 4), we had similar estimates of the FDR under

different documented methods of permuting. The reader is advised to refer to

Manly, Efron, and Westfall and Young for more in this area [188-190].

Stage 3: Validation

Findings deemed significant corresponding to some nominal FDR level are

validated in one or more additional independent cohorts. As a rule, the

significance level of the validation result must be the same or more stringent

FDR level as the initial cohort. For example, if a factor is deemed significant

at FDR of 10% in one cohort, it must also have an FDR of 10% in one of the

validation cohorts. Furthermore, and importantly, the sign of the effect size in

the validation cohort must be equivalent to that in the initial screen.

We also compute the empirical FDR of the validation step, the overall FDR of

validating a factor. We first estimate the number of false positives by counting

the number of factors found significant at level α in multiple cohorts from the

permuted analyses. For example, to assess the FDR of validating a factor in 2

cohorts, we collected the factors that fell below the significance threshold α in

77

the permuted data corresponding to two different cohorts and counted the

number of factors found significant in both. We repeated this operation on all

possible pairs of cohorts, adding up numbers found to be significant in each

pair. We then estimate the FDR by computing the ratio between the total

number of false positives and the number of true validated factors (factors

found to be significant in more than 1 cohort). We repeated the analogous

operation for factors significant in however many cohorts we use to validate

our results. The pseudo code for this procedure follows: 1. numberOfCohorts <- numberOf(cohorts) 2. fdrThreshold = 10% 3. significantFactors <- NewHash(key=factor,init=0) 4. significantNullFactors <- NewHash(key=factor,init=0) 5. For cohort in cohorts: 6. Do ‘Algorithm 1’. 7. Do ‘Algorithm 2’. 8. PvalueThresh <- max(cohortPvalue[fdr < fdrThreshold]) 9. signficantFactorsInCohort = whichFactors(Pvalues < pvalueThresh) 10. signficantFactorsInNullCohort = whichFactors(nullPValues < PvalueThreshold) 11. for factor in significantFactorsInCohort: 12. significantFactors[factor] ++ 13. for factor in significantFactorsNullInCohort: 14. significantNullFactors[factor] ++ 15. validatedFactors <- NewHash(key=numCohorts) 16. for numCohort in [2..numCohorts]: 17. for factor in significantFactors: 18. if(significantFactors[factor] >= numCohort):# need to check the effect size direction 19. validatedFactors[numCohort] ++ 20. nullValidatedFactors <- NewHash(key=numCohorts) 21. for numCohort in [2..numCohorts]: 22. for factor in significantNullFactors: 23. if(significantNullFactors [factor] >= numCohort): 24. nullValidatedFactors [numCohort] ++ 25. validatedFDR <- NewHash(key=numCohorts) 26. for numCohort in [2..numCohorts]: 27. falsePosValidRate <- nullValidatedFactors[numCohort]/numPermutations 28. fdr <- falsePosValidRate / validatedFactors[numCohort] 29. validatedFDR[numCohorts] <- fdr Algorithm 3. Computing the FDR for the multi-cohort validation.

78

In line 1, we retrieve the number of independent cohorts we use to tentatively

validate a significant result, and initialize our significance threshold for a

finding in 1 cohort (eg FDR < 10%). We then initialize two hash data

structures which contain the number of cohorts where the factor is significant,

indexed by the environmental factor name string and do the same for ‘null’

results, or results attained through permutation of the phenotype label (line 3

and 4). Next, for each individual cohort, we do our EWAS stage 1 screen (line

6) and compute the within-cohort FDR (line 7). Then, we collect the number

of significant factors that exceed the nominal FDR threshold (lines 8-10).

Then, we iterate through these significant factors and increment a counter for

the factor (lines 11-12); for factors that are tentatively validated, the count will

be greater than 1. We do the analogous operation for the permuted dataset

(lines 13-14). Next, we count the number of validated factors by totaling up

the factors that had a count greater than 1 (lines 15-19) and the analogous

operation for the permuted dataset (lines 20-24). Finally, we compute the FDR

of validating a factor in lines 25-29: for each possible validation scenario

numCohort (where numCohort is the number of cohorts where a significant is

in 2, 3, 4, etc cohorts) we estimate the false positive rate as the number of

“validated” findings, or rate a factor was found to be significant in numCohort

permuted cohorts divided by the actual number of factors validated in the real

dataset. Thus, this FDR corresponds validating a factor with the significance

rule of FDR for a single cohort.

Final estimates for validated factors are computed by combining independent

cohorts. Tests for heterogeneity between cohorts are also performed to ensure

the final overall estimate is unbiased by any specific cohort.

Stage 4: Sensitivity Analyses

Confounding and reverse causality influence the strength of association, bias

the effect size estimate, and in general affect causal inference of environmental

79

factors to phenotypes. Thus, we propose a method to begin to measure these

biases approximately. However, we cannot claim to find these biases nor

eliminate confounding; nevertheless, we describe methods to assess bias given

that they were measured.

In the first, we systematically comb through all measured variables that were

not considered in our list of environmental factors – but could influence the

association – and sequentially add them to the linear model as an additional

covariate. Then the p-value of association and effect size corresponding to the

environmental factor calculated from the extended model is compared original

model computed in Stage 2. The difference between the extended and original

factor coefficients quantifies the approximate bias due to the new variable.

Types of variables that might bias our associations depend on the phenotype

and environmental factors under study, but often include knowledge of clinical

status (e.g., diagnosis of a disease), recent food, supplement, or drug intake,

and physical activity. For example, knowledge regarding one’s disease state

might induce behavioral change, resulting in increased exposure to foods high

in vitamins and certain nutrients; association between these vitamin factors and

disease might then be attributed to reverse causality. Or, use of a drug might

induce phenotypic change, biasing estimated effects toward the null.

This method is dependent on a multitude of measured potential confounders.

Large epidemiological datasets arising from the public domain or of large

consortia often measure many of these other clinical and behavioral non-

genetic variables which can be utilized to test the “sensitivity” of the final

validated effects of environmental factors associated with a phenotype. We

give specific examples of our sensitivity analyses when covering applications

in sections below.

80

Stage 5: Correlation Globes

The correlation/covariance structure between non-genetic measures are known

to be “dense”, and this structure also influences our ability to infer the

independent effect of factors on phenotype as discovered in EWAS.

Furthermore, our initial screen methodology assumes independence between

factors and we therefore have little idea about their correlation.

Concretely, given a list of discovered factors, their joint association to the

phenotype of interest might be due to their correlation, such as similar routes

of exposure. We assess the degree of dependency between validated factors by

computing their raw correlation coefficient (Pearson’s ρ) and visualizing this

with a correlation “globe”. By visualizing relationships in this way, we can

infer non-independent exposures associated with phenotype [176, 177].

DISCUSSION

As described above, EWAS may facilitate many different ways of screening

for factors. We describe extensions that might be used off-the-shelf to

accommodate longitudinal data and statistical learning methods that consider

the entire matrix of dependent variables at once.

Longitudinal data

As discussed, environmental factors are dynamic. One way to capture the

dynamic relationship between environmental factors and a phenotype of

interest includes repeatedly measuring individuals over time. An example

includes a longitudinal cohort study, in which a cohort is followed for a certain

amount of time beginning prior to disease onset, such as childhood or

adolescence. This type of study design might lessen the bias of reverse

causality, but not completely [34].

81

For a binary dependent variable, the Cox proportional hazard model is a

common analytic model that can accommodate both time-dependent

independent and dependent variables. With this model, we simply substitute

line 4 of algorithm 1 with the Cox model that inputs time-dependent variables.

For both continuous and dependent variables, hierarchical modeling techniques

such as generalized estimating equations may be utilized. The EWAS as

described by algorithm 1 depends on the computation of individual p-values

and effect size for the environmental factor, and statistical tests for these

modeling techniques provide this requirement. Calculation of the empirical

FDR proceeds also in the same way [191].

Feature Selection: Shrinkage Methods

The EWAS screening method considers each environmental factor in a

separate linear model iteratively (algorithm 1). This makes feasible the

screening and interpretation of many variables and not over-fitting the linear

model (i.e., p << n, where p are the number of predictors, n are the number of

individuals). However, this falsely assumes independence between

environmental factors. Statistical learning methods, such as “shrinkage”

methods, enable one to model the dependent variables simultaneously in the

“over-determined” (p ≥ n) setting.

2 such popular shrinkage methods include the “Lasso” [192] and “elastic net”

[193]. These methods are extensions of multivariate regression and have some

relation to tree “boosting” methods [194] and are applicable over the

generalized linear model family, including Cox proportional hazards for

longitudinal data [191]. Both the lasso and elastic net are able to fit an over-

determined model by constraining the size of coefficients (“shrinking”).

Because these methods consider the entire set of independent variables

simultaneously (ie multiple regression), algorithm 1 is supplanted with the

shrinkage procedure. Further, k-fold cross-validation is utilized to select

82

features that have the lowest prediction variability on k number of datasets

held out of the model building process [194].

Feature selection operates through optimizing prediction accuracy of the

dependent variable and not by through ordering of test-statistics of individual

coefficients used in inference. Thus, we must re-configure parts of the Stage 1

(FDR estimation) and Stage 2 (Validation) to accommodate this.

Reconfiguring Stage 1, we use one cohort as the “discovery” cohort, applying

the shrinkage method to find factors associated with the phenotype. Within

this cohort, k-fold cross-validation is applied in order to optimize prediction

accuracy with prediction cohort. Thereafter, the top factors found through this

method are “validated” individually in additional validation cohorts using

common tools for inference (e.g., GLM). Successful validation requires low

nominal p and FDR values for the validation analyses.

Of course, “classical” methods for feature selection exist in the linear

regression domain, such as “forward-stepwise” and “backward-stepwise”.

These methods may be used to select environmental factors, but we opt out of

discussion of these methods due to their high variability in subset selection due

to the step-wise procedure, ultimately reducing their prediction accuracy [195].

The shrinkage methods discussed above avoid this problem.

In this chapter, we have presented a straightforward and generalizable way to

associate environmental variables of large dimension to disease. Furthermore,

we present a way of ranking what variables we may want to pursue for further

study through computation of the FDR. Because of its proposed utility, the

method has become a center point of discussion and debate [1, 196-200]. In

the following chapter, we demonstrate this claimed utility, applying the

method to Type 2 Diabetes and Serum Lipid Levels, risk factors for

cardiovascular disease.

83

CHAPTER 4: ENVIRONMENT-WIDE ASSOCIATIONS TO DISEASE

AND ADVERSE PHENOTYPES: APPLICATIONS TO TYPE 2

DIABETES (T2D) AND SERUM LIPID LEVELS

INTRODUCTION

In the following, we exemplify methods and techniques presented in the

previous chapter with published or submitted “Environment-wide Association

Studies” (EWAS) on diseases including type 2 diabetes (T2D) [35] as well as

on phenotypes that are risk factors for disease, such as serum lipid levels [36].

As described in the previous chapter, EWAS is a framework to

comprehensively and systematically test for environmental association to

disease analogous to “Genome-wide association Study” (GWAS), a now

standard framework in genetic epidemiology to associate genetic variants in a

genome-wide dimension to disease.

The EWASes presented concern complex disease which known to be

multifactorial in etiology in which both many environmental and genetic

factors are known to play a role [2]. Second, they are of great concern given

their rise to epidemic status [201]. Third, through GWAS, we have a robust set

of common genetic variants associated to these diseases, for example T2D

[202] and lipid levels [65] executed on samples of significant size.

Furthermore, and most importantly, this list of genomic loci is being updated

and examined continuously [9, 203]; however, we lag behind in identifying the

comprehensive set of environmental factors (Chapter 1).

The following studies are made possible by the National Health and Nutrition

Examination Survey (NHANES), a representative biannual health survey of

non-institutionalized population of the US [37]. In NHANES, participants are

84

queried regarding their health status and an extensive battery of clinical and

laboratory tests are performed on a subset of these individuals. Specific

environmental attributes are assayed, such as chemical toxins, pollutants,

allergens, bacterial/viral organisms, and nutrients. Of biomedical relevance,

we identified novel environmental factors such as nutrients and industrial

pollutants associated with these diseases that should be examined in follow-up

validation studies.

ENVIRONMENT-WIDE ASSOCIATION STUDY ON TYPE 2

DIABETES

EWAS on T2D: Methods

We associate 266 unique environmental factors to T2D status from the

NHANES. We downloaded the all of the available NHANES data for 1999-

2000, 2001-2002, 2003-2004, and 2005-2006 cohorts and collated

corresponding variables across them. For example, if a variable LBXVIE from

1999-2000 described “A-Tocopherol ug/dL” and variable with name LBXATC

from 2001-2002 also described “a-tocopherol ug/dL”, we applied the same

name for each, LBXATC.

Figure 15 presents a schematic representation of our analysis methodology. We

analyzed all environmental factors from the NHANES that were a direct

measurement of an environmental attribute, such as the amount of pesticide or

heavy metal present in urine or blood. We did not consider internal biological

system laboratory measures such as red blood cell count, triglyceride level,

cholesterol level, or other physiological measures. By using direct and

quantitative measures of factors, we potentially avoid issues of self-report bias.

85

There was a total of 543 factors in our EWAS, but not all factors were present

in all cohorts: 111 factors measured in the 1999-2000 cohort, 146 from 2001-

2002, 211 from 2003-2004, and 75 from 2005-2006. This comprised of 266

unique environmental factors in total, with 157 factors measured in more than

one cohort. Using NHANES categorization, we binned factors into 21 “class”

groupings in order to discern patterns among related groups of factors,

analogous to chromosomal units in GWAS (not shown). Different

environmental factors were measured in varying numbers of participants,

ranging from 507 to 3318 individuals over the different environmental factors.

86

Correlation Globes ofTentatively Validated Factors

(ρ > 0.2)

Recompute βfactor, adjusted by self-report data:

log10(LDL)N=101-3368

log10(trigly.)N=109-3618

zfactor

phen

otyp

e

βfactor

Envi

ronm

enta

l Fac

tor "

Cla

sses

"

P-value(βfactor) < α in 2 or more cohorts?

zfactor = transformed xfactor - adjustment variables

Empirical FDR estimationPermute Phenotype Levels

1000x

log10(HDL)N=222-7485

A

B

C

D

E

FDR(α) ≤ 10%

compute estimate of validated factor using all cohorts

Combined Cohort βfactor

Estimation of R2FG

H

Sensitivity Analyses

24- or 48-hour dietary recall (n=58)

physical activity

total supplement use any use of drugs (metformin, statins, etc)

any metabolic health history

Fasting Glucose> 125 mg/dL?

N=109-3190(8% of total)

total 182 96169 258

11 0Pesticides, Pyrethyroid 10Pesticides, Organophosphate 22 2

13 1110Pesticides, Organochlorine 00 1Pesticides, Chlorophenol 10

0 0Pesticides, Carbamate 0 10 50 0Pesticides, Atrazine

02214Volatile Compounds 2910Virus 666

Polyflourochemicals 0 10 1200 0Polybrominated Ethers 120

Phytoestrogens 6 066Phthalates 12 07 12

15 12911Phenols

0 020Perchlorate23Polychlorinated Biphenyls 26 38 0

22Vitamin E 321Vitamin D 10 11 1Vitamin C 0 0

Vitamin B 54 343Vitamin A 333

22Mineral Nutrients 2 176Carotenoid Nutrients 150

001Latex 022 2114 0Hydrocarbons

23Heavy Metals 18 18 259Furans 55 0

077Dioxins 50Diakyl 77 6

1 1 1Cotinine 11Bacterial 178 13

0Allergen Test 0 2000200Acrylamide

1999-20002001-2002

2003-20042005-2006

87

Figure 15. A.) Summary of the 32 factor classes and the number of factors within them for each NHANES cohort. Each factor is measured in blood or urine.. B.) 100-7,500 individuals had their fasting blood glucose (FBG), HDL-C, LDL-C and triglyceride levels measured for each of these factors in each cohort; lipid levels were log transformed to assume normality for least squares regression. Type 2 Diabetes status was assessed by considering those who had a FBG > 125 mg/dL. C.) Each of these 96 to 258 factors was tested for association with the logarithm of HDL-C, LDL-C, and triglyceride level with a linear regression model adjusted for age, age-squared, sex, BMI, ethnicity, and SES. To test against T2D status, a logistic regression model was utilized, adjusting for age, sex, ethnicity, SES, and BMI. D.) To account for multiple testing, we estimated the empirical null distribution by permuting the lipid levels and estimating the false-discovery rate (FDR). The p-value threshold (α) for statistical significance was determined by controlling the FDR to be under 10%. We deemed a factor to be tentatively validated if it was found to be significant in 2 or more cohorts with an effect in the same direction in all cohorts where it was significant. E.) For lipid level phenotypes, we estimated a final coefficient for tentatively validated factors by combining all cohorts and adjusting for age, age-squared, sex, ethnicity, SES, BMI, waist circumference, T2D status, blood pressure, and cohort. F.) We estimated the coefficient of determination (R2) for the final, combined models. G.) We re-computed our final models, adding 62 self-report variables one-by-one to attempting to check the validity of the environmental effect. H.) We computed the pair-wise correlation between each of the tentatively validated factors along with other clinical co-variates and analyzed these relationships with correlation globes [10]. We omitted from our EWAS 73 factors that varied little across individuals in

our sample. Specifically, we omitted those that had a majority (> 90%) of the

observations below a detection limit threshold as defined by in the NHANES

codebook. We also removed factors that targeted a subset of the population,

such as the test for Trichomonas vaginalis, an infectious pathogen found

primarily in women.

T2D cases were individuals who had a fasting blood glucose (FBG) level

greater or equal to 126 mg/dL, as advised by the American Diabetes

Association (ADA) [204] (Figure 15B). We chose specificity and accuracy of

diagnosis over sensitivity, as we acknowledge this definition ignores those

who were previously diagnosed as diabetic, but now keep their blood glucose

under tight control; in fact, a larger proportion of NHANES respondents

described themselves as diabetics or were taking medications often used to

treat diabetes than were classified by FBG levels. Neither FBG nor the self-

reported diabetes status distinguishes between Type 1 Diabetes (T1D) and

88

T2D; as T2D has a prevalence rate more than 40 times higher than T1D, we

assumed all our cases have T2D.

We used survey-weighted logistic regression to associate each of the 543

environmental attributes to diabetes status while adjusting for age, sex, body

mass index (BMI), ethnicity, and an estimate for SES (Figure 15C). We

acknowledge that estimating SES is difficult; nevertheless, we used the tertile

of poverty index, equivalent to the participant’s household income divided by

the time-adjusted poverty threshold, as the estimate for SES. We used R with

the survey module to conduct all survey-weighted analyses [205, 206].

Exposures were captured either as continuous or a categorical variable. Most

chemical exposure data arising from mass spectrometry or absorption

measurements occurred within a very small range and had a right skew; thus,

we log transformed these variables. Further, we applied a z-score

transformation (adjusting each observation to the mean and scaling by the

standard deviation) in order to compare odds ratios from the many regressions.

Similarly, for categorical variables, we made the definition of the referent

consistent, defining them to be the “negative” results of the test.

We calculated the false discovery rate (FDR), the estimated proportion of false

discoveries made versus the number of real discoveries made at a given

significance level, to control for type I error due to multiple hypotheses testing

in associating the factors to disease status [40]. To estimate the number of

false discoveries, we created a “null distribution” of regression test statistics by

shuffling the diabetes status labels 1000 times and recomputing the regressions.

The FDR was then estimated to be the ratio of the proportion of results that

were called significant at a given level α in the null distribution and the

proportion of results called significant from our real tests. To choose factors

significantly associated with T2D in the first single-cohort phase, we used a

89

significance level (α=0.02), which corresponded to a FDR of 10% across three

out of four cohorts (1999-2000, 2003-2004, and 2005-2006) and 30% for the

2001-2002 cohort.

To improve our power, we used the four independent cohorts to validate

significant findings (Figure 15D). We considered a significant factor as

“validated” if it was found to be significant (α=0.02) in more than one cohort,

at the expense of having to drop those factors not measured in a second cohort.

We then assessed the FDR of the multi-cohort validation. We first estimated

the number of false positives by counting the number of factors found

significant at a level α in two or more cohorts from the permuted datasets. We

then estimated the FDR by computing the ratio between the number of false

positives and the number of validated factors. This value was 2% with α equal

to 0.02.

We fit a final logistic regression model with data combined from multiple

NHANES cohorts utilizing all measurements for a specific environmental

factor, attaining an overall odds ratio. The covariates of the final model were

age, sex, BMI, ethnicity, SES, and cohort. We computed new sample weights

for the combined datasets by taking the average of the original sample weights

as described by the NHANES analytic guidelines [207].

We conducted 3 secondary analytic tests for the validity and sensitivity of our

final estimates. We first attempted to check for reverse causality, or

association of exposure due to T2D diagnosis. Our second test attempted to

take into account the lipophilic characteristics of the environmental factors

found. Our last test attempted to take into account recent food and supplement

consumption as a potential bias for exposure measures. For adequate sample

size and ease of comparison to the final fit model, we utilized all available data

combining multiple NHANES cohorts as the sample to conduct these tests.

90

To attempt to account for one’s T2D diagnosis as a modifier of environmental

exposure, known as “reverse causality”, we recomputed our models omitting

those who had been diagnosed with diabetes. Individuals with a diabetes

diagnosis were identified through yes answers submitted on a NHANES health

questionnaire (“Doctor told you have diabetes?”). Thus, we refit our final

models with individuals who were only at risk for T2D diagnosis.

Our second test attempted to account for the lipophilic chemical characteristics

of our significant factors. Many of the environmental factors measured in

NHANES absorb readily in fatty tissue; presence of fatty tissue is also

associated with T2D and a potential confounder. Thus, we recomputed the

models taking into account total triglycerides and cholesterol measured in

blood specimen of participants.

In our third test, we attempted to compare dietary and supplement consumption

of cases or controls gathered from 24- and 48-hour recall and supplement use

questionnaires reasoning that recent intake may confound exposure-disease

association. The NHANES data contains amount of food components

consumed based on the dietary recall available for all participants examined

above. Specifically, amounts of food components are computed from the

questionnaire using the United States Department of Agriculture (USDA) Food

and Nutrient Database. Some of the vitamin and nutrient components included

vitamin A, vitamin B-6, vitamin B-12, vitamin C, vitamin E, vitamin K,

carotenes, lycopene, thiamin, riboflavin, niacin, folate, calcium, iron,

magnesium, phosphorus, potassium, sodium, iron, zinc, copper, and selenium.

Other components included macronutrients, such as protein, carbohydrates, fat,

fiber, and cholesterol. The total amount of food components considered

numbered 51 to 63 for the different cohorts. Further, the 2003-2004 and 2005-

2006 cohorts contained both 24- and 48-hour recall data. Supplement use

91

included count of consumption of vitamins, minerals, botanicals, and/or their

mixture of them over the past month prior to the survey. To check for possible

confounding by recent consumption, we added each food and supplement

variable to the logistic regression models specified above and re-evaluated

significance and effect size of the validated environmental factors. We coded

food component content as the logarithm (base 10) of the amount entered. We

coded supplement use as an integer count value. We acknowledge the

potential of bias with the use of questionnaire data and a pre-determined

database of food items but assumed it was a reliable proxy of consumption and

behavioral data in lieu of other information.

EWAS on T2D: Results

Population characteristics

Across the cohorts, the total non-weighted and weighted numbers of those who

were diabetic compared to non-diabetic were similar. However, we did see

significant differences with demographic factors such as sex, age, and

socioeconomic status between cases and controls. T2D occurred in higher age

groups in all cohorts (p < 0.001, 2-sided t-test). There were significantly more

male participants than females in all cohorts (p < 0.001, 0.02, 0.03, χ2 test)

except for 2005-2006. Furthermore, there was a significant association

between lowest SES (first tertile of poverty index) and T2D (p=0.006, 0.03,

0.04, logistic regression) in for the 1999-2000, 2001-2002, and 2005-2006

cohorts respectively. While we did not see a univariate association between

ethnicity and T2D as diagnosed by FBG, we did confirm previously reported

associations of ethnicity to T2D when stratifying by age and sex [208]. As

expected, BMI was significantly associated with T2D status (p < 0.001, t-test)

for all cohorts. Given these differences between the cases and controls, we

adjusted our logistic regression models described below accordingly.

92

Environment Associations to T2D

Figure 16 shows the distribution of p-values of association for each

environmental factor and class, adjusted for sex, age, BMI, ethnicity, and the

estimate for SES, plotted in a “Manhattan plot” analogous to the association

results from a GWAS study. The 37 significant or notable factors are

annotated in the figure. Specific categories show association with T2D, such as

organochlorine pesticides, nutrients/vitamins, polychlorinated biphenyls, and

dioxins (Figure 16), having between 10 to 30% of the factors in the class with

p-values less than 0.02. Many positive (low p-values) and negative (high p-

values) associations replicated well among the different cohorts.

93

Figure 16. “Manhattan plot” style graphic showing the environment-wide association with T2D. Y-axis indicates -log10(p-value) of the adjusted logistic regression coefficient for each of the environmental factors. Colors represent different environmental classes as represented in Figure 15A. Within each environmental class, factors are arranged left to right in order from lowest to highest odds ratio (OR). Plot symbols represent different cohorts: 1999-2000 (diamonds), 2001-2002 (square), filled dot (2003-2004), circle (2005-2006). Red horizontal line is –log10(α)=1.8 (α=0.02). Validated factors significant in 2 or more NHANES cohorts are in bold face (α=0.02 in two or more cohorts, FDR of 2%) with larger plot points. Other significant factors (α=0.02) are annotated with numeric label corresponding to the environmental factor class color key on the right. Figure abbreviations: Validated factors: t-β-carotene: trans β-carotene; c-β-carotene: cis β-carotene; PCB170: 2,2',3,3',4,4',5-Heptachlorobiphenyl. Group 1 (dioxins): 1-hxcdd: 1,2,3,6,7,8-Hexachlorodibenzo-p-dioxin; 2-hxcdd: 1,2,3,7,8,9-Hexachlorodibenzo-p-dioxin. Group 2 (furans): OCDF: 1,2,3,4,6,7,8,9-Octachlorodibenzofuran. Group 3 (heavy metals): Ur: uranium; Sb: antimony; Pb: Lead. Group 4 (nutrients): tot-β-car: total β-carotene; α-car: alpha-carotene; retnl: retinol; Vita. D: vitamin D; δ-t: delta-tocopherol. Group 5 (organochlorine pesticides): DDE: dichlorodiphenyltrichloroethylene. Group 6 (PCB): PCB169: 3,3',4,4',5,5'-hexachlorobiphenyl; PCB138: 2,2',3,4,4',4',5-Hexachlorobiphenyl; PCB195: 2,2',3,3',4,4',5,6-Octachlorobiphenyl; PCB183: 2,2',3,4,4',5',6-Heptachlorobiphenyl; PCB199: 2,2',3,3',4,5,5',6'-Octachlorobiphenyl; PCB178: 2,2',3,3',5,5',6-

94

Heptachlorobiphenyl; PCB187: 2,2',3,4',5,5',6-Heptachlorobiphenyl; PCB180: 2,2',3,4,4',5,5'-Heptachlorobiphenyl; PCB146: 2,2',3,4',5,5'-Hexachlorobiphenyl; PCB196: 2,2',3,4,4',5,5',6-Octachlorobiphenyl. Group 7 (bacteria): H2: Herpes Simplex 2; HSBA: Hepatitis B Surface Antibody. Table 7 shows those factors that were validated as being significant in two or

more of the independent cohorts (multi-cohort validation FDR of 2%).

Predicted probabilities of having T2D were computed for a prototype

participant, a 45 year old white male with BMI of 27 (middle of the range for

non-diabetics in the NHANES sample) and from the middle SES, at high and

low exposure levels. For combined cohorts, the predicted probability applies to

a prototype participant from the 2005-2006 cohort. We also computed the

overall estimate by combining NHANES cohort data in a final model

additionally adjusted for cohort; the predicted probabilities for these models

were computed for a prototype participant as defined above. We defined low

exposure as having a log transformed exposure level one standard deviation

lower than the transformed mean, and high exposure as having a log

transformed exposure level one standard deviation higher than the transformed

mean. For example, a 45-year-old male from the 1999-2000 cohort with high

levels (0.09 ng/g) of heptachlor epoxide has a 6% likelihood of being in our

diabetes subset.

95

Factor Cohort

N† T2D, No T2D P

OR (95% CI)

Factor Level (Lo-Hi)

Predicted Probability (Lo-Hi)

cis-β-carotene 2001-2002

211, 2852 0.01

0.6 (0.5, 0.8) 0.4-1.4 µg/dL 0.12-0.05

2003-2004 207, 2698 0.002

0.63 (0.5, 0.7) 0.4-1.9 0.13-0.06

2005-2006 186, 2425 0.02

0.6 (0.5, 0.8) 0.4-1.6 0.15-0.06

2001-2006* 604, 7975 < 0.001

0.6 (0.5, 0.7) 0.4-1.7 0.15-0.06

trans-β-carotene 2001-2002

211, 2854 0.01

0.6 (0.5, 0.8) 5.1-27.2 µg/dL 0.13-0.05

2003-2004 207, 2698 0.002

0.7 (0.6, 0.8) 4.8-24.7 0.13-0.06

2005-2006 203, 2701 0.004

0.6 (0.4, 0.7) 4.8-29.0 0.16-0.06

2001-2006 * 621, 8253 < 0.001

0.6 (0.5, 0.7) 4.9-27.0 0.15-0.06

γ-tocopherol 1999-2000

146, 2091 0.02

1.8 (1.3, 2.4) 107-360 µg/dL 0.03-0.09

2003-2004 207, 2698 0.01

1.6 (1.3, 2.0) 103-356 0.06-0.13

1999-2006* 767, 10307 < 0.001

1.5 (1.3, 1.7) 107-352 0.06-0.13

Heptachlor Epoxide 1999-2000 46, 635 0.002

3.2 (2.4, 4.4) 0.02-0.09 ng/g 0.01-0.06

2003-2004 67, 809 0.01 1.9 (1.3, 2.6) 0.01-0.07 0.02-0.07

1999-2004* 178, 2367 < 0.001

1.7 (1.3, 2.1) 0.02-0.08 0.03-0.07

PCB170 1999-2000 45, 716 0.02 2.3 (1.5, 3.6) 0.03-0.12 ng/g 0.01-0.06

2003-2004 53, 773 0.01 4.5 (2.1, 9.9) 0.01-0.12 0.03-0.42

1999-2004* 165, 2426 < 0.001

2.2 (1.6, 3.2) 0.02-0.13 0.04-0.15

Table 7. Highly statistically significant environmental factors associated to T2D found in more than one NHANES cohort. Odds ratio for each exposure, adjusted for BMI, age, sex, ethnicity, and SES is calculated for a change in the log exposure level by one standard deviation. Factor level is the amount of exposure defined by the low (1 SD lower than the average logged exposure level) and high range (1 SD higher than the average logged exposure level). The predicted probability range is an estimate for a 45-year-old white male with BMI of 27 kg/m2 from the middle SES to develop the disease in the low to high range of exposure. * denotes analysis using combined NHANES cohorts; models adjusted for age, sex, ethnicity, BMI, SES, and cohort; predicted probabilities for combined cohorts applies to an individual from the 2005-2006 cohort. † denotes unweighted number.

96

Nutrients and Vitamins: Carotenes and γ-tocopherol

Several vitamins were found to have levels inversely associated with T2D.

The first type included an antioxidant in the isoforms of β-carotene (final

adjusted OR of 0.6; 95% CI 0.5-0.7; p < 0.001). For the prototypical

participant, high levels of trans or cis β-carotene equated to a 9% improvement

in risk (15 vs. 6%) for T2D status. We were able to confirm the inverse

association of β-carotenes seen in multiple epidemiological studies in Saudi

Arabia [209], among older people [210], among Swedish men [211], and in an

earlier NHANES III cohort (pre-1999) [212], as well as another small study

that showed an inverse response between fasting glucose level and β-carotene

[213]. However, in a prospective case-control study β-carotene was not

significantly inversely associated to T2D [214]. Because T2D is associated

with reduced anti-oxidant defense, anti-oxidants, such as carotenes, have been

occasionally recommended as a therapy [215]. However, the evidence of

mitigation of T2D with these vitamins as therapies has been negligible in

clinical trials, including women who are high risk of cardiovascular disease

[216] or male smokers [217].

We discovered a vitamin that increased risk for T2D. Surprisingly, γ-

tocopherol, a form of vitamin E, was highly significantly and positively

associated with T2D (final adjusted OR 1.5; 95% CI 1.3, 1.7; p < 0.001) in two

cohorts (adjusted OR of 1.8 and 1.6; p=0.02 and 0.01 for 1999-2000 and 2001-

2002 cohorts) and nearly significant in the two others (adjusted OR of 1.3 and

1.6; p=0.06 and 0.04 for 2001-2002 and 2005-2006 cohorts). For the

prototypical participant, low levels of the γ-tocopherol equated to a 7%

improvement in risk (13% vs. 6%). To our knowledge, this is a novel

association between γ-tocopherol and T2D.

Persistent Pollutants: Polychlorinated Biphenyls and Organochlorine

Pesticides

97

We found organochlorinated pesticides and polychlorinated biphenyls (PCBs),

both related pollutant factors, to be a highly positively associated with T2D.

Among the PCBs, we specifically discovered PCB170 (2,2',3,3',4,4',5-

Heptachlorobiphenyl; final adjusted OR of 2.2; 95% CI 1.6-3.2; p < 0.001).

The effect sizes in the individual cohorts for PCB170 were large (adjusted OR

2.3 and 4.5; p = 0.02 and 0.01 for 1999-2000 and 2003-2004 cohorts). The

models predicted up to 15% T2D risk for the prototype participant, more than

double the risk of those with low concentrations of PCB170. The association

between the class of PCBs with T2D has been well described within Native

American [218], Japanese [219], Swedish [220], and Taiwanese [221] cohorts.

Heptachlor epoxide, an oxidation product of the organochlorine pesticide

heptachlor, was among the most highly associated factor (final adjusted OR

1.7; 95% CI 1.3-2.1; p < 0.001) in our EWAS. The effect sizes in the

individual cohorts were also large (adjusted OR 3.2 and 1.9; p=0.002 and 0.01

for 1999-2000 and 2003-2004 cohorts). The predicted probability for the

prototypical participant with high levels of the pollutant was 7%, more than 2

times greater than those who had lower levels of this pollutant.

Secondary analysis to test validity of the final estimates

We then attempted to test the validity of our final estimates by conducting 3

additional analytic tests. In the first test, we attempted to consider the

possibility of “reverse causality” or differential exposure status due to T2D

diagnosis. Second, we attempted to assess the effect of potential confounding

bias due to the lipophilic characteristics on our final environmental factor

effect estimates. Third, we attempted to assess the effect of recent nutrient and

supplement consumption on our final effect estimates.

To consider T2D diagnosis as a modulator of exposure, we removed all

individuals who answered yes when questioned about a past history of diabetes

98

in the NHANES health questionnaire (“Doctor told you have diabetes?”).

Thus, T2D cases were those who had a FBG higher than 125 mg/dL and were

at risk for T2D diagnosis. We recomputed the effect of exposure, adjusted for

age, sex, SES, ethnicity, BMI, and cohort. For all validated factors significant

in more than 2 cohorts above (Table 7), the estimates remained stable and

statistically significant. The effect size for Heptachlor Epoxide was marginally

smaller with an adjusted OR of 1.6 (95% CI 1.1, 2.1; p = 0.008). The adjusted

OR for PCB170 was also marginally smaller, 2.1 (95% CI 1.2, 3.9; p = 0.02).

The effect of γ-tocopherol was larger, with an adjusted OR of 1.8 (95% CI 1.3,

2.2; p < 0.001) and there was no change to effect sizes of the carotenes

(adjusted OR 0.6; 95% CI 0.5, 0.7; p < 0.001). We concluded that there was

not enough evidence to support the phenomenon of reverse causality based on

the effect sizes estimated for those who were at risk for T2D.

We next attempted to account for potential confounding bias of lipid levels.

To assess the degree of possible confounding we refit the logistic regression

adjusting for the logarithm (base 10) of total triglyceride and cholesterol levels

in addition to age, sex, BMI, SES, ethnicity, and cohort. We did not observe a

great change in effect sizes estimates for the environmental factors after this

further adjustment for total triglycerides and cholesterol. The odds ratio after

adjusting for lipid levels for carotenes was 14% higher, 0.7 (95% CI 0.6, 0.8; p

< 0.001) compared to 0.6. Similarly, the odds ratio for γ-tocopherol was

attenuated by 7%, 1.4 (95% CI 1.2, 1.6; p < 0.001) compared to 1.5 (Table 7).

For the pesticide factor, the odds ratio was smaller by 6%, 1.6 (95% CI 1.3,

2.0; p < 0.001) versus 1.7 (Table 7). Lastly, for PCB factor, we observed a 3%

higher odds ratio of 2.3 (95% CI 1.4, 3.7, p = 0.002) versus 2.2. Consistent

with this secondary analysis, we observed a similar degree of effect size

differences when using the “Lipid Adjusted” NHANES environmental factors,

which are only provided for few of the pollutant factors (not shown). We

99

concluded that the effect sizes of the environmental factors were affected by

lipid levels, but not substantially biased by them.

We then searched for differences in food and supplement consumption patterns

between diabetics and non-diabetics for all 4 cohorts close to the time of

survey derived from dietary recall and supplement use questionnaires. In

comparing dietary nutrients, we did not observe a difference for any dietary

nutrient except one between cases and controls. This exception included a

lower total carbohydrate intake for diabetics versus controls, confirming that

many diabetics may have known about their disease; specifically, the adjusted

OR was 0.7 (95% CI 0.6, 0.8; p=0.001) for a 10% increase in total

carbohydrate consumption, adjusted for sex, age, ethnicity, SES, and cohort.

We also observed an inverse association between any supplement use and T2D,

with an adjusted OR of 0.6 (95% CI 0.5, 0.8, p < 0.001), also consistent with

our expectation of increased health awareness for those with T2D. However,

we specifically could find no difference in consumption of carotenes or

tocopherol (p=0.85 and 0.2 respectively) between cases and controls, two of

the validated nutrient factors found in our EWAS (Table 7).

Having observed some difference in consumption behavior between cases and

controls, we then attempted to assess the influence of recalled dietary

consumption on the environmental associations by recomputing the logistic

regression models in presence of dietary and supplement use variables. Adding

the new dietary or supplementary vitamin consumption variables did not

attenuate the odds ratios (maximum change of 1-2%), nor did they lessen the

strength of the associations for all of the 5 validated environmental factors

described in Table 7. Thus, we did not have evidence to support that recent

consumption influenced the factor-disease effect sizes for the validated factors

found in our EWAS.

100

We took a further step in assessing the strength of the environmental

associations, adjusting for total triglycerides and cholesterol, any supplement

use, and food intake simultaneously. Specifically, the odds ratio for a SD

increase in γ-tocopherol levels was 1.3 (95 % CI 1.1, 1.5; p=0.004) when

adjusting for logarithm base 10 of triglycerides, cholesterol, total vitamin E

consumption, beta carotene consumption, total carbohydrate consumption, and

any supplement use along with age, sex, ethnicity, BMI, and SES. The

analogous models for the cis and trans β-carotene resulted in adjusted OR of

0.7 (95% CI 0.6, 0.8; p < 0.001). Odds ratios were consistently high and

significant for the pollutant factors Heptachlor Epoxide and PCB170 after

further analogous adjustment of recent consumption and total lipid levels, with

odds ratios of 1.6 (95% CI 1.3, 2.1; p < 0.001) and 2.2 (95% CI 1.4, 3.5;

p=0.003) respectively. We concluded that recent consumption as encoded by

the dietary recall questionnaire in conjunction with lipid levels did not alter the

validity of the associations of the 5 environmental factors found.

To summarize of our secondary tests for validity, we concluded that reverse

causality, recent food and supplement consumption, and total lipid levels did

not substantially bias our effect estimates for the 5 validated factors. These

tests were made possible by the extensive list of co-variates available in the

NHANES.

EWAS on T2D: Conclusions

We have described a prototype Environmental-Wide Association Study

(EWAS) and applied it to the study of Type 2 Diabetes (T2D), and validated

many of our significant findings across independent cohorts and confirmed

some of them through the literature. This study is made possible by the

examination of multiple cohorts present in the nationally representative

NHANES dataset. We have rediscovered factors such as carotenes and PCBs

with previously known association to T2D. Unexpectedly, we found higher

101

levels of γ-tocopherol were associated with higher likelihood of T2D,

independent of dietary intake. Of the components of Vitamin E, γ-tocopherol is

the most abundant form in the US diet [222], and makes up to 50% of the total

vitamin E in human muscle and adipose tissue [223], two known insulin-target

tissues. As γ-tocopherol has been previously suggested as a preventive agent

against colon cancer [224], any potential adverse metabolic effects for this

vitamin should be studied closely.

Another novel finding was in the significant association between heptachlor

epoxide levels and T2D. Heptachlor is a pesticide; most uses of heptachlor

were discontinued in the late 1980s [225]. The main source of heptachlor and

its breakdown product, heptachlor epoxide, is from food, but heptachlor

epoxide is persistent in the environment and can even be passed in breast milk

[226]. While a significant association with T2D has been reported across

thirty-thousand pesticide applicators who used the pesticide heptachlor [227],

to our knowledge, this broad association between heptachlor epoxide and T2D

in the general public, as surveyed by NHANES, is novel.

While GWAS has allowed us to find novel variants associated with T2D of

possible mechanistic importance and provided a model for a comprehensive

study of the environment described here, associated variants have had only

moderate effect sizes to date. Most of the risk loci identified with GWAS have

small individual odds ratios, generally less than 1.3 [164, 202, 228] and the

highest has been reported to be 1.71, belonging to a variant in the TCF7L2

gene [163, 165]. Albeit from different populations and analytical scenarios, the

effect sizes of our validated environmental factors on T2D were comparable to

the highest odds ratios seen in GWAS.

102

ENVIRONMENT-WIDE ASSOCIATION STUDY ON SERUM LIPID

LEVELS

Serum lipid levels correlate with the risk of coronary heart disease (CHD),

atherosclerosis, stroke, and even the disease described above, type 2 diabetes

(T2D) . Both genetic and environmental factors influence lipid phenotypes.

Lipid level variation is 20-70% heritable [229, 230], while well-documented

environmental or lifestyle factors include physical exercise, smoking, and diet

[231-235]. Other less tangible factors, however, may also be important, as for

example air pollution [236]. Here, we have applied the EWAS paradigm –

extended from above -- to evaluate 322 environmental attributes for their

association with triglycerides, high-density lipoprotein-cholesterol (HDL-C),

and low-density lipoprotein cholesterol (LDL-C).

EWAS on Serum Lipids: Methods

Data

Laboratory data analyzed included serum and urine measures of environmental

factors and clinical measures including lipid levels. We analyzed all factors

that were a direct measurement of an extrinsic environmental attribute (e.g.

amount of pesticide or heavy metal in urine or blood) as described earlier. Of

824 potentially eligible variables across all cohorts, we omitted 119 that varied

little across individuals (continuous variables with > 99% of the observations

below the detection limit threshold and binary variables with > 99% of

observations in either the “negative” or “positive” class). Of the 705

remaining variables, 169 were measured in the 1999-2000 cohort, 182 from

2001-2002, 258 from 2003-2004, and 96 from 2005-2006. Cumulatively, they

comprised 332 unique environmental factors, with 207 factors measured in >1

cohort. We binned these factors into 32 “classes” of related factors, analogous

to chromosomal units in GWAS (Figure 15A) as described earlier.

103

Different environmental factors were measured in varying numbers of

participants: 109-3610 (median 938), 101-3388 (median 896), and 222-7485

(median 1958) individuals for triglyceride, LDL-C, and HDL-C levels

respectively (Figure 15B). Individuals are selected randomly to have these

measurements and the selection procedure is dependent on their demographic

characteristics due to the complex stratified population sampling of NHANES

[237]. Serum triglyceride levels were measured in the morning after >8.5 hours’

fasting. LDL-C levels were derived from total cholesterol and direct HDL-C

measurements used the Friedewald calculation.

Statistical analysis

The systematic EWAS analysis encompasses multiple steps (Figure 15 C-H) as

described earlier. First, survey-weighted linear regressions are performed for

each environmental factor against log10 transformed lipid levels, adjusting for

age, age-squared, sex, body mass index (BMI), ethnicity, and socioeconomic

status (SES) (Figure 15C). For SES we used the tertile of poverty index

(participant’s household income divided by the time-adjusted poverty

threshold), as previously described. Ethnicity was coded in 5 groups (Mexican

American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other).

We used R survey module for all survey-weighted analyses [205, 206].

We calculated the false discovery rate (FDR), the estimated proportion of false

discoveries made versus the number of total discoveries made for a given

significance level α, to control for multiple hypothesis testing (Figure 15D)[40,

238]. We created a “null distribution” of regression test statistics for each

cohort separately, shuffling the triglycerides, HDL-C, and LDL-C levels 1000

times and refitting the linear regression models. FDR is the ratio of the results

called significant at a given level α in the null distribution and the results

called significant from our real tests. We used FDR<10% to select significant

104

associations. This corresponds to α=0.02 for triglycerides, 0.02 for HDL-C,

and 0.01 for LDL-C.

Next, we used the four independent cohorts to validate significant findings

(Figure 15D). We considered a significant factor as “tentatively validated” if it

was significant (FDR<10%) in more than one cohort and with the same

direction of effect in all cohorts. Of the 332 factors, 125 were assessed in only

1 cohort and thus they could not be considered validated; 73 factors were

assessed in 2 cohorts, 102 were assessed in 3 cohorts, and 32 were assessed in

all 4 cohorts. We assessed the FDR of the multi-cohort validation empirically

through permutations, as described in the previous chapter. Briefly, we first

estimated the number of false positives by counting the number of factors

found significant at level α in multiple cohorts from the permuted analyses.

For example, to assess the FDR of validating a factor in 2 cohorts, we collected

the factors that fell below the significance threshold α in the permuted data

corresponding to two different cohorts and counted the number of factors

found significant in both. We repeated this operation on all possible pairs

(n=6) of cohorts, adding up numbers found to be significant in each pair. We

then estimated the FDR by computing the ratio between the total number of

false positives and the number of true validated factors. We repeated the

analogous operation for factors significant in 3 and 4 cohorts. For triglyceride

levels, FDR=0.008, 0.0003, and 5x10-5 for results validated in 2, 3 and 4

cohorts, respectively. For HDL-C, the respective FDR is 0.01, 0.0002, and

5x10-5. For LDL-C, the respective FDR is 0.009, 3x10-5, and < 10-9.

We then fit a final linear regression model with data combined from multiple

NHANES cohorts utilizing all measurements available for a tentatively

validated environmental factor, attaining an overall estimate and p-value

(Figure 15E). We utilized the larger sample size to adjust for additional

quantitative factors that we were unable to adjust for in the single cohort

105

analyses (due to small residual degrees of freedom). In addition to initial

covariates, we also adjusted for waist circumference, T2D status (as defined in

the previous section, fasting blood glucose ≥ 126 mg/dL), systolic and diastolic

blood pressure (mm Hg), and cohort. To estimate how much of the variance

was described by each environmental factor, we estimated the change in the

coefficient of determination (R2) adding that factor versus a model including

only the adjusting factors (Figure 15F). We also performed regressions on

untransformed lipid levels to estimate raw effect size.

Sensitivity analyses

We conducted sensitivity analyses to account for recent food, alcohol,

supplements, medications, exercise, and history of cardiovascular health

(Figure 15G). Sixty-two questionnaire items were used. For adequate sample

size and consistency with the final-fit model, we combined all available

NHANES cohorts.

Intake variables (total calories, carbohydrates, saturated fat, monounsaturated

fat, alcohol, cholesterol, vitamins, and iron) are computed from the

questionnaire using the USDA Food and Nutrient Database. All cohorts

contained 24-hour recall data. For 2003-2004 and 2005-2006 cohorts we

computed an average of 24- and 48-hour recall data. Supplement use included

the integer count of consumption of vitamins, minerals, botanicals, and/or their

mixtures the month prior to the survey. Consumption of any fish or shellfish

during the last month was also considered.

For individuals with abnormal levels of lipids, drug therapies such as statins,

resins, and fibrates are often prescribed [239]. Therefore, we sought to adjust

for any use of these medications. Drug use definition required that the

individual used the drug during the month prior to the survey and the

interviewer saw the prescription bottle.

106

Physical activity also influences lipid levels. We therefore classified

individuals in high, medium, or light intensity weekly exercise categories by

computing metabolic equivalents of recalled activity levels [240, 241],

including components such as leisure time, occupational and household

routines-related activity.

Finally, recalled cardiovascular health history was based on positive response

to questions on the presence of coronary heart disease angina/angina pectoris,

heart attack(s), or congestive heart failure.

To evaluate the impact of these 62 adjusting variables, we recomputed the

regression models by adding each variable to our final model one-by-one and

observed the change in the effect size for each putative environmental factor.

We also built a model adjusting for lipid-lowering drugs, supplement use,

exercise, and self-report cardiovascular-related disease simultaneously.

Correlation pattern between factors associated with lipid levels

Identified factors that are associated with lipid levels may not be independent.

Therefore, we also computed all Pearson correlation coefficients between each

of the validated environmental factors as well as the demographic (age, sex,

ethnicity, SES), and clinical (BMI, waist circumference, blood pressure, and

diabetes status) risk factors to ascertain the pattern of relationships among

them (Figure 15H). Next, we visualized all of these correlations as a

“correlation globe” to infer their inter-dependence as a function of all variables

examined. This approach has been utilized to discover inter-related or

dependent sets of genes in a gene expression microarray experiment [176, 177].

Power calculations

107

We estimated [242] that the EWAS had >90% median power for all cohorts for

detection of 5% change in HDL-C (p<0.02) and LDL-C (p<0.01) and 10%

change of triglyceride levels (p<0.02).

EWAS on Serum Lipids: Results

Demographic and baseline associations with lipid levels

As expected[243], demographics, BMI, ethnicity, and SES correlated with

lipid levels. For example, consistent positive correlations existed between age

and triglycerides (5-10% higher per 10 years, p-values<0.02), and BMI and

triglycerides (2% higher per 1 unit of BMI, p-values<0.004), and consistent

negative correlations between black ethnicity and triglycerides (13% lower vs.

white, p-values<0.001) [244]. Consistent polynomial relationships existed

between age and both HDL-C and LDL-C. Negative correlations existed

between BMI and HDL-C (1% lower per BMI unit, p-values<0.0001). In

addition, SES was associated with HDL-C (1-5% lower for lower vs. higher

tertile, p-values<0.03).

Environment associations with lipid levels

Figure 17 shows the distribution of p-values of association for each

environmental factor binned by its class, adjusted for sex, age, age-squared,

BMI, ethnicity, and SES. For triglyceride levels, 10/169, 24/182, 49/258, and

12/96 factors passed the requested threshold of significance for the 1999-2000,

2001-2002, 2003-2004, and 2005-2006 cohorts respectively. Likewise for

LDL-C, 1/169, 8/182, 15/258, and 11/96 were significant, respectively. For

HDL-C, 2/169, 21/182, 39 /258, and 15/96 were significant. Using other

cohorts, we tentatively validated significant findings from our initial screen.

Across cohorts, there were 22, 8, and 17 tentatively validated factors for

triglycerides, LDL-C and HDL-C, respectively (Figure 17 A-C).

108

The data was combined across cohorts for each tentatively validated factor and

estimates were further adjusted for waist circumference, T2D status, blood

pressure, and cohort. The variance ascribed to baseline co-variates was 22-25%

(triglycerides), 15-16% (LDL-C), and 23-26% (HDL-C). Each of the

tentatively validated environmental factors described an additional 0.7-18.4%

(triglycerides), 1.8-14.1% (LDL-C), and 0.4-4.0% (HDL-C) of the variance in

lipid levels.

109

Figure 17. “Manhattan plot” style graphic showing the environment-wide associations to lipid levels. Y-axis indicates -log10(p-value) of the adjusted linear regression coefficient for each of

−log

10(p

valu

e)

!

! !!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!

!

!

!

!

!

!!

!

!!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!! !

!!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!!

!

!!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!!

!

!

!

!!!

!

!

!

!!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!!!!

!!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!!!

!!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!!!

!

!!

!!

!!

!

!

!

!

!

acry

lam

ide

alle

rgen

test

bact

eria

l inf

ectio

nco

tinin

edi

akyl

diox

ins

fura

ns d

iben

zofu

ran

heav

y m

etal

s

hydr

ocar

bons

late

xnu

trien

ts c

arot

enoi

dnu

trien

ts m

iner

als

nutri

ents

vita

min

Anu

trien

ts v

itam

in B

nutri

ents

vita

min

Cnu

trien

ts v

itam

in D

nutri

ents

vita

min

E

pcbs

perc

hlor

ate

pest

icid

es a

trazi

nepe

stic

ides

car

bam

ate

pest

icid

es c

hlor

ophe

nol

pest

icid

es o

rgan

ochl

orin

epe

stic

ides

org

anop

hosp

hate

pest

icid

es p

yret

hyro

idph

enol

s

phth

alat

esph

ytoe

stro

gens

poly

brom

inat

ed e

ther

spo

lyflo

uroc

hem

ical

svi

ral i

nfec

tion

vola

tile

com

poun

ds

01

23

4

−log

10(p

valu

e)

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

! !

!

!!

!

!

!

!

!!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!!

!

!

!

!

!!

!!

!!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!!

!! !

! !!

!

!

!

!

!!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

acry

lam

ide

alle

rgen

test

bact

eria

l inf

ectio

nco

tinin

edi

akyl

diox

ins

fura

ns d

iben

zofu

ran

heav

y m

etal

s

hydr

ocar

bons

late

xnu

trien

ts c

arot

enoi

dnu

trien

ts m

iner

als

nutri

ents

vita

min

Anu

trien

ts v

itam

in B

nutri

ents

vita

min

Cnu

trien

ts v

itam

in D

nutri

ents

vita

min

E

pcbs

perc

hlor

ate

pest

icid

es a

trazi

nepe

stic

ides

car

bam

ate

pest

icid

es c

hlor

ophe

nol

pest

icid

es o

rgan

ochl

orin

epe

stic

ides

org

anop

hosp

hate

pest

icid

es p

yret

hyro

idph

enol

s

phth

alat

esph

ytoe

stro

gens

poly

brom

inat

ed e

ther

spo

lyflo

uroc

hem

ical

svi

ral i

nfec

tion

vola

tile

com

poun

ds

01

23

4

−log

10(p

valu

e)

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!!

! !!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!!

!

!

!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

! !

!

! !

!

!

!

!

!

!

!!

!!

!!

!

!

!!

!!

!

! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

! !!!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!!!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!!!

!

!!

!

!

!

! !!

!!

acry

lam

ide

alle

rgen

test

bact

eria

l inf

ectio

nco

tinin

edi

akyl

diox

ins

fura

ns d

iben

zofu

ran

heav

y m

etal

s

hydr

ocar

bons

late

xnu

trien

ts c

arot

enoi

dnu

trien

ts m

iner

als

nutri

ents

vita

min

Anu

trien

ts v

itam

in B

nutri

ents

vita

min

Cnu

trien

ts v

itam

in D

nutri

ents

vita

min

E

pcbs

perc

hlor

ate

pest

icid

es a

trazi

nepe

stic

ides

car

bam

ate

pest

icid

es c

hlor

ophe

nol

pest

icid

es o

rgan

ochl

orin

epe

stic

ides

org

anop

hosp

hate

pest

icid

es p

yret

hyro

idph

enol

s

phth

alat

esph

ytoe

stro

gens

poly

brom

inat

ed e

ther

spo

lyflo

uroc

hem

ical

svi

ral i

nfec

tion

vola

tile

com

poun

ds

01

23

4

A triglycerides

B LDL-C

C HDL-C

cohort markers1999-20002001-20022003-20042005-2006

110

the environmental factors. Colors represent different environmental classes as represented in Figure 15. Plot symbols represent different cohorts: 1999-2000 (diamonds), 2001-2002 (square), filled dot (2003-2004), circle (2005-2006). Red horizontal line represents the level of significance corresponding to FDR less than 10%. A) log10(triglycerides), B) log10(LDL-C), C) log10(HDL-C).

Effects for the top tentatively validated associations for triglycerides, LDL-C,

and HDL-C are shown in Figure 18, Figure 19, and Figure 20. Although we

found 22 and 17 factors for triglycerides and HDL-C respectively, we display

the top 2 findings (total of 12) for each category for visualization. Furthermore,

we discuss here some of them in more detail. Effect sizes for continuous

variables are for 1 SD of log-transformed value of the environmental factor.

111

Figure 18. Forest plot for top 12 validated environmental factors per cohort associated with triglycerides in a model adjusting for age, age-squared, SES, ethnicity, sex, BMI. Combined cohort denotes the estimate attained when combining all cohorts available for exposure in a model adjusting for age, age-squared, SES, ethnicity, sex, BMI, waist circumference, T2D status, blood pressure, and cohort. Percent change (x-axis) is the percent change of lipid level for a change in 1SD of logged exposure value. Effect size (in mg/dL) attained when fitting the untransformed lipids to the model. Symbols proportional to sample size and colors represent different environmental classes as represented in Figure 15.

cohort

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022005−2006combined

1999−20002001−20022003−20042005−2006combined

1999−20002003−2004combined

2001−20022003−2004combined

1999−20002001−20022003−2004combined

1999−20002001−20022003−2004combined

2001−20022003−2004combined

1,2,3,4,7,8−hxcdf

trans−b−carotene

cis−b−carotene

Retinol

Retinyl palmitate

a−tocopherol

g−tocopherol

PCB74

PCB170

Oxychlordane

Trans−nonachlor

Enterolactone

N

534806

1735

3605323328897374

3605323325967135

29963610323328899519

3480323327558903

2981360928897140

25853579323328729194

811832

2202

1004825

2155

704986877

2131

8141001

8652228

114910732358

pvalue

0.010.005

2e−05

0.0020.01

7e−041e−08

0.010.01

0.0011e−06

0.0055e−042e−043e−046e−21

7e−042e−040.003

6e−17

0.0012e−042e−048e−20

0.010.0020.002

7e−051e−17

0.010.005

1e−06

0.010.002

4e−06

0.020.0020.003

5e−09

0.020.0020.005

1e−08

0.020.006

2e−07

effe

ct (m

g/dl

)

554830

−18−18−19−16

−9−17−15−12

2323292725

37622441

49865767

2451423941

376138

628650

53785357

42664947

−14−20−17

−20 −10 0 10 20 30 40

% change

112

Figure 19. Forest plot for validated environmental factors associated with LDL-C. See Figure 18.

cohort

2001−20022003−20042005−2006combined

2001−20022005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022005−2006combined

1999−20002001−20022005−2006combined

2001−20022003−20042005−2006combined

trans−b−carotene

cis−b−carotene

b−cryptoxanthin

Combined Lutein/zeaxanthin

trans−lycopene

Retinyl palmitate

a−tocopherol

g−tocopherol

N

3315317428307043

331725416809

3294317428057012

3317317428307043

3315317428307043

320026988425

2734331728306665

3288317428148696

pvalue

0.0030.004

9e−042e−13

0.0020.004

5e−11

9e−046e−040.001

4e−13

0.0012e−045e−043e−15

5e−041e−042e−048e−17

8e−040.001

4e−13

0.0028e−057e−057e−19

0.0030.0020.005

3e−14

effe

ct (m

g/dl

)

8698

776

7798

98109

10101412

586

14171716

8666

0 5 10 15 20

% change

113

Figure 20. Forest plot for top 12 validated environmental factors associated with HDL-C. See Figure 18.

cohort

2003−20042005−2006combined

2003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

2001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

1999−20002001−20022003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2003−20042005−2006combined

2001−20022003−20042005−2006combined

2001−20022003−2004combined

2001−20022003−2004combined

Cotinine

Mercury, total

2−fluorene

3−fluorene


cis−b−carotene

Iron, Frozen Serum

Retinyl stearate

Folate, serum

Vitamin C

Vitamin D

g−tocopherol

Heptachlor Epoxide

N

726769599513

727369616323

233221922252

233221762243

7473679068687388

7478679062647151

63837457270625246764

7251679063378421

746872679559

679969114852

7056727369667401

742867909216

202218352108

pvalue

0.0030.02

2e−06

0.010.002

6e−07

0.010.0060.004

0.020.01

0.006

2e−042e−054e−042e−16

3e−049e−042e−043e−12

0.0090.0030.0060.002

6e−11

0.0020.0030.002

4e−05

0.0040.02

2e−05

0.0060.02

0.002

0.010.004

0.011e−06

0.0010.01

6e−06

0.010.02

0.006

effe

ct (m

g/dl

)

−2−1−1

122

−2−1−1

−2−1−1

3334

2333

22222

−1−1−2−1

111

211

1212

−1−1−1

−2−1−2

−6 −4 −2 0 2 4 6

% change

114

Vitamins A and E: unfavorable association with lipid levels

For all three lipids, we found a consistent association for lipid-soluble, anti-

oxidant vitamins, such as vitamin A, E, and carotenoids (Figure 17, Figure 18,

Figure 19, Figure 20). For example, a form of vitamin A, retinol, was

positively associated with triglycerides (p=6x10-21, effect=10% or 25 mg/dL

higher triglycerides per 1SD) in all cohorts examined. Another form of

vitamin A, retinyl palmitate was also positively associated with triglycerides

(p=6x10-21, effect=10%) and LDL-C (p=4x10-13, effect=5% or 6 mg/dL).

Retinyl stearate was negatively associated with HDL-C (p=4x10-5, effect=-3%

or -1 mg/dL). Retinol is the functional form of vitamin A produced in the

body from β-carotene and is a co-factor in biological processes associated with

vision and gene transcription [245]. Retinyl palmitate and stearate are animal-

and supplement-sourced vitamin A esters stored in the liver [245].

We observed a consistent association between forms of vitamin E (α and γ

tocopherol) and lipid levels. α-tocopherol strongly correlated with higher

triglyceride and LDL-C levels (effect=35% (p=8x10-20) and 16% (p=7x10-19),

or 67 and 16 mg/dL, respectively). γ-tocopherol was also correlated with

higher triglycerides (effect=17% higher, p=10-17) and LDL-C (6% higher,

p=3x10-14) levels, but also with lower HDL-C (effect=-2% , p=6x10-6).

Vitamin E is consumed via vegetables, nuts, oils, and supplements.

Tocopherols are highly lipophilic and their absorption is enhanced by

triglycerides.

Carotenoids: favorable association with HDL-C and triglycerides and

unfavorable association with LDL-C

Both isomers of β-carotene, cis- and trans- were associated with lower

triglyceride levels (p=10-6, effect=-7% or 12 mg/dL; p=10-8, effect=-10% or 16

115

mg/dL respectively). However, both isomers of carotene, in addition to other

carotenoids such as β-cryptoxanthin and lycopene were consistently associated

with higher levels of both HDL-C and LDL-C. The effect was 5% (p=3x10-12)

and 6% (p=5x10-11) for HDL-C and LDL-C levels respectively for cis-β-

carotene and 3% (p=10-10) and 12% (p=8x10-17) for lycopene. Carotenoids are

primarily sourced from consumption of fruits and leafy vegetables[246]; β-

and α-carotene (but not lycopene) are vitamin A precursors [245, 246].

Favorable lipid correlations with vitamins B, C, D, iron, mercury, and

enterolactone

We found serum levels of folate (vitamin B), C, D, iron, and mercury to be

favorably associated with HDL-C (Figure 20). Effect sizes of vitamin, iron

and mercury levels on HDL-C were similar, ranging from 3 to 4% (1-2 mg/dL)

higher HDL-C (p<0·002). Last, we found enterolactone, a product of lignan

metabolism in the intestine, to be associated with 10% (17mg/dL) lower

triglyceride levels (p=2x10-7, Figure 18).

Persistent pollutants: unfavorable association with triglycerides and HDL-C

Polychlorinated biphenyls (PCBs), dibenzofurans, and organochlorine

pesticides, all persistent organic pollutants, were unfavorably associated with

both triglyceride and HDL-C levels (Figure 18, Figure 20). Seven PCB factors

were tentatively validated and the most significant cogeners PCB74 and

PCB170 were associated with 15% (p=10-6) and 19% (p=4x10-6) higher

triglyceride levels. Five organochlorine factors were tentatively validated,

among which oxychlordane and trans-nonachlor changes were linked to 29%

and 30% higher (p=5 x 10-9, 1 x 10-8) triglyceride levels. Another

organochlorine pesticide, heptachlor epoxide, was associated with 3% lower

HDL-C (p=0.006). While use of these compounds is banned, they are known

to persist and accumulate due to their stability and lipophilicity.

116

Markers for air pollution and nicotine: unfavorable association with HDL-C

Several markers of air pollution and nicotine exposure were unfavorably

associated with HDL-C (Figure 20). The polyaromatic hydrocarbon markers

of fluorene, 3-hydroxyfluorene and 2-hydroxyfluorene, were associated with

3% lower HDL-C (p=0.006 and p=0.004). Cotinine, a serum biomarker for

nicotine, was associated also with a 3% lower HDL-C (p=2 x 10-6).

Polyaromatic compounds are formed as a result of burning of hydrocarbon-

based substances, such as tobacco, coal, gas, oil, and meats.

Sensitivity analyses with further adjustments

For most questionnaire variable adjustments, we did not see a sizable

difference in estimated coefficients or p-values for the environmental factors

(Figure 21, Figure 22, Figure 23), including questionnaire items regarding self-

report cardiovascular-related disease status and use of drugs. Interestingly, in

most exceptions, adjustments increased the effect size of the environmental

factor. For example, after adjusting for vitamin and supplement intake, the

associations between γ- and α-tocopherol and triglyceride and LDL-C levels

became stronger. Similarly, adjustment for total fiber intake also strengthened

the association of β-carotenes and cryptoxanthin with LDL-C. The association

of cotinine, 3-, and 2-fluorene with HDL-C strengthened after adjustment for

alcohol intake. Adjusting for any fish and shellfish consumption strengthened

the association between pollutants and triglycerides. Adjustment for fish and

shellfish consumption strengthened the association between retinyl stearate and

HDL-C and triglyceride levels. Conversely, the effect of vitamin C and folate

in relation to HDL-C decreased when taking supplement count, total fiber

intake, and physical activity into account. Adjusting for supplement count

decreased the effect of γ-tocopherol on HDL-C.

117

Figure 21. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(triglycerides). “Original” estimates were adjusted for sex, age, age-squared, SES, ethnicity, BMI, waist circumference, blood pressure, T2D status (fasting blood glucose >= 125 mg/dL), and cohort. “Extended” estimates were adjusted for the same co-variates in the original model (age, age-squared, sex, ethnicity, SES, BMI, waist circumference, blood pressure, diabetes, cohort), in addition to questionnaire items added sequentially. For points annotated as “cardiovascular” (red diamond), the extended estimates were adjusted for the same co-variates in the original model in addition to “count”, “physical_activity”, “any_heart_disease”, and “cholesterol_lowering” simultaneously. The estimates of βfactor that were greater than 10% than the original estimate upon adding the extra co-variate are annotated in color. P-values for the “extended” βfactor are shown to the left of the point. Legend abbreviations: TLZ: total lutein/zeaxanthin; TATOC: total tocopherol; TBCAR: total β-carotene; any_shellfish: any shellfish consumed in past 30 days; any_fish: any fish consumed past 30 days, count: total supplement used in 30 days; physical_activity: total physical activity in metabolic equivalents past 30 days.

-15 -10 -5 0 5 10 15 20

100*(extended-original)/original

0.003

9e-04

0.003

0.003

1e-071e-07

8e-16

1e-17

2e-18

7e-20

0.0062e-04

0.002

0.002

0.0026e-04

2e-04

2e-04

0.0030.06

0.001

0.001

2e-04

2e-04

1,2,3,4,7,8-hxcdf

a-Carotene

trans-b-carotene

cis-b-carotene

Retinyl palmitate

Retinyl stearate

Retinol

g-tocopherol

a-tocopherol

PCB199

PCB74

PCB99

PCB156

PCB170

PCB196 & 203

PCB206

Beta-hexachlorocyclohexane

Dieldrin

Heptachlor Epoxide

Oxychlordane

Trans-nonachlor

Enterolactone

TLZcardiovascularTATOC

TBCAR

any_shellfishany_fishcountphysical_activity

118

Figure 22. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(LDL-C). See Figure 21 for complete caption. Legend abbreviations: TFIBE: total fiber; TVC: total vitamin C; TCRYP: total cryptoxanthin; count: total supplement use in past 30 days; cardiovascular: on lipid lowering drug past 30 days or doctor said participant had heart disease.

-5 0 5 10 15


3e-138e-11

1e-12

1e-12

1e-123e-142e-14

7e-19

1e-18

trans-b-carotene

cis-b-carotene

b-cryptoxanthin


trans-lycopene

Retinyl palmitate

g-tocopherol

a-tocopherol

TFIBE

TVC

TCRYP

countcardiovascular

119

Figure 23. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(HDL-C). See Figure 21 for complete caption. Legend abbreviations: TFIBE: total fiber; cardiovascular: on lipid lowering drug past 30 days or doctor said participant had heart disease.; TALCO: total alcohol; TMAGN: total magnesium; TFF: total food folate; any_fish: any fish consumed in past 30 days; any_shellfish: any shellfish consumed in past 30 days; count: total supplement use past 30 days; physical activity: total physical activity in past 30 days; TPOTA: total potassium.

-40 -30 -20 -10 0 10 20


2e-06

5e-06

2e-06

7e-076e-06

1e-06

0.02

0.007

0.006

0.0060.01

0.005

7e-101e-08

1e-09

2e-10

3e-11

5e-05

0.10.1

8e-042e-04

1e-051e-05

2e-05

1e-04

5e-05

1e-05

0.0049e-04

9e-04

0.0029e-04

9e-04

0.004

2e-067e-06

5e-06

6e-05

1e-05

Cotinine

Mercury, total

3-fluorene

2-fluorene

a-Carotene

trans-b-carotene

cis-b-carotene

b-cryptoxanthin


trans-lycopene

Iron, Frozen Serum

Retinyl stearate

Folate, serum

Vitamin C

Vitamin D

g-tocopherol

Heptachlor Epoxide

TFIBE

cardiovascularTALCO

TMAGN

TFFany_fishany_shellfishcountphysical_activityTPOTA

120

Simultaneous adjustment for self-reported cardiovascular-related disease,

supplement count, lipid-lowering drugs, and physical activity strengthened the

association between tocopherols and pollutant factors and triglycerides, while

attenuating the association to α-carotene (Figure 21). For HDL-C levels,

effects of cotinine, mercury, 3- and 2-flourene, folate, vitamin C, vitamin D,

and γ-tocopherol were all attenuated > 15% (Figure 23). However, the

direction and significance of the effects were preserved throughout.

Correlation patterns

Evaluation of Pearson correlations showed a dense correlation pattern for

triglycerides (Figure 24) and more sporadic strong correlations between

various factors for HDL-C and LDL-C (Figure 25, Figure 26). As expected,

we observed strong correlations among closely related factors, such as between

PCBs (ρ > 0.6) or carotenoids (ρ > 0.5), and even within factor classes, such as

organochlorine pesticides (ρ > 0.3). Of note, the hydrocarbon factors 2-and 3-

hydroxyfluorene were highly correlated with cotinine (ρ=0.6 and 0.7). The

baseline demographics were not strongly related (e.g, ρ > 0.5) with any of the

environmental factors, with the exception of age that showed several strong

associations with many of them.

121

Figure 24. Pair-wise correlation globes for validated environmental and risk factors associated with triglycerides. Each node corresponds to a validated environmental (in color of environmental class, see Figure 15) or demographic/clinical risk factor (in white). Correlations > 0.2 and < -0.2 are shown with line thickness proportional to the absolute value of correlation. Line color corresponds to the sign of correlation (positive=red, negative=blue). To avoid overcrowding, only the most highly associated PCB and organochlorine factors are shown .

122

Figure 25. Pair-wise correlation globes for validated environmental and risk factors associated with LDL-C.

123

Figure 26. Pair-wise correlation globes for validated environmental and risk factors associated with HDL-C. To avoid overcrowding only the most highly associated PCB and organochlorine is shown.

EWAS on Serum Lipids: Conclusions

In the current application, our findings reveal complex relationships between

serum lipid levels and fat-soluble antioxidant vitamins. Randomized studies

and meta-analyses[43, 247-250] have shown these vitamins to have no benefits

or even confer harm when given in high doses, in contrast to previous

favorable associations in observational studies [251, 252]. The unfavorable

lipid profile that we observed with vitamin E forms is possibly consistent with

the randomized evidence on hard clinical outcomes and we also found an

unfavorable lipid profile for vitamin A forms. Carotenoids have a mixed effect,

improving triglycerides and HDL-C, but worsening LDL-C.

124

These associations may reflect a complex web of physiological correlation or

even reverse causality. For example, α-tocopherol and carotenes are

transported in serum with HDL and LDL [246, 253, 254] and accurate

measurement of serum α-tocopherol is dependent on serum lipids [255]. In this

regard, the strong association between α-tocopherol and LDL-C and

triglycerides might be considered a true positive result. On the other hand,

given the lack of evidence for γ-tocopherol or retinol associating with

lipoprotein complexes, their association might be due to reverse causality, or

increased anti-oxidant consumption among those who know about their

adverse lipid level profile. Nevertheless, given that vitamin E consumption has

been found to increase mortality in meta-analysis[43], the large effect sizes

suggest that prospective studies may be scrutinized for any potentially adverse

effects of vitamin E on lipid levels and other metabolic disorders, such as

T2D .

We observed an association of vitamins B (folate), C and D, mercury, and iron,

to higher HDL-C levels. Folate [256] and vitamin D [257] have previously

been associated with higher HDL-C. Fish, a source of cardioprotective omega-

3 fatty acids, are also a large source of mercury[258]; however, we did not

observe a large change in effect size of mercury when accounting for

consumption of fish. These nutrients and metals may be to some extent

surrogate markers of “healthy diet” behaviors; however what exactly

constitutes “healthy diets” is currently very difficult to define, in contrast to

earlier claims [259, 260]. The strength of the association for these dietary

markers is similar on HDL-C, ranging from 1-3 mg/dL for a standardized

change per factor. These are small effects and it is unclear whether

cumulatively they could have a much larger impact in raising HDL-C level,

given the correlations between these markers.

125

We also identified enterolactone to be strongly associated with favorable

triglyceride levels in this study. Enterolactone is a metabolite of lignans,

which are found in foods such as flaxseed, and have been associated with

favorable cholesterol profiles in this form [261, 262]. Again, it is unclear what

role, if any, this marker plays as a surrogate of “healthy diets” and effects on

heart disease have been inconsistent [263].

We found biomarkers of hydrocarbons, 2- and 3-hydroxyfluorene to be

strongly associated with unfavorable HDL-C levels. While others have shown

the association of these metabolites to self-report cardiovascular disease with

the NHANES data [264], to our knowledge the association with HDL-C is

novel. Relatedly, we also found a marker of nicotine, cotinine, to have a

similar association with HDL-C. Particulate matter air pollution, composed of

many types of hydrocarbons, and smoking long have been a major concern for

cardiovascular-related diseases [236, 265, 266]. Smoking is well known to

influence HDL-C levels [267, 268] and acute and chronic exposures to tobacco

smoke have been shown to decrease HDL-C substantially [269]. The high

correlation of the hydrocarbon markers to cotinine suggests that these

associations might all indicate exposure to cigarette smoke.

We also found that persistent organic pollutants, such as organochlorine

pesticides and polychlorinated biphenyls, were associated with large increase

of triglycerides and large decrease in HDL-C. These compounds have been

implicated in other metabolic-related diseases and other populations. For

example, PCB170 and heptachlor epoxide have been associated to T2D in our

previous EWAS on T2D. Similarly, PCBs and dibenzofurans have been

associated with metabolic syndrome in a Japanese population [270] and have

been claimed to have an “obesogenic” effect [271].

126

We should acknowledge that these associations might be confounded due to

the fat solubility of these pollutants. Nevertheless, there have been efforts to

characterize this relationship. For example, in a study analytically considering

causal pathways and confounding bias via structural equation modeling, the

investigators found a relationship between polychlorinated biphenyls and lipid

levels consistent with forward causality for a native population with high

exposure of these pollutants in upstate New York [272]. Another study found

an ecological relationship between cardiovascular-related hospitalization rates

in areas close to PCB pollution [273]. Further, higher incidence of

cardiovascular disease was observed in an occupational Swedish cohort [274].

Nevertheless, and notably, these studies took place in populations in which the

source of exposure was known and dosages were much higher than the general

population levels seen in NHANES.

Finally, identifying specific heritable components through GWAS has proven

difficult: one recent study attributed 10-12% of variability of the lipid levels to

95 genetic loci in a sample of >100,000 individuals [65], and each genetic

factor explained less than 0.5% of the variance. By comparison, each of the

validated environmental factors described a larger portion of the variance in

lipid levels, occasionally exceeding 10%; however, reverse causality cannot be

excluded as for the genetic variants.

DISCUSSION

By combing through all environmental exposure measures using a systematic

EWAS approach, we have found novel multiple environmental factors

associated with type 2 diabetes and lipid levels beyond the level of false

discovery. The method is general enough to apply to diverse datasets, and, in

fact, collaborators have begun utilizing EWAS to study blood pressure and

kidney disease. The EWAS approach bypasses the problem of selectively

testing and reporting one or a few associations at a time that has been

127

repeatedly debated as a source of biased results and false positives in

epidemiological studies [10, 42, 179, 275, 276]. In its current form, EWAS

offers a new way to generate a comprehensive list of associations that have

robust support after multiple comparisons, a practice not followed in

environmental epidemiology currently. The ensuing associations are then

carefully scrutinized for their validity in sensitivity analyses. Adjustments for

potential confounders are also systematic and correlation patterns between

variables are evaluated and visualized.

Like GWAS, the EWAS framework can be used to propose targets for further

study. For example, many factors are correlated; some are similar structurally,

such as the isomers of β-carotene, or show dependent patterns of exposure

environment, such as the PCBs and organochlorine pesticides or the serum

markers for cotinine and urine markers for hydrocarbons. As we extend the

GWAS analogy, and provide a precise definition of the envirome (Chapter 1,

3), these and other environmental factors could be said to be in “linkage

disequilibrium” with each other. Just as is done for preliminary GWAS

findings, EWAS findings can and should be used to identify further factors that

may be in “disequilibrium,” for further detailed measurement and causal

identification.

EWAS allows for comprehensive and systematic analysis of the effects of the

environment in association to disease on a broad scale. While many

investigators have already utilized the NHANES to address the effect of a

limited number of factors on disease, they do not provide a global view of

these associations [277, 278]. Further, the previous studies use differing

definitions of disease status (such as a medical questionnaire), exposure coding

(discretization vs. log transformation), and lack methods for multiple

comparison control [279-281]. It is the well-established toolkit of the GWAS

128

that has provided us with methods to overcome these limitations and to enable

us to postulate about environment-wide association to disease.

Limitations of these studies remind us that measuring environment-wide

aspects in relation to phenotypic states such as disease will be a difficult

undertaking [10]. While the NHANES provides a large number of factors to

study, a comprehensive assessment will require precise definition over a

broader dimension (including more factors). While laboratory measurements

are collected during a baseline fasting state for all participants in NHANES, we

will still have to account for the dynamic and heterogeneous nature of different

exposures and their associated responses by taking replicate measurements at

different physiological states.

Furthermore, the observed cross-sectional correlations in the EWAS setting do

not offer proof for causality (Chapter 3). While we attempted to check the

validity of our estimates by systematically adjusting for known, self-reported

confounders, residual confounding and confounding from unmeasured

variables cannot be ruled out and reverse causality remains always a possibility

for findings of cross-sectional data. We have also shown how to

systematically evaluate the correlation pattern between known and novel

environmental correlates of lipids to communicate the complex inter-

relationships between these variables. We hypothesize that this approach

would be helpful in designing future studies such randomized trials that may

try to intervene at one or more nodes in the correlation globes.

To more formally ascertain causality, we would need to perform prospective

EWAS over the life course, consider incident cases, consider randomization

methods [69], or even evaluate gene-environment interactions as additional

validation (Chapters 3 and 5). Due to the number of hypotheses generated, we

would need to integrate more evidence from large-scale collaborative studies

129

in order to confirm (or refute) etiological aspects of these factors while being

as comprehensive as possible in the observation of potential confounding

variables. For example, additional factors such as behavior (food consumption,

drug use, and/or exercise patterns), geographic location, and occupation must

also be ascertained to account for associated risk factors and reverse causality.

The measurement of 300 environmental factors is hardly a comprehensive

study of the environment, but this is still a greater number of factors measured

than the 30 microsatellite markers [282], or 100 single nucleotide

polymorphisms (SNPs) in some of the earliest implementations of GWAS

[283]. We suggest that measurement technologies for the environment can and

will improve in resolution, as novel associations are made using even few

measurements in these prototype studies. Measurement of the panel of

environmental factors used here, most of which are performed by mass

spectrometry, currently costs an estimated $40,000 per individual [284], or

close to the current pricing for candidate SNPs and copy number variation

sequencing.

Another type of hypothesis we may generate is regarding the complex cause of

disease. For example, we can now use an EWAS to hypothesize about “gene-

environment” interactions and their relation to disease etiology. In the next

chapter, we address how to screen for gene-environment interactions through

integration of GWAS and EWAS, where genetic variability is assessed

simultaneously along with robustly identified environmental factors. As will

be seen, while resource intensive, this type of study design could perhaps

facilitate an explanation of disease causation that has eluded genomic-wide

scans, provide additional validity for the marginal effect of exposure, and

enable more accurate estimates of risk [32].

130

CHAPTER 5: TOWARD ENVIROME-GENOME INTERACTIONS IN

THE CONTEXT OF HUMAN HEALTH: COMPREHENSIVELY

SCREENING FOR GENE-ENVIRONMENT INTERACTIONS IN

ASSOCIATION TO TYPE 2 DIABETES.

INTRODUCTION

In previous chapters, we focused on comprehensive and agnostic methods to

attain robust environmental disease associations on a population scale, notably

known as EWAS (Chapter 3) and we applied the method to find factors

robustly associated with Type 2 Diabetes and serum lipid levels (Chapter

4chapter).

It is hypothesized that both multiple genetic and environmental factors interact

to induce complex disease. Genome-wide association studies (GWAS) have

led to the discovery of many common variants associated with disease [9, 203];

however, each of these common genetic variants confer very modest risks and

cumulatively explain only a limited portion of the disease variance [32]. It is

hypothesized that some of the yet unexplained risk may be accounted for by

“gene-environment interactions”, or that the joint effect of genetics and of the

environment may be different than the marginal effects of each of these two

factors alone [32, 44, 47, 285, 286].

In the following, we introduce a method for screening for gene-environment

interactions between prevalent environmental factors found in EWAS and

common variants found in GWAS in application to T2D, overcoming a few

outstanding challenges in the field. Before describing these challenges, the

method, and application we first define and report some examples of the gene-

environment interactions in context of disease.

131

Background

A classic example of a gene-environment interaction involves the disease

phenylketonuria (PKU) [287]. Those with PKU have inherited a rare genetic

variant that codes for a deficient phenylalanine hydroxylase liver enzyme and

are unable to convert the amino acid compound phenylalanine from their diets

to another amino acid, tyrosine. In the presence of both the deficient enzyme

and phenylalanine, an intermediate compound accumulates, phenylketone,

leading to mental retardation. However, even with the rare genetic variant

coding for the deficient protein, controlling phenylalanine exposure mitigates

adverse phenotypes.

The study of gene-environment interactions is akin to “pharmacogenetics,” [81,

288] which relates genetic differences and variability of molecular responses to

drugs . In fact, the term “ecogenetics” – the study of gene-environment

interactions [289] -- discriminates environmental responses from drug

responses. Over 8 decades ago Archibald Garrod undertook initial studies

regarding genetics and metabolic-focused response to environmental chemicals,

observing “inborn errors of metabolism” in adverse phenotypes such as

alkaptonuria [290]. In 1931, Garrod further described adverse phenotypes that

occur only in certain individuals, “…substances contained in particular foods,

certain drugs, and exhalations of animals or plants produce in some people

effects wholly out of proportion to any which they bring about in average

individuals (sic)” [291]. This observation was the first classic pharmacogenetic

“responder” vs.“non-responder” phenotypes that would come to dominate the

field. Later, Motulsky set the stage for pharmacogenetics (and later

ecogenetics) in which he described the adverse response to drugs as an

environmental and dose-dependent “trigger” for genetically susceptible

individuals [292].

132

The interaction of human genetic variants and specific environmental factors

can be assessed through population-based studies, in which the presence of

both a genetic variant and factor is associated with a disease phenotype [289,

293]. In statistical models, the hypothesis of joint effect is tested against the

marginal association between each of the factors alone and phenotype, as to be

discussed below. However, as both epidemiologists and toxicologists alike

would note, population-based statistical interaction does not mean biological or

molecular interaction [34, 294, 295]. Nevertheless, presence of a statistical

interaction enables us to hypothesize about underlying biological processes

that occur between genes and environmental factors.

Most documented gene-environment interactions between genetic variants and

chemical factors come in the context of genes that control metabolic processes,

or “pharmacokinetic” genes, such as the direct conversion of chemical factors

to products for use or excretion. Often these interaction studies occur amongst

a finite set of diseases, most notably cancer. A famous example includes the

product of CYP1A1, which oxidizes polyaromatic hydrocarbons. Early

hypotheses surrounded the metabolic efficiency of different variants of this

gene and hydrocarbons in lung cancer [296]. Molecular processes involving

acetylation carried out by the N-acetyltransferase and associated proteins

(NATs) have received the most attention in relation to variable host responses.

For example, altered function due to NAT variants and exposure to cigarette

smoke in the context of colon and bladder cancer has been well studied [297,

298], gathering robust evidence in GWAS [45].

Evidence for interactions of specific factors has been less strong for T2D. First,

the environment is often attributed to abstract factors such as “lifestyle” or a

proxy for a collection of environmental influences, such as body mass index

[299-302], but there exceptions, such as the interaction hypothesized between

variants in PPARG (Pro12Ala) and dietary fat intake [303]. More surprisingly,

133

there are few examples of interaction between the strongest hits from GWAS

for T2D and environmental factors, such as between rs7903146 (TCF7L2) and

dietary carbohydrate, although the strength of association was weak to

moderate [304]. What is needed is a method to screen a space of possible

interactions to prioritize further study.

Screening for Gene-Environment Interactions: “G-EWAS”

Despite the hypothesis that gene-environment interactions play a role in

diseases of multifactorial nature such as T2D, there is an absence of

documented evidence for specific gene-environment interactions for the

disease. Investigating gene-environment interactions is a challenging

undertaking. First, analyzing gene-environment interactions is a complex and

power-intensive exercise [47]. Second, traditionally, most population-based

epidemiological studies examine either only genetic risk factors or only

environmental risk factors; there is a smaller set of studies that capture

information comprehensively on both genetics and the environment. It is quite

rare for significant numbers of genotypes and environmental factors to be

measured concurrently. Even still, there is another practical challenge that we

propose to address using comprehensive analytic techniques introduced in

Chapter 3, the selection of what candidate factors to measure in the first place.

The outstanding practical challenge we address here revolves around the

domain of factors: which of the millions of genetic variants or thousands of

environmental factors do we choose to measure jointly? Often, genetic

variants and environmental factors are selected by convenience, without

sufficient documentation of the strength of their marginal associations. It is

possible that given the complexity of gene-environment interaction analyses

[47], there is a problem with selective analyses and selective reporting of only

some of the results from each study in a fragmented and possibly biased

fashion [42, 305]. Many studies do not account for the multiplicity of all the

134

interaction effects that they have explored. There is a need to select common

variants and exposures resulting from comprehensive studies, and in turn,

systematically screen their interactions to avoid the spurious results seen in

many candidate-driven investigations [299, 301, 306].

Instead of the traditional candidate locus and environmental factor approach,

one new way forward would be to screen a set of gene and environmental

factors and use the “best hits” as candidates for further study and validation.

To construct such a screen, we propose utilizing factors arising from

comprehensive and systematic studies that have resulted in robust and

replicated associations with disease of interest. For example, much has been

written about using genome-wide association studies (GWAS) to find common

genetic variants associated with complex disease [203]. We recently published

an analogous approach for finding associated environmental factors, called an

environment-wide association study (EWAS) [35].

We propose here a systematic approach to select and test gene-environment

interactions in association to a common disease such as T2D, specifically

testing the interaction between robust factors found in GWAS and EWAS. We

are able to conduct this study because of the specific nature of the Centers for

Disease Control (CDC) National Health and Nutrition Examination Survey

(NHANES) [37], which we introduced earlier in Chapter 4. To recap, the

survey includes 261 genotyped loci, more than 50 environmental factors

measured in blood and urine (e.g. nutrients, vitamins, and pollutants), and

clinical biomarkers (such as fasting blood glucose) for the same individuals.

Focusing on the top GWAS and EWAS hits on T2D, we systematically

investigate variant-environment interactions in association with the disease on

these cohorts, creating hypotheses for further investigation.

135

METHODS

Figure 27 schematically shows the systematic approach for testing gene-

environment interactions. The analysis of interactions is conducted in a dataset

that has available measurements for both genetic variants and environmental

factors. We select genetic variants and environmental factors that have strong

evidence of association for their marginal effects.

For genetic associations, the current paradigm of GWAS has provided the

framework to assemble robustly replicated sets of common genetic variants

with previously documented genome-wide significance (p<5x10-8 [8]). For

environmental associations, we conducted EWAS to comprehensively search

for and validate prevalent environmental factors in association to T2D

(Chapter 4). For environmental variables, there is less strong consensus on

what are robust enough standards of replication [307] and it should be

acknowledged that, in contrast to genetics, reverse causality cannot be easily

excluded. Here we selected environmental exposures that have shown

significant associations in at least two (and up to 4) independent cohorts after

accounting for the multiplicity of analyses and after adjusting for demographic

factors.

First, we examined the marginal effects of each of the “G” number of genetic

variants or “E” number of environmental factors on T2D separately. Second,

we computed the association between each environmental factor and variant

pair (total of E x G tests) to ascertain the degree of their dependence. In our

main screen, each environmental factor and variant pair (total of E x G tests) is

tested for interaction while adjusting for other known risk factors (Figure 27B).

Finally, multiplicity of analyses is accounted for with both Bonferroni-adjusted

p-values and false discovery rate (FDR) estimation (Figure 27C).

136

Figure 27. Schematic for comprehensive testing and screening for gene-environment interactions against T2D. A.) Genetic and environmental factors are chosen by their strength of marginal association in GWAS and EWAS, B.) Each genetic variant and exposure pair is tested for interaction in association to disease (example shown: γ-tocopherol and the variant rs13266634 in association to T2D (fasting blood glucose [FBG] > 125 mg/dL)) in a logistic regression model adjusting for other risk factors and main effects of exposure and variant, C.) Multiple hypotheses are controlled for using a modified Bonferroni correction and the FDR is estimated. The correction factor for the multiplicity adjustment is the number of estimated independent tests conducted, and the empirical false positive rate for FDR estimation is estimated through a parametric bootstrap approach. D.) Sensitivity analyses are conducted, restricting the samples analyzed to a Caucasian-only subgroup and an over-age-40 subgroup.

Data and selected genetic and environmental factors

We used data from the National Health and Nutrition Examination Survey

(NHANES) described in Chapter 4 [37]. On the genetics side, we considered

18 genetic variants that have been previously documented through GWAS to

have robust association (reaching genome-wide signficance) with T2D. These

variants have been assayed among consenting individuals in two NHANES

surveys, those conducted in 1999-2000 and 2001-2002. A total of 8000

individuals from these cohorts had both consented use of their DNA for

rs7901695rs13266634

rs4402960

rs1260326

rs864745

rs1111875

rs10811661

rs7578597

rs1801282

rs780094

rs12779790

rs2237895

rs2383208

rs10923931

rs4712523rs4607103

rs7903146

rs8050136

THADA

CDKAL1unknown

NOTCH2

GCKR

JAZF1

HHEX

TCF7L2

GCKR

FTO

SLC30A8

KCNQ1

CDKN2A

CAMK1D

IGFBP2

ADAMTS9

TCF7L2

PPARG

18

13

10

5

9

1

6

12

4

7

17

3

8

21

5

16

14

11

2

3 4

15

cis-β-

carot

ene

trans

-β-ca

roten

e

γ-toc

ophe

rol

hepta

chlor

epox

ide

PCB199

A

zγ-tocopherol

logi

t(FBG

≥ 1

26 m

g/dL

) rs13266634

(0)

(1)(2)

Test for InteractionB

Factor Selection

(# of risk alleles)

Sensitivity AnalysesCaucasian sub-group > Age 40 sub-group

D

Multiplicity AdjustmentCBonferroni Correction (meffective=FDR (estimated using parametric bootstrapping)

)Σ Σ×

137

research and had blood samples available for genetic testing. Genetic variants

were chosen a priori by different groups of independent researchers

investigating other research topics. We computed allele frequencies of each

variant stratified on self-report race to confirm their presence. In NHANES,

this was coded in 5 groups (Mexican American, Non-Hispanic Black, Non-

Hispanic White, Other Hispanic, Other).

On the environment side, we previously identified and tentatively validated 5

environmental factors associated with T2D, including trans-β-carotene, cis-β-

carotene, γ-tocopherol, heptachlor epoxide, and PCB170 after systematically

screening 266 environmental factors measured by blood or urine tests (Chapter

4) [10]. To recap, the false-discovery rate for each of these 5 associations was

less than 10% in at least 2 independent cohorts and the overall FDR, assessed

by considering all combinations of attaining significance in more than 1 cohort,

was less than 1% for all 5 factors.

T2D cases are defined as individuals who had 8.5-hour fasting blood glucose

(FBG) greater than or equal to 126 mg/dL as advised by the American

Diabetes Association (ADA), similar to our EWAS on T2D (Chapter 4). To

increase our power for detection of interaction effects, we combined data from

the two cohorts. Depending on the genetic and environmental variables tested

for interaction, there were a total of 921 to 2924 controls and 82 to 278 cases.

Each genetic variable was coded for the number of risk alleles as designated by

the papers from which they were found [23, 308]. Environmental factors were

log-transformed and standardized (expressed in standard deviation units) .

Regression Analyses

We assessed the marginal effect between the genetic variant or environmental

factor on T2D with survey-weighted logistic regression, adjusting for self-

report race, age, sex, and BMI. Next, we ascertained whether genetic variants

138

might be correlated with levels of environmental factor. We evaluated the

correlation between genetic and environmental factors through survey-

weighted linear regression, regressing log base 10 of the environmental

exposure variable on each genetic variable, adjusted for self-report race, sex,

age, and BMI. We used 4-year survey weights corresponding to the smallest

subsample analyzed as advised by the National Centers for Health Statistics

(NCHS) [207].

Next, we conducted our systematic interaction screen (Figure 27A-B).

Specifically, we screened the space of possible pairs of interactions, totaling 90

(18 genetic loci times 5 environmental factors). We utilized survey-weighted

logistic regression to associate each pair of factors to T2D incorporating a

multiplicative interaction term and main effects of both factors. Each model

was further adjusted by age, sex, self-reported race, BMI. As above, 4-year

survey weights corresponding to the smallest subsample analyzed as advised

by the NCHS [207].

Multiplicity Correction and FDR

We corrected for multiple hypotheses through direct Bonferroni correction of

the statistical significance level and FDR estimation (Figure 27C). Bonferroni

multiplicity correction adjusts the threshold for statistical significance (for

example, p=0.05) by the number of statistical tests conducted. Since our tests

are not independent, we estimated the total number of “effective” genetic loci

and environmental exposures tested jointly by taking into account the

correlation between the selected factors. This approach, which more

accurately estimates the number of hypotheses for a group of correlated factors,

has been applied previously to the study of genetic variants [309]. Here, we

expanded the use of the method for environmental factors. For the 18 genetic

loci, we calculated the correlation between the genetic factors stratified by

ethnicity and concluded that there were 17.7 effective genetic factors. For the

5 environmental factors, we calculated 4.41 effective factors. Thus, the total

139

number of effective tests was 78.1 (17.7 x 4.41). The adjusted level of

significance for a single test threshold of p=0.05 therefore is 0.0006

(0.05/78.1).

We also calculated the FDR, the estimated proportion of false positives among

the total of significant hypotheses for a given significance level [238]. To

estimate the number of false positives, we empirically generated a distribution

of null test statistics corresponding to the interaction term while preserving the

main effects of the variant and exposure terms using a parametric bootstrap

method [310].

Briefly, a parametric bootstrap approach samples with replacement responses

from a model representing the “null” hypothesis many times to enable the

creation of a null distribution of test statistics corresponding to the interaction

term. To create the null distribution of test statistics corresponding to the

interaction term (β GxE), we fit a “null” logistic regression model omitting the

interaction term (β GxE = 0) while leaving in the model the parameters

modeling the main effects of the environmental factor, genetic variant, and

remaining covariates (age, sex, race, and BMI). We bootstrapped the

responses from this null model and refit the original model described above,

adding the covariate corresponding to the interaction between variant and

environmental factor. To estimate a null distribution, this procedure is

repeated 100 times. Finally, the FDR is estimated to be the ratio of the results

of our interaction term called significant in the null distribution and all the

results called significant (both true and false positives) at a given significance

level.

All presented analyses include data from diverse ancestry and age groups, as

reflected in the US population that is sampled by NHANES. Given that the

stronger evidence for the specific T2D associations has been procured by

140

studies in Caucasian-descent individuals, we also performed a sensitivity

analysis limiting the data to participants who were coded as non-Hispanic

white and other Hispanic (Figure 27D). Next, as NHANES is cross-sectional

and presumably many genetically at risk participants might not be diagnosed at

time of sampling, we performed an analysis using an older set of individuals

(greater than 40 years of age).

BMI is a notable risk factor for T2D [311, 312]. As means of comparison, we

also sought to document any interaction between the 18 genetic variants and

BMI. To conduct these interaction tests, we standardized BMI by centering

the measurements about the mean and dividing by the standard deviation. As

above, we fit a survey-weighted logistic regression model, modeling diabetes

status as a function of the main effect of the variant (coded as number of risk

alleles), main effect of BMI, interaction term, in addition to sex, age, and self-

report race. We estimated greater than 90% power to detect interactions

between BMI and these genetic loci for interaction OR 1.5 at the 0.05 level of

significance [313].

For all analyses, we used SAS version 9.2 accessed through a Remote Data

Center (RDC) located in Hyattsville, Maryland. As NHANES is a complex,

multi-stage, stratified survey, we utilized survey sampling units, strata, and

weights for all analyses as in the previous chapter [207].

RESULTS

We implemented a systematic screen for detecting interaction effects between

19 genetic variants identified in GWAS on T2D and 5 environmental factors

identified in EWAS on T2D (Chapter 4) [35], a total of 90 interaction tests

(Figure 27A). We modeled T2D using logistic regression with a multiplicative

interaction term between each genetic variant and environmental factor pair

while adjusting for age, sex, self-report race, and body mass index (BMI)

141

(Figure 27B). We assessed multiple hypotheses using a Bonferroni and a false

discovery rate (FDR) approach (Figure 27C) and conducted sensitivity analysis

to assess the robustness of our results (Figure 27D). We begin by assessing

power, marginal associations, and correlation between genetic variants and

environmental factors.

Allele frequencies

We estimated the minor allele frequencies of each of the 18 variants in our two

US NHANES cohorts. For most of the loci, minor allele frequencies were

greater than 5% for all of the self-reported ethnicities. The only exception was

rs1801282, which showed a 3% minor allele frequency in self-reported Blacks.

This suggests that all the surveyed ethnicities had a reasonable frequency of

minor alleles at these loci for study.

Power Calculations

In order to study the interactions between concurrently measured genotypes

and environmental factors in our two US NHANES cohorts, we determined

whether we had sufficient power to proceed. Power computations for

genotype-environment interactions depend on minor allele frequency for loci,

environmental factor variability, ratio of cases to controls, and marginal

associations to disease [313]. We estimated the minor allele frequencies from

our cohorts (5-44%), the exact ratio of cases and controls available for each

genotype-environment factor pair, assumed standardized environmental

variables (SD=1), and assumed a marginal OR of 1.2 and 2.0 for the genetic

loci and environmental factor (gathered from previous GWAS and EWAS)

respectively. Under these assumptions, we determined to have or 30-96%

(median=71%) and 63-99% (median=98%) power for 82 to 278 cases (921 to

2924 controls) to detect an interaction OR of 1.5 and 2.0 respectively for a

significance threshold α=0.05 [313] (Figure 28).

142

Figure 28 Power estimation for detection of interaction for each genetic locus and environmental factor pair tested against T2D (FBG > 125 mg/dL) [313]. Assumptions include an interaction odds ratio of 1.5, a main effect of genetic locus of 1.2 and environmental factor 2.0 estimated from previous studies, minor allele frequencies as in Supplementary Table 1, approximately 10 controls per case, environmental factor SD of 1, and p-value of 0.05. Markers alternate between filled and open for each locus.

Marginal Associations

Three of 18 genetic variants were marginally associated with T2D in the

NHANES cohorts at significance level of 0.05 (uncorrected for multiple

hypotheses here, given that they have been previously robustly documented to

be associated with T2D) after adjustment for age, sex, race, and BMI. These

included loci rs10923931 (NOTCH2), rs7903146 (TCF7L2), and rs13266634

(SLC30A8) (Table 1). In addition to these genetic variants, we also list here

the five environmental factors we previously found associated strongly with

T2D in cohorts examined here (Table 8) [35].

0.3

0.4

0.5

0.6

0.7

0.8

0.9

T2D Power: p=0.05, OR=1.5

powe

r

●

●●

●

●

●

●●

●

●

●

●●

●

●●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●●

●●

●

●●

●●

●

●

●

●●

●

●●

●●

●

●

rs10

8116

61(U

nkno

wn)

rs10

9239

31(N

OTC

H2)

rs11

1187

5(U

nkno

wn)

rs12

6032

6(G

CKR

)

rs12

7797

90(U

nkno

wn)

rs13

2666

34(S

LC30

A8)

rs18

0128

2(PP

ARG

)

rs22

3789

5(KC

NQ

1)

rs23

8320

8(U

nkno

wn)

rs44

0296

0(IG

F2BP

2)

rs46

0710

3(U

nkno

wn)

rs47

1252

3(C

DKA

L1)

rs75

7859

7(TH

ADA)

rs78

0094

(GC

KR)

rs79

0169

5(TC

F7L2

)

rs79

0314

6(TC

F7L2

)

rs80

5013

6(FT

O)

rs86

4745

(JAZ

F1)

●

●

●

●

●

PCB170trans−β−carotene cis−β−carotene γ−tocopherol Heptachlor Epoxide

143

Locus (gene) or Environmental Factor

N(cases) p-value OR (95% CI)

rs10923931(NOTCH2) 3429 (297) 0.0043 1.50 (1.14,1.98) rs7903146(TCF7L2) 3401 (296) 0.015 1.32 (1.06,1.65) rs13266634(SLC30A8) 3427 (298) 0.018 1.33 (1.05,1.69) rs1260326(GCKR) 3408 (296) 0.13 1.27 (0.93,1.73) rs7901695(TCF7L2) 3402 (293) 0.13 1.22 (0.94,1.57) rs780094(GCKR) 3430 (298) 0.15 1.25 (0.92,1.70) rs4607103(Unknown) 3421 (298) 0.41 0.89 (0.68,1.17) rs2383208(Unknown) 3122 (298) 0.42 0.86 (0.60,1.24) rs4402960(IGF2BP2) 3406 (296) 0.52 0.93 (0.74,1.17) rs7578597(THADA) 3416 (291) 0.52 0.88 (0.59,1.30) rs12779790(Unknown) 3415 (293) 0.58 0.91 (0.66,1.27) rs2237895(KCNQ1) 3415 (296) 0.61 0.95 (0.77,1.17) rs1801282(PPARG) 3405 (296) 0.62 0.90 (0.58,1.38) rs4712523(CDKAL1) 3431 (298) 0.63 1.07 (0.81,1.42) rs8050136(FTO) 3403 (295) 0.74 0.96 (0.75,1.23) rs1111875(Unknown) 3407 (296) 0.75 1.04 (0.83,1.30) rs864745(JAZF1) 3430 (298) 0.8 0.96 (0.70,1.31) rs10811661(Unknown) 3406 (296) 0.84 0.96 (0.62,1.48) trans-β-carotene 3033 (189) 5 x 10-5 0.64 (0.52,0.79)* γ-tocopherol 5349 (314) 5 x 10-5 1.46 (1.25,1.72)* cis-β-carotene 3032 (189) 2 x 10-4 0.63 (0.50,0.81)* PCB170 1807 (98) 0.005 1.72 (1.18,2.52)* Heptachlor Epoxide 1711 (94) 0.005 1.49 (1.13,1.98)*

Table 8. Marginal association of each locus (n=18) or environmental factor (n=5) to T2D (FBG > 125 mg/dL). Per-risk-allele ORs are shown, adjusted by sex, age, ethnicity, and BMI. *Per 1 SD OR are shown, adjusted by sex, age, ethnicity, and BMI.

Correlation between genetic variants with environmental variables

We found little evidence for correlation between the 18 genetic variants and

the 5 environmental factors. Nominal relationships included a negative

association between rs10923931 and Heptachlor Epoxide (p=0.02), where

levels of Heptachlor Epoxide decreased 10% per risk allele. We also observed

a negative association between rs10923931 and cis-β-carotene (p=0.04), where

levels of Heptachlor Epoxide deceased 5% per risk allele.

144

Screening for Genetic Variant by Environment Interactions

We then proceeded to study interactions between 18 genetic variants and the 5

environmental factors, a total of 90 interactions tested using survey-weighted

logistic regression adjusted for age, sex, self-reported race, and BMI. Figure

29 presents a Manhattan-style plot where all the 90 interaction terms are

plotted with their p-values. From these 90, we found 8 results at p < 0.05 and

false discovery rates between 1.5 and 37%, involving six genetic variants and

four environmental factors. Further, from these 90, we found 4 results with

FDR less than 20% (p < 0.01) involving 2 variants and 3 environmental factors

worth pursuing for further study.

Figure 29. Manhattan plot of significance values of interaction term (-log10(p-value) for interaction term of pair of factors). The x-axis is grouped by variant (n=18); within each group are 5 points corresponding to the environmental factor tested in interaction with that variant. Top 8 factors (p-value ≤ 0.05) are annotated with their false discovery rate. For example, the interaction between rs13266634 (in SLC30A8 gene) and γ-tocopherol is annotated and has a FDR of 18%. The Bonferroni threshold is seen in the dotted line. Markers alternate between filled and open for each locus.

01

23

4

−log

10(p

valu

e in

tera

ctio

n te

rm)

rs10

8116

61(U

nkno

wn)

rs10

9239

31(N

OTC

H2)

rs11

1187

5(U

nkno

wn)

rs12

6032

6(G

CKR

)

rs12

7797

90(U

nkno

wn)

rs13

2666

34(S

LC30

A8)

rs18

0128

2(PP

ARG

)

rs22

3789

5(KC

NQ

1)

rs23

8320

8(U

nkno

wn)

rs44

0296

0(IG

F2BP

2)

rs46

0710

3(U

nkno

wn)

rs47

1252

3(C

DKA

L1)

rs75

7859

7(TH

ADA)

rs78

0094

(GC

KR)

rs79

0169

5(TC

F7L2

)

rs79

0314

6(TC

F7L2

)

rs80

5013

6(FT

O)

rs86

4745

(JAZ

F1)

●●

●●

●●

●

●

●

●●●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●●

●●

●

●

●

●

●●●

●

●●

●

●

●

●

●●

●

●●

●

●●●●

●●

●

●

●●

●●

●

●●

●

●

●●

●●

●

●●

●●

0.015

0.06

0.160.18

0.220.240.24

0.37

●

●

●

●

●

PCB170trans−β−carotene cis−β−carotene γ−tocopherol Heptachlor Epoxide

145

The top four of eight findings are discussed here. Our top result included the

interaction between the nutrient marker trans-β-carotene and the non-

synonymous SNP rs13266634 (SLC30A8) and it was significant beyond the

Bonferroni-adjusted cutoff significance level (interaction p = 5 x 10-5,

Bonferroni adjusted p-value 0.006, FDR=1.5%). At lower levels of trans-β-

carotene, defined as a point value 1 SD below the mean level of the factor, the

per-allele effect size (odds ratio, OR) was 1.8, (95% CI: 1.3, 2.6) 40% greater

than the marginal effect (Figure 30). The adjusted OR per change in trans-β-

carotene levels was protective for those who had 2 risk alleles for the variant

(adjusted OR 0.5, 95% CI: 0.4, 0.8), while for those with 0 or 1 risk alleles had

negligible effects. We observed similar effects for cis-β-carotene and

rs13266634 (Figure 30).

On the other hand, we observed an opposite effect for individuals who carried

the rs13266634 risk alleles with rising levels of γ-tocopherol (interaction

p=0.0095, FDR=18%). The adjusted OR for individuals with γ-tocopherol

levels 1 SD higher than the mean was 1.6 (adjusted 95% CI: 1.3, 2.1), a 25%

increase in per-allele adjusted OR when compared to the marginal effect

(Figure 30). For individuals below the mean levels of γ-tocopherol, their

genetic risk appears mitigated.

While we did not detect a marginal individual association between intergenic

SNP rs12779790 and T2D, we observed an interaction with this locus and

trans-β-carotene (p < 0.01, FDR = 16%) with T2D (Figure 30). Specifically,

the protective effect of trans-β-carotene increased 50%, an adjusted per-SD

environmental factor OR of 0.3 (95% CI: 0.2, 0.5) for those with 2 risk alleles

compared to 0.6 for the marginal per-SD effect of the factor.

Interestingly, our weakest result (not in top 4), included an interaction between

rs7903146 (TCF7L2), the most highly replicated T2D GWAS variant in

146

Caucasian populations as observed in the NHGRI catalog [203], and trans-β-

carotene (interaction p=0.04, FDR=40%). While the result may be spurious,

we observed that those with 2 risk alleles and low levels of trans-β-carotene

have roughly 8% higher OR (1.4, 95% CI: 1.1, 1.9) compared to the significant

marginal effect (Figure 30).

147

Figure 30. Per-risk allele effect sizes for top putative interactions with p < 0.05. Black markers denote OR for marginal effect of variant; the red markers denote interaction OR computed at low (at 1 SD lower than the mean), mean, or high (at 1SD greater than the mean) levels of exposure respectively. Marker sizes are proportional to inverse variance.

Sensitivity Analyses limited to non-Hispanic white and other Hispanic

participants and older individuals

In sensitivity analyses limited to only Caucasian participants (self-reported

non-Hispanic white and Hispanics, 55 to 58% of the population in the

trans-!-carotene (low(-1SD))trans-!-carotene (mean)trans-!-carotene (high(+1SD))

cis-!-carotene (low(-1SD))cis-!-carotene (mean)cis-!-carotene (high(+1SD))

cis-!-carotene (low(-1SD))cis-!-carotene (mean)cis-!-carotene (high(+1SD))

!-tocopherol (low(-1SD))!-tocopherol (mean)!-tocopherol (high(+1SD))



PCB170(low(-1SD))PCB170(mean)PCB170(high(+1SD))


rs13266634(SLC30A8)

rs13266634(SLC30A8)

rs12779790(Unknown)

rs13266634(SLC30A8)

rs4402960(IGF2BP2)

rs4712523(CDKAL1)

rs2237895(KCNQ1)

rs7903146(TCF7L2)

OR (95% CI)1.3 [1.1,1.7]1.8 [1.3,2.6]1.1 [0.8,1.5]

0.67 [0.41,1.1]p-value (FDR):7.8e-05 (0.015)

1.3 [1.1,1.7]1.8 [1.3,2.5]

1.2 [0.85,1.6]0.76 [0.47,1.2]

p-value (FDR):0.0015 (0.06)

0.91 [0.66,1.3]1.2 [0.72,1.9]

0.78 [0.55,1.1]0.52 [0.34,0.8]

p-value (FDR):0.0062 (0.16)

1.3 [1.1,1.7]0.84 [0.52,1.3]

1.2 [0.88,1.5]1.6 [1.3,2.1]

p-value (FDR):0.0095 (0.18)

0.93 [0.74,1.2]0.82 [0.58,1.2]

1.1 [0.83,1.4]1.4 [1,1.8]

p-value (FDR):0.014 (0.22)

1.1 [0.81,1.4]1.4 [0.86,2.4]1.1 [0.76,1.6]

0.85 [0.61,1.2]p-value (FDR):0.021 (0.24)

0.95 [0.77,1.2]0.44 [0.21,0.93]

0.61 [0.34,1.1]0.85 [0.5,1.5]

p-value (FDR):0.023 (0.24)

1.3 [1.1,1.6]1.4 [1.1,1.9]

1.1 [0.88,1.5]0.88 [0.58,1.4]

p-value (FDR):0.043 (0.37)

0 0.5 1 1.5 2 2.5 3Per risk allele OR

148

originally analyzed NHANES cohorts), we were able to reconfirm the top 4

interactions (FDR < 20%) found to the extent of their effect and strength of

association. Specifically, there was less than 10% change between interaction

effect sizes between the Caucasian-only analysis and the full cohort analyzed.

Further, the statistical significance of association remained less than 0.05

despite decreased power. We concluded we had limited power to claim the 4-

8th ranked interactions were preserved in this sub-sample (p > 0.05); however,

there was a negligible change in their effect sizes.

As NHANES is cross-sectional, there remains a possibility that individuals

who are at genetic risk for T2D might not be diagnosed as such at the time of

their sampling. To estimate the effect of this bias, we estimated the size of

interaction effects for a sub-sample aged 40 and older (63-64% of the

population of the originally analyzed NHANES cohorts). When analyzing

the sub-sample of older individuals, there was negligible change in interaction

effects for all of the top 4 factors (FDR < 20%) and their statistical significance

level remained less than 0.05.

Limited Evidence to Support Interactions with Body Mass Index

We sought to compare interaction effects between BMI, a notable risk factor

for T2D, and the 18 top genetic factors tested in this pilot study. Interestingly,

while adequately powered, we were unable to uncover substantial interaction

effects that would survive Bonferroni correction. We did observe a modest

interaction between BMI and rs8050136 of the FTO gene (uncorrected

interaction p-value=0.03). rs8050136 is known to be an obesity related locus

whose association with T2D is explained primarily through its effect on BMI

[228].

149

DISCUSSION

We have shown here how results from two comprehensive association

approaches on genetics and the environment, GWAS and EWAS, can be

combined to systematically screen for gene-environment interactions. We

implemented ways to correct for multiple hypotheses using a modified

Bonferroni adjustment method and through estimation of the FDR. In

particular, we have implemented a method to estimate the FDR empirically

using the parametric bootstrap [310], a less conservative way to mitigate the

cost of multiple hypotheses. We propose that the most promising hypotheses

that emerge from this systematic process are candidates for replication in

additional independent cohorts in prospective studies.

We restricted our analyses to environmental factors and genetic variants on the

basis of strength of the evidence on their marginal associations in GWAS and

EWAS. One could also consider evaluating gene-environment interactions for

genetic loci or environmental factors that do not have robust support for the

presence of marginal associations. Given the small marginal effects for most

common genetic variants, many genuine associations do not reach genome-

wide significance and remain false-negatives [41]. Some of those may have

strong interactions with the environment [301] , and may only be discovered if

the appropriate joint environmental variables are considered. However,

selecting which of the millions of non-genome-wide-significant SNPs to test is

challenging. It is well known that testing for interactions is power-intensive

[44]; furthermore, testing a large number or all of them imposes an even

greater power and multiplicity burden [47]. For environmental factors, the

choice of which ones to test for interaction is even more tenuous. Notably, in

contrast to common genetic variants, there is yet no high-throughput

measurement platform that captures all the environmental factors and lack of

measuring capacity limits data availability. Measurement error can be

substantial for many environmental exposures [10, 314].

150

We had the ability to screen 266 environmental factors measured in serum and

urine systematically through a prior EWAS process for association with T2D.

We selected for interaction testing only the 5 of them that had the strongest

support, as judged by FDR, persistence of effects after adjustment for

confounders, and replication in independent cohorts. The proposed approach

creates a systematic list of tested interaction terms, while at the same time it

reduces the number of tested interaction terms to a number that is not very high.

However, it is still very important to account for multiple hypothesis testing.

Here, we used here two approaches, multiplicity correction and the FDR, but

other approaches may also be employed [307].

Our application highlights other challenges of testing and validating gene-

environment interactions. First, we had low to moderate power to detect

moderate interaction effects for some of the interactions tested. Not

surprisingly, we found modest p-values of which only one survived the

Bonferroni correction and we had modest FDR estimates for the other highest-

ranking interactions. This documents that great caution is needed in claiming

gene-environment interactions and, more importantly, the need for extensive

replication of the top findings in larger well-powered studies. We stress that

the current exercise focuses on hypothesis generation.

Replication studies can also examine whether genetic effects and their

respective interactions are similar in populations of different ancestry.

Population stratification [181] is the equivalent of confounding for genetic

effects. We used analyses adjusting for self-report race, however we should

acknowledge that the genetic effects identified to-date from GWAS are best

documented in Caucasian populations and that self-report ethnicity is subject

to bias. Genetic effects for GWAS-discovered markers may be different in

different ancestry groups [315-320]. While under-powered, analyses limited to

151

Caucasians showed similar effects for the top hits in our analyses. However,

little is known on how gene-environment interactions may behave in

populations of different ancestry and this should be further investigated.

Another issue in studying complex and age-related diseases such as T2D

includes the classification of cases and controls. For example, a fraction of

non-cases at high genetic risk for T2D will not be diagnosed at the time of their

sampling. To estimate the effect of this bias, we conducted a sensitivity

analyses limiting the cohort to those older than 40. While we observed little

difference in our estimates, we acknowledge that effects might be diluted due

to this type of bias.

There are but a few documented examples of interaction effects between

genome-wide significant loci and environmental or dietary factors on T2D

[304]. Through this screen, we have been able to hypothesize about possible

new ones. For example, the strongest evidence for interaction in our data

existed between rs13266634, a non-synonymous coding variant in the

SLC30A8 gene, and three nutrient factors, trans- and cis-β-carotene, and γ-

tocopherol. The non-synonymous variant rs13266634 is a highly replicated

variant, connected with beta cell function and insulin secretion [29, 50, 64].

For example, in a SLC30A8 knockout mouse model, normal glucose-induced

insulin release was preserved; however, after a high fat diet, the SLC30A8

knockout mouse became glucose intolerant and diabetic [49]. Our data-driven

gene-environment screen has enabled us to hypothesize that impaired insulin

secretion imparted by the rs13266634 variant, combined with presence or

deficiency of specific nutrients in the diet, leads to greater risk for T2D. We

plan to test this hypothesis in depth in both model systems and in other human

populations.

152

Nevertheless, attributing causality of interactions is challenging. For

environmental factors, confounding [68] and reverse causality [10] are major

issues for studying environmental factors. First, little is known about the causal

nature, if any, regarding these environmental factors and T2D [216]. Second,

statistical interaction does not equate to biological interaction [295]. Given the

modest interaction effect sizes and levels of false discovery, the joint effect of

these factors need to be evaluated in independent, larger populations, including

prospective cohorts where the time-dependent associations of environmental

factors can be assessed. Nevertheless, genetic information can sometimes be

helpful in identifying genuine environmental risk factors through Mendelian

randomization [69].

Finally, other infrastructure-related challenges remain for the future of

systematic screening for gene-environment interactions [44]. First, we lack a

complete list of candidate environmental factors regarding the marginal effects

of exposure to disease. In comparison, the analogous list of common genetic

variants is well known and is constantly being updated [9]. Screening and

validating gene-environment interactions is power-intensive and will require

new types of biobanks that can accommodate large amounts of environmental

and genetic measures measured on the same individuals across multiple studies

and cohorts [10]. A straightforward first step includes augmenting current

GWAS with environmental information [45, 301] and adopting high-priority

exposure measures published by public initiatives such as PhenX

(www.phenx.org), whose goal is to build consensus regarding the minimal set

of impactful environmental factors to measure in large GWAS-like studies [66,

67]. A systematic approach to measuring and testing the environment – as we

have shown in previous chapters -- and its interaction with the genetic profile

of individuals may help find and explain a substantial component of the

disease risk for some common health conditions or even lead to hypotheses

regarding disease pathology.

153

CHAPTER 6: CONCLUSION AND DISCUSSION

In this dissertation we have described and implemented methods to create

robust and ranked hypotheses through massive, comprehensive, and systematic

association of the envirome to disease and adverse phenotypes, both on

molecular and population scale.

In the first method operating on molecular-based toxicogenomics data, we use

tools of integrative genomics to merge once disparate datasets, toxicological

gene expression responses and disease gene expression states. We developed a

generalizable representation of environmentally induced molecular responses

called the “Envirome Map”. Assuming that functional states induced by

environmental agents are similar to disease, we correlated the Envirome Map

to cancer expression states. Specifically, we show how the expression states

associated with certain factors, such as Bisphenol A, are correlated with breast

cancer, prompting further study. Importantly, the Envirome Map enables

hypothesis generation in a scalable and practical way, utilizing data from the

public domain in databases such as the Gene Expression Omnibus.

We also have developed and implemented a method to associate the envirome

to disease on a population scale, called the “Environment-wide Association

Study” (EWAS), analogous to the data-driven genome-wide association study

(GWAS). We showed how EWAS provides the benefits of GWAS through

transparent and comprehensive reporting. Most importantly, EWAS has

enabled the discovery of pollutants and dietary markers in association with

common diseases such as Type 2 Diabetes and risk factors for heart disease,

serum lipid levels at low levels of false discovery. This type of discovery is

not standard practice in current day epidemiology. In fact, it has rung a bell in

environmental health circles [1, 197, 200], and has introduced a new area of

154

study to genome scientists [196, 198, 199]. Last, and critically, EWAS calls for

the study of the highest-ranking novel and robust associations in depth in

different study designs and scenarios.

For example, we present a novel way to systematically screen for gene-

environment interactions, integrating results from comprehensive studies on

the envirome and genome dubbed a “G-EWAS”. In our prototype G-EWAS

on T2D, we screened all possible interactions between robustly identified

environmental factors from EWAS and genetic factors in GWAS. Further, we

implemented two ways of accounting for multiple hypotheses, and after

diverse sensitivity analyses, converged on interaction between a non-

synonymous functional variant in the SLC30A8 gene and nutrient markers.

With experimental models for the SLC30A8 knockout gene, investigators have

hypothesized that the risk associated for T2D comes as a result of a dietary

interaction; through G-EWAS we are able to speculate about the role of

specific factors for this hypothesized interaction.

We predict that studies like G-EWAS are just the tip of the glacier. Our

capability of measuring molecular modalities on a population scale is

improving exponentially. We are on the brink of a deluge of genomic

sequence data, capable of measuring both genotypes and molecular responses

on a massive number of individuals. How will we merge data of different

scales with dynamic environmental information to better predict disease?

Given that traditionally these data have not been analyzed jointly, we face

immense infrastructural and analytic challenges with more complex data

modes. It is inevitable new methods of comprehensive inference in the spirit

of EWAS and G-EWAS will need to be developed to take advantage of them.

To even begin to design and conduct such studies, collaborative efforts now

need to be focused on defining standardizing -- from nomenclature to means of

155

measure -- the “envirome”, a concept we only introduced here. For example,

genomic studies have benefitted from the efforts put forth by the National

Centers of Biotechnology Information, cataloging genetic information in an

accessible manner for the scientific and engineering communities to utilize.

Further, standardization enabled projects such the HapMap in which a

federation of institutions was assembled to characterize common genomic

variation on the planet. Such efforts need to take place to conduct true

envirome-wide studies. Surveys such as the National Health and Nutrition

Examination Survey may enable us to attain the map of environmental

variation, but they are only a start.

Investigators have now associated genomic variation with hundreds of

common diseases [321]. Specifically, studies such as GWAS (Figure 5B) have

us allowed us to hypothesize about disease etiology and predict genetic risk

[203]. As a result, we now have a greater understanding of genetic

contribution to disease, but critically, the majority of genetic risk for common

disease has yet to be explained. There needs to be more envirome-based

studies to achieve a fuller understanding of disease. Furthermore, the

investigation of the environment lags behind the genome (Figure 1). To begin

closing the gap, we propose conducting many more EWAS beginning with

common, multifactorial diseases prioritized by the World Health Organization

(e.g., cardiovascular disease, T2D, hypertension, premature births, lung cancer,

asthma). For example, as of this writing, we are conducting EWAS on

hypertension on multiple European cohorts, on chronic kidney disease with

cohorts in the United States, on asthma with cohorts in the United States, and

on mothers with premature births in the United States.

As a result of multiple GWAS on many diseases, we are now able to provide

clinically relevant information to patients for disease prognosis and prevention

[23]. However, we have yet to combine this data with environmental

156

information. For example, how might specific environmental factors modify

our genetic risk for disease? Individuals are now quantifying toxicants in their

own tissue [322] and aggregated results from multiple EWAS will enable

investigators to accurately estimate personal environmental risk. Furthermore,

results from G-EWAS have implications for personal genomics whereby we

may stratify genetic risk for disease by levels of specific environmental factors.

This effort will help us attain truly personalized medicine, whereby specific

modifiable environmental attributes can be the new targets of therapeutics

based on individuals’ genetic profile.

In this dissertation, we have presented and applied new analytical paradigms to

comprehensively connect the environment to disease. Just as in the last 10

years we have witnessed the fruits of genome-wide studies, it is now time to

usher in envirome-wide studies for a more complete understanding of etiology

aligned towards therapeutics and prevention.

157

REFERENCES

1. Rappaport, S.M. and M.T. Smith, Environment and Disease Risks.

Science, 2010. 330(6003): p. 460-461. 2. Schwartz, D. and F. Collins, Medicine. Environmental biology and

human disease. Science, 2007. 316(5825): p. 695-6. 3. Klaassen, C.D., ed. Casarett and Doull's Toxicology - The Basic

Science of Poisons (7th Edition). 7 ed., ed. C.D. Klaassen2008, McGraw-Hill.

4. Willett, W.C., Balancing life-style and genomics research for disease prevention. Science, 2002. 296(5568): p. 695-8.

5. Christiani, D.C., Combating Environmental Causes of Cancer. New Engl J Med, 2011. 364(9): p. 791-793.

6. Ramachandrappa, S. and I.S. Farooqi, Genetic approaches to understanding human obesity. J Clin Invest, 2011. 121(6): p. 2080-6.

7. The Wellcome Trust Case Control Consortium, Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 2007. 447(7145): p. 661-678.

8. Pearson, T.A. and T.A. Manolio, How to interpret a genome-wide association study. J Am Med Assoc, 2008. 299(11): p. 1335-44.

9. Hindorff, L., et al. A Catalog of Published Genome-Wide Association Studies. 2009 [cited 2009 7/28/2009]; Available from: http://www.genome.gov/gwastudies.

10. Ioannidis, J., et al., Researching Genetic Versus Nongenetic Determinants of Disease: A Comparison and Proposed Unification. Sci. Transl. Med., 2009. 1(7): p. 8.

11. Baker, D. and M. Nieuwenhuijsen, eds. Environmental Epidemiology. 2008, Oxford University Press: Oxford.

12. Judson, R.S., et al., In Vitro Screening of Environmental Chemicals for Targeted Testing Prioritization -- The ToxCast Project. Environ Health Perspect, 2009. 118(4).

13. Committee on Toxicity Testing and Assessment of Environmental Agents and National Research Council, Toxicity Testing in the 21st Century: A Vision and a Strategy2007, Washington, D.C.: National Academies Press.

14. Hubal, E.A., Biologically relevant exposure science for 21st century toxicity testing. Toxicol Sci, 2009. 111(2): p. 226-32.

15. Krewski, D., et al., New directions in toxicity testing. Annu Rev Publ Health, 2011. 32: p. 161-78.

16. Wild, C.P., Complementing the genome with an "exposome": the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers Prev, 2005. 14(8): p. 1847-50.

158

17. World Health Organization. Global Health Observatory Data Repository. 2011 [cited 2011 8/9/2011]; Available from: http://apps.who.int/ghodata/.

18. Mullis, K., Process for amplifying nucleic acid sequences, Cetus Corporation: USA.

19. Illumina, I. 2011 [cited 7/19/2011 7/19/2011]; Available from: http://www.illumina.com/.

20. National Center for Biotechnology Information. National Center for Biotechnology Information. 2011 [cited 2011 7/18/2011]; Available from: http://www.ncbi.nlm.nih.gov/guide/.

21. Anthony, J.C., The promise of psychiatric enviromics. Br J Psychiatry Suppl, 2001. 40: p. s8-11.

22. Liu, Y.I., P.H. Wise, and A.J. Butte, The "etiome": identification and clustering of human disease etiological factors. BMC Bioinformatics, 2009. 10 Suppl 2: p. S14.

23. Ashley, E.A., et al., Clinical assessment incorporating a personal genome. Lancet, 2010. 375(9725): p. 1525-1535.

24. Kawakami, N., et al., Effects of smoking on the incidence of non-insulin-dependent diabetes mellitus. Replication and extension in a Japanese cohort of male employees. Am J Epidemiol, 1997. 145(2): p. 103-9.

25. International HapMap, C., A haplotype map of the human genome. Nature, 2005. 437(7063): p. 1299-320.

26. Goh, K.I., et al., The human disease network. Proc Natl Acad Sci U S A, 2007. 104(21): p. 8685-90.

27. Alan D. Lopez, et al., eds. Global Burden of Disease and Risk Factors. 2006, The International Bank for Reconstruction and Development / The World Bank: Washington DC.

28. Lettre, G. and J.D. Rioux, Autoimmune diseases: insights from genome-wide association studies. Hum Mol Genet, 2008. 17(R2): p. R116-21.

29. Rutter, G.A., Think zinc: New roles for zinc in the control of insulin secretion. Islets, 2009. 2(1): p. 49-50.

30. Majithia, A.R. and J.C. Florez, Clinical translation of genetic predictors for type 2 diabetes. Curr Opin Endocrinol, 2009. 16(2): p. 100-6.

31. Lyssenko, V., et al., Mechanisms by which common variants in the TCF7L2 gene increase risk of type 2 diabetes. J Clin Invest, 2007. 117(8): p. 2155-2163.

32. Manolio, T.A., et al., Finding the missing heritability of complex diseases. Nature, 2009. 461(7265): p. 747-753.

33. Goldstein, D.B., Common genetic variation and human traits. N Engl J Med, 2009. 360(17): p. 1696-8.

34. Rothman, K., S. Greenland, and T. Lash, eds. Modern Epidemiology, 3rd Ed. 3rd ed. 2008, Lippincott Williams & Wilkins: Philadelphia.

159

35. Patel, C.J., J. Bhattacharya, and A.J. Butte, An Environment-Wide Association Study (EWAS) on type 2 diabetes mellitus. PLoS ONE, 2010. 5(5): p. e10746.

36. Patel, C.J., et al., Non-genetic associations and correlation globes for determinants of lipid levels: an environment-wide association study. In Review, 2011.

37. Centers for Disease Control and Prevention (CDC). National Health and Nutrition Examination Survey. 2009 [cited 2009 9/1/2009]; Available from: http://www.cdc.gov/nchs/nhanes/.

38. Noble, W.S., How does multiple testing correction work? Nat Biotech, 2009. 27(12): p. 1135-1137.

39. Benjamini, Y. and Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B, 1995.

40. Storey, J.D. and R. Tibshirani, Statistical significance for genomewide studies. Proc Natl Acad Sci U S A, 2003. 100(16): p. 9440-5.

41. Ioannidis, J.P., R. Tarone, and J.K. McLaughlin, The False-positive to False-negative Ratio in Epidemiologic Studies. Epidemiology, 2011. 22(4): p. 450-6.

42. Ioannidis, J.P.A., Why Most Published Research Findings Are False. PLoS Med, 2005. 2(8): p. e124.

43. Miller, E.R., 3rd, et al., Meta-analysis: high-dosage vitamin E supplementation may increase all-cause mortality. Ann Intern Med, 2005. 142(1): p. 37-46.

44. Hunter, D.J., Gene-environment interactions in human diseases. Nat Rev Genet, 2005. 6(4): p. 287-98.

45. Rothman, N., et al., A multi-stage genome-wide association study of bladder cancer identifies multiple susceptibility loci. Nat Genet, 2010. 42(11): p. 978-84.

46. Garcia-Closas, M., et al., NAT2 slow acetylation, GSTM1 null genotype, and risk of bladder cancer: results from the Spanish Bladder Cancer Study and meta-analyses. Lancet, 2005. 366(9486): p. 649-59.

47. Thomas, D., Gene-environment-wide association studies: emerging approaches. Nat Rev Genet, 2010. 11(4): p. 259-272.

48. Ioannidis, J.P., Genetic associations: false or true? Trends Mol Med, 2003. 9(4): p. 135-8.

49. Lemaire, K., et al., Insulin crystallization depends on zinc transporter ZnT8 expression, but is not required for normal glucose homeostasis in mice. Proc Natl Acad Sci USA, 2009. 106(35): p. 14872-7.

50. Nicolson, T.J., et al., Insulin Storage and Glucose Homeostasis in Mice Null for the Granule Zinc Transporter ZnT8 and Studies of the Type 2 Diabetes-Associated Variants. Diabetes, 2009. 58(9): p. 2070-2083.

51. Waters, M.D. and J.M. Fostel, Toxicogenomics and systems toxicology: aims and prospects. Nat Rev Genet, 2004. 5(12): p. 936-48.

160

52. Lamb, J., et al., The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science, 2006. 313(5795): p. 1929-35.

53. Barrett, T., et al., NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucleic Acids Res, 2007. 35(Database issue): p. D760-5.

54. Davis, A.P., et al., Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res, 2009. 37(Database issue): p. D786-92.

55. Lim, E., et al., T3DB: a comprehensively annotated database of common toxins and their targets. Nucleic Acids Res, 2010. 38(Database issue): p. D781-6.

56. Dix, D.J., et al., The ToxCast program for prioritizing toxicity testing of environmental chemicals. Toxicol Sci, 2007. 95(1): p. 5-12.

57. Dudley, J.T., et al., Disease signatures are robust across tissues and experiments. Mol Syst Biol, 2009. 5.

58. Sirota, M., et al., Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data. Sci Transl Med, 2011. 3(96): p. 96ra77.

59. MAQC Consortium, The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotech, 2006. 24(9): p. 1151-1161.

60. MAQC Consortium, The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotech, 2010. 28(8): p. 827-838.

61. Patel, C. and A. Butte, Predicting environmental chemical factors associated with disease-related gene expression data. BMC Med Genomics, 2010. 3(1): p. 17.

62. Wang, T.J., et al., Metabolite profiles and the risk of developing diabetes. Nat Med, 2011. 17(4): p. 448-53.

63. Dawber, T.R., G.F. Meadors, and F.E. Moore, Jr., Epidemiological approaches to heart disease: the Framingham Study. Am J Public Health Nations Health, 1951. 41(3): p. 279-81.

64. Voight, B.F., et al., Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat Genet, 2010. 42(7): p. 579-589.

65. Teslovich, T.M., et al., Biological, clinical and population relevance of 95 loci for blood lipids. Nature, 2010. 466(7307): p. 707-713.

66. Hamilton, C.M., et al., The PhenX Toolkit: Get the Most From Your Measures. Am J Epidemiol, 2011. 174(3): p. 253-260.

67. NHGRI. PhenX. 2011; Available from: http://www.phenx.org. 68. Davey Smith, G., Use of genetic markers and gene-diet interactions for

interrogating population-level causal influences of diet on health. Genes Nutr, 2010. 6(1): p. 27-43-43.

161

69. Davey Smith, G. and S. Ebrahim, 'Mendelian randomization': can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiol, 2003. 32(1): p. 1-22.

70. Thorgeirsson, T.E., et al., A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature, 2008. 452(7187): p. 638-42.

71. Liu, J.Z., et al., Meta-analysis and imputation refines the association of 15q25 with smoking quantity. Nat Genet, 2010. 42(5): p. 436-440.

72. Wang, K.S., et al., A meta-analysis of two genome-wide association studies identifies 3 new loci for alcohol dependence. J Psychiatr Res, 2011.

73. Heath, A.C., et al., A Quantitative-Trait Genome-Wide Association Study of Alcoholism Risk in the Community: Findings and Implications. Biol Psychiatry, 2011.

74. Schumann, G., et al., Genome-wide association and genetic functional studies identify autism susceptibility candidate 2 gene (AUTS2) in the regulation of alcohol consumption. Proc Natl Acad Sci U S A, 2011. 108(17): p. 7119-24.

75. Rauch, A., et al., Genetic variation in IL28B is associated with chronic hepatitis C and treatment failure: a genome-wide association study. Gastroenterology, 2010. 138(4): p. 1338-45, 1345 e1-7.

76. Kamatani, Y., et al., A genome-wide association study identifies variants in the HLA-DP locus associated with chronic hepatitis B in Asians. Nat Genet, 2009. 41(5): p. 591-5.

77. Petrovski, S., et al., Common human genetic variants and HIV-1 susceptibility: a genome-wide survey in a homogeneous African population. AIDS, 2011. 25(4): p. 513-8.

78. Sulem, P., et al., Sequence variants at CYP1A1-CYP1A2 and AHR associate with coffee consumption. Hum Mol Genet, 2011. 20(10): p. 2071-7.

79. De Moor, M.H., et al., Genome-wide association study of exercise behavior in Dutch and American adults. Med Sci Sports Exerc, 2009. 41(10): p. 1887-95.

80. Tanaka, T., et al., Genome-wide association study of vitamin B6, vitamin B12, folate, and homocysteine blood concentrations. Am J Hum Genet, 2009. 84(4): p. 477-82.

81. Yang, J.J. and R.M. Plenge, Genomic Technology Applied to Pharmacological Traits. J Am Med Assoc, 2011. 306(6): p. 652-653.

82. Peters, L.L., et al., The mouse as a model for human biology: a resource guide for complex trait analysis. Nat Rev Genet, 2007. 8(1): p. 58-69.

83. Romanoski, C.E., et al., Systems Genetics Analysis of Gene-by-Environment Interactions in Human Cells. Am J Hum Genet, 2010.

162

84. Idaghdour, Y., et al., Geographical genomics of human leukocyte gene expression variation in southern Morocco. Nat Genet, 2010. 42(1): p. 62-7.

85. Judson, R., et al., The toxicity data landscape for environmental chemicals. Environ Health Perspect, 2009. 117(5): p. 685-95.

86. Weis, B.K., et al., Personalized exposure assessment: promising approaches for human environmental health research. Environ Health Perspect, 2005. 113(7): p. 840-8.

87. Wang, Y., et al., PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res, 2009. 37(Web Server issue): p. W623-33.

88. Williams-DeVane, C.R., M.A. Wolf, and A.M. Richard, DSSTox chemical-index files for exposure-related experiments in ArrayExpress and Gene Expression Omnibus: enabling toxico-chemogenomics data linkages. Bioinformatics, 2009. 25(5): p. 692-694.

89. Andrew, A.S., et al., Drinking-water arsenic exposure modulates gene expression in human lymphocytes from a U.S. population. Environ. Health Perspect., 2008. 116(4): p. 524-31.

90. Malard, V., et al., Global gene expression profiling in human lung cells exposed to cobalt. BMC Genomics, 2007. 8: p. 147.

91. Wang, W., et al., NDRG3 is an androgen regulated and prostate enriched gene that promotes in vitro and in vivo prostate cancer cell growth. Int J Cancer, 2009. 124(3): p. 521-30.

92. Gottipolu, R.R., et al., One-month diesel exhaust inhalation produces hypertensive gene expression pattern in healthy rats. Environ. Health Perspect., 2009. 117(1): p. 38-46.

93. Bild, A.H., et al., Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature, 2006. 439(7074): p. 353-7.

94. Ashburner, M., et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9.

95. Gohlke, J.M., et al., Genetic and environmental pathways to complex diseases. BMC Syst Biol, 2009. 3: p. 46.

96. Becker, K.G., et al., The genetic association database. Nat Genet, 2004. 36(5): p. 431-2.

97. Mattingly, C.J., et al., The comparative toxicogenomics database: a cross-species resource for building chemical-gene interaction networks. Toxicol Sci, 2006. 92(2): p. 587-95.

98. Tusher, V.G., R. Tibshirani, and G. Chu, Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A, 2001. 98(9): p. 5116-21.

99. National Center for Biotechnology Information. Homologene. 2010 3/2008]; Available from: http://www.ncbi.nlm.nih.gov/homologene.

100. Zeeberg, B.R., et al., High-Throughput GoMiner, an 'industrial-strength' integrative gene ontology tool for interpretation of multiple-microarray experiments, with application to studies of Common

163

Variable Immune Deficiency (CVID). BMC Bioinformatics, 2005. 6: p. 168.

101. R Core Team, R: A language and enviornment for statistical computing, 2008, R Foundation for Statistical Computing: Vienna, Austria.

102. Bossé, Y., K. Maghni, and T.J. Hudson, 1alpha,25-dihydroxy-vitamin D3 stimulation of bronchial smooth muscle cells induces autocrine, contractility, and remodeling processes. Physiol Genomics, 2007. 29(2): p. 161-8.

103. Tijet, N., et al., Aryl hydrocarbon receptor regulates distinct dioxin-dependent and dioxin-independent gene batteries. Mol Pharmacol, 2006. 69(1): p. 140-53.

104. Li, Z., et al., Discrimination of vanadium from zinc using gene profiling in human bronchial epithelial cells, in Environ. Health Perspect.2005. p. 1747-54.

105. Selvaraj, V., et al., Gene expression profiling of 17beta-estradiol and genistein effects on mouse thymus. Toxicol Sci, 2005. 87(1): p. 97-112.

106. Lin, C.Y., et al., Whole-genome cartography of estrogen receptor alpha binding sites. PLoS Genet, 2007. 3(6): p. e87.

107. Chandran, U.R., et al., Gene expression profiles of prostate cancer reveal involvement of multiple molecular pathways in the metastatic process. BMC Cancer, 2007. 7: p. 64.

108. Yu, Y.P., et al., Gene expression alterations in prostate cancer predicting tumor aggression and preceding development of malignancy. J Clin Oncol, 2004. 22(14): p. 2790-9.

109. Landi, M.T., et al., Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival. PLoS ONE, 2008. 3(2): p. e1651.

110. Liu, R., et al., The prognostic role of a gene signature from tumorigenic breast-cancer cells. N Engl J Med, 2007. 356(3): p. 217-26.

111. Wang, Y., et al., An overview of the PubChem BioAssay resource. Nucleic Acids Res, 2010. 38(Database issue): p. D255-66.

112. Ho, S.M., et al., Developmental exposure to estradiol and bisphenol A increases susceptibility to prostate carcinogenesis and epigenetically regulates phosphodiesterase type 4 variant 4. Cancer Res, 2006. 66(11): p. 5624-32.

113. Shazer, R.L., et al., Raloxifene, an oestrogen-receptor-beta-targeted therapy, inhibits androgen-independent prostate cancer growth: results from preclinical studies and a pilot phase II clinical trial. BJU Int, 2006. 97(4): p. 691-7.

114. Benbrahim-Tallaa, L., et al., Molecular events associated with arsenic-induced malignant transformation of human prostatic epithelial cells: aberrant genomic DNA methylation and K-ras oncogene activation. Toxicol Appl Pharmacol, 2005. 206(3): p. 288-98.

164

115. Bertilaccio, M.T., et al., Vasculature-targeted tumor necrosis factor-alpha increases the therapeutic index of doxorubicin against prostate cancer. Prostate, 2008. 68(10): p. 1105-15.

116. Borden, L.S., Jr., et al., Vinorelbine, doxorubicin, and prednisone in androgen-independent prostate cancer. Cancer, 2006. 107(5): p. 1093-100.

117. Amato, R.J. and H. Sarao, A phase I study of paclitaxel/doxorubicin/ thalidomide in patients with androgen- independent prostate cancer. Clin Genitourin Cancer, 2006. 4(4): p. 281-6.

118. Kang, J., et al., Subtoxic concentration of doxorubicin enhances TRAIL-induced apoptosis in human prostate cancer cell line LNCaP. Prostate Cancer Prostatic Dis, 2005. 8(3): p. 274-9.

119. Benbrahim-Tallaa, L., et al., Estrogen signaling and disruption of androgen metabolism in acquired androgen-independence during cadmium carcinogenesis in human prostate epithelial cells. Prostate, 2007. 67(2): p. 135-45.

120. Raschke, M., K. Wahala, and B.L. Pool-Zobel, Reduced isoflavone metabolites formed by the human gut microflora suppress growth but do not affect DNA integrity of human prostate cancer cells. Br J Nutr, 2006. 96(3): p. 426-34.

121. Takahashi, Y., et al., Using DNA microarray analyses to elucidate the effects of genistein in androgen-responsive prostate cancer cells: identification of novel targets. Mol Carcinog, 2004. 41(2): p. 108-119.

122. Li, Y., et al., Regulation of gene expression and inhibition of experimental prostate cancer bone metastasis by dietary genistein. Neoplasia, 2004. 6(4): p. 354-63.

123. Koike, H., et al., Insulin-like growth factor binding protein-6 inhibits prostate cancer cell proliferation: implication for anticancer effect of diethylstilbestrol in hormone refractory prostate cancer. Br J Cancer, 2005. 92(8): p. 1538-44.

124. Oh, W.K., The evolving role of estrogen therapy in prostate cancer. Clin Prostate Cancer, 2002. 1(2): p. 81-9.

125. Tokar, E.J., et al., Cholecalciferol (vitamin D3) and the retinoid N-(4-hydroxyphenyl)retinamide (4-HPR) are synergistic for chemoprevention of prostate cancer. J Exp Ther Oncol, 2006. 5(4): p. 323-33.

126. Costello, L.C. and R.B. Franklin, The clinical relevance of the metabolism of prostate cancer; zinc and tumor suppression: connecting the dots. Mol Cancer, 2006. 5: p. 17.

127. Uzzo, R.G., et al., Diverse effects of zinc on NF-kappaB and AP-1 transcription factors: implications for prostate cancer progression. Carcinogenesis, 2006. 27(10): p. 1980-90.

128. Michael, I.P., et al., Human tissue kallikrein 5 is a member of a proteolytic cascade pathway involved in seminal clot liquefaction and

165

potentially in prostate cancer progression. J Biol Chem, 2006. 281(18): p. 12743-50.

129. Uzzo, R.G., et al., Zinc inhibits nuclear factor-kappa B activation and sensitizes prostate cancer cells to cytotoxic agents. Clin Cancer Res, 2002. 8(11): p. 3579-83.

130. Filyak, Y., O. Filyak, and R. Stoika, Transforming growth factor beta-1 enhances cytotoxic effect of doxorubicin in human lung adenocarcinoma cells of A549 line. Cell Biol Int, 2007. 31(8): p. 851-5.

131. Shen, J., et al., Fetal onset of aberrant gene expression relevant to pulmonary carcinogenesis in lung adenocarcinoma development induced by in utero arsenic exposure. Toxicol Sci, 2007. 95(2): p. 313-20.

132. Waalkes, M.P., et al., Enhanced urinary bladder and liver carcinogenesis in male CD1 mice exposed to transplacental inorganic arsenic and postnatal diethylstilbestrol or tamoxifen. Toxicol Appl Pharmacol, 2006. 215(3): p. 295-305.

133. Waalkes, M.P., et al., Animal models for arsenic carcinogenesis: inorganic arsenic is a transplacental carcinogen in mice. Toxicol Appl Pharmacol, 2004. 198(3): p. 377-84.

134. Devereux, T.R., et al., Map kinase activation correlates with K-ras mutation and loss of heterozygosity on chromosome 6 in alveolar bronchiolar carcinomas from B6C3F1 mice exposed to vanadium pentoxide for 2 years. Carcinogenesis, 2002. 23(10): p. 1737-43.

135. Zanesi, N., et al., Lung cancer susceptibility in Fhit-deficient mice is increased by Vhl haploinsufficiency. Cancer Res, 2005. 65(15): p. 6576-82.

136. Diament, M.J., et al., Inhibition of tumor progression and paraneoplastic syndrome development in a murine lung adenocarcinoma by medroxyprogesterone acetate and indomethacin. Cancer Invest, 2006. 24(2): p. 126-31.

137. Moody, T.W., et al., Indomethacin reduces lung adenoma number in A/J mice. Anticancer Res, 2001. 21(3B): p. 1749-55.

138. Levin, G., et al., Indomethacin inhibits the accumulation of tumor cells in mouse lungs and subsequent growth of lung metastases. Chemotherapy, 2000. 46(6): p. 429-37.

139. Meira, L.B., et al., Cancer predisposition in mutant mice defective in multiple genetic pathways: uncovering important genetic interactions. Mutat Res, 2001. 477(1-2): p. 51-8.

140. Fan, J.G., Q.E. Wang, and S.J. Liu, Chrysotile-induced cell transformation and transcriptional changes of c-myc oncogene in human embryo lung cells. Biomed Environ Sci, 2000. 13(3): p. 163-9.

141. Carvajal, A., et al., Progesterone pre-treatment potentiates EGF pathway signaling in the breast cancer cell line ZR-75. Breast Cancer Res Treat, 2005. 94(2): p. 171-83.

166

142. Kato, S., et al., Progesterone increases tissue factor gene expression, procoagulant activity, and invasion in the breast cancer cell line ZR-75-1. J Clin Endocrinol Metab, 2005. 90(2): p. 1181-8.

143. Verheus, M., et al., Plasma phytoestrogens and subsequent breast cancer risk. J Clin Oncol, 2007. 25(6): p. 648-55.

144. Nobert, G.S., M.M. Kraak, and S. Crawford, Estrogen dependent growth inhibitory effects of tamoxifen but not genistein in solid tumors derived from estrogen receptor positive (ER+) primary breast carcinoma MCF7: single agent and novel combined treatment approaches. Bull Cancer, 2006. 93(7): p. E59-66.

145. Seo, H.S., et al., Stimulatory effect of genistein and apigenin on the growth of breast cancer cells correlates with their ability to activate ER alpha. Breast Cancer Res Treat, 2006. 99(2): p. 121-34.

146. Lakshmanaswamy, R., R.C. Guzman, and S. Nandi, Hormonal prevention of breast cancer: significance of promotional environment. Adv Exp Med Biol, 2008. 617: p. 469-75.

147. Bergman Jungestrom, M., L.U. Thompson, and C. Dabrosin, Flaxseed and its lignans inhibit estradiol-induced growth, angiogenesis, and secretion of vascular endothelial growth factor in human breast cancer xenografts in vivo. Clin Cancer Res, 2007. 13(3): p. 1061-7.

148. Vogel, V.G., Recent results from clinical trials using SERMs to reduce the risk of breast cancer. Ann N Y Acad Sci, 2006. 1089: p. 127-42.

149. Eliassen, A.H., et al., Endogenous steroid hormone concentrations and risk of breast cancer among premenopausal women. J Natl Cancer Inst, 2006. 98(19): p. 1406-15.

150. Russo, J., et al., Estrogen and its metabolites are carcinogenic agents in human breast epithelial cells. J Steroid Biochem Mol Biol, 2003. 87(1): p. 1-25.

151. Ackerstaff, E., et al., Anti-inflammatory agent indomethacin reduces invasion and alters metabolism in a human breast cancer cell line. Neoplasia, 2007. 9(3): p. 222-35.

152. Green, M., et al., Diallyl sulfide induces the expression of estrogen metabolizing genes in the presence and/or absence of diethylstilbestrol in the breast of female ACI rats. Toxicol Lett, 2007. 168(1): p. 7-12.

153. Walter, G., R. Liebl, and E. von Angerer, Synthesis and biological evaluation of stilbene-based pure estrogen antagonists. Bioorg Med Chem Lett, 2004. 14(18): p. 4659-63.

154. Vegran, F., et al., Overexpression of caspase-3s splice variant in locally advanced breast carcinoma is associated with poor response to neoadjuvant chemotherapy. Clin Cancer Res, 2006. 12(19): p. 5794-800.

155. Untch, M., et al., Cardiac safety of trastuzumab in combination with epirubicin and cyclophosphamide in women with metastatic breast cancer: results of a phase I trial. Eur J Cancer, 2004. 40(7): p. 988-97.

167

156. Machiels, J.P., et al., Cyclophosphamide, doxorubicin, and paclitaxel enhance the antitumor immune response of granulocyte/macrophage-colony stimulating factor-secreting whole-cell vaccines in HER-2/neu tolerized mice. Cancer Res, 2001. 61(9): p. 3689-97.

157. Murray, T.J., et al., Induction of mammary gland ductal hyperplasias and carcinoma in situ following fetal bisphenol A exposure. Reprod Toxicol, 2007. 23(3): p. 383-90.

158. Uehara, T., et al., A toxicogenomics approach for early assessment of potential non-genotoxic hepatocarcinogenicity of chemicals in rats. Toxicology, 2008. 250(1): p. 15-26.

159. Yager, J.D. and N.E. Davidson, Estrogen carcinogenesis in breast cancer. N Engl J Med, 2006. 354(3): p. 270-82.

160. Dairkee, S.H., et al., Bisphenol A induces a profile of tumor aggressiveness in high-risk cells from breast cancer patients. Cancer Res, 2008. 68(7): p. 2076-80.

161. Buteau-Lozano, H., et al., Xenoestrogens modulate vascular endothelial growth factor secretion in breast cancer cells through an estrogen receptor-dependent mechanism. J Endocrinol, 2008. 196(2): p. 399-412.

162. Subramanian, A., et al., Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA, 2005. 102(43): p. 15545-50.

163. Salonen, J.T., et al., Type 2 diabetes whole-genome association study in four populations: the DiaGen consortium. Am J Hum Genet, 2007. 81(2): p. 338-45.

164. Saxena, R., et al., Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science, 2007. 316(5829): p. 1331-6.

165. Sladek, R., et al., A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature, 2007. 445(7130): p. 881-5.

166. McClellan, J. and M.-C. King, Genetic Heterogeneity in Human Disease. Cell, 2010. 141(2): p. 210-217.

167. Hardy, J. and A. Singleton, Genomewide Association Studies and Human Disease. New Engl J Med, 2009. 360(17): p. 1759-1768.

168. Manolio, T.A., L.D. Brooks, and F.S. Collins, A HapMap harvest of insights into the genetics of common disease. J Clin Invest, 2008. 118(5): p. 1590-605.

169. Wetterstrand, K. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program. 2011 2011/08/12]; Available from: http://www.genome.gov/sequencingcosts.

170. Frayling, T.M., et al., A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science, 2007. 316(5826): p. 889-94.

168

171. McCarthy, M.I., et al., Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet, 2008. 9(5): p. 356-369.

172. NCI-NHGRI Working Group on Replication in Association Studies, Replicating genotype‚ phenotype associations. Nature, 2007. 447(7145): p. 655-660.

173. Ioannidis, J.P., et al., Replication validity of genetic association studies. Nat Genet, 2001. 29(3): p. 306-9.

174. Christakis, N.A. and J.H. Fowler, The spread of obesity in a large social network over 32 years. N Engl J Med, 2007. 357(4): p. 370-9.

175. Pearson, J.F., et al., Association Between Fine Particulate Matter and Diabetes Prevalence in the U.S. Diabetes Care, 2010. 33(10): p. 2196-2201.

176. Butte, A.J. and I.S. Kohane, Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput, 2000: p. 418-29.

177. Butte, A.J., et al., Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc Natl Acad Sci U S A, 2000. 97(22): p. 12182-6.

178. Austin, P.C., et al., Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health. J Clin Epidemiol, 2006. 59(9): p. 964-9.

179. Young, S.S., Acknowledge and fix the multiple testing problem. Int J Epidemiol, 2010. 39(3): p. 934; author reply 934-5.

180. Young, S.S. and M. Yu, Association of bisphenol A with diabetes and other abnormalities. J Am Med Assoc, 2009. 301(7): p. 720-1; author reply 721-2.

181. Smith, G.D., et al., Clustered environments and randomized genes: a fundamental distinction between conventional and genetic epidemiology. PLoS Med, 2007. 4(12): p. e352.

182. Greenland, S., Randomization, Statistics, and Causal Inference. Epidemiology, 1990. 1(6): p. 421-429.

183. Greenland, S. and H. Morgenstern, Confounding in Health Research. Annu Rev Public Health, 2001. 22(1): p. 189-212.

184. Peto, R., et al., Can dietary beta-carotene materially reduce human cancer rates? Nature, 1981. 290(5803): p. 201-208.

185. Omenn, G.S., et al., Effects of a combination of beta carotene and vitamin A on lung cancer and cardiovascular disease. N Engl J Med, 1996. 334(18): p. 1150-5.

186. Hooper, L., A.R. Ness, and G.D. Smith, Antioxidant strategy for cardiovascular diseases. Lancet, 2001. 357(9269): p. 1705-6.

187. Bartell, S.M., W.C. Griffith, and E.M. Faustman, Temporal error in biomarker-based mean exposure estimates for individuals. J Expo Anal Environ Epidemiol, 2004. 14(2): p. 173-179.

169

188. Manly, B.F., Randomization, Bootstrap and Monte Carlo Methods in Biology. 3 ed2007, Boca Raton: Chapman and Hall/CRC.

189. Efron, B., Large-Scale Inference2010, Cambridge: Cambridge University Press.

190. Peter H. Westfall and S.S. Young, Resampling-based Multiple Testing1993, New York: Wiley.

191. Witten, D.M. and R. Tibshirani, Survival analysis with high-dimensional covariates. Stat Methods Med Res, 2010. 19(1): p. 29-51.

192. Tibshirani, R., Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 1996. 58(1): p. 267-288.

193. Zou, H. and T. Hastie, Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 2005. 67(2): p. 301-320.

194. Hastie, T., R. Tibshirani, and J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 ed2009: Springer.

195. Vittinghoff, E., et al., Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models2005, New York: Springer.

196. Todd, J.A., D'oh! Genes and Environment Cause Crohn's Disease. Cell, 2010. 141(7): p. 1114-1116.

197. Fallin, M.D. and W.H.L. Kao, Is 'X'-WAS the Future for All of Epidemiology? Epidemiology, 2011. 22(4): p. 457-459 10.1097/EDE.0b013e31821d3a9f.

198. Mak, H.C., Trends in computational biology - 2010. Nat Biotech, 2011. 29(1): p. 45-45.

199. Heard, E., et al., Ten years of genetics and genomics: what have we achieved and where are we heading? Nat Rev Genet, 2010. 11(10): p. 723-733.

200. Borrell, B., Epidemiology: Every bite you take. Nature, 2011. 470(7334): p. 320-2.

201. Mathers, C.D. and D. Loncar, Projections of global mortality and burden of disease from 2002 to 2030. PLoS Med, 2006. 3(11): p. e442.

202. Zeggini, E., et al., Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet, 2008. 40(5): p. 638-45.

203. Hindorff, L.A., et al., Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA, 2009. 106(23): p. 9362-9367.

204. ADA. Diabetes Information -- All About Diabetes. 2009 6/1/2009]; Available from: http://www.diabetes.org/about-diabetes.jsp.

205. Lumley, T., survey: analysis of complex survey samples, 2009. 206. R Development Core Team, R: A language for statistical computing,

2009, R Foundation for Statistical Computing: Vienna, Austria.

170

207. CDC and National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Analytic Guidelines. 2003 [cited 2010 2/19/2010]; Available from: http://www.cdc.gov/nchs/data/nhanes/nhanes_03_04/nhanes_analytic_guidelines_dec_2005.pdf.

208. Cowie, C.C., et al., Prevalence of diabetes and impaired fasting glucose in adults in the U.S. population: National Health And Nutrition Examination Survey 1999-2002. Diabetes Care, 2006. 29(6): p. 1263-8.

209. Abahusain, M.A., et al., Retinol, alpha-tocopherol and carotenoids in diabetes. Eur J Clin Nutr, 1999. 53(8): p. 630-5.

210. Polidori, M.C., et al., Plasma levels of lipophilic antioxidants in very old patients with type 2 diabetes. Diabetes Metab Res Rev, 2000. 16(1): p. 15-9.

211. Arnlov, J., et al., Serum and dietary beta-carotene and alpha-tocopherol and incidence of type 2 diabetes mellitus in a community-based study of Swedish men: report from the Uppsala Longitudinal Study of Adult Men (ULSAM) study. Diabetologia, 2009. 52(1): p. 97-105.

212. Ford, E.S., et al., Diabetes mellitus and serum carotenoids: findings from the Third National Health and Nutrition Examination Survey. Am J Epidemiol, 1999. 149(2): p. 168-76.

213. Ylonen, K., et al., Dietary intakes and plasma concentrations of carotenoids and tocopherols in relation to glucose metabolism in subjects at high risk of type 2 diabetes: the Botnia Dietary Study. Am J Clin Nutr, 2003. 77(6): p. 1434-41.

214. Wang, L., et al., Plasma lycopene, other carotenoids, and the risk of type 2 diabetes in women. Am J Epidemiol, 2006. 164(6): p. 576-85.

215. Montonen, J., et al., Dietary antioxidant intake and risk of type 2 diabetes. Diabetes Care, 2004. 27(2): p. 362-6.

216. Song, Y., et al., Effects of vitamins C and E and beta-carotene on the risk of type 2 diabetes in women at high risk of cardiovascular disease: a randomized controlled trial. Am J Clin Nutr, 2009. 90(2): p. 429-37.

217. Kataja-Tuomola, M., et al., Effect of alpha-tocopherol and beta-carotene supplementation on the incidence of type 2 diabetes. Diabetologia, 2008. 51(1): p. 47-53.

218. Codru, N., et al., Diabetes in relation to serum levels of polychlorinated biphenyls and chlorinated pesticides in adult Native Americans. Environ Health Perspect, 2007. 115(10): p. 1442-7.

219. Uemura, H., et al., Associations of environmental exposure to dioxins with prevalent diabetes among general inhabitants in Japan. Environ Res, 2008. 108(1): p. 63-8.

220. Rignell-Hydbom, A., L. Rylander, and L. Hagmar, Exposure to persistent organochlorine pollutants and type 2 diabetes mellitus. Hum Exp Toxicol, 2007. 26(5): p. 447-52.

171

221. Wang, S.L., et al., Increased risk of diabetes and polychlorinated biphenyls and dioxins: a 24-year follow-up study of the Yucheng cohort. Diabetes Care, 2008. 31(8): p. 1574-9.

222. Jiang, Q., et al., gamma-tocopherol, the major form of vitamin E in the US diet, deserves more attention. Am J Clin Nutr, 2001. 74(6): p. 714-22.

223. Burton, G.W., et al., Human plasma and tissue alpha-tocopherol concentrations in response to supplementation with deuterated natural and synthetic vitamin E. Am J Clin Nutr, 1998. 67(4): p. 669-84.

224. Campbell, S., et al., Development of gamma (gamma)-tocopherol as a colorectal cancer chemopreventive agent. Crit Rev Oncol Hematol, 2003. 47(3): p. 249-59.

225. Agency for Toxic Substances and Disease Registry. Heptachlor and Heptachlor Epoxide. 2007 [cited 2009 8/1/2009]; Available from: http://www.atsdr.cdc.gov/tfacts12.html.

226. Office of Water Regulations and Standards, Ambient Water Quality Criteria for Heptachlor, ed. U.S.E.P. Agency. Vol. EPA 440 5-80-052. 1980, Washington, DC: United States Environmental Production Agency.

227. Montgomery, M.P., et al., Incident diabetes and pesticide exposure among licensed pesticide applicators: Agricultural Health Study, 1993-2003. Am J Epidemiol, 2008. 167(10): p. 1235-46.

228. Zeggini, E., et al., Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science, 2007. 316(5829): p. 1336-41.

229. Heller, D.A., et al., Genetic and environmental influences on serum lipid levels in twins. N Engl J Med, 1993. 328(16): p. 1150-6.

230. Costanza, M.C., et al., Relative Contributions of Genes, Environment, and Interactions to Blood Lipid Concentrations in a General Adult Population. American Journal of Epidemiology, 2005. 161(8): p. 714-724.

231. Kris-Etherton, P.M., et al., The effect of diet on plasma lipids, lipoproteins, and coronary heart disease. J Am Diet Assoc, 1988. 88(11): p. 1373-400.

232. Schaefer, E.J., Lipoproteins, nutrition, and heart disease. Am J Clin Nutr, 2002. 75(2): p. 191-212.

233. Varady, K.A. and P.J. Jones, Combination diet and exercise interventions for the treatment of dyslipidemia: an effective preliminary strategy to lower cholesterol levels? J Nutr, 2005. 135(8): p. 1829-35.

234. Craig, W.Y., G.E. Palomaki, and J.E. Haddow, Cigarette smoking and serum lipid and lipoprotein concentrations: an analysis of published data. BMJ, 1989. 298(6676): p. 784-8.

235. Kraus, W.E., et al., Effects of the amount and intensity of exercise on plasma lipoproteins. N Engl J Med, 2002. 347(19): p. 1483-92.

172

236. Brook, R.D., et al., Particulate Matter Air Pollution and Cardiovascular Disease: An Update to the Scientific Statement From the American Heart Association. Circulation, 2010. 121(21): p. 2331-2378.

237. Ezzati, T.M., et al., Sample design: Third National Health and Nutrition Examination Survey. Vital Health Stat 2, 1992(113): p. 1-35.

238. Storey, J.D., A Direct Approach to False Discovery Rates. J R Statist Soc B, 2002. 64: p. 479-498.

239. American Heart Association. Drug Therapy for Cholesterol. 2010 [cited 2010 10/5]; Available from: http://www.heart.org/HEARTORG/Conditions/Cholesterol/PreventionTreatmentofHighCholesterol/Drug-Therapy-for-Cholesterol_UCM_305632_Article.jsp.

240. Ainsworth, B.E., et al., Compendium of physical activities: an update of activity codes and MET intensities. Med Sci Sports Exerc, 2000. 32(9 Suppl): p. S498-504.

241. Nelson, M.E., et al., Physical activity and public health in older adults: recommendation from the American College of Sports Medicine and the American Heart Association. Med Sci Sports Exerc, 2007. 39(8): p. 1435-45.

242. Cohen, J., Statistical power analysis for the behavioral sciences. 2 ed1988, Hillsdale, NJ: Erlbaum.

243. Fryar, C.D., et al., Hypertension, high serum total cholesterol, and diabetes: racial and ethnic prevalence differences in U.S. adults, 1999-2006. NCHS Data Brief, 2010(36): p. 1-8.

244. Ford, E.S., et al., Hypertriglyceridemia and Its Pharmacologic Treatment Among US Adults. Arch Intern Med, 2009. 169(6): p. 572-578.

245. Harrison, E.H., Mechanisms of digestion and absorption of dietary vitamin A. Annu Rev Nutr, 2005. 25: p. 87-103.

246. Willett, W.C., Nutritional Epidemiology1998, New York: Oxford University Press.

247. Yusuf, S., et al., Vitamin E supplementation and cardiovascular events in high-risk patients. The Heart Outcomes Prevention Evaluation Study Investigators. N Engl J Med, 2000. 342(3): p. 154-60.

248. Omenn, G.S., et al., Long-term vitamin A does not produce clinically significant hypertriglyceridemia: results from CARET, the beta-carotene and retinol efficacy trial. Cancer Epidemiol Biomarkers Prev, 1994. 3(8): p. 711-3.

249. Redlich, C.A., et al., Effect of long-term beta-carotene and vitamin A on serum cholesterol and triglyceride levels among participants in the Carotene and Retinol Efficacy Trial (CARET). Atherosclerosis, 1999. 145(2): p. 425-32.

173

250. Vivekananthan, D.P., et al., Use of antioxidant vitamins for the prevention of cardiovascular disease: meta-analysis of randomised trials. Lancet, 2003. 361(9374): p. 2017-2023.

251. Mente, A., et al., A Systematic Review of the Evidence Supporting a Causal Link Between Dietary Factors and Coronary Heart Disease. Arch Intern Med, 2009. 169(7): p. 659-669.

252. Willcox, B.J., J.D. Curb, and B.L. Rodriguez, Antioxidants in cardiovascular health and disease: key lessons from epidemiologic studies. Am J Cardiol, 2008. 101(10A): p. 75D-86D.

253. Bender, D., Nutritional Biochemistry of the VItamins2003, Cambridge: University of Cambridge Press.

254. Ogihara, T., et al., Distribution of tocopherol among human plasma lipoproteins. Clin Chim Acta, 1988. 174(3): p. 299-305.

255. Winbauer, A.N., S.S. Pingree, and K.L. Nuttall, Evaluating serum alpha-tocopherol (vitamin E) in terms of a lipid ratio. Ann Clin Lab Sci, 1999. 29(3): p. 185-91.

256. Semmler, A., et al., Plasma folate levels are associated with the lipoprotein profile: a retrospective database analysis. Nutrition Journal, 2010. 9(1): p. 31.

257. Jorde, R., et al., High serum 25-hydroxyvitamin D concentrations are associated with a favorable serum lipid profile. Eur J Clin Nutr, 2010.

258. Smith, K.M., et al., Relationship between fish intake, n-3 fatty acids, mercury and risk markers of CHD (National Health and Nutrition Examination Survey 1999-2002). Public Health Nutr, 2009. 12(8): p. 1261-9.

259. Hu, F.B. and W.C. Willett, Optimal Diets for Prevention of Coronary Heart Disease. J Am Med Assoc, 2002. 288(20): p. 2569-2578.

260. Joshipura, K.J., et al., The Effect of Fruit and Vegetable Intake on Risk for Coronary Heart Disease. Ann Intern Med, 2001. 134(12): p. 1106-1114.

261. Bassett, C.M., D. Rodriguez-Leyva, and G.N. Pierce, Experimental and clinical research findings on the cardiovascular benefits of consuming flaxseed. Appl Physiol Nutr Metab, 2009. 34(5): p. 965-74.

262. Pan, A., et al., Meta-analysis of the effects of flaxseed interventions on blood lipids. Am J Clin Nutr, 2009. 90(2): p. 288-97.

263. Park, D., T. Huang, and W.H. Frishman, Phytoestrogens as cardioprotective agents. Cardiol Rev, 2005. 13(1): p. 13-7.

264. Xu, X., et al., Studying associations between urinary metabolites of polycyclic aromatic hydrocarbons (PAHs) and cardiovascular diseases in the United States. Sci Total Environ, 2010. 408(21): p. 4943-4948.

265. Pope, C.A., III, et al., Cardiovascular Mortality and Long-Term Exposure to Particulate Air Pollution: Epidemiological Evidence of General Pathophysiological Pathways of Disease. Circulation, 2004. 109(1): p. 71-77.

174

266. Miller, K.A., et al., Long-term exposure to air pollution and incidence of cardiovascular events in women. N Engl J Med, 2007. 356(5): p. 447-58.

267. Wilson, P.W., et al., Factors associated with lipoprotein cholesterol levels. The Framingham study. Arteriosclerosis, 1983. 3(3): p. 273-81.

268. Njolstad, I., E. Arnesen, and P.G. Lund-Larsen, Smoking, serum lipids, blood pressure, and sex differences in myocardial infarction. A 12-year follow-up of the Finnmark Study. Circulation, 1996. 93(3): p. 450-6.

269. Moffatt, R.J., et al., Acute exposure to environmental tobacco smoke reduces HDL-C and HDL2-C. Prev Med, 2004. 38(5): p. 637-41.

270. Uemura, H., et al., Prevalence of metabolic syndrome associated with body burden levels of dioxin and related compounds among Japan's general population. Environ Health Perspect, 2009. 117(4): p. 568-73.

271. Dirinck, E., et al., Obesity and Persistent Organic Pollutants: Possible Obesogenic Effect of Organochlorine Pesticides and Polychlorinated Biphenyls. Obesity (Silver Spring), 2010.

272. Goncharov, A., et al., High serum PCBs are associated with elevation of serum lipids and cardiovascular disease in a Native American population. Environ Res, 2008. 106(2): p. 226-39.

273. Sergeev, A.V. and D.O. Carpenter, Hospitalization rates for coronary heart disease in relation to residence near areas contaminated with persistent organic pollutants and other pollutants. Environ Health Perspect, 2005. 113(6): p. 756-61.

274. Gustavsson, P. and C. Hogstedt, A cohort study of Swedish capacitor manufacturing workers exposed to polychlorinated biphenyls (PCBs). Am J Ind Med, 1997. 32(3): p. 234-9.

275. Morgan, T.M., et al., Nonvalidation of reported genetic risk factors for acute coronary syndrome in a large-scale replication study. J Am Med Assoc, 2007. 297(14): p. 1551-61.

276. Boffetta, P., et al., False-positive results in cancer epidemiology: a plea for epistemological modesty. J Natl Cancer Inst, 2008. 100(14): p. 988-95.

277. Lang, I.A., et al., Association of urinary bisphenol A concentration with medical disorders and laboratory abnormalities in adults. J Am Med Assoc, 2008. 300(11): p. 1303-10.

278. Navas-Acien, A., et al., Arsenic exposure and prevalence of type 2 diabetes in US adults. J Am Med Assoc, 2008. 300(7): p. 814-22.

279. Everett, C.J., et al., Association of a polychlorinated dibenzo-p-dioxin, a polychlorinated biphenyl, and DDT with diabetes in the 1999-2002 National Health and Nutrition Examination Survey. Environ Res, 2007. 103(3): p. 413-8.

280. Lee, D.H., et al., Association between serum concentrations of persistent organic pollutants and insulin resistance among nondiabetic adults: results from the National Health and Nutrition Examination Survey 1999-2002. Diabetes Care, 2007. 30(3): p. 622-8.

175

281. Lee, D.H., et al., Relationship between serum concentrations of persistent organic pollutants and the prevalence of metabolic syndrome among non-diabetic adults: results from the National Health and Nutrition Examination Survey 1999-2002. Diabetologia, 2007. 50(9): p. 1841-51.

282. Kitao, Y., et al., A contribution to genome-wide association studies: search for susceptibility loci for schizophrenia using DNA microsatellite markers on chromosomes 19, 20, 21 and 22. Psychiatr Genet, 2000. 10(3): p. 139-43.

283. Ohnishi, Y., et al., A high-throughput SNP typing system for genome-wide association studies. J Hum Genet, 2001. 46(8): p. 471-7.

284. Duncan David E, Experimental man : what one man's body reveals about his future, your health, and our toxic world2009, Hoboken, NJ: Wiley.

285. Ober, C. and D. Vercelli, Gene-environment interactions in human disease: nuisance or opportunity? Trends Genet, 2011. 27(3): p. 107-15.

286. Eichler, E.E., et al., Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet, 2010. 11(6): p. 446-50.

287. National Institute of Child Health and Human Development. Phenylketonuria. 2010 3/24/2010 [cited 2010 8/18]; Available from: http://www.nichd.nih.gov/health/topics/phenylketonuria.cfm.

288. Crowley, J.J., P.F. Sullivan, and H.L. McLeod, Pharmacogenomic genome-wide association studies: lessons learned thus far. Pharmacogenomics, 2009. 10(2): p. 161-3.

289. Khoury, M.J., M.J. Adams, Jr., and W.D. Flanders, An epidemiologic approach to ecogenetics. Am J Hum Genet, 1988. 42(1): p. 89-95.

290. Garrod, A., Alkaptonuria. Lancet, 1902: p. 653-656. 291. Garrod, A., The Inborn Factors in Disease: An Essay1931, Oxford:

Clarendon Press. 292. Motulsky, A.G., Drug reactions, enzymes, and biochemical genetics. J

Am Med Assoc, 1957. 165(7): p. 835-837. 293. Khoury, M.J., T.H. Beaty, and B. Cohen, Fundamentals of Genetic

Epidemiology. 1 ed. Vol. 1. 1993, New York: Oxford University Press. 294. Siemiatycki, J. and D.C. Thomas, Biological models and statistical

interactions: an example from multistage carcinogenesis. Int J Epidemiol, 1981. 10(4): p. 383-7.

295. Wang, X., R.C. Elston, and X. Zhu, Statistical interaction in human genetics: how should we model it if we are looking for biological interaction? Nat Rev Genet, 2010. 12(1): p. 74.

296. Kellermann, G., C.R. Shaw, and M. Luyten-Kellerman, Aryl hydrocarbon hydroxylase inducibility and bronchogenic carcinoma. N Engl J Med, 1973. 289(18): p. 934-7.

176

297. Stern, M.C., et al., Polymorphisms in DNA repair genes, smoking, and bladder cancer risk: findings from the international consortium of bladder cancer. Cancer Res, 2009. 69(17): p. 6857-64.

298. Vineis, P., et al., Current smoking, occupation, N-acetyltransferase-2 and bladder cancer: a pooled analysis of genotype-based studies. Cancer Epidemiol Biomarkers Prev, 2001. 10(12): p. 1249-52.

299. Grarup, N. and G. Andersen, Gene-environment interactions in the pathogenesis of type 2 diabetes and metabolism. Curr Opin Clin Nutr Metab Care, 2007. 10(4): p. 420-6.

300. Romao, I. and J. Roth, Genetic and environmental interactions in obesity and type 2 diabetes. J Am Diet Assoc, 2008. 108(4 Suppl 1): p. S24-8.

301. Khoury, M.J. and S. Wacholder, Invited commentary: from genome-wide association studies to gene-environment-wide interaction studies--challenges and opportunities. Am J Epidemiol, 2009. 169(2): p. 227-30; discussion 234-5.

302. Hetherington, M.M. and J.E. Cecil, Gene-environment interactions in obesity. Forum Nutr, 2010. 63: p. 195-203.

303. Memisoglu, A., et al., Interaction between a peroxisome proliferator-activated receptor gamma gene polymorphism and dietary fat intake in relation to body mass. Hum Mol Genet, 2003. 12(22): p. 2923-9.

304. Cornelis, M.C., et al., TCF7L2, dietary carbohydrate, and risk of type 2 diabetes in US women. Am J Clin Nutr, 2009. 89(4): p. 1256-1262.

305. Ioannidis, J.P., Why most discovered true associations are inflated. Epidemiology, 2008. 19(5): p. 640-8.

306. Omenn, G.S., Overview of the symposium on public health significance of genomics and eco-genetics. Annu Rev Public Health, 2010. 31: p. 1-8.

307. Ioannidis, J.P., Commentary: grading the credibility of molecular evidence for complex diseases. Int J Epidemiol, 2006. 35(3): p. 572-8; discussion 593-6.

308. Chen, R., et al., Non-synonymous and synonymous coding SNPs show similar likelihood and effect size of human disease association. PLoS One, 2010. 5(10): p. e13574.

309. Nyholt, D.R., A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet, 2004. 74(4): p. 765-9.

310. Bůžková, P., T. Lumley, and K. Rice, Permutation and Parametric Bootstrap Tests for Gene–Gene and Gene–Environment Interactions. Ann Hum Genet, 2011. 75(1): p. 36-45.

311. Wilson, P.W., et al., Prediction of incident diabetes mellitus in middle-aged adults: the Framingham Offspring Study. Arch Intern Med, 2007. 167(10): p. 1068-74.

177

312. Lyssenko, V., et al., Predictors of and Longitudinal Changes in Insulin Sensitivity and Secretion Preceding Onset of Type 2 Diabetes. Diabetes, 2005. 54(1): p. 166-174.

313. Gauderman J. and J. Morrison, QUANTO - a program to compute power for G x E and G x G studies, 2009: Los Angeles.

314. Vineis, P., A self-fulfilling prophecy: are we underestimating the role of the environment in gene-environment interaction research? Int J Epidemiol, 2004. 33(5): p. 945-946.

315. Hayes, M.G., et al., Identification of type 2 diabetes genes in Mexican Americans through genome-wide association studies. Diabetes, 2007. 56(12): p. 3033-44.

316. Ioannidis, J.P., Population-wide generalizability of genome-wide discovered associations. J Natl Cancer Inst, 2009. 101(19): p. 1297-9.

317. Shu, X.O., et al., Identification of new genetic risk variants for type 2 diabetes. PLoS Genet, 2010. 6(9): p. e1001127.

318. Yamauchi, T., et al., A genome-wide association study in the Japanese population identifies susceptibility loci for type 2 diabetes at UBE2E2 and C2CD4A-C2CD4B. Nat Genet, 2010. 42(10): p. 864-8.

319. Tsai, F.J., et al., A genome-wide association study identifies susceptibility variants for type 2 diabetes in Han Chinese. PLoS Genet, 2010. 6(2): p. e1000847.

320. Unoki, H., et al., SNPs in KCNQ1 are associated with susceptibility to type 2 diabetes in East Asian and European populations. Nat Genet, 2008. 40(9): p. 1098-102.

321. Mailman, M.D., et al., The NCBI dbGaP database of genotypes and phenotypes. Nature genetics, 2007. 39(10): p. 1181-6.

322. Environmental Working Group and Commonweal. EWG || Human Toxome Project. [cited 2009 11/11/2009]; Available from: http://www.ewg.org/sites/humantoxome/.

environment-wide associations to disease and disease ...mg775gw7130/... · environment-wide...

Documents