search engine for e neu network science 080817

46
Building a search engine to find environmental factors associated with disease and health Chirag J Patel Center for Complex Network Research Northeastern University 8/8/17 [email protected] @chiragjp www.chiragjpgroup.org

Upload: chirag-patel

Post on 23-Jan-2018

194 views

Category:

Health & Medicine


2 download

TRANSCRIPT

Page 1: Search engine for E NEU network science 080817

Building a search engine to find environmental factors associated with

disease and health

Chirag J PatelCenter for Complex Network Research

Northeastern University8/8/17

[email protected]@chiragjp

www.chiragjpgroup.org

Page 2: Search engine for E NEU network science 080817

P = G + EType 2 Diabetes

CancerAlzheimer’s

Gene expression

Phenotype Genome

Variants

Environment

Infectious agentsDiet + Nutrients

PollutantsDrugs

Page 3: Search engine for E NEU network science 080817

We are great at G investigation!

2,940 (as of 6/1/17) 36,066 G-P associations

Genome-wide Association Studies (GWAS)https://www.ebi.ac.uk/gwas/

G

Page 4: Search engine for E NEU network science 080817

Nothing comparable to elucidate E influence!

E: ???

We lack high-throughput methods and data to discover new E in P…

Page 5: Search engine for E NEU network science 080817

A similar paradigm for discovery should existfor E!

Why?

Page 6: Search engine for E NEU network science 080817

σ2P = σ2G + σ2E

Page 7: Search engine for E NEU network science 080817

σ2Gσ2P H2 =

Heritability (H2) is the range of phenotypic variability attributed to genetic variability in a

population

Indicator of the proportion of phenotypic differences attributed to G.

Page 8: Search engine for E NEU network science 080817

Eye colorHair curliness

Type-1 diabetesHeight

SchizophreniaEpilepsy

Graves' diseaseCeliac disease

Polycystic ovary syndromeAttention deficit hyperactivity disorder

Bipolar disorderObesity

Alzheimer's diseaseAnorexia nervosa

PsoriasisBone mineral density

Menarche, age atNicotine dependence

Sexual orientationAlcoholism

LupusRheumatoid arthritis

Crohn's diseaseMigraine

Thyroid cancerAutism

Blood pressure, diastolicBody mass index

DepressionCoronary artery disease

InsomniaMenopause, age at

Heart diseaseProstate cancer

QT intervalBreast cancer

Ovarian cancerHangoverStrokeAsthma

Blood pressure, systolicHypertensionOsteoarthritis

Parkinson's diseaseLongevity

Type-2 diabetesGallstone diseaseTesticular cancer

Cervical cancerSciatica

Bladder cancerColon cancerLung cancerLeukemia

Stomach cancer

0 25 50 75 100Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com

G estimates for burdensome diseases are low and variable: massive opportunity for high-throughput E discovery

Type 2 Diabetes

Heart Disease

Autism (50%???)

Page 9: Search engine for E NEU network science 080817

Eye colorHair curliness

Type-1 diabetesHeight

SchizophreniaEpilepsy

Graves' diseaseCeliac disease

Polycystic ovary syndromeAttention deficit hyperactivity disorder

Bipolar disorderObesity

Alzheimer's diseaseAnorexia nervosa

PsoriasisBone mineral density

Menarche, age atNicotine dependence

Sexual orientationAlcoholism

LupusRheumatoid arthritis

Crohn's diseaseMigraine

Thyroid cancerAutism

Blood pressure, diastolicBody mass index

DepressionCoronary artery disease

InsomniaMenopause, age at

Heart diseaseProstate cancer

QT intervalBreast cancer

Ovarian cancerHangoverStrokeAsthma

Blood pressure, systolicHypertensionOsteoarthritis

Parkinson's diseaseLongevity

Type-2 diabetesGallstone diseaseTesticular cancer

Cervical cancerSciatica

Bladder cancerColon cancerLung cancerLeukemia

Stomach cancer

0 25 50 75 100Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com

G estimates for complex traits are low and variable: massive opportunity for high-throughput E discovery

σ2E : Exposome!

Page 10: Search engine for E NEU network science 080817

It took a new paradigm of GWAS for discovery: Human Genome Project to GWAS

Sequencing of the genome

2001

HapMap project:http://hapmap.ncbi.nlm.nih.gov/

Characterize common variation

2001-current day

High-throughput variant assay

< $99 for ~1M variants

Measurement tools

~2003 (ongoing)

ARTICLES

Genome-wide association study of 14,000cases of seven common diseases and3,000 shared controlsThe Wellcome Trust Case Control Consortium*

There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to theidentification of genes involved in common human diseases.We describe a joint GWAstudy (using the Affymetrix GeneChip500KMapping Array Set) undertaken in the British population, which has examined,2,000 individuals for each of 7 majordiseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals atP, 53 1027: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of thesesignals reflect genuine susceptibility effects. We observed association at many previously identified loci, and foundcompelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified alarge number of further signals (including 58 loci with single-point P values between 1025 and 53 1027) likely to yieldadditional susceptibility loci. The importance of appropriately large samples was confirmed by the modest effect sizesobserved at most loci identified. This study thus represents a thorough validation of the GWA approach. It has alsodemonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses ofmultiple disease phenotypes; has generated a genome-wide genotype database for future studies of common diseases in theBritish population; and shown that, provided individuals with non-European ancestry are excluded, the extent of populationstratification in the British population is generally modest. Our findings offer new avenues for exploring the pathophysiologyof these important disorders. We anticipate that our data, results and software, which will be widely available to otherinvestigators, will provide a powerful resource for human genetics research.

Despite extensive research efforts for more than a decade, the geneticbasis of common humandiseases remains largely unknown. Althoughthere have been some notable successes1, linkage and candidate geneassociation studies have often failed to deliver definitive results. Yetthe identification of the variants, genes and pathways involved inparticular diseases offers a potential route to new therapies, improveddiagnosis and better disease prevention. For some time it has beenhoped that the advent of genome-wide association (GWA) studieswould provide a successful new tool for unlocking the genetic basisof many of these common causes of humanmorbidity andmortality1.

Three recent advances mean that GWA studies that are powered todetect plausible effect sizes are now possible2. First, the InternationalHapMap resource3, which documents patterns of genome-wide vari-ation and linkage disequilibrium in four population samples, greatlyfacilitates both the design and analysis of association studies. Second,the availability of dense genotyping chips, containing sets of hundreds ofthousands of single nucleotide polymorphisms (SNPs) that providegood coverage of much of the human genome, means that for the firsttimeGWAstudies for thousandsof cases andcontrols are technically andfinancially feasible. Third, appropriately large and well-characterizedclinical samples have been assembled for many common diseases.

The Wellcome Trust Case Control Consortium (WTCCC) wasformed with a view to exploring the utility, design and analyses ofGWA studies. It brought together over 50 research groups from theUK that are active in researching the genetics of common humandiseases, with expertise ranging from clinical, through genotyping, to

informatics and statistical analysis. Here we describe the main experi-ment of the consortium: GWA studies of 2,000 cases and 3,000 sharedcontrols for 7 complex human diseases of major public health import-ance—bipolar disorder (BD), coronary artery disease (CAD), Crohn’sdisease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1diabetes (T1D), and type 2 diabetes (T2D). Two further experimentsundertaken by the consortium will be reported elsewhere: a GWAstudy for tuberculosis in 1,500 cases and 1,500 controls, sampled fromThe Gambia; and an association study of 1,500 common controls with1,000 cases for each of breast cancer, multiple sclerosis, ankylosingspondylitis and autoimmune thyroid disease, all typed at around15,000 mainly non-synonymous SNPs. By simultaneously studyingseven diseases with differing aetiologies, we hoped to develop insights,not only into the specific genetic contributions to each of the diseases,but also into differences in allelic architecture across the diseases. Afurther major aim was to address important methodological issues ofrelevance to all GWA studies, such as quality control, design and ana-lysis. In addition to our main association results, we address several ofthese issues below, including the choice of controls for genetic studies,the extent of population structure within Great Britain, sample sizesnecessary to detect genetic effects of varying sizes, and improvements ingenotype-calling algorithms and analytical methods.

Samples and experimental analyses

Individuals included in the study were living within England,Scotland and Wales (‘Great Britain’) and the vast majority had

*Lists of participants and affiliations appear at the end of the paper.

Vol 447 |7 June 2007 |doi:10.1038/nature05911

661Nature ©2007 Publishing Group

WTCCC, Nature, 2008.

Comprehensive, high-throughput analyses

GWAS

Page 11: Search engine for E NEU network science 080817

Explaining the other 50%: A data-driven paradigm for robust discovery of E in disease via

EWAS and the exposome

what to measure? how to measure?

www.sciencemag.org SCIENCE VOL 330 22 OCTOBER 2010 461

PERSPECTIVES

Xenobiotics

Inflammation

Preexisting disease

Lipid peroxidation

Oxidative stress

Gut flora

Internal

chemical

environment

External environm

ent

ExposomeRADIATION

DIET

POLLUTION

INFECTIONS

DRUGS

LIFE-STYLE

STRESS

Reactive electrophiles

Metals

Endocrine disrupters

Immune modulators

Receptor-binding proteins

critical entity for disease eti-ology ( 7). Recent discussion has focused on whether and how to implement this vision ( 8). Although fully charac-terizing human exposomes is daunting, strategies can be developed for getting “snap-shots” of critical portions of a person’s exposome during different stages of life. At one extreme is a “bottom-up” strategy in which all chemi-cals in each external source of a subject’s exposome are measured at each time point. Although this approach would have the advantage of relat-ing important exposures to the air, water, or diet, it would require enormous effort and would miss essential compo-nents of the internal chemi-cal environment due to such factors as gender, obesity, infl ammation, and stress. By contrast, a “top-down” strat-egy would measure all chem-icals (or products of their downstream processing or effects, so-called read-outs or signatures) in a subject’s blood. This would require only a single blood specimen at each time point and would relate directly to the person’s internal chemical environ-ment. Once important exposures have been identifi ed in blood samples, additional test-ing could determine their sources and meth-ods to reduce them.

To make the top-down approach feasible, the exposome would comprise a profi le of the most prominent classes of toxicants that are known to cause disease, namely, reactive elec-trophiles, endocrine (hormone) disruptors, modulators of immune responses, agents that bind to cellular receptors, and metals. Expo-sures to these agents can be monitored in the blood either by direct measurement or by looking for their effects on physiological pro-cesses (such as metabolism). These processes generate products that serve as signatures and biomarkers in the blood. For example, reac-tive electrophiles, which constitute the largest class of toxic chemicals ( 6), cannot generally be measured in the blood. However, metabo-lites of electrophiles are detectable in serum ( 9), and products of their reactions with blood nucleophiles, like serum albumin, offer possi-ble signatures ( 10). Estrogenic activity could be used to monitor the effect of endocrine dis-

ruptors and can be measured through serum biomarkers. Immune modulators trigger the production of cytokines and chemokines that also can be measured in serum. Chemicals that bind to cellular receptors stimulate the production of serum biomarkers that can be detected with high-throughput screens ( 11). Metals are readily measured in blood ( 12), as are hormones, antibodies to pathogens, and proteins released by cells in response to stress. The accumulation of biologically important exposures may also be detected as changes to lymphocyte gene expression or in chemical modifi cations of DNA (such as methylation) ( 13).

The environmental equivalent of a GWAS is possible when signatures and biomarkers of the exposome are characterized in humans with known health outcomes. Indeed, a rel-evant prototype for such a study examined associations between type 2 diabetes and 266 candidate chemicals measured in blood or urine ( 14). It determined that exposure to cer-tain chemicals produced strong associations with the risk of type 2 diabetes, with effect sizes comparable to the strongest genetic loci reported in GWAS. In another study, chromo-

some (telomere) length in peripheral blood mono-nuclear cells responded to chronic psychological stress, possibly mediated by the production of reac-tive oxygen species ( 15).

Characterizing the exposome represents a tech-

nological challenge like that of the human genome project, which began when DNA sequencing was in its infancy ( 16). Analyti-cal systems are needed to pro-cess small amounts of blood from thousands of subjects. Assays should be multiplexed for mea-suring many chemicals in each class of interest. Tandem mass spectrometry, gene and protein chips, and microfl uidic systems offer the means to do this. Plat-forms for high-throughput assays should lead to economies of scale, again like those experienced by the human genome project. And because exposome technologies would provide feedback for thera-peutic interventions and personal-ized medicine, they should moti-vate the development of commer-cial devices for screening impor-tant environmental exposures in blood samples.

With successful characterization of both exposomes and genomes, environmental and genetic determinants of chronic diseases can be united in high-resolution studies that examine gene-environment interactions. Such a union might even push the nature-ver-sus-nurture debate toward resolution.

References and Notes

1. P. Lichtenstein et al., N. Engl. J. Med. 343, 78 (2000). 2. L. A. Hindorff et al., Proc. Natl. Acad. Sci. U.S.A. 106,

9362 (2009). 3. W. C. Willett, Science 296, 695 (2002). 4. P. Vineis, Int. J. Epidemiol. 33, 945 (2004). 5. I. Dalle-Donne et al., Clin. Chem. 52, 601 (2006). 6. D. C. Liebler, Chem. Res. Toxicol. 21, 117 (2008). 7. C. P. Wild, Cancer Epidemiol. Biomarkers Prev. 14, 1847

(2005). 8. http://dels.nas.edu/envirohealth/exposome.shtml 9. W. B. Dunn et al., Int. J. Epidemiol. 37 (suppl. 1), i23

(2008). 10. F. M. Rubino et al., Mass Spectrom Rev. 28, 725 (2009). 11. T. I. Halldorsson et al., Environ. Res. 109, 22 (2009). 12. S. Mounicou et al., Chem. Soc. Rev. 38, 1119 (2009). 13. C. M. McHale et al., Mutat. Res. 10.1016/j.mrrev.

2010.04.001 (2010). 14. C. J. Patel et al., PLoS ONE 5, e10746 (2010). 15. E. S. Epel et al., Proc. Natl. Acad. Sci. U.S.A. 101, 17312

(2004). 16. F. S. Collins et al., Science 300, 286 (2003). 17. Supported by NIEHS through grants U54ES016115 and

P42ES04705.

Characterizing the exposome. The exposome represents the combined exposures from all sources that reach the internal chemical environment. Toxicologically important classes of exposome chemicals are shown. Signatures and biomarkers can detect these agents in blood or serum.

CR

ED

IT: N

. K

EV

ITIY

AG

ALA

/SC

IEN

CE

; (P

HO

TO

CR

ED

ITS

) (L

EF

T, T

OP

FIV

E IM

AG

ES

) T

HIN

KS

TO

CK

.CO

M; (L

EF

T, T

WO

IM

AG

ES

FR

OM

BO

TT

OM

) IS

TO

CK

PH

OT

O.C

OM

; (R

IGH

T) T

HIN

KS

TO

CK

PH

OT

OS

.CO

M

10.1126/science.1192603

Published by AAAS

on

Oct

ober

21,

201

0 ww

w.sc

ienc

emag

.org

Down

load

ed fr

om

“A more comprehensive view of environmental exposure is

needed ... to discover major causes of diseases...”

how to analyze in relation to health?

Wild, 2005, 2012Rappaport and Smith, 2010, 2011

Buck-Louis and Sundaram 2012Miller and Jones, 2014

Patel CJ and Ioannidis JPAI, 2014

Page 12: Search engine for E NEU network science 080817

Promises and Challenges in creating a search engine for E in P

High-throughput E = discovery!systematic; reproducible

multiple hypothesis controlprioritization

Eye colorHair curliness

Type-1 diabetesHeight

SchizophreniaEpilepsy

Graves' diseaseCeliac disease

Polycystic ovary syndromeAttention deficit hyperactivity disorder

Bipolar disorderObesity

Alzheimer's diseaseAnorexia nervosa

PsoriasisBone mineral density

Menarche, age atNicotine dependence

Sexual orientationAlcoholism

LupusRheumatoid arthritis

Crohn's diseaseMigraine

Thyroid cancerAutism

Blood pressure, diastolicBody mass index

DepressionCoronary artery disease

InsomniaMenopause, age at

Heart diseaseProstate cancer

QT intervalBreast cancer

Ovarian cancerHangoverStrokeAsthma

Blood pressure, systolicHypertensionOsteoarthritis

Parkinson's diseaseLongevity

Type-2 diabetesGallstone diseaseTesticular cancer

Cervical cancerSciatica

Bladder cancerColon cancerLung cancerLeukemia

Stomach cancer

0 25 50 75 100Heritability: Var(G)/Var(Phenotype)

Arjun Manrai (Yuxia Cui, David Balshaw)

ARPH 2016JAMA 2014JECH 2014

σ2E : Exposome!

Page 13: Search engine for E NEU network science 080817

Examples of exposome-driven discovery machinery, or EWASs

Page 14: Search engine for E NEU network science 080817

Gold standard for breadth of human exposure information: National Health and Nutrition Examination Survey1

since the 1960snow biannual: 1999 onwards10,000 participants per survey

Introduction

The National Health and Nutrition Examination Survey (NHANES) is a program of studiesdesigned to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it com-bines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation.

The NHANES program began in the early 1960s and has been conducted as a series of sur-veys focusing on different population groups or health topics. In 1999, the survey became a con-tinuous program that has a changing focus on a variety of health and nutrition measurements to meet emerging needs. The survey examines a nationally representative sample of about 5,000 persons each year. These persons are located in counties across the country, 15 of which are visited each year.

The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measure-ments, as well as laboratory tests administered by highly trained medical personnel.

Findings from this survey will be used to de-termine the prevalence of major diseases and risk factors for diseases. Information will be used to assess nutritional status and its associ-ation with health promotion and disease pre-vention. NHANES findings are also the basis for national standards for such measurements as height, weight, and blood pressure. Data from this survey will be used in epidemiologi-cal studies and health sciences research, which help develop sound public health policy,

direct and design health programs and services, and expand the health knowl-edge for the Nation.

Survey Content

As in past health examination surveys, data will be collected on the prevalence of chron-ic conditions in the population. Estimates for previously undiagnosed conditions, as well as those known to and reported by respon-dents, are produced through the survey. Such information is a particular strength of the NHANES program.

Risk factors, those aspects of a person’s life-style, constitution, heredity, or environment that may increase the chances of developing a certain disease or condition, will be examined. Smoking, alcohol consumption, sexual practices, drug use, physical fitness and activity, weight, and dietary intake will be studied. Data on certain aspects of reproductive health, such as use of oral contraceptives and breastfeeding practices, will also be collected.

The diseases, medical conditions, and health indicators to be studied include:

• Anemia• Cardiovascular disease• Diabetes• Environmental exposures• Eye diseases• Hearing loss• Infectious diseases• Kidney disease• Nutrition• Obesity• Oral health• Osteoporosis

The sample for the survey is selected to represent the U.S. population of all ages. To produce reli-able statistics, NHANES over-samples persons 60 and older, African Americans, and Hispanics.

Since the United States has experienced dramatic growth in the number of older people during this century, the aging population has major impli-cations for health care needs, public policy, and research priorities. NCHS is working with public health agencies to increase the knowledge of the health status of older Americans. NHANES has a primary role in this endeavor.

All participants visit the physician. Dietary inter-views and body measurements are included for everyone. All but the very young have a blood sample taken and will have a dental screening. Depending upon the age of the participant, the rest of the examination includes tests and proce-dures to assess the various aspects of health listed above. In general, the older the individual, the more extensive the examination.

Survey Operations

Health interviews are conducted in respondents’ homes. Health measurements are performed in specially-designed and equipped mobile centers, which travel to locations throughout the country. The study team consists of a physician, medical and health technicians, as well as dietary and health interviewers. Many of the study staff are bilingual (English/Spanish).

An advanced computer system using high-end servers, desktop PCs, and wide-area networking collect and process all of the NHANES data, nearly eliminating the need for paper forms and manual coding operations. This system allows interviewers to use note-book computers with electronic pens. The staff at the mobile center can automatically transmit data into data bases through such devices as digital scales and stadiometers. Touch-sensi-tive computer screens let respondents enter their own responses to certain sensitive ques-tions in complete privacy. Survey information is available to NCHS staff within 24 hours of collection, which enhances the capability of collecting quality data and increases the speed with which results are released to the public.

In each location, local health and government officials are notified of the upcoming survey. Households in the study area receive a letter from the NCHS Director to introduce the survey. Local media may feature stories about the survey.

NHANES is designed to facilitate and en-courage participation. Transportation is provided to and from the mobile center if necessary. Participants receive compensation and a report of medical findings is given to each participant. All information collected in the survey is kept strictly confidential. Privacy is protected by public laws.

Uses of the Data

Information from NHANES is made available through an extensive series of publications and articles in scientific and technical journals. For data users and researchers throughout the world, survey data are available on the internet and on easy-to-use CD-ROMs.

Research organizations, universities, health care providers, and educators benefit from survey information. Primary data users are federal agencies that collaborated in the de-sign and development of the survey. The National Institutes of Health, the Food and Drug Administration, and CDC are among the agencies that rely upon NHANES to provide data essential for the implementation and evaluation of program activities. The U.S. Department of Agriculture and NCHS coop-erate in planning and reporting dietary and nutrition information from the survey.

NHANES’ partnership with the U.S. Environ-mental Protection Agency allows continued study of the many important environmental influences on our health.

• Physical fitness and physical functioning• Reproductive history and sexual behavior• Respiratory disease (asthma, chronic bron- chitis, emphysema)• Sexually transmitted diseases • Vision

1 http://www.cdc.gov/nchs/nhanes.htm

>250 exposures (serum + urine)GWAS chip

>85 quantitative clinical traits (e.g., serum glucose, lipids, body mass index)

Death index linkage (cause of death)

Page 15: Search engine for E NEU network science 080817

Gold standard for breadth of exposure & behavior data: National Health and Nutrition Examination Survey

Nutrients and Vitaminsvitamin D, carotenes

Infectious Agentshepatitis, HIV, Staph. aureus

Plastics and consumablesphthalates, bisphenol A

Physical Activitye.g., stepsPesticides and pollutants

atrazine; cadmium; hydrocarbons

Drugsstatins; aspirin

Page 16: Search engine for E NEU network science 080817

What E are associated with aging: all-cause mortality and

telomere length?

Int J Epidem 2013Int J Epidem 2016

Page 17: Search engine for E NEU network science 080817

How does it work?: Searching for exposures and behaviors associated with all-

cause mortality.

NHANES: 1999-2004National Death Index linked mortality

246 behaviors and exposures (serum/urine/self-report)

NHANES: 1999-2001N=330 to 6008 (26 to 655 deaths)

~5.5 years of followup

Cox proportional hazardsbaseline exposure and time to death

False discovery rate < 5%

NHANES: 2003-2004N=177 to 3258 (20-202 deaths)

~2.8 years of followup

p < 0.05

Int J Epidem 2013

Page 18: Search engine for E NEU network science 080817

Adjusted Hazard Ratio

-log10(pvalue)

0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8

02

46

8

1

2

3

45

67

1 Physical Activity2 Does anyone smoke in home?3 Cadmium4 Cadmium, urine5 Past smoker6 Current smoker7 trans-lycopene

(11) 1

2

3 4

5 6

789

10 111213 141516

1 age (10 year increment)2 SES_13 male4 SES_05 black6 SES_27 SES_38 education_hs9 other_eth10 mexican11 occupation_blue_semi12 education_less_hs13 occupation_never14 occupation_blue_high15 occupation_white_semi16 other_hispanic

(69)

EWAS in all-cause mortality:253 exposure/behavior associations in survival

=age, sex, income, education, race/ethnicity, occupation [in red]

FDR < 5%

sociodemographics

replicated factor

Int J Epidem 2013

Page 19: Search engine for E NEU network science 080817

Adjusted Hazard Ratio

-log10(pvalue)

0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8

02

46

8

1

2

3

45

67

1 Physical Activity2 Does anyone smoke in home?3 Cadmium4 Cadmium, urine5 Past smoker6 Current smoker7 trans-lycopene

(11) 1

2

3 4

5 6

789

10 111213 141516

1 age (10 year increment)2 SES_13 male4 SES_05 black6 SES_27 SES_38 education_hs9 other_eth10 mexican11 occupation_blue_semi12 education_less_hs13 occupation_never14 occupation_blue_high15 occupation_white_semi16 other_hispanic

(69)

EWAS identifies factors associated with all-cause mortality:Volcano plot of 200 associations

age (10 years)

income (quintile 2)

income (quintile 1)male

black income (quintile 3)

any one smoke in home?

Multivariate cox (age, sex, income, education, race/ethnicity, occupation [in red])

serum and urine cadmium[1 SD]

past smoker?current smoker?serum lycopene

[1SD]

physical activity[low, moderate, high activity]*

*derived from METs per activity and categorized by Health.gov guidelines R2 ~ 2%

Page 20: Search engine for E NEU network science 080817

few more examples:https://paperpile.com/shared/PtvEae

diabetespreterm birth

incomeblood pressure

lipidskidney diseasetelomere length

mortality

Page 21: Search engine for E NEU network science 080817

Promises and Challenges in creating a search engine for E in P

High-throughput assays of E!scalable and standard technologies

ARPH 2016JAMA 2014JECH 2014

Big data = big bias! Confounding; reverse causality

Dense correlational web of E and P Fragmented and small E-P associations

Influence of time and life-course

Page 22: Search engine for E NEU network science 080817

Challenge to scale absolute E due to heterogeneity and large dynamic range.

Rappaport et al, EHP 2015

Untargeted

Targeted

Page 23: Search engine for E NEU network science 080817

•Getting cheaper, but still not “at scale”• relative not absolute• identification of chemical analytes is an art•detection limits not low enough for E

Page 24: Search engine for E NEU network science 080817

Promises and Challenges in creating a search engine for E in P

High-throughput assays of E!scalable and standard technologies

ARPH 2016JAMA 2014JECH 2014

Big data = big bias! Confounding; reverse causality

Dense correlational web of E and P Fragmented and small E-P associations

Influence of time and life-course

Page 25: Search engine for E NEU network science 080817

Example of fragmentation: Is everything we eat associated with cancer?

Schoenfeld and Ioannidis, AJCN 2012

50 random ingredients from Boston Cooking School

Cookbook

Any associated with cancer?

The effect estimates are shown in Figure 1 by malignancytype or by ingredient for the 20 ingredients for which $10 ar-ticles were identified. Gastrointestinal malignancies were themost commonly studied (45%), followed by genitourinary(14%), breast (14%), head and neck (9%), lung (5%), and gy-necologic (5%) malignancies.

The distribution of standardized (z) scores associated withP values was bimodal, with peaks corresponding to nominallystatistically significant results and a trough in the middle cor-responding to the sparse nonsignificant results (Figure 2, leftpanel). The bimodal peaks and middle trough pattern were evenmore prominent for results reported in the abstracts: 62% of thenominally statistically significant effect estimates were reported

in abstracts, whereas most (70%) of the nonsignificant resultsappeared only in the full text and not in the abstracts (P, 0.0001).

Meta-analyses

Thirty-six relevant effect estimates were obtained from meta-analyses (see Supplementary Table 2 under “Supplemental data”in the online issue). Author conclusions and the respective effectestimates are summarized in Table 1.

Thirty-three (92%) of the 36 estimates pertained to comparisonsof the lowest with the highest levels of consumption, but most ofthese meta-analyses combined studies that had different exposurecontrasts. For example, one meta-analysis (39) combined studies

FIGURE 1. Effect estimates reported in the literature by malignancy type (top) or ingredient (bottom). Only ingredients with$10 studies are shown. Threeoutliers are not shown (effect estimates .10).

4 of 8 SCHOENFELD AND IOANNIDIS

Of 50, 40 studied in cancer risk

Weak statistical evidence:non-replicated

inconsistent effectsnon-standardized

Page 26: Search engine for E NEU network science 080817

Are all the drugs we take associated with cancer?

Sci Reports 2016

Associated all (~500) drugs prescribed in entire population of Sweden(N=9M) with time to cancer

Assessed 2 modeling techniques (Cox and case-crossover)

Page 27: Search engine for E NEU network science 080817

any cancer: 141 (26%)prostate: 56 (10%) breast: 41 (7%)colon: 14 (3%)

What drugs are associated with time to cancer?Too many to be plausible (up to 26%!)

Sci Reports 2016

Modest concordance between Cox and case-crossover: 12 out of 141!

Most correlations small (HR < 1.1); residual confounding?

Page 28: Search engine for E NEU network science 080817

Distribution of associations and p-values due to model choice: Estimating the Vibration of Effects (or Risk)

Variable of Intereste.g., 1 SD of log(serum Vitamin D)

Adjusting Variable Setn=13

All-subsets Cox regression213+ 1 = 8,193 models

SES [3rd tertile]education [>HS]

race [white]body mass index [normal]

total cholesterolany heart disease

family heart diseaseany hypertension

any diabetesany cancer

current/past smoker [no smoking]drink 5/day

physical activity

Data SourceNHANES 1999-2004

417 variables of interesttime to death

N≧1000 (≧100 deaths)

effect sizes p-values

Vibration of EffectsRelative Hazard Ratio (RHR)=HR99/HR1

Range of P-value (RP)=-log10(p-value1) + log10(pvalue99)

●●

0

1

2

3

4

56

78 9 10

111213

1

50

99

1 50 99

2.5

5.0

7.5

0.64 0.68 0.72 0.76Hazard Ratio

−log

10(p

valu

e)

Vitamin D (1SD(log)) RHR = 1.14

RPvalue = 4.68

A

B

C D

E

F

median p-value/HR for k

percentile indicator

JCE, 2015

●●

0

1

2

3

4

56

78 9 10

111213

1

50

99

1 50 99

2.5

5.0

7.5

0.64 0.68 0.72 0.76Hazard Ratio

−log

10(p

valu

e)

Vitamin D (1SD(log)) RHR = 1.14RP = 4.68

01

2 34

56

78

910

1112

13

1

50

99

1 50 99

1

2

3

4

0.75 0.80 0.85 0.90Hazard Ratio

−log

10(p

valu

e)

Thyroxine (1SD(log)) RHR = 1.15RP = 2.90

http://bit.ly/effectvibration

Page 29: Search engine for E NEU network science 080817

Promises and Challenges in creating a search engine for E in P

High-throughput assays of E!scalable and standard technologies

ARPH 2016JAMA 2014JECH 2014

Big data = big bias! Confounding; reverse causality

Dense correlational web of E and P Fragmented and small E-P associations

Influence of time and life-course

Page 30: Search engine for E NEU network science 080817

Interdependencies of the exposome: Correlation globes paint a complex view of exposure

Red: positive ρBlue: negative ρ

thickness: |ρ|

for each pair of E:Spearman ρ

(575 factors: 81,937 correlations)

permuted data to produce“null ρ”

sought replication in > 1 cohort

Pac Symp Biocomput. 2015JECH. 2015

Page 31: Search engine for E NEU network science 080817

Red: positive ρBlue: negative ρ

thickness: |ρ|

for each pair of E:Spearman ρ

(575 factors: 81,937 correlations)

Interdependencies of the exposome: Correlation globes paint a complex view of exposure

permuted data to produce“null ρ”

sought replication in > 1 cohort

Pac Symp Biocomput. 2015JECH. 2015

Effective number of variables:

500 (10% decrease)

Page 32: Search engine for E NEU network science 080817

Does my single association between E and P matter?

Page 33: Search engine for E NEU network science 080817

Does my association between E and P matter in the entire possible space of associations?

ARPH 2017 Hum Genet 2012

JECH 2014 Curr Epidemiol Rep 2017 Curr Env Health Rep 2016

p-2

20

8

1

6

18

7

10

p-1

12

2

5

9

4

16

21

11

13

3

17

14

19

p

15

6 101 1482 7 …1311 12 e54 153 9 e-1e-2

E exposure factors

P ph

enot

ypic

fact

ors

which ones to test?all?

the ones in blue?

E times P possibilities!how to detect signal from noise?

Page 34: Search engine for E NEU network science 080817

P

Scaling up the search in multiple (m=157) phenotypes:does my single association between E and P matter?

Body MeasuresBody Mass Index

Height

Blood pressure & fitnessSystolic BPDiastolic BPPulse rateVO2 Max

MetabolicGlucose

LDL-CholesterolTriglycerides

InflammationC-reactive protein

white blood cell count

Kidney functionCreatinineSodium

Uric Acid

Liver functionAspartate aminotransferaseGamma glutamyltransferase

AgingTelomere lengthTime to death

Raj Manrai, Hugues Aschard, JPA Ioannidis, Dennis Bier

Page 35: Search engine for E NEU network science 080817

Creation of a phenotype-exposure association map: A 2-D view of 209 phenotype by 514 exposure associations

> 0< 0

Association Size:

504 E exposure and diet indicators × 209 clinical trait phenotypes NHANES 1999-2000, 2001-2002, 2005-2006, …, 2011-2012 (8)

Median N: 150-5000 per survey

~83,092 E-P associations! significant associations (FDR < 5%)

adjusted by age, age2, sex, race, income

Raj Manrai, Hugues Aschard, JPA Ioannidis, Dennis Bier

209

phe

noty

pes

514 exposures

Page 36: Search engine for E NEU network science 080817

Alpha-carotene

Alcohol

Vita

min

E a

s al

pha-

toco

pher

olBeta-carotene

Caffeine

Calcium

Carbohydrate

Cholesterol

Copper

Beta-cryptoxanthin

Folic

aci

dFo

late

, DFE

Food

fola

teD

ieta

ry fi

ber

Iron

Energy

Lycopene

Lute

in +

zea

xant

hin

MFA

16:

1M

FA 1

8:1

MFA

20:

1Magnesium

Tota

l mon

ouns

atur

ated

fatty

aci

dsMoisture

Niacin

PFA

18:

2P

FA 1

8:3

PFA

20:

4P

FA 2

2:5

PFA

22:

6To

tal p

olyu

nsat

urat

ed fa

tty a

cids

Phosphorus

Potassium

Protein

Retinol

SFA

4:0

SFA

6:0

SFA

8:0

SFA

10:

0S

FA 1

2:0

SFA

14:

0S

FA 1

6:0

SFA

18:

0Selenium

Tota

l sat

urat

ed fa

tty a

cids

Tota

l sug

ars

Tota

l fat

Theobromine

Vita

min

A, R

AE

Thiamin

Vita

min

B12

Riboflavin

Vita

min

B6

Vita

min

CV

itam

in K

Zinc

No

Sal

tO

rdin

ary

Sal

ta-Carotene

Vita

min

B12

, ser

umtrans-b-carotene

cis-b-carotene

b-cryptoxanthin

Fola

te, s

erum

g-tocopherol

Iron,

Fro

zen

Ser

umC

ombi

ned

Lute

in/z

eaxa

nthi

ntrans-lycopene

Fola

te, R

BC

Ret

inyl

pal

mita

teR

etin

yl s

tear

ate

Retinol

Vita

min

Da-Tocopherol

Daidzein

o-Desmethylangolensin

Equol

Enterodiol

Enterolactone

Genistein

Est

imat

ed V

O2m

axP

hysi

cal A

ctiv

ityD

oes

anyo

ne s

mok

e in

hom

e?To

tal #

of c

igar

ette

s sm

oked

in h

ome

Cotinine

Cur

rent

Cig

aret

te S

mok

er?

Age

last

sm

oked

cig

aret

tes

regu

larly

# ci

gare

ttes

smok

ed p

er d

ay w

hen

quit

# ci

gare

ttes

smok

ed p

er d

ay n

ow#

days

sm

oked

cig

s du

ring

past

30

days

Avg

# c

igar

ette

s/da

y du

ring

past

30

days

Sm

oked

at l

east

100

cig

aret

tes

in li

feD

o yo

u no

w s

mok

e ci

gare

ttes.

..nu

mbe

r of d

ays

sinc

e qu

itU

sed

snuf

f at l

east

20

times

in li

fedr

ink

5 in

a d

aydr

ink

per d

ayda

ys 5

drin

ks in

yea

rda

ys d

rink

in y

ear

3-fluorene

2-fluorene

3-phenanthrene

1-phenanthrene

2-phenanthrene

1-pyrene

3-be

nzo[

c] p

hena

nthr

ene

3-be

nz[a

] ant

hrac

ene

Mon

o-n-

buty

l pht

hala

teM

ono-

pht

hala

teM

ono-

cycl

ohex

yl p

htha

late

Mon

o-et

hyl p

htha

late

Mon

o- p

htha

late

Mon

o--h

exyl

pht

hala

teM

ono-

isob

utyl

pht

hala

teM

ono-

n-m

ethy

l pht

hala

teM

ono-

pht

hala

teM

ono-

benz

yl p

htha

late

Cadmium

Lead

Mer

cury

, tot

alB

ariu

m, u

rine

Cad

miu

m, u

rine

Cob

alt,

urin

eC

esiu

m, u

rine

Mer

cury

, urin

eIo

dine

, urin

eM

olyb

denu

m, u

rine

Lead

, urin

eP

latin

um, u

rine

Ant

imon

y, u

rine

Thal

lium

, urin

eTu

ngst

en, u

rine

Ura

nium

, urin

eB

lood

Ben

zene

Blo

od E

thyl

benz

ene

Blo

od o

-Xyl

ene

Blo

od S

tyre

neB

lood

Tric

hlor

oeth

ene

Blo

od T

olue

neB

lood

m-/p

-Xyl

ene

1,2,3,7,8-pncdd

1,2,3,7,8,9-hxcdd

1,2,3,4,6,7,8-hpcdd

1,2,3,4,6,7,8,9-ocdd

2,3,7,8-tcdd

Beta-hexachlorocyclohexane

Gamma-hexachlorocyclohexane

Hexachlorobenzene

Hep

tach

lor E

poxi

deMirex

Oxychlordane

p,p-DDE

Trans-nonachlor

2,5-

dich

loro

phen

ol re

sult

2,4,

6-tri

chlo

roph

enol

resu

ltPentachlorophenol

Dimethylphosphate

Diethylphosphate

Dimethylthiophosphate

PCB66

PCB74

PCB99

PCB105

PCB118

PC

B13

8 &

158

PCB146

PCB153

PCB156

PCB157

PCB167

PCB170

PCB172

PCB177

PCB178

PCB180

PCB183

PCB187

3,3,4,4,5,5-hxcb

3,3,4,4,5-pncb

3,4,4,5-tcb

Per

fluor

ohep

tano

ic a

cid

Per

fluor

ohex

ane

sulfo

nic

acid

Per

fluor

onon

anoi

c ac

idP

erflu

oroo

ctan

oic

acid

Per

fluor

ooct

ane

sulfo

nic

acid

Per

fluor

ooct

ane

sulfo

nam

ide

2,3,7,8-tcdf

1,2,3,7,8-pncdf

2,3,4,7,8-pncdf

1,2,3,4,7,8-hxcdf

1,2,3,6,7,8-hxcdf

1,2,3,7,8,9-hxcdf

2,3,4,6,7,8-hxcdf

1,2,3,4,6,7,8-hpcdf

Measles

Toxoplasma

Hep

atiti

s A

Ant

ibod

yH

epat

itis

B c

ore

antib

ody

Hep

atiti

s B

Sur

face

Ant

ibod

yH

erpe

s II

Albumin, urineUric acidPhosphorusOsmolalitySodiumPotassiumCreatinineChlorideTotal calciumBicarbonateBlood urea nitrogenTotal proteinTotal bilirubinLactate dehydrogenase LDHGamma glutamyl transferaseGlobulinAlanine aminotransferase ALTAspartate aminotransferase ASTAlkaline phosphotaseAlbuminMethylmalonic acidPSA. totalProstate specific antigen ratioTIBC, Frozen SerumRed cell distribution widthRed blood cell countPlatelet count SISegmented neutrophils percentMean platelet volumeMean cell volumeMean cell hemoglobinMCHCHemoglobinHematocritFerritinProtoporphyrinTransferrin saturationWhite blood cell countMonocyte percentLymphocyte percentEosinophils percentC-reactive proteinSegmented neutrophils numberMonocyte numberLymphocyte numberEosinophils numberBasophils numbermean systolicmean diastolic60 sec. pulse:60 sec HRTotal CholesterolTriglyceridesGlucose, serumInsulinHomocysteineGlucose, plasmaGlycohemoglobinC-peptide: SILDL-cholesterolDirect HDL-CholesterolBone alkaline phosphotaseTrunk FatLumber Pelvis BMDLumber Spine BMDHead BMDTrunk Lean excl BMCTotal Lean excl BMCTotal FatTotal BMDWeightWaist CircumferenceTriceps SkinfoldThigh CircumferenceSubscapular SkinfoldRecumbent LengthUpper Leg LengthStanding HeightHead CircumferenceMaximal Calf CircumferenceBody Mass Index

-0.4 -0.2 0 0.2 0.4

Value

050

100

150

Color Keyand Histogram

Count

phen

otyp

es

exposures

+-

nutrients

BMI,

wei

ght,

BMD

met

abol

ic

rena

l fun

ctio

npcbs

met

abol

ic

bloo

d pa

ram

eter

s

hydrocarbons

EWAS-derived phenotype-exposure association map: A 2-D view of connections between P and E

R2: ~1-40% (average of 20%)

Page 37: Search engine for E NEU network science 080817

83,092 total associations between E and P 12,237 significant associations (6%, in yellow):

Average association size: 0.6% for 1SD change in E

percent change for 1 SD increase

7%-6%

Page 38: Search engine for E NEU network science 080817

Alpha-carotene

Alcohol

Vita

min

E a

s al

pha-

toco

pher

olBeta-carotene

Caffeine

Calcium

Carbohydrate

Cholesterol

Copper

Beta-cryptoxanthin

Folic

aci

dFo

late

, DFE

Food

fola

teD

ieta

ry fi

ber

Iron

Energy

Lycopene

Lute

in +

zea

xant

hin

MFA

16:

1M

FA 1

8:1

MFA

20:

1Magnesium

Tota

l mon

ouns

atur

ated

fatty

aci

dsMoisture

Niacin

PFA

18:

2P

FA 1

8:3

PFA

20:

4P

FA 2

2:5

PFA

22:

6To

tal p

olyu

nsat

urat

ed fa

tty a

cids

Phosphorus

Potassium

Protein

Retinol

SFA

4:0

SFA

6:0

SFA

8:0

SFA

10:

0S

FA 1

2:0

SFA

14:

0S

FA 1

6:0

SFA

18:

0Selenium

Tota

l sat

urat

ed fa

tty a

cids

Tota

l sug

ars

Tota

l fat

Theobromine

Vita

min

A, R

AE

Thiamin

Vita

min

B12

Riboflavin

Vita

min

B6

Vita

min

CV

itam

in K

Zinc

No

Sal

tO

rdin

ary

Sal

ta-Carotene

Vita

min

B12

, ser

umtrans-b-carotene

cis-b-carotene

b-cryptoxanthin

Fola

te, s

erum

g-tocopherol

Iron,

Fro

zen

Ser

umC

ombi

ned

Lute

in/z

eaxa

nthi

ntrans-lycopene

Fola

te, R

BC

Ret

inyl

pal

mita

teR

etin

yl s

tear

ate

Retinol

Vita

min

Da-Tocopherol

Daidzein

o-Desmethylangolensin

Equol

Enterodiol

Enterolactone

Genistein

Est

imat

ed V

O2m

axP

hysi

cal A

ctiv

ityD

oes

anyo

ne s

mok

e in

hom

e?To

tal #

of c

igar

ette

s sm

oked

in h

ome

Cotinine

Cur

rent

Cig

aret

te S

mok

er?

Age

last

sm

oked

cig

aret

tes

regu

larly

# ci

gare

ttes

smok

ed p

er d

ay w

hen

quit

# ci

gare

ttes

smok

ed p

er d

ay n

ow#

days

sm

oked

cig

s du

ring

past

30

days

Avg

# c

igar

ette

s/da

y du

ring

past

30

days

Sm

oked

at l

east

100

cig

aret

tes

in li

feD

o yo

u no

w s

mok

e ci

gare

ttes.

..nu

mbe

r of d

ays

sinc

e qu

itU

sed

snuf

f at l

east

20

times

in li

fedr

ink

5 in

a d

aydr

ink

per d

ayda

ys 5

drin

ks in

yea

rda

ys d

rink

in y

ear

3-fluorene

2-fluorene

3-phenanthrene

1-phenanthrene

2-phenanthrene

1-pyrene

3-be

nzo[

c] p

hena

nthr

ene

3-be

nz[a

] ant

hrac

ene

Mon

o-n-

buty

l pht

hala

teM

ono-

pht

hala

teM

ono-

cycl

ohex

yl p

htha

late

Mon

o-et

hyl p

htha

late

Mon

o- p

htha

late

Mon

o--h

exyl

pht

hala

teM

ono-

isob

utyl

pht

hala

teM

ono-

n-m

ethy

l pht

hala

teM

ono-

pht

hala

teM

ono-

benz

yl p

htha

late

Cadmium

Lead

Mer

cury

, tot

alB

ariu

m, u

rine

Cad

miu

m, u

rine

Cob

alt,

urin

eC

esiu

m, u

rine

Mer

cury

, urin

eIo

dine

, urin

eM

olyb

denu

m, u

rine

Lead

, urin

eP

latin

um, u

rine

Ant

imon

y, u

rine

Thal

lium

, urin

eTu

ngst

en, u

rine

Ura

nium

, urin

eB

lood

Ben

zene

Blo

od E

thyl

benz

ene

Blo

od o

-Xyl

ene

Blo

od S

tyre

neB

lood

Tric

hlor

oeth

ene

Blo

od T

olue

neB

lood

m-/p

-Xyl

ene

1,2,3,7,8-pncdd

1,2,3,7,8,9-hxcdd

1,2,3,4,6,7,8-hpcdd

1,2,3,4,6,7,8,9-ocdd

2,3,7,8-tcdd

Beta-hexachlorocyclohexane

Gamma-hexachlorocyclohexane

Hexachlorobenzene

Hep

tach

lor E

poxi

deMirex

Oxychlordane

p,p-DDE

Trans-nonachlor

2,5-

dich

loro

phen

ol re

sult

2,4,

6-tri

chlo

roph

enol

resu

ltPentachlorophenol

Dimethylphosphate

Diethylphosphate

Dimethylthiophosphate

PCB66

PCB74

PCB99

PCB105

PCB118

PC

B13

8 &

158

PCB146

PCB153

PCB156

PCB157

PCB167

PCB170

PCB172

PCB177

PCB178

PCB180

PCB183

PCB187

3,3,4,4,5,5-hxcb

3,3,4,4,5-pncb

3,4,4,5-tcb

Per

fluor

ohep

tano

ic a

cid

Per

fluor

ohex

ane

sulfo

nic

acid

Per

fluor

onon

anoi

c ac

idP

erflu

oroo

ctan

oic

acid

Per

fluor

ooct

ane

sulfo

nic

acid

Per

fluor

ooct

ane

sulfo

nam

ide

2,3,7,8-tcdf

1,2,3,7,8-pncdf

2,3,4,7,8-pncdf

1,2,3,4,7,8-hxcdf

1,2,3,6,7,8-hxcdf

1,2,3,7,8,9-hxcdf

2,3,4,6,7,8-hxcdf

1,2,3,4,6,7,8-hpcdf

Measles

Toxoplasma

Hep

atiti

s A

Ant

ibod

yH

epat

itis

B c

ore

antib

ody

Hep

atiti

s B

Sur

face

Ant

ibod

yH

erpe

s II

Albumin, urineUric acidPhosphorusOsmolalitySodiumPotassiumCreatinineChlorideTotal calciumBicarbonateBlood urea nitrogenTotal proteinTotal bilirubinLactate dehydrogenase LDHGamma glutamyl transferaseGlobulinAlanine aminotransferase ALTAspartate aminotransferase ASTAlkaline phosphotaseAlbuminMethylmalonic acidPSA. totalProstate specific antigen ratioTIBC, Frozen SerumRed cell distribution widthRed blood cell countPlatelet count SISegmented neutrophils percentMean platelet volumeMean cell volumeMean cell hemoglobinMCHCHemoglobinHematocritFerritinProtoporphyrinTransferrin saturationWhite blood cell countMonocyte percentLymphocyte percentEosinophils percentC-reactive proteinSegmented neutrophils numberMonocyte numberLymphocyte numberEosinophils numberBasophils numbermean systolicmean diastolic60 sec. pulse:60 sec HRTotal CholesterolTriglyceridesGlucose, serumInsulinHomocysteineGlucose, plasmaGlycohemoglobinC-peptide: SILDL-cholesterolDirect HDL-CholesterolBone alkaline phosphotaseTrunk FatLumber Pelvis BMDLumber Spine BMDHead BMDTrunk Lean excl BMCTotal Lean excl BMCTotal FatTotal BMDWeightWaist CircumferenceTriceps SkinfoldThigh CircumferenceSubscapular SkinfoldRecumbent LengthUpper Leg LengthStanding HeightHead CircumferenceMaximal Calf CircumferenceBody Mass Index

-0.4 -0.2 0 0.2 0.4

Value

050

100

150

Color Keyand Histogram

Count

phen

otyp

es

exposures

+- EWAS-derived phenotype-exposure association map: A 2-D view of connections between P and E:

does my correlation matter?

Page 39: Search engine for E NEU network science 080817

EWAS-derived phenotype-exposure association map: A 2-D view of connections between P and E:

does my correlation matter?

Alpha-carotene

Alcohol

Vita

min

E a

s al

pha-

toco

pher

olBeta-carotene

Caffeine

Calcium

Carbohydrate

Cholesterol

Copper

Beta-cryptoxanthin

Folic

aci

dFo

late

, DFE

Food

fola

teD

ieta

ry fi

ber

Iron

Energy

Lycopene

Lute

in +

zea

xant

hin

MFA

16:

1M

FA 1

8:1

MFA

20:

1Magnesium

Tota

l mon

ouns

atur

ated

fatty

aci

dsMoisture

Niacin

PFA

18:

2P

FA 1

8:3

PFA

20:

4P

FA 2

2:5

PFA

22:

6To

tal p

olyu

nsat

urat

ed fa

tty a

cids

Phosphorus

Potassium

Protein

Retinol

SFA

4:0

SFA

6:0

SFA

8:0

SFA

10:

0S

FA 1

2:0

SFA

14:

0S

FA 1

6:0

SFA

18:

0Selenium

Tota

l sat

urat

ed fa

tty a

cids

Tota

l sug

ars

Tota

l fat

Theobromine

Vita

min

A, R

AE

Thiamin

Vita

min

B12

Riboflavin

Vita

min

B6

Vita

min

CV

itam

in K

Zinc

No

Sal

tO

rdin

ary

Sal

ta-Carotene

Vita

min

B12

, ser

umtrans-b-carotene

cis-b-carotene

b-cryptoxanthin

Fola

te, s

erum

g-tocopherol

Iron,

Fro

zen

Ser

umC

ombi

ned

Lute

in/z

eaxa

nthi

ntrans-lycopene

Fola

te, R

BC

Ret

inyl

pal

mita

teR

etin

yl s

tear

ate

Retinol

Vita

min

Da-Tocopherol

Daidzein

o-Desmethylangolensin

Equol

Enterodiol

Enterolactone

Genistein

Est

imat

ed V

O2m

axP

hysi

cal A

ctiv

ityD

oes

anyo

ne s

mok

e in

hom

e?To

tal #

of c

igar

ette

s sm

oked

in h

ome

Cotinine

Cur

rent

Cig

aret

te S

mok

er?

Age

last

sm

oked

cig

aret

tes

regu

larly

# ci

gare

ttes

smok

ed p

er d

ay w

hen

quit

# ci

gare

ttes

smok

ed p

er d

ay n

ow#

days

sm

oked

cig

s du

ring

past

30

days

Avg

# c

igar

ette

s/da

y du

ring

past

30

days

Sm

oked

at l

east

100

cig

aret

tes

in li

feD

o yo

u no

w s

mok

e ci

gare

ttes.

..nu

mbe

r of d

ays

sinc

e qu

itU

sed

snuf

f at l

east

20

times

in li

fedr

ink

5 in

a d

aydr

ink

per d

ayda

ys 5

drin

ks in

yea

rda

ys d

rink

in y

ear

3-fluorene

2-fluorene

3-phenanthrene

1-phenanthrene

2-phenanthrene

1-pyrene

3-be

nzo[

c] p

hena

nthr

ene

3-be

nz[a

] ant

hrac

ene

Mon

o-n-

buty

l pht

hala

teM

ono-

pht

hala

teM

ono-

cycl

ohex

yl p

htha

late

Mon

o-et

hyl p

htha

late

Mon

o- p

htha

late

Mon

o--h

exyl

pht

hala

teM

ono-

isob

utyl

pht

hala

teM

ono-

n-m

ethy

l pht

hala

teM

ono-

pht

hala

teM

ono-

benz

yl p

htha

late

Cadmium

Lead

Mer

cury

, tot

alB

ariu

m, u

rine

Cad

miu

m, u

rine

Cob

alt,

urin

eC

esiu

m, u

rine

Mer

cury

, urin

eIo

dine

, urin

eM

olyb

denu

m, u

rine

Lead

, urin

eP

latin

um, u

rine

Ant

imon

y, u

rine

Thal

lium

, urin

eTu

ngst

en, u

rine

Ura

nium

, urin

eB

lood

Ben

zene

Blo

od E

thyl

benz

ene

Blo

od o

-Xyl

ene

Blo

od S

tyre

neB

lood

Tric

hlor

oeth

ene

Blo

od T

olue

neB

lood

m-/p

-Xyl

ene

1,2,3,7,8-pncdd

1,2,3,7,8,9-hxcdd

1,2,3,4,6,7,8-hpcdd

1,2,3,4,6,7,8,9-ocdd

2,3,7,8-tcdd

Beta-hexachlorocyclohexane

Gamma-hexachlorocyclohexane

Hexachlorobenzene

Hep

tach

lor E

poxi

deMirex

Oxychlordane

p,p-DDE

Trans-nonachlor

2,5-

dich

loro

phen

ol re

sult

2,4,

6-tri

chlo

roph

enol

resu

ltPentachlorophenol

Dimethylphosphate

Diethylphosphate

Dimethylthiophosphate

PCB66

PCB74

PCB99

PCB105

PCB118

PC

B13

8 &

158

PCB146

PCB153

PCB156

PCB157

PCB167

PCB170

PCB172

PCB177

PCB178

PCB180

PCB183

PCB187

3,3,4,4,5,5-hxcb

3,3,4,4,5-pncb

3,4,4,5-tcb

Per

fluor

ohep

tano

ic a

cid

Per

fluor

ohex

ane

sulfo

nic

acid

Per

fluor

onon

anoi

c ac

idP

erflu

oroo

ctan

oic

acid

Per

fluor

ooct

ane

sulfo

nic

acid

Per

fluor

ooct

ane

sulfo

nam

ide

2,3,7,8-tcdf

1,2,3,7,8-pncdf

2,3,4,7,8-pncdf

1,2,3,4,7,8-hxcdf

1,2,3,6,7,8-hxcdf

1,2,3,7,8,9-hxcdf

2,3,4,6,7,8-hxcdf

1,2,3,4,6,7,8-hpcdf

Measles

Toxoplasma

Hep

atiti

s A

Ant

ibod

yH

epat

itis

B c

ore

antib

ody

Hep

atiti

s B

Sur

face

Ant

ibod

yH

erpe

s II

Albumin, urineUric acidPhosphorusOsmolalitySodiumPotassiumCreatinineChlorideTotal calciumBicarbonateBlood urea nitrogenTotal proteinTotal bilirubinLactate dehydrogenase LDHGamma glutamyl transferaseGlobulinAlanine aminotransferase ALTAspartate aminotransferase ASTAlkaline phosphotaseAlbuminMethylmalonic acidPSA. totalProstate specific antigen ratioTIBC, Frozen SerumRed cell distribution widthRed blood cell countPlatelet count SISegmented neutrophils percentMean platelet volumeMean cell volumeMean cell hemoglobinMCHCHemoglobinHematocritFerritinProtoporphyrinTransferrin saturationWhite blood cell countMonocyte percentLymphocyte percentEosinophils percentC-reactive proteinSegmented neutrophils numberMonocyte numberLymphocyte numberEosinophils numberBasophils numbermean systolicmean diastolic60 sec. pulse:60 sec HRTotal CholesterolTriglyceridesGlucose, serumInsulinHomocysteineGlucose, plasmaGlycohemoglobinC-peptide: SILDL-cholesterolDirect HDL-CholesterolBone alkaline phosphotaseTrunk FatLumber Pelvis BMDLumber Spine BMDHead BMDTrunk Lean excl BMCTotal Lean excl BMCTotal FatTotal BMDWeightWaist CircumferenceTriceps SkinfoldThigh CircumferenceSubscapular SkinfoldRecumbent LengthUpper Leg LengthStanding HeightHead CircumferenceMaximal Calf CircumferenceBody Mass Index

-0.4 -0.2 0 0.2 0.4

Value

050

100

150

Color Keyand Histogram

Count

phen

otyp

es

exposures

+-

Page 40: Search engine for E NEU network science 080817

Alpha-carotene

Alcohol

Vita

min

E a

s al

pha-

toco

pher

olBeta-carotene

Caffeine

Calcium

Carbohydrate

Cholesterol

Copper

Beta-cryptoxanthin

Folic

aci

dFo

late

, DFE

Food

fola

teD

ieta

ry fi

ber

Iron

Energy

Lycopene

Lute

in +

zea

xant

hin

MFA

16:

1M

FA 1

8:1

MFA

20:

1Magnesium

Tota

l mon

ouns

atur

ated

fatty

aci

dsMoisture

Niacin

PFA

18:

2P

FA 1

8:3

PFA

20:

4P

FA 2

2:5

PFA

22:

6To

tal p

olyu

nsat

urat

ed fa

tty a

cids

Phosphorus

Potassium

Protein

Retinol

SFA

4:0

SFA

6:0

SFA

8:0

SFA

10:

0S

FA 1

2:0

SFA

14:

0S

FA 1

6:0

SFA

18:

0Selenium

Tota

l sat

urat

ed fa

tty a

cids

Tota

l sug

ars

Tota

l fat

Theobromine

Vita

min

A, R

AE

Thiamin

Vita

min

B12

Riboflavin

Vita

min

B6

Vita

min

CV

itam

in K

Zinc

No

Sal

tO

rdin

ary

Sal

ta-Carotene

Vita

min

B12

, ser

umtrans-b-carotene

cis-b-carotene

b-cryptoxanthin

Fola

te, s

erum

g-tocopherol

Iron,

Fro

zen

Ser

umC

ombi

ned

Lute

in/z

eaxa

nthi

ntrans-lycopene

Fola

te, R

BC

Ret

inyl

pal

mita

teR

etin

yl s

tear

ate

Retinol

Vita

min

Da-Tocopherol

Daidzein

o-Desmethylangolensin

Equol

Enterodiol

Enterolactone

Genistein

Est

imat

ed V

O2m

axP

hysi

cal A

ctiv

ityD

oes

anyo

ne s

mok

e in

hom

e?To

tal #

of c

igar

ette

s sm

oked

in h

ome

Cotinine

Cur

rent

Cig

aret

te S

mok

er?

Age

last

sm

oked

cig

aret

tes

regu

larly

# ci

gare

ttes

smok

ed p

er d

ay w

hen

quit

# ci

gare

ttes

smok

ed p

er d

ay n

ow#

days

sm

oked

cig

s du

ring

past

30

days

Avg

# c

igar

ette

s/da

y du

ring

past

30

days

Sm

oked

at l

east

100

cig

aret

tes

in li

feD

o yo

u no

w s

mok

e ci

gare

ttes.

..nu

mbe

r of d

ays

sinc

e qu

itU

sed

snuf

f at l

east

20

times

in li

fedr

ink

5 in

a d

aydr

ink

per d

ayda

ys 5

drin

ks in

yea

rda

ys d

rink

in y

ear

3-fluorene

2-fluorene

3-phenanthrene

1-phenanthrene

2-phenanthrene

1-pyrene

3-be

nzo[

c] p

hena

nthr

ene

3-be

nz[a

] ant

hrac

ene

Mon

o-n-

buty

l pht

hala

teM

ono-

pht

hala

teM

ono-

cycl

ohex

yl p

htha

late

Mon

o-et

hyl p

htha

late

Mon

o- p

htha

late

Mon

o--h

exyl

pht

hala

teM

ono-

isob

utyl

pht

hala

teM

ono-

n-m

ethy

l pht

hala

teM

ono-

pht

hala

teM

ono-

benz

yl p

htha

late

Cadmium

Lead

Mer

cury

, tot

alB

ariu

m, u

rine

Cad

miu

m, u

rine

Cob

alt,

urin

eC

esiu

m, u

rine

Mer

cury

, urin

eIo

dine

, urin

eM

olyb

denu

m, u

rine

Lead

, urin

eP

latin

um, u

rine

Ant

imon

y, u

rine

Thal

lium

, urin

eTu

ngst

en, u

rine

Ura

nium

, urin

eB

lood

Ben

zene

Blo

od E

thyl

benz

ene

Blo

od o

-Xyl

ene

Blo

od S

tyre

neB

lood

Tric

hlor

oeth

ene

Blo

od T

olue

neB

lood

m-/p

-Xyl

ene

1,2,3,7,8-pncdd

1,2,3,7,8,9-hxcdd

1,2,3,4,6,7,8-hpcdd

1,2,3,4,6,7,8,9-ocdd

2,3,7,8-tcdd

Beta-hexachlorocyclohexane

Gamma-hexachlorocyclohexane

Hexachlorobenzene

Hep

tach

lor E

poxi

deMirex

Oxychlordane

p,p-DDE

Trans-nonachlor

2,5-

dich

loro

phen

ol re

sult

2,4,

6-tri

chlo

roph

enol

resu

ltPentachlorophenol

Dimethylphosphate

Diethylphosphate

Dimethylthiophosphate

PCB66

PCB74

PCB99

PCB105

PCB118

PC

B13

8 &

158

PCB146

PCB153

PCB156

PCB157

PCB167

PCB170

PCB172

PCB177

PCB178

PCB180

PCB183

PCB187

3,3,4,4,5,5-hxcb

3,3,4,4,5-pncb

3,4,4,5-tcb

Per

fluor

ohep

tano

ic a

cid

Per

fluor

ohex

ane

sulfo

nic

acid

Per

fluor

onon

anoi

c ac

idP

erflu

oroo

ctan

oic

acid

Per

fluor

ooct

ane

sulfo

nic

acid

Per

fluor

ooct

ane

sulfo

nam

ide

2,3,7,8-tcdf

1,2,3,7,8-pncdf

2,3,4,7,8-pncdf

1,2,3,4,7,8-hxcdf

1,2,3,6,7,8-hxcdf

1,2,3,7,8,9-hxcdf

2,3,4,6,7,8-hxcdf

1,2,3,4,6,7,8-hpcdf

Measles

Toxoplasma

Hep

atiti

s A

Ant

ibod

yH

epat

itis

B c

ore

antib

ody

Hep

atiti

s B

Sur

face

Ant

ibod

yH

erpe

s II

Albumin, urineUric acidPhosphorusOsmolalitySodiumPotassiumCreatinineChlorideTotal calciumBicarbonateBlood urea nitrogenTotal proteinTotal bilirubinLactate dehydrogenase LDHGamma glutamyl transferaseGlobulinAlanine aminotransferase ALTAspartate aminotransferase ASTAlkaline phosphotaseAlbuminMethylmalonic acidPSA. totalProstate specific antigen ratioTIBC, Frozen SerumRed cell distribution widthRed blood cell countPlatelet count SISegmented neutrophils percentMean platelet volumeMean cell volumeMean cell hemoglobinMCHCHemoglobinHematocritFerritinProtoporphyrinTransferrin saturationWhite blood cell countMonocyte percentLymphocyte percentEosinophils percentC-reactive proteinSegmented neutrophils numberMonocyte numberLymphocyte numberEosinophils numberBasophils numbermean systolicmean diastolic60 sec. pulse:60 sec HRTotal CholesterolTriglyceridesGlucose, serumInsulinHomocysteineGlucose, plasmaGlycohemoglobinC-peptide: SILDL-cholesterolDirect HDL-CholesterolBone alkaline phosphotaseTrunk FatLumber Pelvis BMDLumber Spine BMDHead BMDTrunk Lean excl BMCTotal Lean excl BMCTotal FatTotal BMDWeightWaist CircumferenceTriceps SkinfoldThigh CircumferenceSubscapular SkinfoldRecumbent LengthUpper Leg LengthStanding HeightHead CircumferenceMaximal Calf CircumferenceBody Mass Index

-0.4 -0.2 0 0.2 0.4

Value

050

100

150

Color Keyand Histogram

Count

phen

otyp

es

exposures

+- EWAS-derived phenotype-exposure association map: A 2-D view of connections between P and E:

does my correlation matter?

Page 41: Search engine for E NEU network science 080817

Alpha-carotene

Alcohol

Vita

min

E a

s al

pha-

toco

pher

olBeta-carotene

Caffeine

Calcium

Carbohydrate

Cholesterol

Copper

Beta-cryptoxanthin

Folic

aci

dFo

late

, DFE

Food

fola

teD

ieta

ry fi

ber

Iron

Energy

Lycopene

Lute

in +

zea

xant

hin

MFA

16:

1M

FA 1

8:1

MFA

20:

1Magnesium

Tota

l mon

ouns

atur

ated

fatty

aci

dsMoisture

Niacin

PFA

18:

2P

FA 1

8:3

PFA

20:

4P

FA 2

2:5

PFA

22:

6To

tal p

olyu

nsat

urat

ed fa

tty a

cids

Phosphorus

Potassium

Protein

Retinol

SFA

4:0

SFA

6:0

SFA

8:0

SFA

10:

0S

FA 1

2:0

SFA

14:

0S

FA 1

6:0

SFA

18:

0Selenium

Tota

l sat

urat

ed fa

tty a

cids

Tota

l sug

ars

Tota

l fat

Theobromine

Vita

min

A, R

AE

Thiamin

Vita

min

B12

Riboflavin

Vita

min

B6

Vita

min

CV

itam

in K

Zinc

No

Sal

tO

rdin

ary

Sal

ta-Carotene

Vita

min

B12

, ser

umtrans-b-carotene

cis-b-carotene

b-cryptoxanthin

Fola

te, s

erum

g-tocopherol

Iron,

Fro

zen

Ser

umC

ombi

ned

Lute

in/z

eaxa

nthi

ntrans-lycopene

Fola

te, R

BC

Ret

inyl

pal

mita

teR

etin

yl s

tear

ate

Retinol

Vita

min

Da-Tocopherol

Daidzein

o-Desmethylangolensin

Equol

Enterodiol

Enterolactone

Genistein

Est

imat

ed V

O2m

axP

hysi

cal A

ctiv

ityD

oes

anyo

ne s

mok

e in

hom

e?To

tal #

of c

igar

ette

s sm

oked

in h

ome

Cotinine

Cur

rent

Cig

aret

te S

mok

er?

Age

last

sm

oked

cig

aret

tes

regu

larly

# ci

gare

ttes

smok

ed p

er d

ay w

hen

quit

# ci

gare

ttes

smok

ed p

er d

ay n

ow#

days

sm

oked

cig

s du

ring

past

30

days

Avg

# c

igar

ette

s/da

y du

ring

past

30

days

Sm

oked

at l

east

100

cig

aret

tes

in li

feD

o yo

u no

w s

mok

e ci

gare

ttes.

..nu

mbe

r of d

ays

sinc

e qu

itU

sed

snuf

f at l

east

20

times

in li

fedr

ink

5 in

a d

aydr

ink

per d

ayda

ys 5

drin

ks in

yea

rda

ys d

rink

in y

ear

3-fluorene

2-fluorene

3-phenanthrene

1-phenanthrene

2-phenanthrene

1-pyrene

3-be

nzo[

c] p

hena

nthr

ene

3-be

nz[a

] ant

hrac

ene

Mon

o-n-

buty

l pht

hala

teM

ono-

pht

hala

teM

ono-

cycl

ohex

yl p

htha

late

Mon

o-et

hyl p

htha

late

Mon

o- p

htha

late

Mon

o--h

exyl

pht

hala

teM

ono-

isob

utyl

pht

hala

teM

ono-

n-m

ethy

l pht

hala

teM

ono-

pht

hala

teM

ono-

benz

yl p

htha

late

Cadmium

Lead

Mer

cury

, tot

alB

ariu

m, u

rine

Cad

miu

m, u

rine

Cob

alt,

urin

eC

esiu

m, u

rine

Mer

cury

, urin

eIo

dine

, urin

eM

olyb

denu

m, u

rine

Lead

, urin

eP

latin

um, u

rine

Ant

imon

y, u

rine

Thal

lium

, urin

eTu

ngst

en, u

rine

Ura

nium

, urin

eB

lood

Ben

zene

Blo

od E

thyl

benz

ene

Blo

od o

-Xyl

ene

Blo

od S

tyre

neB

lood

Tric

hlor

oeth

ene

Blo

od T

olue

neB

lood

m-/p

-Xyl

ene

1,2,3,7,8-pncdd

1,2,3,7,8,9-hxcdd

1,2,3,4,6,7,8-hpcdd

1,2,3,4,6,7,8,9-ocdd

2,3,7,8-tcdd

Beta-hexachlorocyclohexane

Gamma-hexachlorocyclohexane

Hexachlorobenzene

Hep

tach

lor E

poxi

deMirex

Oxychlordane

p,p-DDE

Trans-nonachlor

2,5-

dich

loro

phen

ol re

sult

2,4,

6-tri

chlo

roph

enol

resu

ltPentachlorophenol

Dimethylphosphate

Diethylphosphate

Dimethylthiophosphate

PCB66

PCB74

PCB99

PCB105

PCB118

PC

B13

8 &

158

PCB146

PCB153

PCB156

PCB157

PCB167

PCB170

PCB172

PCB177

PCB178

PCB180

PCB183

PCB187

3,3,4,4,5,5-hxcb

3,3,4,4,5-pncb

3,4,4,5-tcb

Per

fluor

ohep

tano

ic a

cid

Per

fluor

ohex

ane

sulfo

nic

acid

Per

fluor

onon

anoi

c ac

idP

erflu

oroo

ctan

oic

acid

Per

fluor

ooct

ane

sulfo

nic

acid

Per

fluor

ooct

ane

sulfo

nam

ide

2,3,7,8-tcdf

1,2,3,7,8-pncdf

2,3,4,7,8-pncdf

1,2,3,4,7,8-hxcdf

1,2,3,6,7,8-hxcdf

1,2,3,7,8,9-hxcdf

2,3,4,6,7,8-hxcdf

1,2,3,4,6,7,8-hpcdf

Measles

Toxoplasma

Hep

atiti

s A

Ant

ibod

yH

epat

itis

B c

ore

antib

ody

Hep

atiti

s B

Sur

face

Ant

ibod

yH

erpe

s II

Albumin, urineUric acidPhosphorusOsmolalitySodiumPotassiumCreatinineChlorideTotal calciumBicarbonateBlood urea nitrogenTotal proteinTotal bilirubinLactate dehydrogenase LDHGamma glutamyl transferaseGlobulinAlanine aminotransferase ALTAspartate aminotransferase ASTAlkaline phosphotaseAlbuminMethylmalonic acidPSA. totalProstate specific antigen ratioTIBC, Frozen SerumRed cell distribution widthRed blood cell countPlatelet count SISegmented neutrophils percentMean platelet volumeMean cell volumeMean cell hemoglobinMCHCHemoglobinHematocritFerritinProtoporphyrinTransferrin saturationWhite blood cell countMonocyte percentLymphocyte percentEosinophils percentC-reactive proteinSegmented neutrophils numberMonocyte numberLymphocyte numberEosinophils numberBasophils numbermean systolicmean diastolic60 sec. pulse:60 sec HRTotal CholesterolTriglyceridesGlucose, serumInsulinHomocysteineGlucose, plasmaGlycohemoglobinC-peptide: SILDL-cholesterolDirect HDL-CholesterolBone alkaline phosphotaseTrunk FatLumber Pelvis BMDLumber Spine BMDHead BMDTrunk Lean excl BMCTotal Lean excl BMCTotal FatTotal BMDWeightWaist CircumferenceTriceps SkinfoldThigh CircumferenceSubscapular SkinfoldRecumbent LengthUpper Leg LengthStanding HeightHead CircumferenceMaximal Calf CircumferenceBody Mass Index

-0.4 -0.2 0 0.2 0.4

Value

050

100

150

Color Keyand Histogram

Count

phen

otyp

es

exposures

+- EWAS-derived phenotype-exposure association map: A 2-D view of connections between P and E:

does my correlation matter?

total sugar

Polyunsaturatedfats

Vitamin D

Cotinine

Page 42: Search engine for E NEU network science 080817

High-throughput data analytics to mitigate analytical challenges of exposome-based research:

Consider multiplicity of hypotheses and correlational web!

Does my correlation matter? How does my new correlation

compare to the family of correlations?What is the total variance

explained(σ2E)?

saturated fatty acids and HA1C: 0.5%does it matter? (i.e., 1.2% is average!)

ρ

ARPH 2016 JAMA 2014 JECH 2015

Explicit in number of hypotheses tested

False discovery rate; family-wise error rate;Report database size!

p-2

20

8

1

6

18

7

10

p-1

12

2

5

9

4

16

21

11

13

3

17

14

19

p

15

6 101 1482 7 …1311 12 e54 153 9 e-1e-2

E exposure factors

P ph

enot

ypic

fact

ors

Page 43: Search engine for E NEU network science 080817

http://chiragjpgroup.org/exposome-analytics-course

Nam PhoPlease contact me for help or project ideas!

Page 44: Search engine for E NEU network science 080817

Eye colorHair curliness

Type-1 diabetesHeight

SchizophreniaEpilepsy

Graves' diseaseCeliac disease

Polycystic ovary syndromeAttention deficit hyperactivity disorder

Bipolar disorderObesity

Alzheimer's diseaseAnorexia nervosa

PsoriasisBone mineral density

Menarche, age atNicotine dependence

Sexual orientationAlcoholism

LupusRheumatoid arthritis

Crohn's diseaseMigraine

Thyroid cancerAutism

Blood pressure, diastolicBody mass index

DepressionCoronary artery disease

InsomniaMenopause, age at

Heart diseaseProstate cancer

QT intervalBreast cancer

Ovarian cancerHangoverStrokeAsthma

Blood pressure, systolicHypertensionOsteoarthritis

Parkinson's diseaseLongevity

Type-2 diabetesGallstone diseaseTesticular cancer

Cervical cancerSciatica

Bladder cancerColon cancerLung cancerLeukemia

Stomach cancer

0 25 50 75 100Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com

Use high-throughput tools and data (e.g., exposome) will enhance discovery of the role of E (and G) in P.

Page 45: Search engine for E NEU network science 080817

In conclusion:Data science inspired approaches to ascertain exposome and

genome will enable biomedical discovery

Dense correlations, confounding, reverse causality: how to assess at high dimension?

Understand interacting G and E for causation

Mitigate fragmented literature of associations.

september2011 119

Multiple modelling

This problem is akin to – but less well recognised and more poorly understood than – multiple testing. For example, consider the use of linear regression to adjust the risk levels of two treatments to the same background level of risk. There can be many covariates, and each set of covariates can be in or out of the model. With ten covariates, there are over 1000 possible models. Consider a maze as a metaphor for modelling (Figure 3). The red line traces the correct path out of the maze. The path through the maze looks simple, once it is known. Returning to a linear regression model, terms can be put into and taken out of a regression model. Once you get a p-value smaller than 0.05, the model can be frozen and the model selection justified after the fact. It is easy to justify each turn.

The combination of multiple testing and multiple modelling can lead to a very large search space, as the example of bisphenol A in Box 3 shows. Such large search spaces can give small, false positive p-values somewhere within them. Unfortunately, authors and consumers are often like a deer caught in the headlights and take a small p-value as indicating a real effect.

How can it be fixed? A new, combined strategy

It should be clear by now that more than small-scale remedies are needed. The entire system of observational studies and the claims that are made from them is no longer functional, nor is it fit for purpose. What can be done to fix this broken system? There are no principled

ways in the literature for dealing with model selection, so we propose a new, composite strategy. Following Deming, it is based not upon the workers – the researchers – but on the production system managers – the funding agencies and the editors of the journals where the claims are reported.

We propose a multi-step strategy to help bring observational studies under control (see Table 2). The main technical idea is to split the data into two data sets, a modelling data set and a holdout data set. The main operational idea is to require the journal to accept or reject the paper based on an analysis of the modelling data set without knowing the results of applying the methods used for the modelling set on the holdout set and to publish an addendum to the paper giving the results of the analysis of the holdout set. We now cover the steps, one by one.

1 The data collection and clean-up should be done by a group separate from the analysis group. There can be a tempta-tion on the part of the analyst to do some exploratory data analysis during the data clean up. Exploratory analysis could lead to model selection bias.

Box 2. Publication bias

There is general recognition that a paper has a much better chance of acceptance if something new is found. This means that, for publication, the claim in the paper has to be based on a p-value less than 0.05. From Deming’s point of view5, this is quality by inspection. The journals are placing heavy reliance on a statistical test rather than examination of the methods and steps that lead to a conclusion. As to having a p-value less than 0.05, some might be tempted to game the system10 through multiple testing, multiple modelling or unfair treatment of bias, or some combination of the three that leads to a small p-value. Researchers can be quite creative in devising a plausible story to fit the statistical finding.

2 The data cleaning team creates a modelling data set and a holdout set and gives the modelling data set, less the item to be predicted, to the analyst for examination.

P < 0.05

Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one can work towards a suitably small p-value. © ktsdesign – Fotolia

Table 2. Steps 0–7 can be used to help bring the observational study process into control. Currently researchers analysing observational data sets are under no effective oversight

Step Process / Action

0 Data are made publicly available

1 Data cleaning and analysis separate

2 Split sample: A, modelling; and B, holdout (testing)

3 Analysis plan is written, based on modelling data only

4 Written protocol, based on viewing predictor variables of A

5 Analysis of A only data set

6 Journal accepts paper based on A only

7 Analysis of B data set gives Addendum

EWASs in aging: mortality and quantitative traits

Adjusted Hazard Ratio

-log10(pvalue)

0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8

02

46

8

1

2

3

45

67

1 Physical Activity2 Does anyone smoke in home?3 Cadmium4 Cadmium, urine5 Past smoker6 Current smoker7 trans-lycopene

(11) 1

2

3 4

5 6

789

10 111213 141516

1 age (10 year increment)2 SES_13 male4 SES_05 black6 SES_27 SES_38 education_hs9 other_eth10 mexican11 occupation_blue_semi12 education_less_hs13 occupation_never14 occupation_blue_high15 occupation_white_semi16 other_hispanic

(69)

?

Page 46: Search engine for E NEU network science 080817

Harvard DBMI Susanne ChurchillNathan PalmerSophia MamousetteSunny AlvearMichal Preminger

Chirag J [email protected]

@chiragjpwww.chiragjpgroup.org

NIH Common FundBig Data to Knowledge

AcknowledgementsRagGroup Nam Pho Jake Chung Kajal Claypool Arjun Manrai Chirag Lakhani Adam BrownDanielle RasoolyAlan LeGoallecSivateja Tangirala

Amar Dhand Center for Complex Networks

Mentioned Collaborators Isaac KohaneJohn IoannidisDennis BierHugo Aschard