spiral.imperial.ac.uk  · web view2019. 4. 11. · high throughput sequencing in respiratory,...

73
High Throughput Sequencing in Respiratory, Critical Care and Sleep Medicine Research: An Official American Thoracic Society Workshop Report Craig P. Hersh 1 , Ian M. Adcock 2 , Juan C. Celedón 3 , Michael H. Cho 1 , David C. Christiani 4 , Blanca E. Himes 5 , Naftali Kaminski 6 , Rasika A. Mathias 7 , Deborah A. Meyers 8 , John Quackenbush 9 , Susan Redline 10 , Katrina A. Steiling 11 , Holly K. Tabor 12 , Martin D. Tobin 13, 14 , Mark M. Wurfel 15 , Ivana V. Yang 16 , and Gerard H. Koppelman 17, 18 on behalf of the American Thoracic Society Section on Genetics and Genomics 1. Channing Division of Network Medicine and Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA; Harvard Medical School, Boston, MA, USA 2. Airways Disease Section, National Heart and Lung Institute, Imperial College London, UK 3. Division of Pediatric Pulmonary Medicine, Allergy and Immunology, Children's Hospital of Pittsburgh of UPMC, Pittsburgh, PA, USA 4. Harvard T.H. Chan School of Public Health, Departments of Environmental Health and Epidemiology, and the Pulmonary and Critical Care Division, Department of Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA. 1

Upload: others

Post on 28-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

High Throughput Sequencing in Respiratory, Critical Care and Sleep Medicine Research: An Official American Thoracic Society Workshop Report

Craig P. Hersh 1, Ian M. Adcock 2, Juan C. Celedón 3, Michael H. Cho 1, David C. Christiani 4,

Blanca E. Himes 5, Naftali Kaminski 6, Rasika A. Mathias 7, Deborah A. Meyers 8, John Quackenbush 9, Susan Redline 10, Katrina A. Steiling 11, Holly K. Tabor 12, Martin D. Tobin 13, 14, Mark M. Wurfel 15, Ivana V. Yang16, and Gerard H. Koppelman 17, 18 on behalf of the American Thoracic Society Section on Genetics and Genomics

1. Channing Division of Network Medicine and Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA; Harvard Medical School, Boston, MA, USA

2. Airways Disease Section, National Heart and Lung Institute, Imperial College London, UK

3. Division of Pediatric Pulmonary Medicine, Allergy and Immunology, Children's Hospital of Pittsburgh of UPMC, Pittsburgh, PA, USA

4. Harvard T.H. Chan School of Public Health, Departments of Environmental Health and Epidemiology, and the Pulmonary and Critical Care Division, Department of Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.

5. Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA

6. Pulmonary, Critical Care and Sleep Medicine, Yale University School of Medicine, New Haven, CT, USA

7. Division of Allergy and Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD, USA

8. University of Arizona College of Medicine, Tucson, AZ, USA.

9. Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA

10. Departments of Medicine, Brigham and Women's Hospital and Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA.

11. Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA.

12. Stanford Center for Biomedical Ethics, Department of Medicine, Stanford University, School of Medicine, Stanford, CA, USA

13. Department of Health Sciences, University of Leicester, Leicester, UK

14. National Institute for Health Research Leicester Respiratory Biomedical Research Centre, Glenfield Hospital, Leicester, UK

15. Division of Pulmonary, Critical Care and Sleep Medicine, University of Washington, Seattle, WA, USA.

16. Department of Medicine, University of Colorado, Denver, CO, USA

17. University of Groningen, University Medical Center Groningen, Beatrix Children’s Hospital, Department of Pediatric Pulmonology and Pediatric Allergology,

18. University of Groningen, University Medical Center Groningen, Groningen Research Institute for Asthma and COPD

Corresponding author:

Craig P. Hersh, MD, MPH

Channing Division of Network Medicine

Brigham and Women's Hospital,

181 Longwood Ave.

Boston MA 02115

Phone +1-617-525-0729

Fax +1-617-525-0958

[email protected]

Key words: bioinformatics, functional genomics, genetic epidemiology, RNA sequencing, whole genome sequencing

Word count: 4478/4500

Abstract

High throughput, “next generation” sequencing methods are now being broadly applied across all fields of biomedical research, including respiratory disease, critical care, and sleep medicine. While there are numerous review articles and best practice guidelines related to sequencing methods and data analysis, there are fewer resources summarizing issues related to study design and interpretation, especially as applied to common, complex, non-malignant diseases. To address these gaps, a single day workshop was held at the American Thoracic Society meeting in May 2017, led by the ATS Section on Genetics and Genomics. The aim of this workshop was to review the design, analysis, interpretation, and functional follow-up of high throughput sequencing studies in respiratory, critical care, and sleep medicine research. This workshop brought together experts in multiple fields, including genetic epidemiology, biobanking, bioinformatics, and research ethics, along with physician-scientists with expertise in a range of relevant diseases. The workshop focused on application of DNA and RNA sequencing research in common, chronic diseases, and did not cover sequencing studies in lung cancer, monogenic diseases (e.g. cystic fibrosis), or microbiome sequencing. Participants reviewed and discussed study design, data analysis and presentation, interpretation, functional follow-up and reporting of results. This report summarizes the main conclusions of the workshop, specifically addressing the application of these methods in respiratory, critical care, and sleep medicine research. This workshop report may serve as a resource for our research community, as well as for journal editors and reviewers of sequencing-based manuscript submissions in our research field.

Word count: 246/250

Contents

Overview

Workshop Agenda

Principles of study design

Study designs and phenotyping for genetic epidemiology

Biobanks

Health equity

Research ethics

Next generation DNA sequencing

Study design

Identifying causal variants

Next generation RNA sequencing

Study design

Sequencing methods

Small RNA sequencing

Reporting for RNA sequencing experiments

Data analysis

Bioinformatics

Quality control

Statistical analysis

Data sharing

Cell type heterogeneity

Single cell sequencing

Multi-omics integration

Functional validation

Conclusions

Overview

High throughput, next generation, sequencing technologies (NGS) (see Box 1 for a glossary of terms) are becoming increasingly used in studies of common, complex diseases, including respiratory diseases such as asthma, chronic obstructive pulmonary disease, and idiopathic pulmonary fibrosis; critical illnesses; and sleep disorders (Table 1) (Figure 1) (1-15). RNA sequencing is now more cost-effective than microarrays, and exome and whole genome DNA sequencing are rapidly replacing genotyping arrays. Given the widespread application of these techniques in respiratory, critical care and sleep medicine research, a workshop was organized at the ATS International Conference in Washington, DC in May 2017. The aim of this workshop was to review the design, analysis, interpretation, and functional follow-up of high throughput sequencing studies in respiratory, critical care, and sleep medicine research. Whereas reviews and best practices guidelines for DNA and RNA sequencing have been published (16, 17), this workshop focused on the application of DNA and RNA sequencing to common, complex diseases in human populations, but not on epigenome or microbiome studies, or cancer genetics.

Workshop Agenda

The workshop participants focused on five topics, each of which concluded with a panel discussion. Areas of emphasis included study design, ethical considerations and health inequalities, applications of DNA and RNA sequencing, cell type heterogeneity, and functional studies. Biomedical literature searches were conducted by the speakers and co-chairs. The co-chairs collected summaries from speakers, and a writing group prepared the document for review by the workshop participants. Recommendations were formulated by discussion and consensus (Box 2).

Principles of Study Design

Study Designs and Phenotyping for Genetic Epidemiology

The general principles of epidemiology study design remain true for genetic epidemiology studies, including subject ascertainment, phenotype definition, and sample size considerations, as summarized in the STrengthening the REporting of Genetic Association Studies (STREGA) Guidelines (18). There are several possible designs for genetics studies of respiratory disease. In the past, most studies enrolled subjects ascertained for a specific condition. These studies usually employ careful phenotyping to define the disease of interest using endotypes, such as methacholine challenge testing or polysomnography (19-21). General population (cohort) studies may offer the advantage of large sample sizes and the ability to study multiple outcomes, though the phenotyping may not be as precise. Questionnaires may be the primary source of respiratory disease diagnosis, though several large cohorts have included spirometry (22, 23). For common diseases, the large numbers may offset the potential for phenotypic heterogeneity (e.g. childhood vs. adult-onset asthma) or even misclassification (e.g. COPD misdiagnosed as asthma, especially in women) (24). Recent studies have linked genetics to the electronic medical record (25). General population cohorts have limited utility in studies of critical illness, where subjects are enrolled in the hospital (26). Although most recent studies have been case-control or cohort studies, family-based studies still play a role, especially in the analysis of rare variants, where transmission can be followed through a pedigree (27).

As genomic data sharing has become the norm, secondary analysis for respiratory diseases is now routinely performed in general population studies, such as the Framingham Heart Study (22). When secondary data will be used or when multiple studies will be combined in a meta-analysis, investigators must carefully review the phenotyping methods, including the specific questionnaire items, to be sure that similar traits are being compared. In case-control studies, control subjects should have comparable exposures, such as smokers with normal lung function in COPD studies or patients with multi-trauma, pneumonia, or sepsis who did not develop ARDS (28-30).

Biobanks

Sequencing studies undertaken at the scale required for well-powered association testing require organized efforts, with coordinated biobanking.  Rare diseases and phenotypes may be due to genetic variants with high penetrance, which may be detectable in relatively modest numbers of samples. Even in this situation, and in the absence of many different mutations causing similar phenotypes, it is helpful to draw upon very large numbers of sequenced individuals. Genomics England (the “100,000 Genomes Project”) (31), the NHLBI Trans-Omics in Precision Medicine (TOPMed)(32) and the Genome Aggregation Database (33) are initiatives that will enhance such comparisons, which are critical to confirm whether variants are causal for the disease in question (Table 2). The Genomics England project is recruiting patients with cancers, infectious diseases such as tuberculosis, and rare diseases, including primary ciliary dyskinesia, spontaneous pneumothorax, familial pulmonary fibrosis and familial multiple pulmonary arteriovenous malformations.

Complementing efforts that specifically recruit individuals with particular diseases are population biobanks that are agnostic to health status, many of which have extensive longitudinal follow-up. The UK Biobank recruited 500,000 participants aged 40-69 years (34). In addition to a baseline assessment, subsequent health status is evaluated via linked electronic health care records. Beginning with respiratory studies in 50,000 participants (35), genome-wide genotyping has now been extended to all participants. Whole exome sequencing (WES) is underway, and whole genome sequencing (WGS) has recently been announced in collaboration with industry partners. The sequence data will be made available to the research community. While 95% of the UK Biobank is of European ancestry, similar efforts are in progress in China in the Kadoorie Biobank (36).

Health Equity

Many respiratory, critical care, and sleep disorders have substantial differences in disease susceptibility, prevalence, and burden according to race and ethnicity (37). Genetic and genomic factors, along with their interplay with the environment, contribute to these differences. For example, African ancestry is a strong predictor of lung function (38). Some disease susceptibility or pharmacogenetic variants that have been identified in diseases such as asthma, emphysema, and sleep apnea (39-41) are population-specific, and may be absent or low frequency in other populations (42, 43). While most of these studies have been performed using common variants, there is even more ethnic diversity for rare variants (44). Individuals of African ancestry harbor a larger number of rare variants than whites, which may have important clinical implications. For example, rare variants identified as causing cardiomyopathy in whites were so common in African Americans as to indicate that they were unlikely to be pathogenic (45). To date, there has been a large gap in research studies involving non-whites, for reasons including convenience, access, and genetic heterogeneity (46). In addition, there have also been disparities in funding minority investigators or diseases that predominately affect minorities, such as sickle cell disease (47). By 2060, only 44% of the USA will be non-Hispanic white (48). There is a strong scientific and moral justification for expanding sequencing studies into other ancestries. Efforts such as TOPMed are performing WGS in a large number of non-European samples. These and other efforts can help ensure that sequencing efforts improve health equity for the benefit of all.

Research Ethics

Several ethical challenges have emerged related to next-generation sequencing studies. Due to the ability to assay numerous sites and the requirements for data sharing by NIH and other funders (49), genome-wide association studies (GWAS) and sequencing data are often used for secondary studies, which may be unrelated to the initial trait or disease proposed. To allow for these studies, investigators must request and subjects must provide broad consent for secondary data analysis, as opposed to narrow consent for a specific disease. There is no consensus about what is required for broad consent for databases such as the NCBI database of Genotypes and Phenotypes (dbGaP) (50); these decisions are frequently left to local Institutional Review Boards.

In addition to the genes of interest, WES and WGS studies will identify other genetic variants that may be clinically significant for the subject or their family members. The American College of Medical Genetics has provided recommendations for the reporting of secondary results from clinical sequencing (51), providing a list of 59 actionable genes, which interestingly does not include the genes for cystic fibrosis or alpha-1 antitrypsin deficiency. However, it is not clear how these recommendations would apply to research studies, where the sequencing is not performed in a Clinical Laboratory Improvement Amendments (CLIA)-certified laboratory. There is usually no mechanism or resources for confirmatory clinical sequencing or genetic counseling. Sequencing studies of DNA of patients with a critical illness require consent from patient proxies, as well as the designation of an individual to receive study results if the patient remains incapacitated or dies (52).

Next generation DNA sequencing

Study Design

Humans carry an extraordinary amount of genetic diversity. While most variants in a given individual are common, most genetic variants in a population are rare (i.e. present in less than 1 % of the population). Genotyping microarrays, together with methods of genotype imputation, efficiently allow testing of more prevalent genetic variants. However, comprehensive assessment of DNA variation, including discovery of rare or novel variants, requires a test that assays each base pair. The exponential decrease in sequencing costs has led to the ability to perform WGS studies to identify the contributions of rare variants to disease.

The design of a DNA sequencing study should consider several factors. Sequencing depth determines the accuracy of variant calling in an individual. For example, 30x coverage leads to high accuracy over most of the genome. However, one can sequence more samples for the same cost using lower coverage (e.g. 3-6x). While less accurate, lower coverage still provides suitable variant information for genetic association testing and may be superior to genotyping (53). Another consideration is targeted (e.g. exome) versus WGS. The exome harbors most rare, highly deleterious mutations; sequencing only these regions reduces costs, though these savings are offset somewhat by the additional costs and inefficiencies of library preparation. Some coding regions may still be poorly covered by WES (54).

Identifying Causal Variants

NGS has led to breakthroughs in the discovery of genes for Mendelian diseases and other rare variants of strong effect (55, 56). Several groups have provided guidelines for identifying causal variants for Mendelian disease (57) and for identifying genetic association in WES (58) and WGS (59). However, the interpretation of WES and WGS data has specific challenges. Very small error rates over billions of base pairs has the potential to generate many false positives (60), though advances in technology, approaches, and bioinformatics methods have vastly improved data quality. Similarly, the large number of variants carried by any individual (27) can also lead to false positives, and caution must be used to avoid inflated estimates of pathogenicity (61). In addition, most studies of rare variants are likely underpowered (5, 58, 62). The growing availability of population-specific high-quality reference genomes will aid comparison of diseased study populations to these reference datasets. Coordinated efforts are providing population-based reference data important for filtering causal variants (33), and performing WGS in large numbers of subjects such as the NIH Center for Common Disease Genetics and TOPMed. These efforts will lead to new rare variant discoveries, as well as improved reference panels for genotyping studies and fine mapping. Table 3 details recommendations for reporting the results of a WES or WGS study.

Next generation RNA sequencing

Study Design

In contrast to genome sequence, gene expression is dependent on cell, tissue and disease state. Although this context dependence makes sample collection and assays more challenging, gene expression may be more closely reflective of disease pathophysiology. Gene expression microarrays have been largely replaced by RNA-seq, which can assay a broader range of RNA types with increased sensitivity and lower costs. RNA-seq studies can be grouped into two broad categories, based on their study designs (63). Annotation studies aim to define the transcriptome of a specific cell type or organism, including novel transcripts. In comparison, quantification or differential expression studies compare transcript levels across experimental conditions or diseases. An introduction to RNA-seq for bench science has been recently published (64).

Several factors are important in designing an RNA-seq study in human populations. Since sequencing costs depend on the number of reads, there is an inherent trade-off between sample size and sequencing depth, similar to DNA sequencing. As low as 1-10 million reads per sample may be adequate for differential expression (65, 66), whereas up to 200 million reads may be required to define all isoforms (67, 68). For most human disease studies, the number of samples may be more important than the number of reads (69).

Sequencing methods

Sequencing technology has been reviewed elsewhere (70, 71). Library construction follows one of three general protocols (72). Poly-A capture is best for selecting coding transcripts, but requires the highest quality input RNA. Total RNA libraries include more RNA species; ribosomal RNA depletion is used to improve yield. Methods to capture specific target sequences can be used for lower quality or fragmented RNA, including formalin-fixed paraffin embedded tissue. Globin depletion is a common step for whole blood samples. Small RNA sequencing (i.e. microRNA) requires a specific library construction protocol.

Investigators must determine the minimum quality for input RNA, based on RNA integrity number (RIN). Other variables in sequencing include stranded vs. unstranded protocols, single vs. paired end reads and insert length. Single end short reads (≤75bp) are acceptable for standard differential expression (73). Paired end longer reads are preferable for annotating splicing events. In larger studies, samples should be randomized across batches to minimize confounding; analysis should account for batch effects. Gene expression in human tissues can be quite variable, depending on sampling, technical factors, and cellular heterogeneity (74). The latter can be addressed by deconvolution methods, which require additional information, either cell counts or cell specific reference transcriptomes.

A review of the specific steps and analytic tools for RNA-seq data analysis has been recently published (16). However, there remains controversy regarding the optimal methods for normalization of RNA-seq data and for detecting differential expression. A few studies have examined these questions systematically (75-79). Furthermore, selection of an analytic approach is also dependent on the experimental design; for example, not all available tools support inclusion of covariates. Caution is advised regarding the interpretation of results where the number of biologic replicates an experiment is small or where transcripts are expressed at very low levels.

Reporting for RNA-seq experiments

We propose minimum elements to be reported for RNA-seq experiments (Table 3), which extend the Minimal Information About Microarray Experiments (MIAME) guidelines (80). These minimal reporting requirements are important due to the rapidly evolving landscape of methods and tools for the analysis of RNA-seq data, the need to ensure RNA-seq data are both interpretable and reproducible, and the need to facilitate access to and integration of RNA-seq experiments across a spectrum of biologic and experimental conditions. Given the lack of consensus on the optimal methods for mapping versus assembling RNA-seq reads, normalizing RNA-seq data, and assessing for differential expression, we encourage investigators to repeat key analyses using more than one approach.

Data Analysis

Bioinformatics

Sequencing technologies have improved to the point where the greatest barrier to obtaining scientific insights is more related to data storage, analysis and interpretation than its generation (81). The first critical component is an interdisciplinary team with expertise encompassing both the design and use of specialized methods on sophisticated computational resources (82, 83). Institutional infrastructure or external service providers that offer high-performance computing environments, including cloud computing and core facilities, are important to facilitate the generation and analysis of high-throughput sequence data. One significant advantage to these solutions is that they distribute costs over many users. Additionally, these resources can help ensure high quality data and results, as they are generated by devoted personnel who are more familiar with NGS approaches than occasional users. On the other hand, some analytic steps are best performed with feedback from those familiar with experimental design, rather than by pipelines that may overlook important issues. Ideally, researchers with backgrounds in (1) chemistry and molecular biology involved in generating the data, (2) in-depth familiarity with analysis of sequencing data, and (3) design of a particular experiment, will communicate intermediate results and tailor analytical steps as needed.

Quality Control

Quality control (QC) in a rigorous and standardized matter is critical. After raw reads are de-multiplexed following their generation by the sequencing instrument, QC steps to ensure sequencing worked appropriately provide output including number of reads per sample, quality score of reads at each base, and over-represented sequences (e.g., primers). Subject level QC includes checking for gender consistencies, duplicates, and related subjects. After reads are aligned to a reference genome, additional parameters assess mapping quality. For example, software tools (Tables 4 and 5) can be used to summarize the number of mapped reads, including junction spanning reads for RNA-seq, and can compute the number of bases assigned to various classes of DNA/RNA according to a reference file. For RNA-seq, use of Ambion External RNA Controls Consortium (ERCC) spike-ins offers an additional QC measure. Scripts that provide standardized and reproducible reports based on output from these various programs, such as those that are part of the GATK best-practices pipeline for DNA-Seq (84) or taffeta scripts for RNA-seq (85) facilitate the assessment of QC and decisions about which samples should be excluded or what kinds of bias might be present. Mapped read files that are in bam format can be converted to bigwig or other compressed format for display in a browser such as the UCSC genome browser to verify that mapping of particular genes looks as expected (e.g., that full lengths of genes are covered vs. highly irregular portions).

Statistical Analysis

Following alignment of reads, DNA-Seq data is analyzed to search for variation relative to a reference genome and Variant Call Format (vcf/bcf) files are obtained. There are a variety of statistical tests for association within the WGS framework. The simplest consideration is based on frequency of the variant and/or a prior probability of being associated with disease (e.g., known pathogenic, deleterious, functionally relevant). Common variants, defined as those with minor allele frequency (MAF) >0.01-0.05 or minor allele count >10-20, depending on the sample size and factors such as case-control imbalance, are usually analyzed with a GWAS approach of single SNP association tests (86). Rare and/or deleterious variants are generally analyzed using a window-based approach, where windows consist of SNPs in or near a gene or are based on sliding genome regions of 5-50kb. Statistics for each window are obtained using different approaches: burden tests collapse many variants into a single risk score but assume all variants are similar in effect (i.e. risk alleles); adaptive burden tests collapse many variants but allow for risk, protective and neutral; and variance component tests apply random effects modeling (87). There is also a class of window-based rare variant tests that combine the variance component and burden test framework to take advantage of strengths of both; SKAT-O is the most commonly used (88). It is routine to apply multiple rare variant tests and use alternative options of windows, resulting in increased multiple testing complexity. In addition to Bonferroni correction, permutation and Bayesian approaches are employed to adjust for multiple comparisons.

Controversy about the best way to analyze RNA-seq data still exists, and methods development is ongoing (89). For samples passing initial QC, the next step involves quantification of levels of genes or transcripts. In most cases, reference gene or transcript files are obtained from Ensembl (90) or RefSeq (91). The output of the quantification process is then used with an appropriate software package to measure differential expression and assess related QC. Regardless of program used, it is important to report false-discovery rate adjusted p-values and fold changes.

Data Sharing

To ensure that high-throughput sequence results are reproducible and that costly data can benefit all stakeholders, data sharing resources have grown significantly. Both raw data and results generated from projects sponsored by major funders are required to be deposited into publicly available databases. DNA-Seq data, including that of TOPMed, are deposited in dbGaP, along with individual-level phenotype and association results (50). RNA-seq and other sequencing data is available in the Sequence Read Archive and can be discovered via the Gene Expression Omnibus (92).

Cell type heterogeneity

An important issue for consideration in many omics studies of the lung is cell heterogeneity. The lung is a complex organ comprising approximately forty resident cell types (93), a growing number of cell subpopulations that are present either transiently during development or in adult lung (94), as well as many types of inflammatory cells that infiltrate the airways and alveoli during periods of injury or disease. Thus, a signal measured by omics technologies in the whole lung can reflect a change in the pattern of expression of the molecules measured within a certain cell type, a change in the cellular composition of the lung, or both. There are three main approaches to deal with cellular heterogeneity. One approach is to perform statistical deconvolution of omic profiles by relying on cell specific features from reference datasets. This approach has been used widely in peripheral blood profiling studies (95) and more recently on complex tissues (96), but is highly dependent on known markers and difficult to implement for lung cell populations due to the limitations of appropriate reference datasets. The second approach is to isolate cell types based on cell surface markers using flow cytometry or specific areas of the lung by laser capture microdissection (LCM). While cell sorting is often used in immunological studies and has facilitated major contributions to the field (97), it is limited by the need for known cell markers and antibodies for cell populations of interest, as well as concerns that stress from cell sorting may affect gene expression patterns. LCM can be technically challenging on human lung tissue, but has had some success in conjunction with sequencing technologies (98). However, in most benign tissues, the resolution of LCM allows for enriching for a regional microenvironment, but not for dissecting between different cell types.

Single Cell Sequencing

The recently developed single cell technologies provide the best solution to identification of all relevant cell populations, though technical limitations remain (99-101). The reproducibility and success of such studies depend greatly on availability of high quality human tissue, on cellular susceptibility to stress, and on the platforms used(102). Despite these limitations, single cell RNA-seq studies have shed light on lung cell population heterogeneity during lung development (103) and in lung diseases such as IPF (15). The recent development of single-nucleus RNA-seq may address some of the issues with tissue quality (104). To address cellular heterogeneity in human lung disease, the field needs a “lung (disease) atlas” such as the one proposed by the human cell atlas (105), a large collaborative set of studies that will systematically profile all cells in diseased and healthy human lungs.

Multi-omics Integration

DNA-seq offers the opportunity to detect different sources of DNA variation, including common and rare single nucleotide variant and small and large deletions and insertions, whereas RNA-seq provides full assessment of the cell or tissue transcriptome at a given point in time, with a high dynamic range of these transcripts, different mRNA transcript isoforms as well as different classes of mRNA and ncRNAs. Integrative genomics methods evaluate the functional significance of DNA variation on gene expression. Firstly, these address the relation of genetic variation with transcript abundance (expression Quantitative Trait Locus, eQTL). Since SNP variants can be called in RNA-seq, this offers a direct assessment of an eQTL effect if a SNP is present in its heterozygous form; comparison of the allelic expression demonstrates the relation of a SNP with transcript abundance (allelic imbalance). Since many disease SNPs exert their function through eQTLs, well-powered lung-relevant eQTL datasets are needed. Until now, many studies have been performed in mixed cell populations of white blood cells (106) or whole lung (107). Since eQTLs may be cell specific, lung cell eQTL maps are needed to increase our understanding of lung specific disease SNPs. Secondly, RNA-seq enables assessment of the relation of genetic variation with splicing events leading to alternative isoform expression (splice QTL). Finally, the association of genetic variation with transcripts induced by disease or specific stimuli (context dependent eQTLs) can be investigated either in paired observational human studies or ex vivo in laboratory settings (Table 6). Since most respiratory diseases develop as a result from environmental exposures and genetic background, the study of inducible eQTLs may offer a good model to understand disease development by integrating genetic variation with induced gene expression.

Using study-specific subsets of RNA-seq data or external reference datasets such as the GTEx consortium (108) allows for the ability to impute gene expression in large numbers of individuals on whom only genetic variant data are available (109). These integrative approaches can expand the value of smaller sample sizes with transcript data to the larger datasets to identify gene expression correlated with phenotype (Table 6).

We anticipate that further integration of other omics data in bulk tissue or at the single cell level, though efforts such as the Lung Map (110), will markedly increase our understanding of respiratory disease. The efficiency and speed of these types of analysis may be improved by the implementation of composite measures (111) in future research programs in respiratory medicine, particularly as the approach can be used for all types of omics analysis including transcriptomics, proteomics, metabolomics and epigenetics. Omics integration analysis has advanced considerably, and many machine learning methods are now being used, including Bayesian and Network-based approaches (112) and more recently, deep learning and neural networks (113).

Functional Validation

In functional studies, gene expression may also be used as outcome, either in vivo when human subjects or patients are exposed to an environmental stimulus or drugs, or in human samples or animal models. Comparative analysis of humans and mouse models through RNA-seq may enable swift validation of downstream targets and provide insight in the validity of the animal model (114). Additional sequencing methods such as DNA-Seq, ATAC-Seq, and ChIP-Seq can identify functional regions affected by genetic variants. Gene editing techniques including Crispr-Cas9 that enable knockdown of genes or SNP-specific editing may be followed up by a readout on the effects of gene regulatory networks (115) (Table 6).

Conclusions

With large efforts such as TOPMed and the UK Biobank, in addition to specific disease studies, there is an ever-increasing amount of sequencing data available for studies of respiratory disease, critical care and sleep medicine. Due to the complex nature of these studies, it is critical to include researchers with multiple backgrounds at the outset of study design, including clinician-scientists and epidemiologists who can enroll and phenotype subjects; laboratory personnel with skills in biobanking, sample management, and high-throughput sequencing; bioinformaticians, statisticians, and computational biologists who can manage and analyze data; and molecular biologists who can conduct functional validation studies. All of these experts must collaborate to design studies, interpret data, and present results. This should not discourage new investigators from participating in omics studies, as each person can provide complementary expertise. Specific recommendations regarding study design, analysis, and follow-up (Box 1) should serve as guides for starting a new sequencing study or for the critical appraisal of a completed study. Genomics is a rapidly-evolving field, and researchers must keep abreast of best practices. However, general principles of study design and data reporting are likely to remain valid in the future.

Acknowledgements:

This workshop report was prepared by a subcommittee of the American Thoracic Society Section on Genetics and Genomics in the Assembly on Allergy, Immunology and Inflammation.

Workshop Participants:

Craig P. Hersh, M.D., M.P.H. (Co-Chair)

Gerard H. Koppelman, M.D., Ph.D. (Co-Chair)

Ian M. Adcock, Ph.D.

Juan C. Celedón, M.D., Dr.P.H.

Michael H. Cho, M.D., M.P.H.

David C. Christiani, M.D., M.P.H.

Blanca E. Himes, Ph.D.

Naftali Kaminski, M.D.

Rasika A. Mathias, Sc.D.

Deborah A. Meyers, Ph.D.

John Quackenbush, Ph.D.

Susan Redline, M.D., M.P.H.

Katrina A. Steiling, M.D., M.Sc.

Holly K. Tabor, Ph.D.

Martin Tobin, Ph.D., M.B.Ch.B.

Mark M. Wurfel, M.D., Ph.D.

Ivana V. Yang, Ph.D.

Disclosures

Dr. Hersh reports personal fees from Mylan, personal fees from AstraZeneca, personal fees from Concert Pharmaceuticals, personal fees from 23andMe, grants from Novartis, grants from Boehringer-Ingelheim, grants from National Institutes of Health, outside the submitted work. Dr. Adcock reports grants from Boehringer Ingelhein, grants from Novartis, outside the submitted work. Dr. Celedón has received research materials from Pharmavite (vitamin D and placebo capsules), and Merck and GSK (inhaled steroids) to provide medications free of cost to participants in NIH-funded studies, unrelated to this work. Dr. Cho has received grant support from GSK. Dr. Kaminski reports personal fees from Biogen Idec, Boehringer Ingelheim, Third Rock, MMI, personal fees from Pliant, Samumed, NuMedii, Indaloo, non-financial support from miRagen , outside the submitted work; In addition, Dr. Kaminski has a patent on New Threapies in Pulmonary Fibrosis with royalties paid by Biotech, and a patent on Peripheral Blood Gene Expression issue. He is a Member of the Scientific Advisory Committee, the Research Advisory Forum and serves as Deputy Editor of Thorax, BMJ. Dr. Redline reports grants from NIH, during the conduct of the study. Dr. Steiling reports other from UpToDate, other from Academic Research Coalition LLC, grants from Lungevity Foundation, from null, outside the submitted work; In addition, Dr. Steiling has a patent US Patent 9,677,138 issued to Boston University. Dr. Tobin reports grants from GSK, grants from Pfizer, outside the submitted work. Dr. Yang reports grants from NIH, outside the submitted work. Dr. Koppelman reports grants from Lung Foundation of the Netherlands, TEVA the Netherlands, GSK, TETRI Foundation, UBBO Emmius Foundation, VERTEX, outside the submitted work.

Drs. Christiani, Himes, Mathias, Meyers, Quackenbush, Tabor, and Wurfel have nothing to disclose.

Funding

Dr. Hersh reports funding from NIH grants R01HL125583 and R01HL130512. Dr. Cho is supported by NHLBI grants R01HL113264, R01HL137927, and R01HL135142. Dr. Tobin has received support from a Wellcome Trust Investigator Award (WT202849/Z/16/Z) and from the UK Medical Research Council (MR/N011317/1). The research was partially supported by the NIHR Leicester Biomedical Research Centre; the views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. Dr. Himes if funded by NIH R01 HL133433 and R01 HL141992. S Redline has received support from NIH R35HL135818 and HL113338.

Figure Legends

Figure 1: Next Generation Sequencing methodology (Illumina)

Genomic DNA is fragmented and sequencing adaptors are attached. The genomic library is then hybridized to complementary oligo-nucleotide probes in the flow cell chamber. Since there are adaptors on both ends, hybridization results in a bridge. Amplification leads to clusters of fragments with the same sequence. Clusters are denatured, then sequencing-by-synthesis involves the addition of fluorescently-labeled nucleotides, with serial imaging after the incorporation of each nucleotide. Reprinted from Tucker T, Marra M, Friedman JM. Massively parallel sequencing: the next big thing in genetic medicine. Am J Hum Genet 2009;85:142-152 with permission from Elsevier.(116).

Figure 2: Workflow for a next generation sequencing study in human disease

Box 1. Definitions of commonly used terms in sequencing studies

Batch effect: In a large study, library construction and sequencing is done in batches (e.g. 96 well plate), which is a source of technical variation that should be addressed in the data analysis.

Complex trait: A disease or phenotype that does not follow Mendelian inheritance. Complex traits are likely influenced by multiple genes and environmental factors. Most common human diseases would be considered complex diseases.

Deconvolution: RNA sequencing is frequently performed in tissues such as blood or lung, which are composed of multiple cell types. Deconvolution methods aim to estimate the cell type proportions and/or identify the cell type(s) responsible for the expression of specific genes.

Exome: The portion of the human genome (approximately 1%) that encodes for proteins. Whole exome sequencing (WES) specifically targets these sequences.

Expression Quantitative Trait Locus (eQTL): A genetic variant, usually a SNP (see below), that affects the expression of a gene. eQTLs can be located near the gene of interest (cis-eQTL) or distant (> 1 MB away) (trans-eQTL).

Genome-wide association study (GWAS): A study which assays hundreds of thousands to millions of SNPs across the genome and tests each variant for association with a disease or trait of interest.

Mendelian disorder: A disease determined by variation in a single gene, e.g. cystic fibrosis or sickle cell disease.

Next generation sequencing: Highly automated parallel sequencing technique of small fragments of DNA or RNA. Millions or even billions of nucleotides, up to a whole genome, can be determined in one day.

RNA Integrity Number (RIN): A proprietary algorithm that quantifies RNA degradation based on an electropherogram.

Single nucleotide polymorphism: A single base pair change in DNA sequence (e.g. C to T) which is prevalent (>1 %) in the general populations. SNPs are the most common type of genetic variation.

Sequencing coverage/depth: For DNA sequencing, the number of reads that include a specific nucleotide in the sequencing experiment. This can be averaged across the genome, e.g. 30x. For RNA sequencing, sequencing depth is usually presented as the total number of sequencing reads.

Variant calling: The process of identifying genetic variants (usually SNPs) in an individual exome or genome sequence.

21

Box 2: Recommendations for design and analysis of next generation sequencing studies

· General principles of genetic epidemiology study design are especially important in next generation sequencing studies. Disease specific studies may have more detailed phenotyping, whereas population studies may allow for larger sample sizes.

· Investigators must clearly define the phenotypes of both cases and control subjects, including consideration of relevant exposures, such as smoking.

· Relying only on datasets largely composed on individuals of European ancestry will limit discoveries, especially in diseases which may be more prevalent in other racial or ethnic groups. Therefore, we suggest expanding sequencing efforts in subjects of different ancestries.

· Studies should consider broad consent for secondary data use.

· Researchers should employ standardized methods of experimental design and data analysis for exome and whole genome sequencing studies.

· Methods for design and analysis in RNA sequencing studies are more variable. Researchers should clearly document their methods, including software versions and input parameters, and consider validating key results with a different analysis method.

· Multidisciplinary teams should include bioinformaticians, statisticians and computational biologists to assist in the management and analysis of large datasets.

· Quality control is the responsibility of the investigators. It should not be assumed that the core sequencing facility has performed all the necessary QC steps.

· Data sharing is required by many funders and should be the default.

· Due to the extreme cellular heterogeneity of the lung, future studies should address cellular heterogeneity in the design and analysis, by any of the proposed methods.

· We anticipate an important role of single cell sequencing in the future. For wider acceptance, single cell sequencing has to address specific hypotheses and should be held to the same rigorous standards as other study designs, even while the laboratory and statistical methods are under development.

· Given the pervasive influence of circadian biology on multiple cellular processes including gene expression, studies should time stamp sample collection.

· Integrating different omics datasets, both at the single cell or at the tissue level, will likely increase our understanding of the complex diseases in respiratory, critical care and sleep medicine.

· Laboratory validation is an important next step towards the eventual translation of results.

Table 1: Examples of human next generation sequencing studies in respiratory, critical care and sleep medicine

Technology

Disease/trait

Study Design/subjects

Main Findings

Validation

Reference

Whole Exome Sequencing

Narcolepsy

18 families

8 missense variants in P2Y11 

Resequencing in 250 cases, 150 controls; in vitro P2Y11 signaling assays

(10)

Whole Exome Sequencing

Bronchopulmonary Dysplasia

50 twin pairs, including 51 BPD cases

258 genes with rare non-synonymous mutations

Lung gene expression in published human data and rat BPD model, mouse phenotype database

(11)

Whole Exome Sequencing

Airflow obstruction

100 heavy smokers with normal lung function

Non-synonymous SNP in CCDC38

Association testing in 2 additional studies. Immunohistochemistry in bronchial epithelial cells.

(6)

Whole Exome Sequencing

Idiopathic Pulmonary Fibrosis

79 probands with familial pulmonary fibrosis, 2816 controls

Mutations in PARN found in cases, not controls. Mutations in RTEL1 more common in cases vs. controls

Mutations segregated in families. Shorter leukocyte telomeres in mutation carriers.

(8)

Whole Genome Sequencing

Pulmonary vascular disease

864 PAH,

16 PVOD/ PCH,

7134 controls

EIF2AK4 mutations in 19 PAH patients

Phenotype association with younger age, reduced KCO, shorter survival

(12)

Whole Genome Sequencing

Asthma

WGS in 8,453 Icelanders, imputation in >150K

Rare variant in IL33 associated with lower eosinophil count, reduced asthma risk

Genotyping in 6465 cases, >300K controls; IL33 gene expression;

in vitro assay of receptor binding

(3)

RNA-sequencing

Smoking

Blood samples from 229 current, 286 former smokers

171 DE genes, including 7 lncRNAs, 8 genes with differential exon use

Published microarray study

(13)

RNA-sequencing

COPD

Lung tissue from 98 cases, 91 controls

2312 DE genes

qPCR for 7 genes

(2)

Single cell RNA-sequencing

Idiopathic Pulmonary Fibrosis

FACS sorted lung epithelial cells from 6 IPF, 3 controls

4 cell clusters: AT2, basal, goblet, and indeterminate

Immunofluorescence confocal microscopy for epithelial cell markers

(15)

miRNA-sequencing

Sepsis

Plasma from 29 sepsis, 44 non-infective SIRS, 16 controls

6 miRNAs distinguish sepsis from SIRS

qPCR, correlation with inflammatory cytokines

(9)

miRNA-sequencing

Exercise physiology

Plasma before/after treadmill exercise test, n=26

miR-181b increased with exercise

qPCR in separate cohort (n=59),

Skeletal muscle expression in mouse exercise model

(14)

PMVEC = pulmonary microvascular endothelial cells, PVOD = pulmonary veno-occlusive disease, PCH = pulmonary capillary hemangiomatosis

Table 2: Biobanks and commonly used databases for next generation sequencing research

Biobanks and other large sequencing studies

URL

Center for Common Disease Genetics

www.genome.gov/27563570

China Kadoorie Biobank

www.ckbiobank.org

Genomics England (“100,000 Genomes Project”)

www.genomicsengland.co.uk

Trans-Omics in Precision Medicine (TOPMed)

www.nhlbiwgs.org

United Kingdom Biobank

www.biobank.ac.uk

Databases

Database of Genotypes and Phenotypes (dbGaP)

www.ncbi.nlm.nih.gov/gap

Ensembl genome browser

www.ensembl.org

Gene Expression Omnibus (GEO)

www.ncbi.nlm.nih.gov/geo

Genome Aggregation Database (gnomAD)

www.gnomad.broadinstitute.org

Genotype-Tissue Expression project (GTEx)

www.gtexportal.org

Human Cell Atlas

www.humancellatlas.org

Lung Map

www.lungmap.net

Reference Sequence Database (RefSeq)

www.ncbi.nlm.nih.gov/refseq

Sequence Read Archive (SRA)

www.ncbi.nlm.nih.gov/sra

University of California Santa Cruz (UCSC) Genome Browser

www.genome.ucsc.edu

Table 3: Minimal elements required in the reporting of high throughput sequencing studies. Required elements should also include the package or software name, version number, and settings used for the analysis.

Analytic Step

Required elements

Pre-processing and pre-analysis quality control

Randomization of samples

Target design, when applicable (e.g. WES)

Methods for quality assessment of:

· Raw reads

· Aligned reads and coverage

· Global data quality

· Ancestry of samples (comparison with study and to reference genomes)

Core analytics

Method of read alignment

Method of variant calling

Method of association analyses

Advanced analytics

Methods for integration with other data types

A. Whole exome and genome sequencing

B. RNA Sequencing

Analytic Step

Required elements

Pre-processing and pre-analysis quality control

Spike-in use

Randomization of samples

Number of raw reads

Methods for quality assessment of:

· Raw reads

· Aligned reads

· Quantification of reads

· Reproducibility of replicates

· Global data quality

Core analytics

Method of transcript/gene identification

Method of transcript/gene quantification

Method of normalization

Method of batch correction

Method of detection of differential expression

Advanced analytics

Method of transcript/isoform discovery

Method of indel detection

Method of gene fusion detection

Method of variant detection

Method for single cell analyses

Methods for integration with other data types

Table 4: Software for DNA sequencing studies. This table provides an overview of commonly used software tools for performing analysis of next generation sequencing data. Because the field continues to evolve rapidly, additional tools not listed in this table may also be useful to researchers.

Task

Tools

URL

Alignment

BWA-MEM (117)

Bowtie2 (118)

http://bio-bwa.sourceforge.net

http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

Quality Control

Raw reads:

FastQC

FASTX-Toolkit

Mapping:

BAMtools (119)

Picard Tools

Variants: GATK (120)

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://hannonlab.cshl.edu/fastx_toolkit/

https://github.com/pezmaster31/bamtools/

https://broadinstitute.github.io/picard/

https://software.broadinstitute.org/gatk/

Variant Calling

SAMtools (121)

GATK unified genotyper, haplotype caller, variant quality score recalibration (122)

http://www.htslib.org

https://software.broadinstitute.org/gatk/

Visualization

Integrative Genomics Viewer (IGV) (123)

UCSC Genome Browser (124)

http://software.broadinstitute.org/software/igv/

http://www.genome.ucsc.edu

Association Analysis

PLINK 2 (common variants) (125)

SKAT-O (rare variants) (88)

GENESIS (rare variants) (126)

BOLT-LMM (127)

https://www.cog-genomics.org/plink2/

https://www.hsph.harvard.edu/skat/

https://bioconductor.org/packages/release/bioc/html/GENESIS.html

https://data.broadinstitute.org/alkesgroup/BOLT-LMM/

Table 5: Software for RNA sequencing studies. This table provides an overview of commonly used software tools for performing analysis of RNA sequencing data. Because the field continues to evolve rapidly, additional tools not listed in this table may also be useful to researchers.

Task

Tools

URL

Alignment

Bowtie (128)

STAR (129)

TopHat (130)

http://bowtie-bio.sourceforge.net/index.shtml

https://github.com/alexdobin/STAR/

https://ccb.jhu.edu/software/tophat/index.shtml

Transcript quantification

Cufflinks (130)

eXpress (131)

HTSeq-count (132)

Kallisto (133)

RSEM(134)

http://cole-trapnell-lab.github.io/cufflinks/

https://pachterlab.github.io/eXpress/

http://htseq.readthedocs.io/en/master/count.html

https://pachterlab.github.io/kallisto/

https://github.com/deweylab/RSEM

Quality Control

Raw reads:

FastQC

FASTX-Toolkit

Mapping:

BAMtools (119)

Picard Tools

RSeQC (135)

Quantification:

NOISeq (136)

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://hannonlab.cshl.edu/fastx_toolkit/

https://github.com/pezmaster31/bamtools/

https://broadinstitute.github.io/picard/

http://rseqc.sourceforge.net

https://bioconductor.org/packages/release/bioc/html/NOISeq.html

Differential expression

DEGseq (137)

DESeq2 (138)

edgeR (139)

limma/voom (140)

PoissonSeq (141)

NOISeq (136)

Sleuth (142)

https://bioconductor.org/packages/release/bioc/html/DEGseq.html

https://bioconductor.org/packages/release/bioc/html/DESeq2.html

http://bioconductor.org/packages/release/bioc/html/edgeR.html

https://bioconductor.org/packages/release/bioc/html/limma.html

https://cran.r-project.org/web/packages/PoissonSeq/index.html

https://bioconductor.org/packages/release/bioc/html/NOISeq.html

https://pachterlab.github.io/sleuth/

Alternative splicing

CuffDiff2 (143)

DEX-Seq (144)

DSG-Seq (145)

MISO (146)

rSeqDiff (147)

Leafcutter (148)

http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/

http://bioconductor.org/packages/release/bioc/html/DEXSeq.html

http://bioinfo.au.tsinghua.edu.cn/software/DSGseq/

http://genes.mit.edu/burgelab/miso/

http://www-personal.umich.edu/~jianghui/rseqdiff/

https://github.com/davidaknowles/leafcutter

Visualization

CummeRbund (130)

Integrative Genomics Viewer (IGV) (123)

RNASeqViewer (149)

SplicePlot (150)

SpliceSeq (151)

SplicingViewer (152)

UCSC Genome Browser (124)

http://compbio.mit.edu/cummeRbund/

http://software.broadinstitute.org/software/igv/

http://bioinfo.au.tsinghua.edu.cn/software/RNAseqViewer/

http://montgomerylab.stanford.edu/spliceplot/index.html

https://bioinformatics.mdanderson.org/main/SpliceSeq:Overview

http://bioinformatics.zj.cn/splicingviewer/

http://www.genome.ucsc.edu

Table 6: Selected examples of omics integration and using omics for functional validation studies.

Technique

Example

Context-dependent eQTLs

Li and coworkers showed that cytokine production by peripheral blood mononuclear cells upon stimulation depends on six specific SNPs (153). One inducible cytokine QTL at the NAA35-GOLM1 locus markedly modulated interleukin (IL)-6 production in response to multiple pathogens and also showed association with susceptibility to candidemia.

Imputed gene expression (PrediXcan)

Ferreira et al tested for associations between asthma and 17,190 genes found to have cis- and/or trans-eQTLs across 12 cell types relevant to asthma (154). They confirmed 37 genes where the association was driven by eQTLs located in established risk loci for allergic disease, and discovered 11 novel genetic associations.

Gene knockdown

Dixit et al investigated the effect of gene knockdown by CRISPR/Cas9 on RNA-seq expression in human LPS stimulated bone marrow dendritic cells, a method they called Perturb-seq (155). By analyzing the transcriptional consequences of perturbations of transcription factors in these cells, they were able to interpret the functional consequences of these transcription factors, as well as their interaction, uncovering their molecular mechanisms.

Figure 1:

Figure 2:

Formulation of study aims

Consideration of ethical aspects and health disparities

Subject consent and phenotyping

Sample collection

Study design

Enrollment

Nucleic acid isolation

Library construction

Next-gen sequencing

Sequencing

Sequence reads

Genome alignment

Data analysis

Quality control

Variant calling

Association analysis

Quality control

Quantification

Differential expression

Downstream analyses

Replication studies

Data integration

Functional validation

Manuscript

Shared datasets

Research products

RNA

DNA

References

1. Campbell CD, Mohajeri K, Malig M, Hormozdiari F, Nelson B, Du G, Patterson KM, Eng C, Torgerson DG, Hu D, Herman C, Chong JX, Ko A, O'Roak BJ, Krumm N, Vives L, Lee C, Roth LA, Rodriguez-Cintron W, Rodriguez-Santana J, Brigino-Buenaventura E, Davis A, Meade K, LeNoir MA, Thyne S, Jackson DJ, Gern JE, Lemanske RF, Jr., Shendure J, Abney M, Burchard EG, Ober C, Eichler EE. Whole-genome sequencing of individuals from a founder population identifies candidate genes for asthma. PLoS One 2014; 9: e104396.

2. Kim WJ, Lim JH, Lee JS, Lee SD, Kim JH, Oh YM. Comprehensive Analysis of Transcriptome Sequencing Data in the Lung Tissues of COPD Subjects. Int J Genomics 2015; 2015: 206937.

3. Smith D, Helgason H, Sulem P, Bjornsdottir US, Lim AC, Sveinbjornsson G, Hasegawa H, Brown M, Ketchem RR, Gavala M, Garrett L, Jonasdottir A, Jonasdottir A, Sigurdsson A, Magnusson OT, Eyjolfsson GI, Olafsson I, Onundarson PT, Sigurdardottir O, Gislason D, Gislason T, Ludviksson BR, Ludviksdottir D, Boezen HM, Heinzmann A, Krueger M, Porsbjerg C, Ahluwalia TS, Waage J, Backer V, Deichmann KA, Koppelman GH, Bonnelykke K, Bisgaard H, Masson G, Thorsteinsdottir U, Gudbjartsson DF, Johnston JA, Jonsdottir I, Stefansson K. A rare IL33 loss-of-function mutation reduces blood eosinophil counts and protects from asthma. PLoS Genet 2017; 13: e1006659.

4. Radder JE, Zhang Y, Gregory AD, Yu S, Kelly NJ, Leader JK, Kaminski N, Sciurba FC, Shapiro SD. Extreme Trait Whole-Genome Sequencing Identifies PTPRO as a Novel Candidate Gene in Emphysema with Severe Airflow Obstruction. Am J Respir Crit Care Med 2017; 196: 159-171.

5. Qiao D, Lange C, Beaty TH, Crapo JD, Barnes KC, Bamshad M, Hersh CP, Morrow J, Pinto-Plata VM, Marchetti N, Bueno R, Celli BR, Criner GJ, Silverman EK, Cho MH, Lung Go, COPDGene Investigators, NHLBI Exome Sequencing Project. Exome Sequencing Analysis in Severe, Early-Onset Chronic Obstructive Pulmonary Disease. Am J Respir Crit Care Med 2016; 193: 1353-1363.

6. Wain LV, Sayers I, Soler Artigas M, Portelli MA, Zeggini E, Obeidat M, Sin DD, Bosse Y, Nickle D, Brandsma CA, Malarstig A, Vangjeli C, Jelinsky SA, John S, Kilty I, McKeever T, Shrine NR, Cook JP, Patel S, Spector TD, Hollox EJ, Hall IP, Tobin MD. Whole exome re-sequencing implicates CCDC38 and cilia structure and function in resistance to smoking related airflow obstruction. PLoS Genet 2014; 10: e1004314.

7. Petrovski S, Todd JL, Durheim MT, Wang Q, Chien JW, Kelly FL, Frankel C, Mebane CM, Ren Z, Bridgers J, Urban TJ, Malone CD, Finlen Copeland A, Brinkley C, Allen AS, O'Riordan T, McHutchison JG, Palmer SM, Goldstein DB. An Exome Sequencing Study to Assess the Role of Rare Genetic Variation in Pulmonary Fibrosis. Am J Respir Crit Care Med 2017; 196: 82-93.

8. Stuart BD, Choi J, Zaidi S, Xing C, Holohan B, Chen R, Choi M, Dharwadkar P, Torres F, Girod CE, Weissler J, Fitzgerald J, Kershaw C, Klesney-Tait J, Mageto Y, Shay JW, Ji W, Bilguvar K, Mane S, Lifton RP, Garcia CK. Exome sequencing links mutations in PARN and RTEL1 with familial pulmonary fibrosis and telomere shortening. Nat Genet 2015; 47: 512-517.

9. Caserta S, Kern F, Cohen J, Drage S, Newbury SF, Llewelyn MJ. Circulating Plasma microRNAs can differentiate Human Sepsis and Systemic Inflammatory Response Syndrome (SIRS). Scientific reports 2016; 6: 28006.

10. Degn M, Dauvilliers Y, Dreisig K, Lopez R, Pfister C, Pradervand S, Rahbek Kornum B, Tafti M. Rare missense mutations in P2RY11 in narcolepsy with cataplexy. Brain 2017; 140: 1657-1668.

11. Li J, Yu KH, Oehlert J, Jeliffe-Pawlowski LL, Gould JB, Stevenson DK, Snyder M, Shaw GM, O'Brodovich HM. Exome Sequencing of Neonatal Blood Spots and the Identification of Genes Implicated in Bronchopulmonary Dysplasia. Am J Respir Crit Care Med 2015; 192: 589-596.

12. Hadinnapola C, Bleda M, Haimel M, Screaton N, Swift A, Dorfmuller P, Preston SD, Southwood M, Hernandez-Sanchez J, Martin J, Treacy C, Yates K, Bogaard H, Church C, Coghlan G, Condliffe R, Corris PA, Gibbs S, Girerd B, Holden S, Humbert M, Kiely DG, Lawrie A, Machado R, MacKenzie Ross R, Moledina S, Montani D, Newnham M, Peacock A, Pepke-Zaba J, Rayner-Matthews P, Shamardina O, Soubrier F, Southgate L, Suntharalingam J, Toshner M, Trembath R, Vonk Noordegraaf A, Wilkins MR, Wort SJ, Wharton J, NIHR BioResource-Rare Diseases Consortium, UK National Cohort Study of Idiopathic and Heritable PAH, Graf S, Morrell NW. Phenotypic Characterization of EIF2AK4 Mutation Carriers in a Large Cohort of Patients Diagnosed Clinically With Pulmonary Arterial Hypertension. Circulation 2017; 136: 2022-2033.

13. Parker MM, Chase RP, Lamb A, Reyes A, Saferali A, Yun JH, Himes BE, Silverman EK, Hersh CP, Castaldi PJ. RNA sequencing identifies novel non-coding RNA and exon-specific effects associated with cigarette smoking. BMC Med Genomics 2017; 10: 58.

14. Shah R, Yeri A, Das A, Courtright-Lim A, Ziegler O, Gervino E, Ocel J, Quintero-Pinzon P, Wooster L, Bailey CS, Tanriverdi K, Beaulieu LM, Freedman JE, Ghiran I, Lewis GD, Van Keuren-Jensen K, Das S. Small RNA-seq during acute maximal exercise reveal RNAs involved in vascular inflammation and cardiometabolic health: brief report. Am J Physiol Heart Circ Physiol 2017; 313: H1162-H1167.

15. Xu Y, Mizuno T, Sridharan A, Du Y, Guo M, Tang J, Wikenheiser-Brokamp KA, Perl AT, Funari VA, Gokey JJ, Stripp BR, Whitsett JA. Single-cell RNA sequencing identifies diverse roles of epithelial cells in idiopathic pulmonary fibrosis. JCI Insight 2016; 1: e90558.

16. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szczesniak MW, Gaffney DJ, Elo LL, Zhang X, Mortazavi A. A survey of best practices for RNA-seq data analysis. Genome Biol 2016; 17: 13.

17. Goldstein DB, Allen A, Keebler J, Margulies EH, Petrou S, Petrovski S, Sunyaev S. Sequencing studies in human genetics: design and interpretation. Nat Rev Genet 2013; 14: 460-470.

18. Little J, Higgins JP, Ioannidis JP, Moher D, Gagnon F, von Elm E, Khoury MJ, Cohen B, Davey-Smith G, Grimshaw J, Scheet P, Gwinn M, Williamson RE, Zou GY, Hutchings K, Johnson CY, Tait V, Wiens M, Golding J, van Duijn C, McLaughlin J, Paterson A, Wells G, Fortier I, Freedman M, Zecevic M, King R, Infante-Rivard C, Stewart A, Birkett N, STrengthening the REporting of Genetic Association Studies. STrengthening the REporting of Genetic Association Studies (STREGA): an extension of the STROBE statement. PLoS Med 2009; 6: e22.

19. Himes BE, Hunninghake GM, Baurley JW, Rafaels NM, Sleiman P, Strachan DP, Wilk JB, Willis-Owen SA, Klanderman B, Lasky-Su J, Lazarus R, Murphy AJ, Soto-Quiros ME, Avila L, Beaty T, Mathias RA, Ruczinski I, Barnes KC, Celedon JC, Cookson WO, Gauderman WJ, Gilliland FD, Hakonarson H, Lange C, Moffatt MF, O'Connor GT, Raby BA, Silverman EK, Weiss ST. Genome-wide association analysis identifies PDE4D as an asthma-susceptibility gene. Am J Hum Genet 2009; 84: 581-593.

20. Nieuwenhuis MA, Vonk JM, Himes BE, Sarnowski C, Minelli C, Jarvis D, Bouzigon E, Nickle DC, Laviolette M, Sin D, Weiss ST, van den Berge M, Koppelman GH, Postma DS. PTTG1IP and MAML3, novel genomewide association study genes for severity of hyperresponsiveness in adult asthma. Allergy 2017; 72: 792-801.

21. Chen H, Cade BE, Gleason KJ, Bjonnes AC, Stilp AM, Sofer T, Conomos MP, Ancoli-Israel S, Arens R, Azarbarzin A, Bell GI, Below JE, Chun S, Evans DS, Ewert R, Frazier-Wood AC, Gharib SA, Haba-Rubio J, Hagen EW, Heinzer R, Hillman DR, Johnson WC, Kutalik Z, Lane JM, Larkin EK, Lee SK, Liang J, Loredo JS, Mukherjee S, Palmer LJ, Papanicolaou GJ, Penzel T, Peppard PE, Post WS, Ramos AR, Rice K, Rotter JI, Sands SA, Shah NA, Shin C, Stone KL, Stubbe B, Sul JH, Tafti M, Taylor KD, Teumer A, Thornton TA, Tranah GJ, Wang C, Wang H, Warby SC, Wellman DA, Zee PC, Hanis CL, Laurie CC, Gottlieb DJ, Patel SR, Zhu X, Sunyaev SR, Saxena R, Lin X, Redline S. Multiethnic Meta-Analysis Identifies RAI1 as a Possible Obstructive Sleep Apnea-related Quantitative Trait Locus in Men. Am J Respir Cell Mol Biol 2018; 58: 391-401.

22. Wilk JB, Chen TH, Gottlieb DJ, Walter RE, Nagle MW, Brandler BJ, Myers RH, Borecki IB, Silverman EK, Weiss ST, O'Connor GT. A genome-wide association study of pulmonary function measures in the Framingham Heart Study. PLoS Genet 2009; 5: e1000429.

23. Hobbs BD, de Jong K, Lamontagne M, Bosse Y, Shrine N, Artigas MS, Wain LV, Hall IP, Jackson VE, Wyss AB, London SJ, North KE, Franceschini N, Strachan DP, Beaty TH, Hokanson JE, Crapo JD, Castaldi PJ, Chase RP, Bartz TM, Heckbert SR, Psaty BM, Gharib SA, Zanen P, Lammers JW, Oudkerk M, Groen HJ, Locantore N, Tal-Singer R, Rennard SI, Vestbo J, Timens W, Pare PD, Latourelle JC, Dupuis J, O'Connor GT, Wilk JB, Kim WJ, Lee MK, Oh YM, Vonk JM, de Koning HJ, Leng S, Belinsky SA, Tesfaigzi Y, Manichaikul A, Wang XQ, Rich SS, Barr RG, Sparrow D, Litonjua AA, Bakke P, Gulsvik A, Lahousse L, Brusselle GG, Stricker BH, Uitterlinden AG, Ampleford EJ, Bleecker ER, Woodruff PG, Meyers DA, Qiao D, Lomas DA, Yim JJ, Kim DK, Hawrylkiewicz I, Sliwinski P, Hardin M, Fingerlin TE, Schwartz DA, Postma DS, MacNee W, Tobin MD, Silverman EK, Boezen HM, Cho MH, COPDGene Investigators, ECLIPSE Investigators, LifeLines Investigators, Spiromics Research Group, International COPD Genetics Network Investigators, UK BiLEVE Investigators, International COPD Genetics Consortium. Genetic loci associated with chronic obstructive pulmonary disease overlap with loci for lung function and pulmonary fibrosis. Nat Genet 2017; 49: 426-432.

24. Chapman KR, Tashkin DP, Pye DJ. Gender bias in the diagnosis of COPD. Chest 2001; 119: 1691-1695.

25. Gottesman O, Kuivaniemi H, Tromp G, Faucett WA, Li R, Manolio TA, Sanderson SC, Kannry J, Zinberg R, Basford MA, Brilliant M, Carey DJ, Chisholm RL, Chute CG, Connolly JJ, Crosslin D, Denny JC, Gallego CJ, Haines JL, Hakonarson H, Harley J, Jarvik GP, Kohane I, Kullo IJ, Larson EB, McCarty C, Ritchie MD, Roden DM, Smith ME, Bottinger EP, Williams MS, e MN. The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future. Genet Med 2013; 15: 761-771.

26. Rogers AJ. Genome-Wide Association Study in Acute Respiratory Distress Syndrome. Finding the Needle in the Haystack to Advance Our Understanding of Acute Respiratory Distress Syndrome. Am J Respir Crit Care Med 2018; 197: 1373-1374.

27. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 2011; 12: 745-755.

28. Cho MH, McDonald ML, Zhou X, Mattheisen M, Castaldi PJ, Hersh CP, Demeo DL, Sylvia JS, Ziniti J, Laird NM, Lange C, Litonjua AA, Sparrow D, Casaburi R, Barr RG, Regan EA, Make BJ, Hokanson JE, Lutz S, Dudenkov TM, Farzadegan H, Hetmanski JB, Tal-Singer R, Lomas DA, Bakke P, Gulsvik A, Crapo JD, Silverman EK, Beaty TH. Risk loci for chronic obstructive pulmonary disease: a genome-wide association study and meta-analysis. Lancet Respir Med 2014; 2: 214-225.

29. Bime C, Pouladi N, Sammani S, Batai K, Casanova N, Zhou T, Kempf CL, Sun X, Camp SM, Wang T, Kittles RA, Lussier YA, Jones TK, Reilly JP, Meyer NJ, Christie JD, Karnes J, Gonzalez-Garay M, Christiani DC, Yates CR, Wurfel MM, Meduri GU, Garcia JGN. Genome Wide Association Study in African Americans with Acute Respiratory Distress Syndrome Identifies the Selectin P Ligand Gene as a Risk Factor. Am J Respir Crit Care Med 2018; 197: 1421-1432.

30. Christie JD, Wurfel MM, Feng R, O'Keefe GE, Bradfield J, Ware LB, Christiani DC, Calfee CS, Cohen MJ, Matthay M, Meyer NJ, Kim C, Li M, Akey J, Barnes KC, Sevransky J, Lanken PN, May AK, Aplenc R, Maloney JP, Hakonarson H, Trauma ALI SNP Consortium investigators. Genome wide association identifies PPFIA1 as a candidate gene for acute lung injury risk following major trauma. PLoS One 2012; 7: e28268.

31. Turnbull C. Introducing whole-genome sequencing into routine cancer care: the Genomics England 100 000 Genomes Project. Ann Oncol 2018; 29: 784-787.

32. Brody JA, Morrison AC, Bis JC, O'Connell JR, Brown MR, Huffman JE, Ames DC, Carroll A, Conomos MP, Gabriel S, Gibbs RA, Gogarten SM, Gupta N, Jaquish CE, Johnson AD, Lewis JP, Liu X, Manning AK, Papanicolaou GJ, Pitsillides AN, Rice KM, Salerno W, Sitlani CM, Smith NL, NHLBI Trans-Omics for Precision Medicine Consortium, Cohorts for Heart and Aging Research in Genomic Epidemiology Consortium, TOPMed Hematology and Hemostasis Working Group, CHARGE Analysis and Bioinformatics Working Group, Heckbert SR, Laurie CC, Mitchell BD, Vasan RS, Rich SS, Rotter JI, Wilson JG, Boerwinkle E, Psaty BM, Cupples LA. Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology. Nat Genet 2017; 49: 1560-1563.

33. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O'Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP, Kosmicki JA, Duncan LE, Estrada K, Zhao F, Zou J, Pierce-Hoffman E, Berghout J, Cooper DN, Deflaux N, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, Kurki MI, Moonshine AL, Natarajan P, Orozco L, Peloso GM, Poplin R, Rivas MA, Ruano-Rubio V, Rose SA, Ruderfer DM, Shakir K, Stenson PD, Stevens C, Thomas BP, Tiao G, Tusie-Luna MT, Weisburd B, Won HH, Yu D, Altshuler DM, Ardissino D, Boehnke M, Danesh J, Donnelly S, Elosua R, Florez JC, Gabriel SB, Getz G, Glatt SJ, Hultman CM, Kathiresan S, Laakso M, McCarroll S, McCarthy MI, McGovern D, McPherson R, Neale BM, Palotie A, Purcell SM, Saleheen D, Scharf JM, Sklar P, Sullivan PF, Tuomilehto J, Tsuang MT, Watkins HC, Wilson JG, Daly MJ, MacArthur DG, Exome Aggregation Consortium. Analysis of protein-coding genetic variation in 60,706 humans. Nature 2016; 536: 285-291.

34. Collins R. What makes UK Biobank special? Lancet 2012; 379: 1173-1174.

35. Wain LV, Shrine N, Miller S, Jackson VE, Ntalla I, Artigas MS, Billington CK, Kheirallah AK, Allen R, Cook JP, Probert K, Obeidat M, Bosse Y, Hao K, Postma DS, Pare PD, Ramasamy A, U. K. Brain Expression Consortium, Magi R, Mihailov E, Reinmaa E, Melen E, O'Connell J, Frangou E, Delaneau O, Ox G. S. K. Consortium, Freeman C, Petkova D, McCarthy M, Sayers I, Deloukas P, Hubbard R, Pavord I, Hansell AL, Thomson NC, Zeggini E, Morris AP, Marchini J, Strachan DP, Tobin MD, Hall IP. Novel insights into the genetics of smoking behaviour, lung function, and chronic obstructive pulmonary disease (UK BiLEVE): a genetic association study in UK Biobank. Lancet Respir Med 2015; 3: 769-781.

36. Chen KD, Chang PT, Ping YH, Lee HC, Yeh CW, Wang PN. Gene expression profiling of peripheral blood leukocytes identifies and validates ABCB1 as a novel biomarker for Alzheimer's disease. Neurobiol Dis 2011; 43: 698-705.

37. Celedon JC, Burchard EG, Schraufnagel D, Castillo-Salgado C, Schenker M, Balmes J, Neptune E, Cummings KJ, Holguin F, Riekert KA, Wisnivesky JP, Garcia JGN, Roman J, Kittles R, Ortega VE, Redline S, Mathias R, Thomas A, Samet J, Ford JG, American Thoracic Society, the National Heart Lung and Blood Institute. An American Thoracic Society/National Heart, Lung, and Blood Institute Workshop Report: Addressing Respiratory Health Equality in the United States. Ann Am Thorac Soc 2017; 14: 814-826.

38. Kumar R, Seibold MA, Aldrich MC, Williams LK, Reiner AP, Colangelo L, Galanter J, Gignoux C, Hu D, Sen S, Choudhry S, Peterson EL, Rodriguez-Santana J, Rodriguez-Cintron W, Nalls MA, Leak TS, O'Meara E, Meibohm B, Kritchevsky SB, Li R, Harris TB, Nickerson DA, Fornage M, Enright P, Ziv E, Smith LJ, Liu K, Burchard EG. Genetic ancestry in lung-function predictions. N Engl J Med 2010; 363: 321-330.

39. Torgerson DG, Ampleford EJ, Chiu GY, Gauderman WJ, Gignoux CR, Graves PE, Himes BE, Levin AM, Mathias RA, Hancock DB, Baurley JW, Eng C, Stern DA, Celedon JC, Rafaels N, Capurso D, Conti DV, Roth LA, Soto-Quiros M, Togias A, Li X, Myers RA, Romieu I, Van Den Berg DJ, Hu D, Hansel NN, Hernandez RD, Israel E, Salam MT, Galanter J, Avila PC, Avila L, Rodriquez-Santana JR, Chapela R, Rodriguez-Cintron W, Diette GB, Adkinson NF, Abel RA, Ross KD, Shi M, Faruque MU, Dunston GM, Watson HR, Mantese VJ, Ezurum SC, Liang L, Ruczinski I, Ford JG, Huntsman S, Chung KF, Vora H, Li X, Calhoun WJ, Castro M, Sienra-Monge JJ, del Rio-Navarro B, Deichmann KA, Heinzmann A, Wenzel SE, Busse WW, Gern JE, Lemanske RF, Jr., Beaty TH, Bleecker ER, Raby BA, Meyers DA, London SJ, Mexico City Childhood Asthma Study, Gilliland FD, Children's Health Study and Harbors study, Burchard EG, Genetics of Asthma in Latino Americans Study, Study of Genes-Environment and Admixture in Latino Americans, Study of African Americans Asthma Genes and Environments, Martinez FD, Childhood Asthma Research and Education Network, Weiss ST, Childhood Asthma Management Program, Williams LK, Study of Asthma Phenotypes and Pharmacogenomic Interactions by Race-Ethnicity, Barnes KC, Genetic Research on Asthma in African Diaspora Study, Ober C, Nicolae DL. Meta-analysis of genome-wide association studies of asthma in ethnically diverse North American populations. Nat Genet 2011; 43: 887-892.

40. Manichaikul A, Hoffman EA, Smolonska J, Gao W, Cho MH, Baumhauer H, Budoff M, Austin JH, Washko GR, Carr JJ, Kaufman JD, Pottinger T, Powell CA, Wijmenga C, Zanen P, Groen HJ, Postma DS, Wanner A, Rouhani FN, Brantly ML, Powell R, Smith BM, Rabinowitz D, Raffel LJ, Hinckley Stukovsky KD, Crapo JD, Beaty TH, Hokanson JE, Silverman EK, Dupuis J, O'Connor GT, Boezen HM, Rich SS, Barr RG. Genome-wide study of percent emphysema on computed tomography in the general population. The Multi-Ethnic Study of Atherosclerosis Lung/SNP Health Association Resource Study. Am J Respir Crit Care Med 2014; 189: 408-418.

41. Cade BE, Chen H, Stilp AM, Gleason KJ, Sofer T, Ancoli-Israel S, Arens R, Bell GI, Below JE, Bjonnes AC, Chun S, Conomos MP, Evans DS, Johnson WC, Frazier-Wood AC, Lane JM, Larkin EK, Loredo JS, Post WS, Ramos AR, Rice K, Rotter JI, Shah NA, Stone KL, Taylor KD, Thornton TA, Tranah GJ, Wang C, Zee PC, Hanis CL, Sunyaev SR, Patel SR, Laurie CC, Zhu X, Saxena R, Lin X, Redline S. Genetic Associations with Obstructive Sleep Apnea Traits in Hispanic/Latino Americans. Am J Respir Crit Care Med 2016; 194: 886-897.

42. Yasuda K, Miyake K, Horikawa Y, Hara K, Osawa H, Furuta H, Hirota Y, Mori H, Jonsson A, Sato Y, Yamagata K, Hinokio Y, Wang HY, Tanahashi T, Nakamura N, Oka Y, Iwasaki N, Iwamoto Y, Yamada Y, Seino Y, Maegawa H, Kashiwagi A, Takeda J, Maeda E, Shin HD, Cho YM, Park KS, Lee HK, Ng MC, Ma RC, So WY, Chan JC, Lyssenko V, Tuomi T, Nilsson P, Groop L, Kamatani N, Sekine A, Nakamura Y, Yamamoto K, Yoshida T, Tokunaga K, Itakura M, Makino H, Nanjo K, Kadowaki T, Kasuga M. Variants in KCNQ1 are associated with susceptibility to type 2 diabetes mellitus. Nat Genet 2008; 40: 1092-1097.

43. Chung WH, Hung SI, Hong HS, Hsih MS, Yang LC, Ho HC, Wu JY, Chen YT. Medical genetics: a marker for Stevens-Johnson syndrome. Nature 2004; 428: 486.

44. Mathias RA, Taub MA, Gignoux CR, Fu W, Musharoff S, O'Connor TD, Vergara C, Torgerson DG, Pino-Yanes M, Shringarpure SS, Huang L, Rafaels N, Boorgula MP, Johnston HR, Ortega VE, Levin AM, Song W, Torres R, Padhukasahasram B, Eng C, Mejia-Mejia DA, Ferguson T, Qin ZS, Scott AF, Yazdanbakhsh M, Wilson JG, Marrugo J, Lange LA, Kumar R, Avila PC, Williams LK, Watson H, Ware LB, Olopade C, Olopade O, Oliveira R, Ober C, Nicolae DL, Meyers D, Mayorga A, Knight-Madden J, Hartert T, Hansel NN, Foreman MG, Ford JG, Faruque MU, Dunston GM, Caraballo L, Burchard EG, Bleecker E, Araujo MI, Herrera-Paz EF, Gietzen K, Grus WE, Bamshad M, Bustamante CD, Kenny EE, Hernandez RD, Beaty TH, Ruczinski I, Akey J, Caapa, Barnes KC. A continuum of admixture in the Western Hemisphere revealed by the African Diaspora genome. Nat Commun 2016; 7: 12522.

45. Manrai AK, Funke BH, Rehm HL, Olesen MS, Maron BA, Szolovits P, Margulies DM, Loscalzo J, Kohane IS. Genetic Misdiagnoses and the Potential for Health Disparities. N Engl J Med 2016; 375: 655-665.

46. Bustamante CD, Burchard EG, De la Vega FM. Genomics for the world. Nature 2011; 475: 163-165.

47. Strouse JJ, Lobner K, Lanzkron S, Haywood C. NIH and National Foundation Expenditures For Sickle Cell Disease and Cystic Fibrosis Are Associated With Pubmed Publications and FDA Approvals [abstract]. Blood 2013; 122: 1739.

48. Colby SL, Ortman JM. Projections of the Size and Composition of the U.S. Population: 2014 to 2060. Washington, D.C.: U.S. Census Bureau; 2015. p. P25-1143.

49. Paltoo DN, Rodriguez LL, Feolo M, Gillanders E, Ramos EM, Rutter JL, Sherry S, Wang VO, Bailey A, Baker R, Caulder M, Harris EL, Langlais K, Leeds H, Luetkemeier E, Paine T, Roomian T, Tryka K, Patterson A, Green ED, National Institutes of Health Genomic Data Sharing Governance Committees. Data use under the NIH GWAS data sharing policy and future directions. Nat Genet 2014; 46: 934-938.

50. Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L, Popova N, Pretel S, Ziyabari L, Lee M, Shao Y, Wang ZY, Sirotkin K, Ward M, Kholodov M, Zbicz K, Beck J, Kimelman M, Shevelev S, Preuss D, Yaschenko E, Graeff A, Ostell J, Sherry ST. The NCBI dbGaP database of genotypes and phenotypes. Nat Genet 2007; 39: 1181-1186.

51. Kalia SS, Adelman K, Bale SJ, Chung WK, Eng C, Evans JP, Herman GE, Hufnagel SB, Klein TE, Korf BR, McKelvey KD, Ormond KE, Richards CS, Vlangos CN, Watson M, Martin CL, Miller DT. Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics. Genet Med 2017; 19: 249-255.

52. Goodman JL, Amendola LM, Horike-Pyne M, Trinidad SB, Fullerton SM, Burke W, Jarvik GP. Discordance in selected designee for return of genomic findings in the event of participant death and estate executor. Mol Genet Genomic Med 2017; 5: 172-176.

53. Pasaniuc B, Rohland N, McLaren PJ, Garimella K, Zaitlen N, Li H, Gupta N, Neale BM, Daly MJ, Sklar P, Sullivan PF, Bergen S, Moran JL, Hultman CM, Lichtenstein P, Magnusson P, Purcell SM, Haas DW, Liang L, Sunyaev S, Patterson N, de Bakker PI, Reich D, Price AL. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat Genet 2012; 44: 631-635.

54. Belkadi A, Bolze A, Itan Y, Cobat A, Vincent QB, Antipenko A, Shang L, Boisson B, Casanova JL, Abel L. Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proc Natl Acad Sci U S A 2015; 112: 5473-5478.

55. Chong JX, Buckingham KJ, Jhangiani SN, Boehm C, Sobreira N, Smith JD, Harrell TM, McMillin MJ, Wiszniewski W, Gambin T, Coban Akdemir ZH, Doheny K, Scott AF, Avramopoulos D, Chakravarti A, Hoover-Fong J, Mathews D, Witmer PD, Ling H, Hetrick K, Watkins L, Patterson KE, Reinier F, Blue E, Muzny D, Kircher M, Bilguvar K, Lopez-Giraldez F, Sutton VR, Tabor HK, Leal SM, Gunel M, Mane S, Gibbs RA, Boerwinkle E, Hamosh A, Shendure J, Lupski JR, Lifton RP, Valle D, Nickerson DA, Centers for Mendelian Genomics, Bamshad MJ. The Genetic Basis of Mendelian Phenotypes: Discoveries, Challenges, and Opportunities. Am J Hum Genet 2015; 97: 199-215.

56. Ng SB, Bigham AW, Buckingham KJ, Hannibal MC, McMillin MJ, Gildersleeve HI, Beck AE, Tabor HK, Cooper GM, Mefford HC, Lee C, Turner EH, Smith JD, Rieder MJ, Yoshiura K, Matsumoto N, Ohta T, Niikawa N, Nickerson DA, Bamshad MJ, Shendure J. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet 2010; 42: 790-793.

57. MacArthur DG, Manolio TA, Dimmock DP, Rehm HL, Shendure J, Abecasis GR, Adams DR, Altman RB, Antonarakis SE, Ashley EA, Barrett JC, Biesecker LG, Conrad DF, Cooper GM, Cox NJ, Daly MJ, Gerstein MB, Goldstein DB, Hirschhorn JN, Leal SM, Pennacchio LA, Stamatoyannopoulos JA, Sunyaev SR, Valle D, Voight BF, Winckler W, Gunter C. Guidelines for investigating causality of sequence variants in human disease. Nature 2014; 508: 469-476.

58. Auer PL, Reiner AP, Wang G, Kang HM, Abecasis GR, Altshuler D, Bamshad MJ, Nickerson DA, Tracy RP, Rich SS, NHLBI GO Exome Sequencing Project, Leal SM. Guidelines for Large-Scale Sequence-Based Complex Trait Association Studies: Lessons Learned from the NHLBI Exome Sequencing Project. Am J Hum Genet 2016; 99: 791-801.

59. Morrison AC, Voorman A, Johnson AD, Liu X, Yu J, Li A, Muzny D, Yu F, Rice K, Zhu C, Bis J, Heiss G, O'Donnell CJ, Psaty BM, Cupples LA, Gibbs R, Boerwinkle E, Cohorts for H, Aging Research in Genetic Epidemiology C. Whole-genome sequence-based analysis of high-density lipoprotein cholesterol. Nat Genet 2013; 45: 899-901.

60. MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, Albers CA, Zhang ZD, Conrad DF, Lunter G, Zheng H, Ayub Q, DePristo MA, Banks E, Hu M, Handsaker RE, Rosenfeld JA, Fromer M, Jin M, Mu XJ, Khurana E, Ye K, Kay M, Saunders GI, Suner MM, Hunt T, Barnes IH, Amid C, Carvalho-Silva DR, Bignell AH, Snow C, Yngvadottir B, Bumpstead S, Cooper DN, Xue Y, Romero IG, 1000 Genomes Project Consortium, Wang J, Li Y, Gibbs RA, McCarroll SA, Dermitzakis ET, Pritchard JK, Barrett JC, Harrow J, Hurles ME, Gerstein MB, Tyler-Smith C. A systematic survey of loss-of-function variants in human protein-coding genes. Science 2012; 335: 823-828.

61. Wang Z, Sadovnick AD, Traboulsee AL, Ross JP, Bernales CQ, Encarnacion M, Yee IM, de Lemos M, Greenwood T, Lee JD, Wright G, Ross CJ, Zhang S, Song W, Vilarino-Guell C. Nuclear Receptor NR1H3 in Familial Multiple Sclerosis. Neuron 2016; 90: 948-954.

62. Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc Natl Acad Sci U S A 2012; 109: 1193-1198.

63. Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods 2011; 8: 469-477.

64. Koch CM, Chiu SF, Akbarpour M, Bharat A, Ridge KM, Bartom ET, Winter DR. A Beginner's Guide to Analysis of RNA-seq Data. Am J Respir Cell Mol Biol 2018; 59: 145-157.

65. Lei R, Ye K, Gu Z, Sun X. Diminishing returns in next-generation sequencing (NGS) transcriptome data. Gene 2015; 557: 82-87.

66. Wang Y, Ghaffari N, Johnson CD, Braga-Neto UM, Wang H, Chen R, Zhou H. Evaluation of the coverage and depth of transcriptome by RNA-Seq in chickens. BMC Bioinformatics 2011; 12 Suppl 10: S5.

67. Tarazona S, Garcia-Alcalde F, Dopazo J, Ferrer A, Conesa A. Differential expression in RNA-seq: a matter of depth. Genome Res 2011; 21: 2213-2223.

68. Liu Y, Ferguson JF, Xue C, Silverman IM, Gregory B, Reilly MP, Li M. Evaluating the impact of sequencing depth on transcriptome profiling in human adipose. PLoS One 2013; 8: e66883.

69. Liu Y, Zhou J, White KP. RNA-seq differential expression studies: more sequence or more replication? Bioinformatics 2014; 30: 301-304.

70. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet 2010; 11: 31-46.

71. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 2016; 17: 333-351.

72. Raz T, Kapranov P, Lipson D, Letovsky S, Milos PM, Thompson JF. Protocol dependence of sequencing-based gene expression measurements. PLoS One 2011; 6: e19287.

73. Head SR, Komori HK, LaMere SA, Whisenant T, Van Nieuwerburgh F, Salomon DR, Ordoukhanian P. Library construction for next-generation sequencing: overviews and challenges. Biotechniques 2014; 56: 61-64, 66, 68, passim.

74. McCall MN, Illei PB, Halushka MK. Complex Sources of Variation in Tissue Expression Data: Analysis of the GTEx Lung Transcriptome. Am J Hum Genet 2016; 99: 624-635.

75. Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics 2013; 14: 91.

76. Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am J Bot 2012; 99: 248-256.

77. Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM. Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing. BMC Genomics 2012; 13: 484.

78. Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol 2013; 14: R95.

79. SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol 2014; 32: 903-914.

80. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 2001; 29: 365-371.

81. Mardis ER. The $1,000 genome, the $100,000 analysis? Genome Med 2010; 2: 84.

82. Brookes AJ, Robinson PN. Human genotype-phenotype databases: aims, challenges and opportunities. Nat Rev Genet 2015; 16: 702-715.

83. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE. Big Data: Astronomical or Genomical? PLoS Biol 2015; 13: e1002195.

84. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011; 43: 491-498.

85. Himes BE, Koziol-White C, Johnson M, Nikolos C, Jester W, Klanderman B, Litonjua AA, Tantisira KG, Truskowski K, MacDonald K, Panettieri RA, Jr., Weiss ST. Vitamin D Modulates Expression of the Airway Smooth Muscle Transcriptome in Fatal Asthma. PLoS One 2015; 10: e0134057.

86. Spencer CC, Su Z, Donnelly P, Marchini J. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet 2009; 5: e1000477.

87. Lee S, Abecasis GR, Boehnke M, Lin X. Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet 2014; 95: 5-23.

88. Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, NHLBI GO Exome Sequencing Project-ESP Lung Project Team, Christiani DC, Wurfel MM, Lin