the causes of variation lindon eaves and tim york boulder, co march 2001

Download The Causes of Variation Lindon Eaves and Tim York Boulder, CO March 2001

If you can't read please download the document

Upload: rudolf-oconnor

Post on 18-Jan-2018

223 views

Category:

Documents


0 download

DESCRIPTION

Quantitative Trait Locus (QTL) Any gene whose contribution to variation in a quantitative trait is large enough to stand out against the background noise of other genetic and environmental factors

TRANSCRIPT

The Causes of Variation Lindon Eaves and Tim York Boulder, CO March 2001 One Issue (Among Many!) Identifying genes that cause complex diseases and genes that contribute to variation in quantitative traits Quantitative Trait Locus (QTL) Any gene whose contribution to variation in a quantitative trait is large enough to stand out against the background noise of other genetic and environmental factors Quantitative Trait A continuously variable trait (in which variation may be caused by multiple genetic and/or environmental factors); any categorical trait in which differences between categories may be mapped onto variation in a continuous trait Common diseases Estimated life time risk c.60% Substantial genetic component Non-Mendelian inheritance Non-genetic risk factors Multiple interacting pathways Most genes still not mapped Examples Ischaemic heart disease (30-50%, F-M) Breast cancer (12%, F) Colorectal cancer (5%) Recurrent major depression (10%) ADHD (5%) Non-insulin dependent diabetes (5%) Essential hypertension (10-25%) Even for simple diseases: Number of alleles is large (Wright et al, 1999) Ischaemic heart disease (LDR) >190 Breast cancer (BRAC1) >300 Colorectal cancer (MLN1) >140 Definitions Locus: One of c ,000 genes Allele: One of several variants of a specific gene Gene: a sequence of DNA that codes for a specific function Base pair: chemical letter of the genome (a gene has many 1000s of base pairs) Genome: all the genes considered together Finding QTLs Linkage Association Linkage Finds QTLs by correlating phenotypic similarity with genetic similarity (IBD) in specific parts of genome Linkage Doesnt depend on guessing gene Works over broad regions (good for getting in right ball-park) and whole genome (genome scan) Only detects large effects (>10%) Requires large samples (10,000s?) Cant guarantee close to gene Association Looks for correlation between specific alleles and phenotype (trait value, disease risk) Association More sensitive to small effects Need to guess gene/alleles (candidate gene) or be close enough for linkage disequilibrium with nearby loci May get spurious association (stratification) need to have genetic controls to be convinced Reality: For complex disorders and quantitative traits Large number of alleles at large number of genes Defining the Haystack 3x10 9 base pairs Markers every 6-10kb for association in populations with no recent bottleneck history 1 SNPs per 721 b.p. (Wang et al., 1998) c.14 SNPs per 10kb = 1000s haplotypes/alleles O ( ) genes Problems Large number of loci and alleles/haplotypes Possible interactions between genes Possible interactions between genes and environment Relatively low frequencies of individual risk factors Functional form of genotype-phenotype relations not known Sorting out signal from noise minimizing errors within budget Scaling of phenotype (continuous, discontinuous) Spurious association (stratification) Prepare for the worst Need statistical approaches that can screen enormous numbers of loci and alleles to identify reliably those that have impact on risk to disease System Chosen for Study 100 loci 20 loci affect outcome, 80 nuisance genes 257 alleles/locus Allele frequencies c % Disease genes each explain 2.5% variance in risk (c. 2-fold risk increase) 40% rarest alleles increase risk 50% variance non-genetic Its a Mess! Dont know which genes might have clues Dont know which alleles unordered categories > locus/allele combinations More predictor combinations than people (curse of dimensionality) Reality worse Problems Informatics: large volume of data Computational: large number of combinations Statistical: large number of chance associations Genetic-epidemiological: secondary associations How are we going to figure it out? Data Mining (Steinberg and Cartel) Attempt to discover possibly very complex structure in huge databases (large number of records and large number of variables) Problems include classification, regression, clustering, association (market analysis) Need tools to partially or fully automate the discovery process Large databases support search for rare but important patterns and interactions (epistasis, GxE) Some Approaches to DM Logistic regression Neural networks CART (Breiman et al. 1984) MARS (Friedman, 1991) MARS Multivariate Adaptive Regression Splines Key references Friedman, J.H. (1991) Multivariate Adaptive Regression Splines (with discussion), Annals of Statistics, 19: Steinberg, D., Bernstein, B., Colla, P., Martin, K., Friedman, J.H. (1999) MARS User Guide. San Diego, CA: Salford Systems The MARS Advantage Allows large number of predictors (loci/alleles/environments) to be screened Non-parametric Continuous and discontinuous outcomes Systematic search for detailed interactions Testing and cross-validation Continuous and categorical predictors Decides best form of relationship Example Regression Spline: Impact of Non-Retail Business on Median Boston House Prices Knot Median House Price Industrial Business Fitting functions with Splines Piece-wise linear regression. simplest form. allow regression to bend. Knots define where the function changes behavior. Local fit vs. Global fit. actual dataspline with 3 knots One predictor example True knots at 20 and 45 (left) Best single knot at about 35 (right) Y Y XX Re-express variables as basis functions Done to generalize the search for knots. Difficult to illustrate splines with > one dimension. Core building block of MARS model max (0, X c); example: BF1 = max(0, ENV 5); BF2 = max(0, ENV 8); 0 for ENV