introduction to gene mapping

ASSOCIATION MAPPING

INTRODUCTION TO GENE MAPPING: Association Mapping

Seeks to identify specific functional variants (i.e. loci, alleles) linked to phenotypic difference in a trait, to facilitate detection of trait-causing DNA sequence polymorphisms and/or selection of genotypes that closely resemble in phenotype Also known as Linkage Disequilibrium (LD) mapping or Association Genetics, is a population based survey used to identify trait-marker relationships based on LD.

Leilani NoraAssistant Scientist

ASSOCIATION VS. QTL MAPPINGAttribute Detection goal QTL mapping Quantitative trait locus, wide region within specific pedigrees Low moderate density linkage maps only required Association Genetics Quantitative Trait Nucleotide, physically close as possible to causative sequences High disequilibrium within small physical regions requiring many markers. LD experiments; unrelated individuals Unstructured populations, large numbers of small unrelated families.

ASSOCIATION VS. QTL MAPPINGAttribute Marker discovery costs Extent of inference QTL mapping Moderate Association Genetics Moderate for few traits, high for many traits

Resolution of causative trait polymorphism Experimental populations

Pedigree specific, except Species or sub species wide where species has high extant LD 102- low 103 105 for small genomes ~109 for large genomes

Defined pedigrees, eg. Backcross, F2, RI, three and two generation pedigrees

Number of markers required for genome coverage

WHY ASSOCIATION GENETICS? The higher resolution afforded by use of unstructured populations allows the intriguing possibility of identifying the genes or even specific nucleotides underpinning trait variation. The opportunity to use molecular markers to enhance rates of genetic gain, including utilization of specific genes from non-elite germplasm in a more directed and efficient manner.

HARDY WEINBERG EQUILIBRIUM

HARDY WEINBERG EQUILIBRIUM (HWE) The Hardy-Weinberg model, describes and predicts genotype and allele frequencies in a nonevolving population. Is an expression of the notion of a population in genetic equilibrium and is a basic principle of population genetics.

HWE ASSUMTIONSAssumptionRandom Mating No migration No Mutation No selection

ExceptionInbreeding /Outbreeding Migration Source of all variation Directional / Disruptive / Stabilizing Selection Genetic Drift

EffectDecrease or increase in heterozygosity Homogenize different populations Increase heterozygosity Reduce variation Increase variation Increase variation Reduce Variation

Infinite population

HARDY WEINBERG EQUILIBRIUM1st Generation : Genotype and Allele Frequencies Consider a locus with two alleles: A and a We can use the Punnetts square to produce all possible combinations of these gametes (Table1) Assume in the first generation the alleles are not in HWE and the genotype frequency is shown in Table 2.

HARDY WEINBERG EQUILIBRIUM1st Generation : Genotype and Allele Frequencies Allele frequencies for population not in HWE

1 P(AA) = p2 + pq 2Next Generation

1 P(aa) = q 2 + pq 2

Table 1Male Female A a A AA Aa a Aa aa

Table 2Genotype AA Aa aawhere : p2+ 2pq + q2 =1

Freq p2 pq q2

When a population is in HWE, the next generation will result in the same genotype frequency as well as the same allele frequency. Allele frequencies for a population in HWE: P(AA) = P(A) P(A) =p2 P(Aa) = 2P(A) P(a) = 2pq P(aa) = P(a)P(a) = q2

HARDY WEINBERG EQUILIBRIUM When a population is in HWE, the next generation will result in the same genotype frequency as well as the same allele frequency. Allele frequencies for a population in HWE: P(AA) = P(A) P(A) =p2 P(Aa) = 2P(A) P(a) = 2pq P(aa) = P(a)P(a) = q2 For example, consider a diallelic locus with alleles A and a with frequencies 0.85 and 0.15, respectively. If the locus is in HWE, calculate the allele frequencies.

HARDY WEINBERG EQUILIBRIUMViolation in HWE Assumption When a locus is not in HWE, then this suggests one or more of the Hardy-Weinberg assumptions is false. Departure from HWE has been used to infer the existence of natural selection, argue for existence of assortive (non-random) mating, and infer genotyping errors. It is therefore of interest to test whether a population is in HWE at a locus. Two most popular ways of testing HWE - Chi-Square test - Exact test

CHI-SQUARE GOODNESS OF FIT Compares observed genotype counts with the values expected under Hardy-Weinberg For a locus with two alleles, we might construct a table as follows:Genotype AA Aa aa Observed Expected Under HWE nAA nAa naa np2 2npq nq2

CHI-SQUARE GOODNESS OF FIT Test Statistic for Allelic Association is:

2 =

Genotypes

(Observed count - Expected count )2Expected Count

under Ho (2 df ) 1

where: n is the number of individuals in the sample p is the probability that a random allele in a population is of type A q - is the probability that a random allele in a population is of type a

DATAFRAME: ge03d1p1.csv Dataframe with 250 observations and 7 variables.

DATAFRAME: ge03d1p1.csvRead data file ge03d1p1.csv > assoc1 library(genetics) > summary(Snp4)Number of samples typed: 243 (97.2%) Allele Frequency: (2 alleles) Count Proportion A 323 0.66 B 163 0.34 NA 14 NA Heterozygosity (Hu) Poly. Inf. Content = 0.4467269 = 0.3464355 Genotype Frequency: Count Proportion A/A 109 0.45 A/B 105 0.43 B/B 29 0.12 NA 7 NA

> table(Snp4) A/A 109 A/B 105 B/B 29

PACKAGE genetics : HW.chisq() Test the null hypothesis (Ho) that Hardy-Weinberg equilibrium holds using chi-square method > HWE.chisq(x, ) # x genotype or haplotype object Illustration > HWE.chisq(Snp4)Pearson's Chi-squared test with simulated p-value (based on 10000 replicates) data: tab X-squared = 0.2298, df = NA, p-value = 0.6657

PACKAGE genetics : HW.exact() Exact test of HWE for 2 Allele Markers > HWE.exact(x, ) # x genotype or haplotype object Illustration > HWE.exact(Snp4)Exact Test for Hardy-Weinberg Equilibrium data: snp4 N11 = 109, N12 = 105, N22 = 29, N1 = 323, N2 = 163, p-value = 0.666

LINKAGE DISEQUILIBRIUM Also known as gametic phase disequilibrium, gametic disequilibrium and allelic association. Non random association of alleles at different loci

LINKAGE DISEQUILIBRIUM

It is the correlation between polymorphisms (SNPs) that is caused by their shared history of mutation and recombination. LD and Linkage are related but they are distinctly different.

LINKAGE DISEQUILIBRIUM Two loci, A and B are said to be in linkage (or gametic) disequilibrium if their respective alleles do not associate independently in the studied population. Occurs when genotypes at the two loci are not independent of another. If all polymorphism were independent at the population level, association studies would have to examine every one of them Linkage disequilibrium makes tightly linked variants strongly correlated producing cost savings for association studies.

MEASURE OF LINKAGE DISEQUILIBRIUM Consider two loci (A and B), each segregating for two alleles (A, a, B, b) There are four possible gametes (or haplotypes) present in the populations:

Locus A Locus B B b Total A XAB XAb pA a XaB Xab qa Totals pB qb 1.0

Gamete : AB, Ab, aB, ab Frequency : XAB, XAb, XaB, Xab Allele frequency can be expressed as gamete frequencies : pA, pa, pB, pb

MEASURE OF LINKAGE DISEQUILIBRIUM If the alleles at the two loci are randomly associated with one another, then the frequencies of the four gametes are equal to the product of the frequencies of alleles.

COEFFICIENT OF LD If alleles at the two loci are not randomly associated then there will be a deviation (D) in the expected frequencies

Locus A Locus A Locus B B b Total A pAB = pA pB pAqb= pA (1-pB) pA a qa pB = (1-pA) pB qaqb = (1-pA) (1-pB) qa Totals pB qb 1.0 This parameter D is the Coefficient of Linkage Disequilibrium first proposed by Lewontin and Kojima (1960) . The most common expression of D is: Dij = pij pipj or DAB = pAB pApB

Locus B B b Total

A pAB = pApB+DAB pAb= pApb-DAB pA

a pa B =papB-DAB pab =papb+DAB qa

Totals pB qb 1.0

In this situation there is no linkage disequilibrium and gamete frequencies can be accurately followed using allele frequencies.

MEASURE OF LINKAGE DISEQUILIBRIUMNormalized Measure Of Lewontin, D D D' = DmaxWhere:

TEST OF LDChi-square Test of Linkage Disequilibrium (D)

2 =

2nD 2 ~ (1) p A (1 p A ) pB (1 pB )

Dmax = min[pApB, qaqb], min[pAqb, qapB],

if DAB < 0 if DAB > 0

Compared with the threshold value obtained from the chisquare table with 1 df at certain level of significance. n is the number of individuals in the population. If significant, this means that D is significantly different from 0 and that the population under study is in linkage disequilibrium If not significant, this means that D is not significantly different from 0 and that the population under study in in linkage equilibrium.

Varies between 0 and 1 and allows to assess the extent of linkage disequilibrium relative to the maximum possible value it can take. D will only be less than one if all four possible haplotypes are observed.

MEASURE OF LINKAGE DISEQUILIBRIUMCorrelation between A and B alleles, 2 or r2

ILLUSTRATION OF LDFigure 1: Completely CorrelatedA C

DAB 2 = = p A (1 p A ) pB (1 pB ) 2n2

2

G

T

If allele frequencies are equal, then r2 varies between 0 to 1 1 when the two markers provide identical information 0 when they are in perfect equilibrium As for D, the maximum value of r2 depends on the allele frequencies and one can determine r value in a manner analogous to a D. Shows an example of LD where the two polymorphisms are completely correlated with one another Two linked mutations occur at a similar point in time and no recombination has occurred between sites. In this case, the history of mutation and recombination for the sites is the same.

ILLUSTRATION OF LDFigure 2: Not Completely CorrelatedA C

LD ANALYSIS IN R: LD() Computes pairwise linkage disequilibrium between genetic markers Usage

T G

> LD(g1, g2, ) # g1 genotype object or dataframe containing genotype objects # g2 genotype object (ignored if g1 is a dataframe)

Polymorphisms are not completely correlated, but there is no evidence of recombination. This type of LD structure develop when mutations occur on different allelic lineages. This is the situation in which r2 and D act differently, with D still equal to 1, but where r2 can be much smaller .

Sample: LD()> > > > > > library (genetics) Snp4 |z|) (Intercept) -1.3749 0.2386 -5.761 8.35e-09 *** snp4A/B 0.1585 0.3331 0.476 0.634 snp4B/B 0.2297 0.4952 0.464 0.643 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 254.91 on 242 degrees of freedom Residual deviance: 254.58 on 240 degrees of freedom (7 observations deleted due to missingness)AIC: 260.58 Number of Fisher Scoring iterations: 4

> modelQT2 summary(modelQT2)Deviance Residuals: Min 1Q Median -0.7433 -0.7204 -0.6715 3Q -0.6715 Max 1.7890

DETECTING POPULATION STRUCTURENon-Parametric method

DETECTING POPULATION STRUCTURE

Cluster Analysis Multi-Dimensional Scaling Principal Component Analysis

Parametric method STRUCTURE

CLUSTER ANALYSIS Exploratory technique which may be used to search for category structure based on natural groupings in the data, or reduce a very large body of data to a relatively compact description No assumptions are made concerning the number of groups or the group structure. Grouping is done on the basis of similarities or distances (dissimilarities). There are various techniques in doing this which may give different results. Thus researcher should consider the validity of the clusters found.

HIERARCHICAL CLUSTER ANALYSISSteps in Performing Agglomerative Hierarchical Clustering 1. Obtain the Data Matrix 2. Standardize the data matrix if needed be 3. Generate the resemblance or distance matrix 4. Execute the Clustering Method

DISTANCE / DISSIMILARITY MATRIX

CLUSTERING METHODTypes of Hierarchical Agglomerative Clustering 1. Single Linkage (SLINK) 2. Complete Linkage (CLINK) 3. Average Linkage (ALINK) 4. Wards Method - minimize the error SS 5. Centroid Method Partitioning Method 1. K-means clustering 2. K-centroids

DATAFRAME: AMP2009.csvRead data file AMP2009.csv > AMP09 abline(v=0,lty=2) > abline(h=0,lty=2)

PRINCIPAL COMPONENT ANALYSIS Data analytic method which provides a specific set of projections which represent a given data set in a fewer dimensions. Use to transform correlated variables into uncorrelated ones, in other words to sphere the data. The final rationale for this technique is that it finds the linear combinations of data which have relatively large (or relatively small) variability.

PRINCIPAL COMPONENT ANALYSIS : prcomp() Performs a principal component analysis on the given data matrix and returns the results as an object of class prcomp

> cmsdscale(x, center=T, scale.= F, ) # x a numeric or complex matrix (or dataframe) which provides the data for PCA # center a logical value indicating whether the variables should be shifted to be 0 centered. # scale. a logical value indicating whether the variables should be scaled to have a unit variance before the analysis.

PRINCIPAL COMPONENT ANALYSIS : predict() A generic function for predictions from the results of various model fitting functions. > predict(object, ) # object a model object for which prediction is desired.

DATA FRAME: GenonumConvert SNPs data to numeric > GenoNum str(GenoNum)int [1:207, 1:184] 3 3 3 2 3 2 2 2 2 2 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:207] "AMP001" "AMP002" "AMP003" "AMP004" ... ..$ : chr [1:184] "SNP1" "SNP2" "SNP3" "SNP4" ...

PRINCIPAL COMPONENT ANALYSIS : prcomp()> PCAMP scores plot(PCAMP$"x"[,1],PCAMP$"x"[,2], xlab="PC1",ylab="PC2", type="p")

STRUCTURE VERSION 2.3 Implements a model-based clustering method for inferring population structure using genotype data of unlinked markers. Bayesian Statistics based which can accommodate prior knowledge about the population structure. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. To download the software use below link:

http://pritch.bsd.uchicago.edu/structure.html For more details on how to use the software you can contact Dr. Ken McNally

SAMPLE OUTPUT USING STRUCTURE

SAMPLE OUTPUT USING STRUCTURE Figure 2. Triangle plot of the Q-matrix. Each individual is represented by colored point. The colors correspond to the prior population labels.

Figure 1. Bar plot of estimates of Q. Each individual is represented by a single vertical line broken into K colored segments, with lengths proportional to each of the K inferred clusters. The numbers 1 to 4 correspond to the predefined populations.

When K=3 the ancestry vectors can be plotted onto triangle, as shown. For a given point, each of the three components is given by the distance to one edge of the triangle. Individuals who are in one of the corners are therefore assigned completely to one population or another.

FEATURES OF PACKAGE GenABEL Specifically designed for GWAS

ASSOCIATION ANALYSIS Using GenABEL

Provides specific facilities for storage and manipulation of large data Very fast tests for GWAS Specific functions to analyze and display the results. More efficient than the package genetics

DESCRIPTION OF gwaa.data-class In GenABEL, special data class, gwaa.data-class is used to store GWA data. Includes the phenotypic and genotypic data, chromosome, and location of every SNP. An object of some class has slots which may contain actual data or objects of other classes. At first level gwaa.data-class object has slot phdata, which contains all the phenotypic information in a dataframe. The other slot is gtdata which contains all GWA genetic information in an object of class snp.data For every SNP it is desirable to know the details of coding and strand (+ , -, top, bot)

EXPLORING gwaa.data-class> library(GenABEL) > data(ge03d2ex) > str(ge03d2ex)Formal class 'gwaa.data' [package "GenABEL"] with 2 slots ..@ phdata:'data.frame': 136 obs. of 8 variables: .. ..$ id : chr [1:136] "id199" "id287" "id300"... .. ..$ sex : int [1:136] 1 0 1 0 0 1 1 0 0 1 ... ..... ..@ gtdata:Formal class 'snp.data' [package "GenABEL"] with 11 slots .. .. ..@ nbytes : num 34 .. .. ..@ nids : int 136 .. .. ..@ nsnps : int 4000 .. .. ..@ idnames : chr [1:136] "id199" "id287"... .. .. ..@ snpnames : chr [1:4000] "rs7435137"... .. .. ..@ chromosome: Factor w/ 4 levels "1","2","3","X ....

STURCTURE OF gwaa.data-class

EXPLORING gwaa.data-class# Summary of Phenotype data > summary(ge03d2ex@phdata) # No. of people in a study > ge03d2ex@gtdata@nids # No. of SNPs > ge03d2ex@gtdata@nsnps # SNP Names > ge03d2ex@gtdata@snpnames[1:10] # Chromosome labels > ge03d2ex@gtdata@chromosome[1:10] # SNPs map position/location > ge03d2ex@gtdata@map[1:10]

IMPORT DATA TO GenABEL To import data to GenABEL, need to prepare two files - Phenotypic data - Genotypic data Description of Phenotypic data file - First line must consists of variable name - First column must contain the unique ID, named id. - Second column should be named sex (0=female, 1=male) - Other columns in the file should contain phenotypic information. - Missing values should be coded as NA

IMPORT DATA TO GenABEL Example of few phenotypic file : Pheno.csv

Save this file as Pheno.dat

IMPORT DATA TO GenABEL Description of Genotypic data file - For every SNP, information on map position, chromosome, and strand should be provided. - For every individual, every SNP genotype should be provided. - GenABEL provided a number of function to convert these data from different formats to the internal GenABEL raw format. > convert.snp.illumina() > convert.snp.tped() > convert.snp.ped() > convert.snp.txt()

IMPORT DATA TO GenABEL: snp.convert.txt() Converts genotypic data file to raw internal data formatted file > convert.snp.text(infile, outfile,..) # infile input data file 1st line - contains IDs 2nd line names of all SNPs 3rd line list of chromosomes the SNPs belongs to 4th line genomic position of the SNPs 5th line genetic data. # outfile output data file

IMPORT DATA TO GenABEL: snp.convert.txt() Sampe genotypic data file : Geno.csv

IMPORT DATA TO GenABEL: load.gwaa.data() Load data (genotypes and phenotypes) from files to gwaa.data object > load.gwaa.data(phenofile=pheno.dat, genofile=geno.raw, sort=T) # phenofile data table with phenotypes # genofile internally formatted genotypic data file using convert.snp.txt # sort logical value indicating whether SNPs should be sorted in ascending order according to chromosome and position

Save this file as Geno.dat

IMPORT DATA TO GenABEL: load.gwaa.data()> convert.snp.text("Geno.dat","Geno.raw") > genphen descriptive.trait(data, by.var) # data an object of snp.data-class or gwaa.dataclass # by.var a binary trait; which will separated analysis for each group

DESCRIPTIVE STATISTICS OF PHENOTYPE: descriptive.trait()> descriptive.trait(ge03d2ex) No Mean SD id 136 NA NA sex 136 0.529 0.501 age 136 49.069 12.926 dm2 136 0.632 0.484 height 135 169.440 9.814 weight 135 87.397 25.510 diet 136 0.059 0.236 bmi 135 30.301 8.082

DESCRIPTIVE STATISTICS OF PHENOTYPE: descriptive.trait()> descriptive.trait(ge03d2ex, by=ge03d2ex@phdata$dm2)No(by.var=0) id sex age dm2 height weight diet bmi Pexact id NA sex 0.074 age NA dm2 NA height NA weight NA diet 1.000 bmi NA Mean SD No (by.var=1) 50 NA NA 50 0.420 0.499 50 47.038 13.971 50 NA NA 49 167.671 8.586 49 76.534 17.441 50 0.060 0.240 49 27.304 6.463 Mean SD Ptt Pkw 86 NA NA NA NA 86 0.593 0.494 0.053 0.052 86 50.250 12.206 0.179 0.205 86 NA NA NA NA 86 170.448 10.362 0.097 0.141 86 93.587 27.337 0.000 0.000 86 0.058 0.235 0.965 0.965 86 32.008 8.441 0.000 0.001

DESCRIPTIVE STATISTICS OF MARKER: descriptive.marker() Generate descriptive summary tables for genotypic data > descriptive.marker(data, digits) # data an object of snp.data-class or gwaa.dataclass # digits number of digits to be printed

DESCRIPTIVE STATISTICS OF MARKERS: descriptive.marker()> descriptives.marker(ge03d2ex)$'Cumulative distr. of different alpha' X pop pop[1] [26] [51] [76] [101] [126] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

STRATIFIED ASSOCIATION> data1.sa plot(cleanqt, cex=0.5, pch=19, ylim=c(1,4)) > add.plot(data1.sa, col="green", cex=1.2) > add.plot(origdata, col="red", cex=1.2)

STRATIFIED ASSOCIATION Comparison of Structured Association analysis

PCA USING PRICES METHOD: egscore() Fast score test for association (FASTA) between a trait and genetic polymorphism, adjusted for possible stratification by principal components.

> egscore(formula, data, kin) # formula formula describing fixed effects (y ~ a +b) - mean the outcome y depends on two covariates, a and b. # data An object of gwaa.data-class # kin kinship matrix as returned by ibs

PCA USING PRICES METHOD: egscore()> data1.eg plot(cleanqt, cex=0.5, pch=19, ylim=c(1,5)) > add.plot(data1.sa, col="green", cex=1.2) > add.plot(data1.eg, col=red", cex=1.3)

1

2 Chromosome

3

X

REFERENCES Zhao JH. Use of R in Genome-wide Association Studies (GWASs)

THANK YOU!

Aluchencko, Yurii. GenABEL Tutorial (March 14, 2008)

introduction to gene mapping

Documents

allele frequency

hwe assumption

hwe assumtionsassumptionrandom

genotype frequency

aa aa table

genotype aa aa aawhere

pq paa

nonevolving population