qtl - biostatistics and medical informaticskbroman/teaching/...genotype at a mar k er. do a t-test /...
TRANSCRIPT
IntroductiontoQTLmappinginexperimentalcrosses
KarlWBroman
DepartmentofBiostatistics
TheJohnsHopkinsUniversity
http://biosun01.biostat.jhsph.edu/˜kbroman
Outline
�Experimentsanddata
�Models
�ANOVAatmarkerloci
�Intervalmapping
�LODthresholds
�LODsupportintervals
�PowertodetectQTLs
�Howmanymarkers/mice?
�Selectionbias
�Errorsinthemap
�Genotypingerrors
�Selectivegenotyping
�Covariates
�Non-normaltraits
Backcrossexperiment
P1
A A
P2
B B
F1
A B
BC BCBCBC
Intercrossexperiment
P1
A A
P2
B B
F1
A B
F2 F2F2F2
Phenotypedistributions
�WithineachoftheparentalandF�strains,individualsaregeneticallyidentical.
�Environmentalvariationmayormaynotbeconstantwithgenotype.
�Thebackcrossgenerationex-hibitsgeneticaswellasenvi-ronmentalvariation.
Parental strains
020406080100
020406080100
F1 generation
020406080100
Phenotype
Backcross generation
DataandGoals
Phenotypes:��=phenotypeformouse�
Genotypes:���=1/0ifmouse�isBB/ABatmarker�
(forabackcross)Geneticmap:Locationsofmarkers
Goals:
�Identifythe(oratleastone)genomicregions(QTLs)thatcontributetovariationinthephenotype.
�FormconfidenceintervalsforQTLlocations.
�EstimateQTLeffects.
0
20
40
60
80
Chromosome
Location (cM)
12345678910111213141516171819X
Genetic map
20406080100120
20
40
60
80
100
120
Markers
Individuals
12345678910111213141516171819X
Missing genotypes
Models:Recombination
Weassume:Mendel’srulesNocrossoverinterference
����� �� ����� �� ���
LocationsofcrossoversareaccordingtoaPoissonprocess.
�����formaMarkovchainwithtransitionprobabilities:
�������� ����� �� �������� ����� �� ����=recombinationfraction=�����������
�isthegeneticdistanceinMorgans.
Markovchain:������������!������!������!"""� �������������Models:Genotype#$Phenotype
Let�=phenotype
%=wholegenomegenotype
ImagineasmallnumberofQTLswithgenotypes%�!"""!%&.(�&distinctgenotypes)
E���%� '()�***�(+var���%� ,�()�***�(+
Homoscedasticity(constantvariance):,�(-,�
Normallydistributedresidualvariation:��%./�'(!,��.Additivity:'()�***�(+ '01& �2�3�%�(%� �or�)Epistasis:Anydeviationsfromadditivity.
Thesimplestmethod:ANOVA
�Alsoknownasmarkerregression.
�Splitmiceintogroupsaccordingtogenotypeatamarker.
�Doat-test/ANOVA.
�Repeatforeachmarker.40
50
60
70
80
Phenotype
BBAB
Genotype at D1M30
BBAB
Genotype at D2M99
Effectatamarker
ConsiderthecaseofasingleQTLwitheffect3 '44�'54.ConsideramarkerlinkedtotheQTL,with� recomb.frac.
OfindividualswithmarkergenotypeBB,meanphenotypeis:
'44�����0'54� '44��3
OfindividualswithmarkergenotypeAB,meanphenotypeis:
'54�����0'44� '540�3
Difference:�'44��3���'540�3� 3������
ANOVAatmarkerloci
Advantages
�Simple.
�Easilyincorporatescovariates.
�Easilyextendedtomorecomplexmodels.
�Doesn’trequireageneticmap.
Disadvantages
�Mustexcludeindividualswithmissinggenotypedata.
�ImperfectinformationaboutQTLlocation.
�Suffersinlowdensityscans.
�OnlyconsidersoneQTLatatime.
Intervalmapping(IM)
Lander&Botstein(1989)
�AssumeasingleQTLmodel.
�Eachpositioninthegenome,oneatatime,ispositedastheputativeQTL.
�Let6 1/0ifthe(unobserved)QTLgenotypeisBB/AB.Assume� '03607where7./��!,��.
�Givengenotypesatlinkedmarkers,�.mixtureofnormaldist’nswithmixingproportion��6 ��markerdata�:
QTLgenotype 8�89BBABBBBB:;<=>?:;<=@?A:;<=?=>=@A:;<=? BBAB:;<=>?=@A==>:;<=@?A= ABBB=>:;<=@?A=:;<=>?=@A= ABAB=>=@A:;<=?:;<=>?:;<=@?A:;<=?
Thenormalmixtures
8�89 B
7cM13cM
�Twomarkersseparatedby20cM,withtheQTLclosertotheleftmarker.
�Thefigureatrightshowthedis-tributionsofthephenotypecondi-tionalonthegenotypesatthetwomarkers.
�Thedashedcurvescorrespondtothecomponentsofthemixtures.
20406080100
Phenotype
BB/BB
BB/AB
AB/BB
AB/AB
µ
µ
µ
µ
µ+∆
µ+∆
µ+∆
µ+∆
Intervalmapping(continued)
LetC� ��6� ��markerdata����6�./�'036�!,��
�����markerdata!'!3!,� C�D���E'03!,�0���C��D���E'!,�
whereD��E'!,� FGHI����'�����,��J�K�L,�
Loglikelihood:M�'!3!,� 1�NOP�����markerdata!'!3!,�
Maximumlikelihoodestimates(MLEs)of',3,,:valuesforwhichM�'!3!,�ismaximized.
EMalgorithm
Dempsteretal.(1977)
Estep:
LetQRS��T ��6� ����!markerdata!U'RST!U3RST!U,RST� &VWRXVYZ[\]^�Z_\]^�Z`\]^T &VWRXVYZ[\]^�Z_\]^�Z`\]^T�R��&VTWRXVYZ[\]^�Z`\]^T
Mstep:
LetU'RS��T 1������QRS��T ���1����QRS��T �� U3RS��T 1���QRS��T ��1�QRS��T ��U'RS��TU,RS��T abitcomplicated
Thealgorithm:
StartwithQR�T � C�;iteratetheE&Mstepsuntilconvergence.
LODscores
TheLODscoreisameasureofthestrengthofevidenceforthepresenceofaQTLataparticularlocation.
LOD�6� NOP�alikelihoodratiocomparingthehypothesisofaQTLatposition6versusthatofnoQTL
NOP��b����QTLat6!U'c!U3c!U,c� ����noQTL!U'!U,�d
U'c!U3c!U,caretheMLEs,assumingasingleQTLatposition6.NoQTLmodel:Thephenotypesareindependentandidentically
distributed(iid)/�'!,��.
AnexampleLODcurve
020406080100
0
2
4
6
8
10
12
Chromosome position (cM)
LOD
05001000150020002500
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Map position (cM)
lod
12345678910111213141516171819LOD curves
Intervalmapping
Advantages
�Takesproperaccountofmissingdata.
�Allowsexaminationofpositionsbetweenmarkers.
�GivesimprovedestimatesofQTLeffects.
�Providesprettygraphs.
Disadvantages
�Increasedcomputationtime.
�Requiresspecializedsoftware.
�Difficulttogeneralize.
�OnlyconsidersoneQTLatatime.
MultipleQTLmethods
WhyconsidermultipleQTLsatonce?
�Reduceresidualvariation.
�SeparatelinkedQTLs.
�InvestigateinterationsbetweenQTLs(epistasis).
LODthresholds
LargeLODscoresindicateevidenceforthepresenceofaQTL.
Q:Howlargeislarge?
$WeconsiderthedistributionoftheLODscoreunderthenullhypothesisofnoQTL.
Keypoint:WemustmakesomeadjustmentforourexaminationofmultipleputativeQTLlocations.
$WeseekthedistributionofthemaximumLODscore,genome-wide.The95th%ileofthisdistributionservesasagenome-wideLODthreshold.
Estimatingthethreshold:simulations,analyticalcalculations,per-mutation(randomization)tests.
NulldistributionoftheLODscore
�Nulldistributionderivedbycomputersimulationofbackcrosswithgenomeoftypicalsize.
�Solidcurve:distributionofLODscoreatanyonepoint.
�Dashedcurve:distributionofmaximumLODscore,genome-wide.
01234
LOD score
Permutationtests
mice
markers
genotypedata
phenotypes
eLOD:f? (asetofcurves)e8g h
ijkLOD:f?
�Permute/shufflethephenotypes;keepthegenotypedataintact.
�CalculateLODl:f?<m8lghijkLODl:f? �Wewishtocomparetheobserved8tothedistributionof8l.
�no:8lp8
?isagenome-wideP-value.
�The95th%ileof8lisagenome-wideLODthreshold.
�Wecan’tlookatallqrpossiblepermutations,butarandomsetof1000isfeasi-bleandprovidesreasonableestimatesofP-valuesandthresholds.
�Value:conditionsonobservedphenotypes,markerdensity,andpatternofmiss-ingdata;doesn’trelyonnormalityassumptionsorasymptotics.
LODsupportintervals
s-LODsupportinterval
�ChromosomalregionforwhichtheLODscoreiswithinsofitsmaximum.
�Generallys 1or2;Iprefers 1.5.
�PlotofLOD�tuG�LOD� depictsevidenceforQTLlocation.
2030405060
−2.0
−1.5
−1.0
−0.5
0.0
Chromosome position (cM)
LOD
− m
ax(LOD
)
PowertodetectQTLs
�ThepowertodetectaQTListhechancethatitsLODscoreexceedsthegenome-widethreshold.
�Powerdependson
–SizeoftheQTLeffect.
–Numberofprogeny.
–Typeofcross.
–Densityofmarkers.
–StringencyoftheLODthreshold.
�Atright:
–Dashedcurve:dist’nofmaxLODundernullhypothesis.
–Solidcurve:dist’nofLODscoreatQTL,withv2�aa. –Dottedcurve:dist’nofLODscoreatQTL,withv2�aa.
LOD score
0369
QTL effect = 5Power = 2%
LOD score
0369
QTL effect = 8Power = 41%
LOD score
0369
QTL effect = 11Power = 97%
Howmanymarkers/mice?
�Atright:
–Top:v2�aa –Bottom:v2�aa –Solid:10cMspacing
–Dashed:1cMspacing
�Moremice:
–Morerecombinationbreakpoints.
–Reducedsamplingvariation.
�Moremarkers:
–Moredetailedgenotypeinformation.
–Notnecessarilyincreasedprecision(dependsonnumberofmice,sizeofQTLeffect,andluck).
�Note:Thefiguresshouldbetakenwithagrainofsalt.
20253035404550
−2.0
−1.5
−1.0
−0.5
0.0
100 mice
Chromosome position (cM)
LOD
− m
ax(LOD
)
20253035404550
−2.0
−1.5
−1.0
−0.5
0.0
200 mice
Chromosome position (cM)
LOD
− m
ax(LOD
)
Selectionbias
�TheestimatedeffectofaQTLwillvarysomewhatfromitstrueeffect.
�OnlywhentheestimatedeffectislargewilltheQTLbedetected.
�AmongthoseexperimentsinwhichtheQTLisdetected,theestimatedQTLeffectwillbe,onaverage,largerthanitstrueeffect.
�Thisisselectionbias.
�SelectionbiasislargestinQTLswithsmallormoderateeffects.
�ThetrueeffectsofQTLsthatweidentifyarelikelysmallerthanwasobserved.
Estimated QTL effect
051015
QTL effect = 5Bias = 79%
Estimated QTL effect
051015
QTL effect = 8Bias = 18%
Estimated QTL effect
051015
QTL effect = 11Bias = 1%
Thegeneticmap:effectsoferrors
�Markerorder–CauseswigglyLODcurves.
–Shouldn’tcompletelyeliminateasignal.
�Mapdistances–Doesn’tseemtomakemuch
difference.
–MakesabigdifferenceinperceivedlengthofLODsupportintervals.
�Greatereffectsoferrorswithappreciablemissingdata
020406080100
0
2
4
6
8
10
Map position (cM)
lod
1
Error in marker order
050100150200
0
2
4
6
8
10
Map position (cM)
lod
1
Error in marker spacing
Thegeneticmap:findingproblems
�Consider:–Estimatedgeneticmap–Pairwiserecombination
fractions
�Misplacedmarkerscause:–Biggapsinthemap–Largerecombination
fractions
0
200
400
600
800
1000
1200
Chromosome
Location (cM)
1
Comparison of genetic maps
5101520
5
10
15
20
Markers
Markers
1
1
Pairwise recombination fractions and LOD scores
Genotypingerrors:effects
�Withgenotypingerrors,individualsareplacedinthewronggenotypegroup.
�Withwidelyspacedmarkers,thereislittleeffect.
�Withdensemarkers,errorsmaketheLODcurvehavemoredips.
020406080100
0
2
4
6
Map position (cM)
lod
1
Genotyping errors; 10 cM spacing
020406080100
0
2
4
6
8
Map position (cM)
lod
1
Genotyping errors; 2 cM spacing
Identifyinggenotypingerrors
�Lookfortightdoublecrossovers.(Crossoverinterferenceisoften
strong.)
�ErrorLODscores(Lincoln&Lander1992)
–Modelforgenotypingerrors.
–Assumederrorrate.
–Assumptionofnointerference.
–Atmarkerwinmousex,LOD= yz{�|}~�����inerror�markerdata��� ~�����correct�markerdata����
010203040
010
2030
4050
6070
Chromosome 5
Individual
Position (cM
)
510152025
10
20
30
40
Markers
Individuals
513
Genotyping error LOD scores
Selectivegenotyping
�Saveeffortbyonlytypingthemostinformativeindividuals(say,top&bottom10%).
�Usefulincontextofasingle,inexpensivetrait.
�TrickytoestimatetheeffectsofQTLs:useIMwithallphenotypes.
�Can’tgetatinteractions.
�Likelybettertoalsogenotypesomerandomportionoftherestoftheindividuals.
40
50
60
70
80
Phenotype
BBAB
All genotypes
BBAB
Top and botton 10%
Covariates
�Examples:treatment,sex,litter,lab,age.
�Controlresidualvariation.
�Avoidconfounding.
�LookforQTL�environ’tinteractions
�Adjustbeforeintervalmapping(IM)versusadjustwithinIM.
020406080100120
0
1
2
3
4
Map position (cM)
lod
7
Intercross, split by sex
020406080100120
0
1
2
3
4
Map position (cM)
lod
7
Reverse intercross
Non-normaltraits
�Standardintervalmappingassumesnormallydistributedresidualvariation.(Thusthephenotypedistributionisamixtureofnormals.)
�Inreality:weseedichotomoustraits,counts,skeweddistributions,outliers,andallsortsofoddthings.
�Intervalmapping,withLODthresholdsderivedfrompermutationtests,generallyperformsjustfineanyway.
�Alternativestoconsider:–Nonparametricapproaches(Kruglyak&Lander1995)
–Transformations(e.g.,log,squareroot)
–Specially-tailoredmodels(e.g.,ageneralizedlinearmodel,theCoxproportionalhazardmodel,andthemodelinBromanetal.2000)
SummaryI
�ANOVAatmarkerloci(akamarkerregression)issimpleandeasilyextendedtoincludecovariatesoraccommodatecomplexmodels.
�IntervalmappingimprovesonANOVAbyallowinginferenceofQTLstopositionsbetweenmarkersandtakingproperaccountofmissinggenotypedata.
�ANOVAandIMconsideronlysingle-QTLmodels.MultipleQTLmethodsallowthebetterseparationoflinkedQTLsandarenecessaryfortheinvestigationofepistasis.
�StatisticalsignificanceofLODpeaksrequiresconsiderationofthemaximumLODscore,genome-wide,underthenullhypothesisofnoQTLs.Permutationtestsareextremelyusefulforthis.
�1.5-LODsupportintervalsindicatetheplausiblelocationofaQTL.AplotoftheLODcurve,re-centeredsothatitsmaximumisat0,isavaluabletoolfordepictingevidenceforQTLlocation.
�Onceyou’veachieveda10cMmarkerspacing,moremicewillprobablybemoreimportantthanmoremarkers.Butthisdependsonthenumberofmice,thesizeoftheQTLeffect,andluck.
SummaryII
�EstimatesofQTLeffectsaresubjecttoselectionbias.Suchestimatedeffectsareoftentoolarge.
�Studyyourdata.Lookforerrorsinthegeneticmap,genotypingerrorsandphenotypeoutliers.Butdon’tworryaboutthemtoomuch.
�Selectivegenotypingcansaveyoutimeandmoney,butproceedwithcaution.
�Studyyourdata.Theconsiderationofcovariatesmayrevealextremelyinterestingphenomena.
�Intervalmappingworksreasonablywellevenwithnon-normaltraits.Butconsidertransformationsorspecially-tailoredmodels.Ifintervalmappingsoftwareisnotavailableforyourpreferredmodel,startwithsomeversionofANOVA.