genome projects - biological computing · genome projects shifra ben ... taken place, are organized...
TRANSCRIPT
Outline
Whataregenomes,andhowweretheysequenced?
Whatcanwedowithgenomes?
Howdowelookatgenomes?
Howdowechooseabrowser?
Howdothebrowserswork?
WhyStudyGenomes?
Understandbiologicalprocesses
Understandpathologicalprocesses
Diagnose,prevent,andcurediseases
Sowhatisagenome?
Agenomeisthefullcollec5onofgene5cmaterial(DNA)ofanorganism(includingnon‐nuclearDNA,suchasmitochondrialorchloroplastDNA)
Itismorethanthegenes(whichareonly~1.5%ofthehumangenome)
Inhumansthereare3,000,000,000basepairsofDNA
WhatelseisintheDNA?(asidefromgenes)
Thereareareasthatdon’tcodeforgenesatall
Thereareareasthatregulatethegenes
– Whenshouldaproteinbeexpressed?night/day,fetus/adult
– Whereshouldaproteinbeexpressed?eye,lung,muscle,brain
AGGTTAGATTATGCCCCGAGGGCGCCCCAGCCGAAATTTTTAATGCAGGTTTAATAGTTTAGAGCCTGTGGGCTTCCATGGCTTGGTTCTGCTGTTCTTCACTGGGGACTTGGGGGACCCTGGGAGCTTGTGATGGGGCCTGTCTCCACCTCTGTAAATCCAAGGAGTCAGATGACAAATCTGTCATTTCGGGCGACACACTCCCCTGAGGAAAGGGCCTTGCAGGAGGGCAGAGCAGCTTGCTGGGCATGGCAGGGAGTGGAGAAGGGCAGGGGGCGCAGAGCAGGAGCAGCTTCCTGCCTCTGGGTGGGGACAGTGATCCCCACTGGGGACTGGCAAAGCCCCATGCTCTCTGTTCACCCTGGATGGGTGGCACCTGGGGGCAGGCATGGGGCCTGCAGGAGCCCCTGTGTGCCAGCCCTCCCCTGCCAGCATCCCATCTCCCAGGAGGCCCCCAGGGCAGGTAAGTGCCAGGTCCCCCCTCAGCTCACCGTTGTCCTTCCCCTTGACGAACGCCTCCCACTCCCGGAACCACTGCATGCTGATGCAGTAGATGACGCCCGGCGACTCCTCGGCCTGGAAGGCCTTGTTCAACTGGCAGGCGGGCAGGGACGGGAGGAGACAGAGGGCAGGTCAGTCTCTATTTACCTTCAGCAAGGATTTCCCAAATGCCCCCGCCCCAGTCCCTCACCCAAGAGTCTTACAAAAACACCAGATACTCAAGGTGAAAAATCAAACCCCCAGACAAGCTCAAACAAAATGAACACACCCATCACTCAGAGAGGACGGTGGTAACATTTTGTGGGTCTCTTCAATAACTGTTTGACCCAACTGAGACCACAAGGGAGATTCTACTTTTTGAGAAGGAATCTCACTCTGTCACCCAGGCTGGAGTGCAGTGGCGCGATCT CGGCTTACTGCAACCTCTGCCTCCCAGGTTCAAGCGATTCTCCCACCTCAGCCTCTTGAGTAGCTGGGATTACAGGTGTGTGCCACCACCTGGCTACTTTTTGTGTTTTTAGTAGAGACGGGGTTTCGCCATGTTGGCCAGGCTGGTCTTGAACTCCTGACCTCAGGTGATCCGCCCACCTCGACCTCTCAAAAGTGCTGGGATAACAGGCATGAACCACTGCGCCCGGCCTGGGAGATGCTAATTTTCTCCGGTTGAATAGAATGTGCCTATCTGCTCAGAGAGGCAGCTCTCCTTCTGACAGGAGCATTTTCTTTTTCGAGATGGGGGGTGGTCTCACTCTGTGCCCAGGCGGGAGTGCAGCGGCGCAATCATGGCTCACTGCAGCCTTGACCTCCTGGGCTCAAGTGATCCTCCCACCTCAGCCTCCTGAGTAGGTTGGACCACAGGTGCATACCACTAGGCCCAGCCCTGACAGTCTCTTTTTCGTTTGTGTTCTGAGACAGGGTCTCACTCTATTGCCCAGGCTGCGGTGCAGTGGCATGATCACGGCTCACTGCAGCCTCAACCTCCCAGGCTTAGGTGATCCTCCCAACTCACTCAGCCCTCCAGGTAGCGGGGACTACAGGTACACATCACCATGCCTGGCTAATTTTTGTATTGTTTGTAGAGATGGGGTTTCGCCATGTTGGCCAAGTTGGTCTTGAACTCCTGG
ChallengesofDNA
DNAhasonlyfourleZers
TheyarestrungtogetherwithNOobviouspunctua5on
Therearenosignalstosay“agenestarts(orends)here”
Howdowemakesenseofsomuchinforma5on?
Ifcompiledinbooks,thedatawouldfillan
es5mated200volumesthesizeofa
ManhaZantelephonebook(at1000pages
each),andreadingitwouldrequire26
yearsworkingaroundtheclock.
TheHumanGenomeProject
Wasplannedin1988,startedin1990
Originallyplannedtotake15years,inthreefiveyearstages
Objec5onstothegenomeproject
1)Fearthatfundingwillbedivertedfromotherareasofresearch
2)Whatisthevalueofsequencingacompletegenome,giventhehighpropor5onofnongenicsequences(“JunkDNA”).
Twoaltera5onsintheoriginalplanhelped:
1)Focusshifedfromlarge‐scalesequencingtomappingthegenome,whichwouldhastenthesearchfordiseasegenes
2)Simultaneouslydeterminethenucleo5desequenceofthegenomesofotherorganisms;thisprovidescomparisonsandpointsofreferenceforthehumansequence
Stage11991‐1995
Crea5ngaGene5cMapofthegenome
Crea5ngaPhysicalMapofthegenome
Crea5ngasetofoverlappingclones
Createfaster/cheapermethodsofsequencing
Createsofware/databasesthatcandealwiththedata
Stage21994‐1998
Finishmapping(bothgene5candphysical)
Startsequencing
Startannota5on‐genefindingandplacementonmaps
Stage 3 1998-2003 Area Goals 1993-98 Status as of Oct. 1998 Goals 1998-2003 Genetic map Average 2- 1 cm map published Completed to 5-cm Sept. 1994
Physical map Map 30,000 STSs 52,000 STSs mapped Completed
DNA sequence Complete 80 Mb 180 Mb human plus Finish 1/3 of human for all organisms 111 Mb non-human sequence by end of 2001 by 1998 Working draft of
remainder end of 2001 Complete human sequence by end of 2003
Human sequence Not a goal - 100,000 mapped SNPs variation
Gene Develop 30,000 ESTs mapped Full-length cDNAs Identification Technology
Functional Not a goal - Develop genomic-scale analysis technologies
Techniques
ThemajorbreakthroughthatallowedtheHumanGenomeProjecttotakeoffweretwotechniques:
1) Improvementsinsequencing:CycleSequencingAutomatedSequencersFlourescentDyesHighThroughput‐lowercosts
2)PCR
Markers
Therearemanydifferenttypesofmarkers:
RFLP:Restric5onFragmentLengthPolymorphism
Microsatellite
VNTR:VariableNumberTandemRepeat
STS:SequenceTaggedSite
1 2
100 200
400
500
Restric5onFragmentLengthPolymorphism
700basepairPCRfragment
Allele1
Allele2
GAATTC
GAATTC
GAATTC
GAATTT
Microsatellites
Microsatellitesaresmallrepe55vestretches
ofDNA,usuallyrepeatsofdi,triandtetra
nucleo5des.
ForexampleCACACA,orCAGCAGCAG…..
Becausethesestretchesrepeat,whenthe
DNArecombines,itsveryeasyforthe
machineryto“slip”andaddorsubtractafew
copies
1 2
4 2
Allele
# of repeats
AAAACAGCAGCAGTTTTT AAAACAGCAGCAGTTTTT
AAAACAGCAGCAGTTTTT AAAACAGCAGCAGTTTTT
Parent chromosomes Recombination
AAAACAGCAGCAGCAGTTTTT
AAAACAGCAGTTTTT
Daughter chromosomes Allele 1
Allele 2
VariableNumberTandemRepeat
RegionsofDNAthatrepeat‐forexample,microsatellites,althoughVNTR’scanhavemorecomplexsequence.
ThisisalsodetectedbyperformingPCRandlookingatthenumberofrepeatsonagel
primer 1
Primer 2
PCR product
A genomic region
Sequence Tagged Site
STS is defined as:
primer 1 primer 2
PCR product size STS is designed to be unique in the genome
Gene5cMapsAgene5cmapmeasuresrecombina5ondistanceandanswerstheques5on,“Howofenaretwomarkersfoundtogether?”
Twomarkersaresaidtobe1cen5Morgan(cM)apartiftheyareseparatedbyrecombina5on1%ofthe5me.Agene5cdistanceof1cMisroughlyequaltoaphysicaldistanceof1millionbasepairs(1MegabaseorMb)
Gene5cMaps
5 2 4 4
2 4 2 4
2 3 1 4
3 2 2 4
2 4 2 4
2 5 1 4
2 4 2 4
2 3 1 4
2 4 2 4
2 5 2 4
2 4 2 4
2 3 1 4
2 4 2 4
2 5 2 4
3 2 2 4
2 5 2 4
3 2 2 4
2 5 2 4
PhysicalMaps
Physicalmapsvarygreatly.Thelowestresolu5onmaparethechromosomebandingpaZerns(ideogram).
Thehighestresolu5onmapistheactualsequence.
Whatweactuallyuseisinbetween.Onetypeofmapdevelopedaspartofthegenomeprojectistheradia5onhybridmap(RHmap)
Firstrarecuongenzymeswereused
togeneratepieces~150,000basepairs
long.Differentenzymeswereusedto
getoverlappingsegments.
Howwasthegenomecloned?
BACs(BacterialAr5ficialChromosomes)wereplacedalongthemapwiththehelpofmarkers,andtheleastredundantpathwaschosen.
TheneachBACwasbrokendownthesameway,thefragmentswereplacedinorderusingrestric5onenzymemapping,andsequenced.
CeleraCorp.saidthatitwouldsequencethegenomefasterandcheaperthanthepublicprojectcould.
Theyusedamethodknownasrandomshotgunsequencing.
Theybrokethegenomeupinto2000and10,000bppieces,sequencedthem,andwroteacomputerprogramtoputitallbacktogether.
ThencameCelera…..
WholeGenomeShotgun
Wholegenomeshotgundidn’tactuallysequencewholepiecesofDNA,butjusttheirends.
TheWGSreadsaverage~500bp.
Theydohowevergiveinforma5onintermsofmatepairsastowhichpiecefallswhere,andthatgivesamethodoforderingandameasureofthesizeoftheholes
Itisimportanttonotethatnotallgenome
sequencesareorganizedintochromosomes.
Thosethatarebeingsequencedbywhole
genomeshotgun,wheremappinghasnot
takenplace,areorganizedinto“linkage
groups”
Theproblemswithshotgun…
Thereareseveralproblemswithshotgunsequencing:
‐ repe55veelements
‐ genefamilies
‐needmoresequence
“x”coverage
Whatdoes5xcoverageofthegenomemean?
That55mesthenumberofbasesinthatgenomeweresequenced(soforhuman5x3x109bases)
ItdoesNOTmeanthatthewholegenomewascovered55mes
Thepublicreac5on….
ThepublicprojectwasconcernedthatCelerawouldfinishfirst,andasacommercialcompany,trytopatentsignificantpartsofthegenome.
Tomakethepubliceffortfaster,theydecidedtoshotguntheexis5ngBACs.
ThisleadtoHTGs
HighThroughputGenomesequence
Theoutputofthepublicprojecthadbeenwellorderedwellfinishedsequence.
Asaresultofshotgunsequencing,weendedupwith“draf”sequence‐sequencewhosegeneralloca5onisknown,buttheexactorderordirec5onofthepiecesisnot.
HTGS
Status Loca-on Defini-on
Phase0 HTGdivision single‐fewpassreadsofa singleclone(notcon5gs).
Phase1 HTGdivision Unfinished,maybe unordered,unorientedcon5gs, withgaps.
Phase2 HTGdivision Unfinished,ordered,oriented con5gs,withorwithoutgaps.
Phase3 Primarydivision Finished,nogaps(withor withoutannota5ons).
Dra4Sequence‐Spring2000p q
GapOrderedcon5gs,
finishedandunfinished
Orienta5onUnknown
?
?
UnfinishedClone
?
?
UnmappedCon5gs
p q
SequencingProblems
PhysicalHoles(noclones) SequenceHoles(nosequence)
AssemblyProblems
Repe55veelements Genefamilies
Pseudogenes
Duplica5on
Take
n fro
m: C
hurc
h D
M e
t al (
2009
) Lin
eage
-Spe
cific
Bio
logy
Rev
eale
d by
a F
inis
hed
Gen
ome
Ass
embl
y of
the
Mou
se. P
loS
Bio
l 7(5
): e1
0001
12
Thepost‐genomicera…..
Nowthatwehavemostofthegenomesequence,effortsarebeingturnedtounderstandingthewealthofinforma5onproduced:
Definingthegenes,whenandwheretheyareexpressed,whatcontrolsthem,whotheyinteractwith,andhowtheyaremutated,par5cularlyindisease.
AdvantagesofGenomeSequence
Previously,itwas“onegene,onepostdoc”
NowthatwehaveabeZerpictureofthings,wecanstudysystems,andgeneinterac5ons
Previously,ittookyearstocloneandsequenceagene
Now,allweneedisaliZlebitofsequence,
andwecanlookuptherestinthegenome
AdvantagesofGenomeSequence
Previously,aferyearsspentcloning,more5mewasneededtosequencethesurroundingareaofthegene,tostartlookingintoregulatoryelements
Now,wehavethesurroundingsequenceandcanstartlookingfortheregulatoryelementsdirectly
Sowestartedwithgenomics….
Nowwehave“Omics”:
Transcriptome
Proteome
Regulome…….
Ands5llplentyofwork…….
N50
Acommonmeasureforhowwellput
togetheragenomeis
Themarkwhere50%oftheclonesare
aboveorbelowthatlength
N50human:46.4Mb(build37)
N50mouse:39.3Mb(build37)
“TheYchromosomeinthisassemblycontainstwopseudoautosomalregions(PARs)thatweretakenfromthecorrespondingregionsintheXchromosomeandareexactduplicates:
chrY:10001-2649520 and chrY:59034050-59363566 chrX:60001-2699520 and chrX:154931044-155260560
UCSCgenomebrowser,humangenomepage
YChromosome
MouseGenome
NCBIMouseBuild36representsahighlypolishedproduct.Itiscomposedlargelyoffinishedsequence(clickhereforagraphofsequencecomposi5on).HTGSphase3sequenceisassembledbyhandintonon‐redundantcon5gsandthenon‐redundantsequenceisusedasanimputtotheassembly.AhandcuratedcombinedTilingPathfilewasusedtoguidetheassembly.Allnon‐finishedpor5onsoftheassemblyhavebeenhandcuratedtodeterminewhethertoincludeWGS,HTGSsequenceortoleaveagap.
Mousethingstoremember
Strains– ThereferencesequenceisfromC57Bl/6J
– Thereissequenceavailablefrom:129,A/J,B6/CBAF1J,Balb/c,C3H,NOD
Ychromosome‐theoriginalreferencemousewasfemale!
MouseGenome
TheYchromosomewasbuiltbyhandincollabora5onwiththeWashingtonUniversityGenomeSequencingCenterandthePagelabattheWhiteheadIns5tuteforBiomedicalResearch.Currently,onlytheshortarmoftheYhasreliablemappingdata,somostofthecon5gsontheYchromosomeareunplaced.
Release5:2006
Table1:Release5Sta5sicsArm Gaps MajordifferencecomparedtoRelease4X 3 8kbaddedtothedistalend, gapsfilledinregions1‐112L 2 591kbaddedtotheproximalendofthearm.2R 1 380kbaddedtotheproximalend.3L 1 16kbaddedondistalend,
718kbaddedtoproximalend,othergapsfilled.3R 0 None4 1 70kbpaddedtothedistalend
Gapsofunknownsizearedenotedby100N'sinthefastafiles.TherearetwosizedgapsonXthathavees5matesfortheirsize.Thereare6othergapsinthegenomewhicharenotsized.Inaddi5ontothemajorchanges,allarmsexceptfor3Rhadminorsequencechangesinregionswherethefingerprintdigestshowedevidenceofmissassembly,orinregionsthatrequiredqualityimprovements.
Arabidopsis–TAIR10
Thesequencedregionscover119.1megabasesofthe125‐megabasegenomeandextendintocentromericregions.Majorreleasein2003,minorupdatesin2008,2009.Hasalistofknowngapsinthegenome.
TheTAIR10releasecontains27,416proteincodinggenes,4827pseudogenesortransposableelementgenesand1359ncRNAs(33,602genesinall,41,671genemodels).Atotalof126newlociand2099newgenemodelswereadded.
TAIRGenomeMonitoringSeptember6,2002
WeregularlycomparemRNAandESTsequencesfromGenBankandTIGRwiththeArabidopsisgenomesequenceandpulloutArabidopsissequencesforwhichthereisnogoodmatch(eitherbecausethesequencequalityistoolow,theyarefromanotherorganism,ortheyfallintooneofthefewsmallunsequencedregionsremaining).GotoourFTPsitetodownloadfastafilescontainingtheseunmatchedsequences.
TotalmRNAstested:17,441Numberthatdidnotmatchthegenomesequence:19(0.0011%).TotalESTstested:174,625Numberthatdidnotmatchthegenomesequence:1600(0.009%).