genome characterization assembly/sequencing bio520 bioinformaticsjim lund assigned reading: ch 9
TRANSCRIPT
![Page 1: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/1.jpg)
Genome Characterization
Assembly/sequencing
BIO520 Bioinformatics Jim Lund
Assigned reading: Ch 9
![Page 2: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/2.jpg)
Organism Selection
Library Creation
Sequencing
Assembly
Gap Closure
Finishing
Annotation
The (original) genome sequencing process
![Page 3: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/3.jpg)
Organism Selection
Sequencing
Assembly
Annotation
The (current) genome sequencing process
Next gen. random sequencing lets library generation get skipped
Gap closure and finishing often get skipped, at least for now.
![Page 4: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/4.jpg)
Contigs, Islands
contigs
Island
![Page 5: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/5.jpg)
Assembly pipeline
1. Sequence reads.2. Phred: base calling.3. crossmatch: screen out vector, E.
coli sequence.4. Phrap: assemble contigs.5. Consed: view assembly, correct
problems.6. Finishing.
![Page 6: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/6.jpg)
Assembly Methods
• Strip out vector (or contaminant)• Mask known repeats• Trim off unreliable data• Find Matches (n seq x n seq comparisons)
– how long (what ktuple [10 common])
– how perfect (reliability index)
– where to look? (ends only vs entire)
![Page 7: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/7.jpg)
Assembly Programs
• PHRAP FAMILY– phred/phrap/consed/cross_match
– Developed by Phil Green, U of Wash.
• Other assemblers– phrap, kangaroo, phrapo,
– CAP, TIGRAssembler,...
http://www.phrap.org/
![Page 8: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/8.jpg)
Assembly
• Phred -reads DNA sequencing trace files, calls bases, and assigns a quality value to each called base.– The quality value is a log-transformed error probability,
specifically: Q = -10 log10( Pe )– Q = quality value, Pe = error probability.– Q= 20 -> 1% chance of miscall, Q= 30 -> 0.1% chance of miscall.
• Phrap -assembles shotgun DNA sequence data.• Consed/Autofinish -view, edit, and finish sequence
assemblies created with phrap. – Allows the user to pick primers and templates– Suggests additional sequencing reactions – Suggest digests and forward/reverse pair information to
check accuracy of assembly.
![Page 9: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/9.jpg)
Poisson statistics for sequencing completion
P0=e-L(N)/GL=read length
N=#readsG=genome size
E. coli 15kbH. sapiens 900kb
Coverage1 = 1-fold = 1X
1
3
8
10
50
% not sequenced
37
5
0.03
0.005
< 1e20
![Page 10: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/10.jpg)
Gaps
Number of Gaps = Ne-c
150kb Target Clone, 500 bp reads
N=# of readsc = fold coverage
Coverage,reads1, 300
5, 1500
8, 2400
10, 3000
50, 15000
Gaps
111
10
1
0
0
![Page 11: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/11.jpg)
GapsNumber of Gaps = Ne-c
Human genome, 3Gb, 1,000 bp reads
N=# of readsc = fold coverage
454 Seq, 400bp reads
Coverage,reads1, 3e6
5, 1.5e7
8, 2.4e7
10, 3e7
50, 3.75e8
Gaps
1,000,000
100,000
8,000
1,400
7
![Page 12: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/12.jpg)
Contigs, Islands
contigs
Island
TTT
C
![Page 13: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/13.jpg)
Finishing
• GOALS– >95% coverage on BOTH strands
– every base covered 3X
– resolve ambiguities
• Finish when random no longer productive (~8X range)
![Page 14: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/14.jpg)
Sequence finishing. How?
• Identify gaps, ambiguities– Captured gaps: gaps is contained in a clone
• Extend from end of contigs– Resequencing, new chemistry.– Specific primers– Subcloning and sequencing.
• Uncaptured gaps.– New specific primers– PCR across gap, sequence PCR product.
• Resolve ambiguities– Consensus or resequence
• Specific primers, different chemistry
![Page 15: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/15.jpg)
Large clone sequencing process
• Phase 1: Unfinished, may be unordered/unoriented contigs, with gaps.
• Phase 2: Unfinished, fully oriented and ordered sequence, may contain gaps and low quality sequences
• Phase 3: Finished, no gaps.
![Page 16: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/16.jpg)
Genome assembly after initial contigs are made
• Order clones/contig sequences:– Sequence overlaps.
• Clone/contig end sequences.
– Clone fingerprints.– Anchor using other maps
• Sequence based markers on genetic or physical maps.
• Conserved synteny to other genomes.
• Easiest when re-sequencing, e.g, another human genome!
![Page 17: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/17.jpg)
Process Control
• LIMS– Laboratory
information management system
• AIMS– Analysis
information management system
![Page 18: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/18.jpg)
Hard genome sequencing problems
• Repeats• Complex genome structures
Where does a clone from a repetitive region map?
![Page 19: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/19.jpg)
Approaches to sequencerepeat problems
• Multiple fragment sizes in 1 project• Use length/distance info• New assemblers, eg. ARACHNE
![Page 20: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/20.jpg)
Results of Multi-length Fragment Assembly
• Contigs
• “Supercontigs”
• Clone links for finishing
• Clone map
![Page 21: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/21.jpg)
DOE Joint Genome Institute (JGI) Prokaryote Finishing Standards
• All low-quality areas (<Q30) are reviewed and resequenced.
• The final error rate must be less than 0.2 per 10 Kb.• No single-clone coverage is permitted (minimum of 2x
depth everywhere).• Single-stranded regions are manually inspected and
quantified.• All positions where an aligned high-quality read (>Q29)
disagrees with the consensus base are checked.• All strings of xxxx are resolved in the final sequence.• All repeats are verified.• The ends of final contigs (chromosomes, plasmids) are
checked• The final assembly is given a manual QC check.
![Page 22: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/22.jpg)
Completed genomes 23 complete, 329 in assembly, in progress 389Arabidopsis thaliana Caenorhabditis elegans Candida glabrataCryptococcus neoformansCyanidioschyzon merolae Debaryomyces hanseniiDrosophila melanogasterEncephalitozoon cuniculiEntamoeba histolytica
Plants Animal s Protists Fungi
http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi
Eremothecium gossypiiHomo sapiens Kluyveromyces lactisLeishmzania major Friedlin Mus musculusOryza sativaSaccharomyces cerevisiaeSchizosaccharomyces pombeTrypanosoma cruzi Yarrowia lipolytica
![Page 23: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/23.jpg)
Genomes Complete
• Eukaryotes--23 complete, 329 in assembly, in progress 389– Human, mouse, rat, zebrafish, – Homo sapiens neanderthalensis– Drosophila, Anopheles, Caenorhabditis– Arabadopsis, oat, corn, barley, rice, tomato– Saccharomyces, Schizosaccharomyces,
Magnaportha, Cryptococcus, Candida…– Encephalitozoon cuniculi, Guillardia theta – Toxoplasma, Plasmodium– And many more…
![Page 24: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/24.jpg)
Eubacteria and Archaea genomes
• 608 Bacteria and 48 Archaea completed• Comprehensive Microbial Resource
– http://pathema.tigr.org/tigr-scripts/CMR/CmrHomePage.cgi
• Joint Genome Institute– http://www.jgi.doe.gov/genome-projects/– 2065 genome projects underway or
completed!
• NCBI Genomes
![Page 25: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/25.jpg)
Genome Centers
• Joint Genome Institute (DOE)• Whitehead Institute (MIT)• TIGR• Washington University (St. Louis)• Celera• Sanger Institute (the other UK)• RIKEN (Japan)• Beijing Genomics Institute (China)• Max Planck (Germany)…
![Page 26: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/26.jpg)
Where do you find Genomic data?
• NCBI– Entrez (by clone, by Refseq)– Genome (view and search map)
• Genome center sites• Organism genome project sites
• Annotations projects– UCSC Genome Browser, – Ensembl Genome Browser
![Page 27: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/27.jpg)
Arabidopsis
http://mips.helmholtz-muenchen.de/plant/athal/index.jsp
![Page 28: Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9](https://reader033.vdocuments.net/reader033/viewer/2022051401/56649e625503460f94b5d75b/html5/thumbnails/28.jpg)
C. elegans (nematode)
http://wormbase.org