de novo genome assembly - t.seemann - imb winter school 2016 - brisbane, au - 4 july 2016
TRANSCRIPT
![Page 1: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/1.jpg)
De novo Genome AssemblyA/Prof Torsten Seemann
Winter School in Mathematical & Computational Biology - Brisbane, Australia - 4 July 2016
![Page 2: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/2.jpg)
Introduction
![Page 3: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/3.jpg)
The human genome has 47 pieces
MT
(or XY)
![Page 4: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/4.jpg)
The shortest piece is 48,000,000 bp
Total haploid size
3,200,000,000 bp(3.2 Gbp)
Total diploid size
6,400,000,000 bp(6.4 Gbp)
∑ = 6,400,000,000 bp
![Page 5: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/5.jpg)
Human DNA iSequencer ™ 46 chromosomal and1 mitochondrial
sequences
In an ideal world ...
AGTCTAGGATTCGCTATAGATTCAGGCTCTGATATATTTCGCGGCATTAGCTAGAGATCTCGAGATTCGTCCCAGTCTAGGATTCGCTATAAGTCTAAGATTC...
![Page 6: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/6.jpg)
The real world ( for now )
Shortfragments
Reads are stored in “FASTQ” files
Genome
Sequencing
![Page 7: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/7.jpg)
![Page 8: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/8.jpg)
Assemble Compare
![Page 9: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/9.jpg)
Genome assembly(the red pill)
![Page 10: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/10.jpg)
De novo genome assembly
Ideally, one sequence per replicon.
Millions of short sequences
(reads)
A few long sequences(contigs)
Reconstruct the original genome from the sequence reads only
![Page 11: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/11.jpg)
De novo genome assembly
“From scratch”
![Page 12: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/12.jpg)
An example
![Page 13: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/13.jpg)
A small “genome”
Friends, Romans, countrymen, lend me your ears;
I’ll return them
tomorrow!
![Page 14: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/14.jpg)
• Readsds, Romans, countns, countrymen, leFriends, Rom send me your ears;crymen, lend me
ShakespearomicsWhoops! I dropped
them.
![Page 15: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/15.jpg)
• Readsds, Romans, countns, countrymen, leFriends, Rom send me your ears;crymen, lend me
• OverlapsFriends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;
ShakespearomicsI am good
with words.
![Page 16: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/16.jpg)
• Readsds, Romans, countns, countrymen, leFriends, Rom send me your ears;crymen, lend me
• OverlapsFriends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;
• Majority consensusFriends, Romans, countrymen, lend me your ears; (1 contig)
ShakespearomicsWe have reached a
consensus !
![Page 17: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/17.jpg)
Overlap - Layout - ConsensusAmplified DNA
Shear DNA
Sequenced reads
Overlaps
Layout
Consensus ↠ “Contigs”
![Page 18: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/18.jpg)
Assembly graphs
![Page 19: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/19.jpg)
Overlap graph
![Page 20: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/20.jpg)
Another example is this
![Page 21: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/21.jpg)
Overlaps find
![Page 22: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/22.jpg)
Do the graph traverse
Size matters not. Look at me. Judge me by my size, do you?Size matters not. Look at me. Judge me by my soze, do you?
2 supporting reads
1 supporting reads
![Page 23: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/23.jpg)
So far, so good.
![Page 24: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/24.jpg)
![Page 25: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/25.jpg)
Why is it so hard?
![Page 26: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/26.jpg)
What makes a jigsaw puzzle hard?
Repetitive regions
Lots of pieces
Missing pieces
Multiplecopies
No corners (circular
genomes)
No box
Dirty pieces
Frayed pieces : .,
![Page 27: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/27.jpg)
What makes genome assembly hard
Size of the human genome = 3.2 x 109 bp (3,200,000,000)
Typical short read length = 102 bp (100)
A puzzle with millions to billions of pieces
Storing in RAM is a challenge ⇢ “succinct data structure”
1. Many pieces (read length is very short compared to the genome)
![Page 28: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/28.jpg)
What makes genome assembly hard
Finding overlaps means examining every pair of reads
Comparisons = N×(N-1)/2 ~ N2
Lots of smart tricks to reduce this close to ~N
2. Lots of overlaps
![Page 29: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/29.jpg)
What makes genome assembly hard
ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA
3. Lots of sky (short repeats)
How could we possibly assemble this segment of DNA?All the reads are the same!
![Page 30: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/30.jpg)
What makes genome assembly hard
Read 1: GGAACCTTTGGCCCTGTRead 2: GGCGCTGTCCATTTTAGAAACC
4. Dirty pieces (sequencing errors)
1. The error in Read 2 might prevent us from seeing that it overlaps with Read 1
2. How would we know which is the correct sequence?
![Page 31: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/31.jpg)
What makes genome assembly hard
5. Multiple copies (long repeats)
Our old nemesis the REPEAT !
Gene Copy 1 Gene Copy 2
![Page 32: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/32.jpg)
Repeats
![Page 33: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/33.jpg)
What is a repeat?
A segment of DNA that occurs more than once
in the genome
![Page 34: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/34.jpg)
Major classes of repeats in the human genome
Repeat Class Arrangement Coverage (Hg) Length (bp)
Satellite (micro, mini) Tandem 3% 2-100
SINE Interspersed 15% 100-300
Transposable elements Interspersed 12% 200-5k
LINE Interspersed 21% 500-8k
rDNA Tandem 0.01% 2k-43k
Segmental Duplications Tandem or Interspersed
0.2% 1k-100k
![Page 35: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/35.jpg)
RepeatsRepeat copy 1 Repeat copy 2
Collapsed repeat consensus
1 locus
4 contigs
![Page 36: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/36.jpg)
Repeats are hubs in the graph
![Page 37: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/37.jpg)
Long reads can span repeatsRepeat copy 1 Repeat copy 2
long reads
1 contig
![Page 38: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/38.jpg)
Draft vs Finished genomes (bacteria)
250 bp - Illumina - $250 12,000 bp - Pacbio - $2500
![Page 39: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/39.jpg)
The two laws of repeats
1. It is impossible to resolve repeats of length L unless you have reads longer than L.
2. It is impossible to resolve repeats of length L unless you have reads longer than L.
![Page 40: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/40.jpg)
How much data do we need?
![Page 41: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/41.jpg)
Coverage vs Depth
Coverage
Fraction of the genome sequenced by
at least one read
Depth
Average number of reads that cover any given region
Intuitively more reads should increase coverage and depth
![Page 42: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/42.jpg)
Example: Read length = ⅛ Genome Length
Coverage = ⅜
Depth = ⅜Genome
Reads
![Page 43: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/43.jpg)
Example: Read length = ⅛ Genome Length
Coverage = 4.5/8
Depth = 5/8Genome
Reads
Newly Covered Regions = 1.5 Reads
![Page 44: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/44.jpg)
Example: Read length = ⅛ Genome Length
Coverage = 5.2/8
Depth = 7/8Genome
Reads
Newly Covered Regions = 0.7 Reads
![Page 45: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/45.jpg)
Example: Read length = ⅛ Genome Length
Coverage = 6.7/8
Depth = 10/8Genome
Reads
Newly Covered Regions = 1.5 Reads
![Page 46: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/46.jpg)
Depth is easy to calculate
Depth = N × L / G
Depth = Y / G
N = Number of reads
L = Length of a read
G = Genome length
Y = Sequence yield (N x L)
![Page 47: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/47.jpg)
Example: Coverage of the Tarantula genome
“The size estimate of the tarantula genome based on k-mer analysis is 6 Gb and we sequenced at 40x coverage from a single female A. geniculata”
Sangaard et al 2014. Nature Comm (5) p1-11
Depth = N × L / G40 = N x 100 / (6 Billion)
N = 40 * 6 Billion / 100 N = 2.4 Billion
![Page 48: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/48.jpg)
Coverage and depth are related
Approximate formula for coverage assuming random reads
Coverage = 1 - e-Depth
![Page 49: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/49.jpg)
Much more sequencing needed in reality
● Sequencing is not random○ GC and AT rich regions
are under represented
○ Other chemistry quirks
● More depth needed for:○ sequencing errors○ polymorphisms
![Page 50: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/50.jpg)
Assessing assemblies
![Page 51: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/51.jpg)
ContiguityCompletenessCorrectness
![Page 52: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/52.jpg)
Contiguity
● Desire○ Fewer contigs○ Longer contigs
● Metrics○ Number of contigs
○ Average contig length
○ Median contig length
○ Maximum contig length
○ “N50”, “NG50”, “D50”
![Page 53: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/53.jpg)
Contiguity: the N50 statistic
![Page 54: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/54.jpg)
Completeness : Total size
Proportion of the original genome represented by the assembly
Can be between 0 and 1
Proportion of estimated genome size … but estimates are not perfect
![Page 55: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/55.jpg)
Completeness: core genes
Proportion of coding sequences can be estimated based on known core genes thought to be present in a wide variety of organisms.
Assumes that the proportion of assembled genes is equal to the proportion of assembled core genes.
Number of Core Genes in Assembly
Number of Core Genes in Database
In the past this was done with a tool called CEGMA
There is a new tool for this called BUSCO
![Page 56: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/56.jpg)
Correctness
Proportion of the assembly that is free from errors
Errors include
1. Mis-joins2. Repeat compressions3. Unnecessary duplications4. Indels / SNPs caused by assembler
![Page 57: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/57.jpg)
Correctness: check for self consistency
● Align all the reads back to the contigs● Look for inconsistencies
Original ReadsAlign
Assembly
Mapped Read
![Page 58: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/58.jpg)
Assemble ALL the things
![Page 59: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/59.jpg)
Why assemble genomes
● Produce a reference for new species● Genomic variation can be studied by comparing
against the reference● A wide variety of molecular biological tools become
available or more effective● Coding sequences can be studied in the context of their
related non-coding (eg regulatory) sequences● High level genome structure (number, arrangement of
genes and repeats) can be studied
![Page 60: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/60.jpg)
Vast majority of genomes remain unsequenced
Estimated number of species ~ 9 Million
Whole genome sequences (including incomplete drafts) ~ 3000
Estimated number of species Millions to Billions
Genomic sequences ~ 150,000
Eukaryotes Bacteria & Archaea
Every individual has a different genome
Some diseases such as cancer give rise to massive genome level variation
![Page 61: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/61.jpg)
![Page 62: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/62.jpg)
Not just genomes
● Transcriptomes○ One contig for every isoform
○ Do not expect uniform coverage
● Metagenomes○ Mixture of different organisms○ Host, bacteria, virus, fungi all at once
○ All different depths
● Metatranscriptomes○ Combination of above!
![Page 63: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/63.jpg)
Conclusions
![Page 64: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/64.jpg)
Take home points
● De novo assembly is the process of reconstructing a long sequence from many short ones
● Represented as a mathematical “overlap graph”
● Assembly is very challenging (“impossible”) because○ sequencing bias under represents certain regions○ Reads are short relative to genome size○ Repeats create tangled hubs in the assembly graph○ Sequencing errors cause detours and bubbles in the assembly graph
![Page 66: De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016](https://reader034.vdocuments.net/reader034/viewer/2022042907/5879a6a21a28ab082c8b7109/html5/thumbnails/66.jpg)
The End