c ompartmentalized s hotgun a ssembly

33
C ompartmentalized S hotgun A ssembly ? ? ? CSA wo stated motivations? ?

Upload: magee

Post on 02-Feb-2016

31 views

Category:

Documents


0 download

DESCRIPTION

C ompartmentalized S hotgun A ssembly. ?. ?. ?. CSA Two stated motivations?. ?. Matcher matched…. …matched Celera reads with PFP BACTIGS , 20.76 million Celera reads matched (76%), 0.62 million had a mate pair that matched, 2.97 million Celera reads were unique and un-screened, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: C ompartmentalized  S hotgun  A ssembly

Compartmentalized Shotgun Assembly

?

?

?CSATwo stated motivations?

?

Page 2: C ompartmentalized  S hotgun  A ssembly

Matcher matched…

…matched Celera reads with PFP BACTIGS,

– 20.76 million Celera reads matched (76%),– 0.62 million had a mate pair that matched,

• 2.97 million Celera reads were unique and un-screened,

– 1.189 Gbp of unique DNA sequence, at 5.11X yields a predicted 240 Mbp of unique Celera sequence.

Page 3: C ompartmentalized  S hotgun  A ssembly

Combining Assembler assembles…“…Celera and PFP sequence for a transient assembly”

…first, Celera reads,– are checked for over-collapsed regions,

• sequences with Mate Pairs that match region are kept,• more mate pair matches = higher value assembly,

…then Celera reads are combined with PFP reads,

• “Greedy” program recognizes highest value assemblies first in order to build contigged sequence,

…then “Stones” to fill the gaps.

Page 4: C ompartmentalized  S hotgun  A ssembly

Results…PFP vs. CSA

• The GenBank (PFP) data for the Phase 1 and 2 BACs yielded an average of 19.8 bactigs per

BAC, of average size 8099 bp,

• Application of the Combining Assembler resulted in individual Celera/BAC assemblies being put together into an average of 1.83 scaffolds (median of 1 scaffold) per BAC region consisting of an average of 8.57 contigs of average size 18,973 bp.

pp. 1313, 1st column, last paragraph

Page 5: C ompartmentalized  S hotgun  A ssembly

Compartmentalized Shotgun Assembly

?

Page 6: C ompartmentalized  S hotgun  A ssembly

Celera Unique ScaffoldsWGA

• The 5.89 million Celera fragments not matching the GenBank data were assembled with the whole-genome assembler.

• The Celera assembly resulted in a set of scaffolds totaling 442 Mbp in span and consisting of 326 Mbp of sequence. More than 20% of the scaffolds were >5 kbp long, and these averaged 63% sequence and 27% gaps with a total of

302 Mbp of sequence.

Page 7: C ompartmentalized  S hotgun  A ssembly

Compartmentalized Shotgun Assembly

?

?

Page 8: C ompartmentalized  S hotgun  A ssembly

Tiler tiles…

• Scaffolds into larger components using

– Mate End Pairs,

– BAC-end pairs,

– STS,

• Heuristic: a rule of thumb, simplification, or educated guess that reduces or limits the search for solutions in domains that are difficult and poorly understood. Unlike algorithms, heuristics do not guarantee optimal (or even feasible) solutions and are often used with no theoretical guarantee.

Page 9: C ompartmentalized  S hotgun  A ssembly

Compartmentalized Shotgun Assembly

*

•3,845 Components

• shredded, WGA

Page 10: C ompartmentalized  S hotgun  A ssembly

• > 100 kbp Scaffolds;– 92% sequence, 8% gaps,– 105,264 gaps, 1,935 scaffolds,– 1.3 Mbp scaffold size, 23,242 bp

contig size.

– > 49% gaps < 500 bp,– > 62% gaps < 1 kb,– No gap larger than 100 kbp.

93%

Page 11: C ompartmentalized  S hotgun  A ssembly

How do you compare assemblies?

Page 12: C ompartmentalized  S hotgun  A ssembly

WGA vs. CSA• This gives some measure of consistent coverage:

– 1.982 Gbp (95.00%) of the WGA is covered by the CSA,

– 2.169 Gbp (87.69%) of the CSA is covered by the WGA.

• Only 31 scaffolds were ~unique to an assembly,

• 295 kb (0.012%) CSA inconsistent with WGA,• 2.108 Mb (0.11% WGA inconsistent with CSA,

smallregions

Overall, CSA slightly better than WGA…

Why?How does the CSA compare with the Clone-by-Clone approach?

Page 13: C ompartmentalized  S hotgun  A ssembly

Hierarchical Clone-by-Clone Whole Genome Assembly

Map First: then sequence Sequence First: then map

Page 14: C ompartmentalized  S hotgun  A ssembly

Mapping ScaffolderGM99 and fingerprint maps

Page 15: C ompartmentalized  S hotgun  A ssembly

Mapping ScaffolderGM99 and fingerprint maps

Page 16: C ompartmentalized  S hotgun  A ssembly

Tab. 4

?

Page 17: C ompartmentalized  S hotgun  A ssembly

Assembly and Validation Analysis…did it really work?

• Completeness: % of euchromatic sequence in the assembly,

– estimate the size and # of gaps (Table 3),

92.2 % Sequence

7.8 % Gaps

CSA

116,442 Gaps

91 % Sequence

9 % Gaps

WGA

102,068 Gaps

92.5 % Sequence

12.9 % Gaps

PFP

Small gaps (554 bp) = 145,514 Gaps,

Large gaps (35 kb) = 4076 Gaps.

Page 18: C ompartmentalized  S hotgun  A ssembly

Assembly and Validation Analysis…did it really work?

• Completeness: % of euchromatic sequence in the assembly,

– estimate the size and # of gaps (Table 3),

– compare to “finished” sequences of 21,22 • 3.4 Mb gaps, 75% gaps are repeats,

– match with STS data (ePCR, BLAST),• 93.4% tested found assembled, 5.5% in “chaff” = 98.9%,

• Correctness:

– Mate-Pair analysis.

Page 19: C ompartmentalized  S hotgun  A ssembly

Mate Pair Analysis

Valid: correct orientation and correct distance + 3 SD

2.7% were found to be invalid.

Page 20: C ompartmentalized  S hotgun  A ssembly

CSA vs. PFP

What does this show?

Page 21: C ompartmentalized  S hotgun  A ssembly

PFP

Chromosome 21

CSA

Green: Same Order,

Orientation Yellow: Same

Orientation

Red: Out of Order, Orientation

Blue: GapsViolations:

Red : misorientedYellow: distance

Page 22: C ompartmentalized  S hotgun  A ssembly

Chromosome 8

PFP

CSA

Page 23: C ompartmentalized  S hotgun  A ssembly

PFP

CSA

Page 24: C ompartmentalized  S hotgun  A ssembly

What’s the take home message?

Page 25: C ompartmentalized  S hotgun  A ssembly

Blue: breaksRed: gaps > 10kb

Fig. 7, key

PFP

CSA

Page 26: C ompartmentalized  S hotgun  A ssembly

Fig. 7

Page 27: C ompartmentalized  S hotgun  A ssembly

Gene Prediction and AnnotationWhy’s it So Hard to Find Genes?

• Exons/Introns,

• Alternative Splicing/Termination,

• Alternate transcription start/stop sites,

• Tandem Repeats, Psuedogenes, etc.

• We don’t really understand all there is to know about gene and genome structure,

• etc.

Page 28: C ompartmentalized  S hotgun  A ssembly

Gene Number Predictions?…before PFP, WGA or CSA

Textbooks: ~100,000

Upgraded to 142,634? EST data

“…counts [that] fall far short…”

EST Data --> 35,000

35,000 genes based on the density of Chromosome 22

28, 000 - 34,000 Humans vs. pufferfish

Page 29: C ompartmentalized  S hotgun  A ssembly

Automated Gene AnnotationOTTO

Tell me how it works.How was it validated, including Table 7.

…if necessary, use the Online Primer and other NCBI resources to broaden your understanding,

– cDNAs, ESTs, RefSeq, Protein Sequence Databases, BLAST, etc. are described in appropriate detail on the WEB.

Page 30: C ompartmentalized  S hotgun  A ssembly

Questions?

Page 31: C ompartmentalized  S hotgun  A ssembly
Page 32: C ompartmentalized  S hotgun  A ssembly

Repeat Resolver ...most of the remaining gaps were due to repeats.

“Rocks”

Use “low Discriminator Value” contig sets to fill gaps,

- find two or more mate pairs with unambiguous matches in the scaffold near the gap (2 kb, 10kb or 50 kb), (1 in 107),

“Stones”

- find mate pair matches 2 kb, 10 kb, and 50 kb from gap, place the mate in the gap, check to see if it’s consistent with other “placed” sequences.

Page 33: C ompartmentalized  S hotgun  A ssembly

Repeat Resolver ...most of the remaining gaps were due to repeats.

“Rocks”

Use “low Discriminator Value” contig sets to fill gaps,

- find two or more mate pairs with unambiguous matches in the scaffold near the gap (2 kb, 10kb or 50 kb), (1 in 107),

“Stones”

- find mate pair matches 2 kb, 10 kb, and 50 kb from gap, place the mate in the gap, check to see if it’s consistent with other “placed” sequences.