isquest: finding insertion sequences in prokaryotic sequence fragment data & applications...
TRANSCRIPT
1
ISQuest: Finding Insertion Sequences in Prokaryotic Sequence
Fragment Data & Applications
Presented By: Abhishek Biswas, Department Of Computer Science, ODU
Presentation For: Hampton University Interview, 2nd round
Date: 06/18/2015
Advisors:
David Gauthier, Department of Biological Sciences
Desh Ranjan, Department Of Computer Science
Mohammad Zubair, Department of Computer Science
Click icon to add picture
2
Presentation Outline
Biological Preliminaries
Repeat Structures, Mobile Elements and Insertion Sequences(IS)
ISQuest: Finding Insertion Sequences[1]
Applications of ISQuest Tool
Comparative Genomics
Correlative Algorithm for Repeat Placement (CARP)
Algorithm, Experiment & results [1]Biswas A., Gauthier D., Ranjan D.,Zubair M., ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data, Bioinformatics, Accepted, June 2015
What is a genome?
A genome is an organism’s complete DNA
Collection of genes Polypeptide codes
Genes, Operons
Non-coding region
Transcription
mRNA
Translation
Protein
A C T G
denine
ytosine
hymine
uanine
Start Codon
AUG/GUG
Stop Codon UAA/UAG/UG
A
The DNA Sequencing ProcessMultiple Copies of the Genome
Randomly Cut Pieces Size ~ 3-
4Kbp
~400bp
~400bp
mate pairs
~400bp
~400bp
single reads (orientation unknown)
Linker
known distance
Depth of Coverage
DNACircularizatio
n
Genome Assembly Process
Correctly ordering the short sequence fragments
Overlap information
Mate-pair links
6
Genome Assembly Output
Ideally the complete genome sequence
Most assemblies are incomplete Repeat sequences (e.g. Insertion Sequences) Error in sequencing process
Contiguous sequences (Contigs)/scaffolds returned
Assemblers terminate at ambiguous points
7
Assembly Validation
Often requires manual validation
View depth of coverage
Locate areas where mate pairs are stretched or
compressed
Design primers and validate the joins
Requires manual effort and is time consuming
Design of primers may be problematic
8
Hawkeye Manual Assembly Validation
M. Schatz, et. al. 2007
9
Repeat Structures or Mobile Genetic Elements
Repeat structure is a repeating segment of DNA sequence
Biologically significant
Copies not identical
Examples
Insertion Sequences
Transposase (pseudogenes)
Ribosomal RNA is coded by a large number of identical genes that
are tandemly repeated to form one or more clusters.
Mobile genetic sequences
Create close copies at different locations
Mobile Elements and Insertion Sequences(IS)
GenomeInsertion Sequence
Interrupted Gene X_1
GenomeInsertion Sequence
Gene X_1 Gene X_2
GenomeInsertion Sequence
Gene X_1 Gene X_2
Intragenic Insertion(pseudogene)
Intergenic Insertion
Insertion replacing parts of two genes
10
11
Why are MGEs important?
Horizontal Gene Transfer
Cause of interesting evolutionary traits not explained
by reproduction
Comparative genomics
Genome Assembly
Most assemblers generate incomplete assembly Repeat sequences (e.g. Insertion Sequences)
12
Contribution of ISQuest
Annotate partial repeat structures
MGEs often degenerate during transposition
Requires no prior assembly or annotation of ORFs
Though using a draft assembly improves assembly
time
Available on SourceForge
https://sourceforge.net/projects/isquest/
13
ISQuest Algorithm (1): Obtaining Seed Sequences
Input sequences (Reads/Contigs)
MegaBLAST against local “nt” Database
Select BLAST query sequences that hit results with IS annotations GenBank Files
Generate seed sequence library
14
ISQuest Algorithm (2): Extending Seed Sequences
Assemble raw reads to ends
Find boundary
Generate new seed sequence library
15
ISQuest Algorithm (2): Extending Seed Sequences
Assemble raw reads to ends
Find boundary
Generate new seed sequence library
16
ISQuest Algorithm (2): Extending Seed Sequences
Assemble raw reads to ends
Find boundary
Generate new seed sequence library
17
ISQuest : Output Sequences
18
ISQuest : Example
19
Experimental Setup
All sequenced bacterial genomes
3810 Genomic Sequences
We sumulated DNA fragmentation process
using ART[1] simulator
Assembled all read libraries into draft assemblies
Applied ISQuest to find Insertion Sequences
Verified against GenBank annotations
[1]Huang, W., et al. (2012) ART: a next-generation sequencing read simulator, Bioinformatics, 28, 593-594.
Performance of Repeat Quest for ISs
Robinson DG, Lee M-C, and Marx CJ. OASIS: an automated program for global investigation of bacterial and archaeal insertion sequences.
Nucleic Acids Research., 2012
70% Length Match
80% Length Match
90% Length Match
21
Applications : Comparative Genomics
Click icon to add picture
22
Phylogenetic Tree of M. Marinum
Built using kSNP[5]
[5] Gardner SN, Hall BG (2013) When Whole-Genome Alignments Just Won't Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes.
23
Orthogonal Clustering
Phylogenetic tree of 42 Mycobacterium Marinum strains
Clustering based on IS sequences ISQuest to generate IS sequences for each
strain Clustering using CDHit[4]
Find core IS elements
[4] "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-9.
24
Application : Genome Assembly
Click icon to add picture
25
Genome Assembly Process
Ordering the overlapping short sequence
fragments
Reconstruct the source genome
26
Computationally Guided Draft Assembly
Correlative Algorithm for Repeat Placement (CARP)
Correct repeat placement Repeat elements identified manually/computationally Currently use only repeating Insertion Sequences
Adding lines of evidence to joins Matched repeat elements Mate-pair evidence Synteny (gene organization) with reference genomes
User can look at all the evidence for a join
27
Correlative Algorithm for Repeat Placement(CARP)
Current version of CARP
Works with repeating insertion sequences
Identifies the insertion points
Joins contigs based on insertion sequences and gene synteny
Input to CARP
A set of high confidence contigs
A library of insertion sequences
One or more reference organisms
28
Step 1:Annotating the Contig Ends
Find the partial repeating IS at the contig ends Assemblers terminate as repeat cannot be resolved Therefore, contigs end in repeat regions
Annotation uses MegaBLAST for matching
ContigsC1
C2
C3
C4
Cn
Insertion SequencesIS1IS2IS3
ISn
Unknown Partial Repeats
End Annotated ContigsIS4
C2
C3
C4
Ck
IS2
IS1IS2
IS4
IS3
IS4
IS3
IS1
IS1
C1
Annotated Partial Repeats
29
Mobile genetic sequences
Create close copies at different locations
Insertion Sequences(IS) and Insertion
GenomeInsertion Sequence
Interrupted Gene
GenomeInsertion Sequence
Gene X_1 Gene X_2
GenomeInsertion Sequence
Gene X_1 Gene X_3
Intragenic Insertion(pseudogene)
Intergenic Insertion
Insertion replacing parts of two genes
30
Step 2: Identifying Insertion Type
Classify based on two types of insertions possible Intergenic: insertion within a gene Intragenic: insertion not within a gene
Database of genes from reference organism
Intergenic : closest neighboring geneIS4
IS2Contig 1
200 bp 200 bp
MegaBLAST (database of genes)
Gene XMatch
Intragenic Insertion
Gene YMatch
Intergenic Insertion
No Matc
h
31
Step 3: Matching contigs
Match contigs to be joined IS and orientation of IS must match Intragenic: Interrupted gene must be same Intergenic: Gene synteny with reference organism
Matches are accepted if genes are within threshold
Contig 1
Intragenic Insertion Sequence Match
Contig 4IS1
IS1
Complete Gene
Contig 1
Intergenic Insertion Sequence Match
Contig 4IS1
IS1
Close BLAST Hit in Reference Organism
32
Step 4: Applying Mate-Pair Evidence
Mate-pair evidence to validate or discard joins
Priority to strong mate-pair evidence Major rearrangements must not be missed
Adjustable threshold values Valid Mate-
pairsCARP Match Join
>20 N.A. (Added Evidence) Accepted
<20 Y Accepted
<3 Y Further Review
None Y Rejected*
* Unless the mate-pair distances are too small to cover IS gap in which case the join should be reviewed further.
33
Viewer Showing Joins
34
Experiment Setup
We selected 2 bacteria with large number of repeating IS
Bacillus halodurans
Mycobacterium marinum M
Simulated sequencing read libraries
3Kbp mate-pair libraries
Mean 450bp read length
30x coverage
Assembled using Celera WGS[2, 3] assembler
35
Results
De novo assembly of M. shotsii Three 454 read libraries (3,8 kbp paired-end) Celera WGS assembler generated 42 scaffolds 6 repeating Insertion Sequences identified CARP reduced number of scaffolds to 17
Organism Contigs
Celera Scaffol
ds
CARP Scaffolds
a
CARP Scaffolds
b
Incorrect Joins
(Celera, CARP)
B. halodurans
857 92 38 17 (16,12,15)
M. marinum M
773 48 30 27 (5,3,3)
astrict CARP thresholds; bweaker CARP thresholds
36
Summary
Computationally Guided Draft Assembly
Joins based on 3 kinds of evidence Matched insertion sequences Gene synteny with one or more reference organisms Mate pairs information
Generates list of all joins Each join annotated with evidence User can assess confidence of join
We show that CARP provides valuable assembly validation
37
Future Work
ISQuest Algorithm
Metagenomics read handling
Incorporate capability to correctly place
any repeat element
RepeatQuest
38
References
1. Michael Schatz, Adam Phillippy, Ben Shneiderman, Steven Salzberg, Hawkeye: an interactive visual analytics tool for genome assemblies Genome Biology, Vol. 8, No. 3. (09 March 2007), R34, doi:10.1186/gb-2007-8-3-r34
2. Celera Assembler - http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page
3. E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A. Kravitz, C. M. Mobarry, K. H. J. Reinert, K. A. Remington, E. L. Anson, R. A. Bolanos, H.-H. Chou, C. M. Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon, L. Chen, P. J. Dunn, Z. Lai, Y. Liang, D. R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams, and J. C. Venter, "A Whole-Genome Assembly of Drosophila," Science, vol. 287, pp. 2196-2204, March 24, 2000.
39
Desh Ranjan and Mohammad ZubairDepartment of Computer Science
ISQuest: Finding Insertion Sequences
David GauthierDepartment of Biological Sciences
40
•Collaboration• Department of Computer Science at ODU (Jing He, Desh Ranjan,
M. Zubair)
•SupportNSF-DBI-135662ODU startup fund, MSF fund and M&S FellowshipNSF HRD-0420407
•Students• Dong Si (PhD, 2015)• Lin Chen (current PhD student)• Maryam Arab (current MS student)
Protein Structure Research Group
41
Thank You
Click icon to add picture