isquest: finding insertion sequences in prokaryotic sequence fragment data & applications...

41
ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science, ODU Presentation For: Hampton University Interview, 2 nd round Date: 06/18/2015 Advisors: David Gauthier, Department of Biological Sciences Desh Ranjan, Department Of Computer Science Mohammad Zubair, Department of Computer Science Click icon to add picture 1

Upload: alannah-cobb

Post on 12-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

1

ISQuest: Finding Insertion Sequences in Prokaryotic Sequence

Fragment Data & Applications

Presented By: Abhishek Biswas, Department Of Computer Science, ODU

Presentation For: Hampton University Interview, 2nd round

Date: 06/18/2015

Advisors:

David Gauthier, Department of Biological Sciences

Desh Ranjan, Department Of Computer Science

Mohammad Zubair, Department of Computer Science

Click icon to add picture

Page 2: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

2

Presentation Outline

Biological Preliminaries

Repeat Structures, Mobile Elements and Insertion Sequences(IS)

ISQuest: Finding Insertion Sequences[1]

Applications of ISQuest Tool

Comparative Genomics

Correlative Algorithm for Repeat Placement (CARP)

Algorithm, Experiment & results [1]Biswas A., Gauthier D., Ranjan D.,Zubair M., ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data, Bioinformatics, Accepted, June 2015

Page 3: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

What is a genome?

A genome is an organism’s complete DNA

Collection of genes Polypeptide codes

Genes, Operons

Non-coding region

Transcription

mRNA

Translation

Protein

A C T G

denine

ytosine

hymine

uanine

Start Codon

AUG/GUG

Stop Codon UAA/UAG/UG

A

Page 4: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

The DNA Sequencing ProcessMultiple Copies of the Genome

Randomly Cut Pieces Size ~ 3-

4Kbp

~400bp

~400bp

mate pairs

~400bp

~400bp

single reads (orientation unknown)

Linker

known distance

Depth of Coverage

DNACircularizatio

n

Page 5: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

Genome Assembly Process

Correctly ordering the short sequence fragments

Overlap information

Mate-pair links

Page 6: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

6

Genome Assembly Output

Ideally the complete genome sequence

Most assemblies are incomplete Repeat sequences (e.g. Insertion Sequences) Error in sequencing process

Contiguous sequences (Contigs)/scaffolds returned

Assemblers terminate at ambiguous points

Page 7: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

7

Assembly Validation

Often requires manual validation

View depth of coverage

Locate areas where mate pairs are stretched or

compressed

Design primers and validate the joins

Requires manual effort and is time consuming

Design of primers may be problematic

Page 8: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

8

Hawkeye Manual Assembly Validation

M. Schatz, et. al. 2007

Page 9: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

9

Repeat Structures or Mobile Genetic Elements

Repeat structure is a repeating segment of DNA sequence

Biologically significant

Copies not identical

Examples

Insertion Sequences

Transposase (pseudogenes)

Ribosomal RNA is coded by a large number of identical genes that

are tandemly repeated to form one or more clusters.

Page 10: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

Mobile genetic sequences

Create close copies at different locations

Mobile Elements and Insertion Sequences(IS)

GenomeInsertion Sequence

Interrupted Gene X_1

GenomeInsertion Sequence

Gene X_1 Gene X_2

GenomeInsertion Sequence

Gene X_1 Gene X_2

Intragenic Insertion(pseudogene)

Intergenic Insertion

Insertion replacing parts of two genes

10

Page 11: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

11

Why are MGEs important?

Horizontal Gene Transfer

Cause of interesting evolutionary traits not explained

by reproduction

Comparative genomics

Genome Assembly

Most assemblers generate incomplete assembly Repeat sequences (e.g. Insertion Sequences)

Page 12: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

12

Contribution of ISQuest

Annotate partial repeat structures

MGEs often degenerate during transposition

Requires no prior assembly or annotation of ORFs

Though using a draft assembly improves assembly

time

Available on SourceForge

https://sourceforge.net/projects/isquest/

Page 13: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

13

ISQuest Algorithm (1): Obtaining Seed Sequences

Input sequences (Reads/Contigs)

MegaBLAST against local “nt” Database

Select BLAST query sequences that hit results with IS annotations GenBank Files

Generate seed sequence library

Page 14: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

14

ISQuest Algorithm (2): Extending Seed Sequences

Assemble raw reads to ends

Find boundary

Generate new seed sequence library

Page 15: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

15

ISQuest Algorithm (2): Extending Seed Sequences

Assemble raw reads to ends

Find boundary

Generate new seed sequence library

Page 16: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

16

ISQuest Algorithm (2): Extending Seed Sequences

Assemble raw reads to ends

Find boundary

Generate new seed sequence library

Page 17: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

17

ISQuest : Output Sequences

Page 18: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

18

ISQuest : Example

Page 19: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

19

Experimental Setup

All sequenced bacterial genomes

3810 Genomic Sequences

We sumulated DNA fragmentation process

using ART[1] simulator

Assembled all read libraries into draft assemblies

Applied ISQuest to find Insertion Sequences

Verified against GenBank annotations

[1]Huang, W., et al. (2012) ART: a next-generation sequencing read simulator, Bioinformatics, 28, 593-594.

Page 20: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

Performance of Repeat Quest for ISs

Robinson DG, Lee M-C, and Marx CJ. OASIS: an automated program for global investigation of bacterial and archaeal insertion sequences.

Nucleic Acids Research., 2012

70% Length Match

80% Length Match

90% Length Match

Page 21: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

21

Applications : Comparative Genomics

Click icon to add picture

Page 22: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

22

Phylogenetic Tree of M. Marinum

Built using kSNP[5]

[5] Gardner SN, Hall BG (2013) When Whole-Genome Alignments Just Won't Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes.

Page 23: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

23

Orthogonal Clustering

Phylogenetic tree of 42 Mycobacterium Marinum strains

Clustering based on IS sequences ISQuest to generate IS sequences for each

strain Clustering using CDHit[4]

Find core IS elements

[4] "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-9.

Page 24: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

24

Application : Genome Assembly

Click icon to add picture

Page 25: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

25

Genome Assembly Process

Ordering the overlapping short sequence

fragments

Reconstruct the source genome

Page 26: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

26

Computationally Guided Draft Assembly

Correlative Algorithm for Repeat Placement (CARP)

Correct repeat placement Repeat elements identified manually/computationally Currently use only repeating Insertion Sequences

Adding lines of evidence to joins Matched repeat elements Mate-pair evidence Synteny (gene organization) with reference genomes

User can look at all the evidence for a join

Page 27: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

27

Correlative Algorithm for Repeat Placement(CARP)

Current version of CARP

Works with repeating insertion sequences

Identifies the insertion points

Joins contigs based on insertion sequences and gene synteny

Input to CARP

A set of high confidence contigs

A library of insertion sequences

One or more reference organisms

Page 28: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

28

Step 1:Annotating the Contig Ends

Find the partial repeating IS at the contig ends Assemblers terminate as repeat cannot be resolved Therefore, contigs end in repeat regions

Annotation uses MegaBLAST for matching

ContigsC1

C2

C3

C4

Cn

Insertion SequencesIS1IS2IS3

ISn

Unknown Partial Repeats

End Annotated ContigsIS4

C2

C3

C4

Ck

IS2

IS1IS2

IS4

IS3

IS4

IS3

IS1

IS1

C1

Annotated Partial Repeats

Page 29: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

29

Mobile genetic sequences

Create close copies at different locations

Insertion Sequences(IS) and Insertion

GenomeInsertion Sequence

Interrupted Gene

GenomeInsertion Sequence

Gene X_1 Gene X_2

GenomeInsertion Sequence

Gene X_1 Gene X_3

Intragenic Insertion(pseudogene)

Intergenic Insertion

Insertion replacing parts of two genes

Page 30: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

30

Step 2: Identifying Insertion Type

Classify based on two types of insertions possible Intergenic: insertion within a gene Intragenic: insertion not within a gene

Database of genes from reference organism

Intergenic : closest neighboring geneIS4

IS2Contig 1

200 bp 200 bp

MegaBLAST (database of genes)

Gene XMatch

Intragenic Insertion

Gene YMatch

Intergenic Insertion

No Matc

h

Page 31: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

31

Step 3: Matching contigs

Match contigs to be joined IS and orientation of IS must match Intragenic: Interrupted gene must be same Intergenic: Gene synteny with reference organism

Matches are accepted if genes are within threshold

Contig 1

Intragenic Insertion Sequence Match

Contig 4IS1

IS1

Complete Gene

Contig 1

Intergenic Insertion Sequence Match

Contig 4IS1

IS1

Close BLAST Hit in Reference Organism

Page 32: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

32

Step 4: Applying Mate-Pair Evidence

Mate-pair evidence to validate or discard joins

Priority to strong mate-pair evidence Major rearrangements must not be missed

Adjustable threshold values Valid Mate-

pairsCARP Match Join

>20 N.A. (Added Evidence) Accepted

<20 Y Accepted

<3 Y Further Review

None Y Rejected*

* Unless the mate-pair distances are too small to cover IS gap in which case the join should be reviewed further.

Page 33: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

33

Viewer Showing Joins

Page 34: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

34

Experiment Setup

We selected 2 bacteria with large number of repeating IS

Bacillus halodurans 

Mycobacterium marinum M

Simulated sequencing read libraries

3Kbp mate-pair libraries

Mean 450bp read length

30x coverage

Assembled using Celera WGS[2, 3] assembler

Page 35: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

35

Results

De novo assembly of M. shotsii Three 454 read libraries (3,8 kbp paired-end) Celera WGS assembler generated 42 scaffolds 6 repeating Insertion Sequences identified CARP reduced number of scaffolds to 17

Organism Contigs

Celera Scaffol

ds

CARP Scaffolds

a

CARP Scaffolds

b

Incorrect Joins

(Celera, CARP)

B. halodurans

857 92 38 17 (16,12,15)

M. marinum M

773 48 30 27 (5,3,3)

astrict CARP thresholds; bweaker CARP thresholds

Page 36: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

36

Summary

Computationally Guided Draft Assembly

Joins based on 3 kinds of evidence Matched insertion sequences Gene synteny with one or more reference organisms Mate pairs information

Generates list of all joins Each join annotated with evidence User can assess confidence of join

We show that CARP provides valuable assembly validation

Page 37: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

37

Future Work

ISQuest Algorithm

Metagenomics read handling

Incorporate capability to correctly place

any repeat element

RepeatQuest

Page 38: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

38

References

1. Michael Schatz, Adam Phillippy, Ben Shneiderman, Steven Salzberg, Hawkeye: an interactive visual analytics tool for genome assemblies Genome Biology, Vol. 8, No. 3. (09 March 2007), R34, doi:10.1186/gb-2007-8-3-r34

2. Celera Assembler - http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page

3. E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A. Kravitz, C. M. Mobarry, K. H. J. Reinert, K. A. Remington, E. L. Anson, R. A. Bolanos, H.-H. Chou, C. M. Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon, L. Chen, P. J. Dunn, Z. Lai, Y. Liang, D. R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams, and J. C. Venter, "A Whole-Genome Assembly of Drosophila," Science, vol. 287, pp. 2196-2204, March 24, 2000.

Page 39: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

39

Desh Ranjan and Mohammad ZubairDepartment of Computer Science

ISQuest: Finding Insertion Sequences

David GauthierDepartment of Biological Sciences

Page 40: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

40

•Collaboration• Department of Computer Science at ODU (Jing He, Desh Ranjan,

M. Zubair)

•SupportNSF-DBI-135662ODU startup fund, MSF fund and M&S FellowshipNSF HRD-0420407

•Students• Dong Si (PhD, 2015)• Lin Chen (current PhD student)• Maryam Arab (current MS student)

Protein Structure Research Group

Page 41: ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data & Applications Presented By: Abhishek Biswas, Department Of Computer Science,

41

Thank You

Click icon to add picture