clustering method for repeat analysis in dna sequences

26
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences

Upload: freira

Post on 24-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Clustering Method for Repeat Analysis in DNA sequences. Gnana Sundar Rajendiran Joyesh Mishra Rishi Mishra FALL 2008 Bioinformatics. Abstract. Implement a Proposed Clustering Technique by Volfovsky et al. for Repeat Analysis - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Clustering Method for Repeat Analysis in DNA sequences

GNANA SUNDAR RAJENDIRANJOYESH MISHRARISHI MISHRA

FALL 2008 BIOINFORMATICS

Clustering Method for Repeat Analysis in DNA

sequences

Page 2: Clustering Method for Repeat Analysis in DNA sequences

Abstract

Implement a Proposed Clustering Technique by Volfovsky et al. for Repeat Analysis

Distribute Merging Repeats into Classes/Clusters based on Similarity Measures

Why analyze repeats?

Page 3: Clustering Method for Repeat Analysis in DNA sequences

Algorithm Description

Selection using RepeatMatchSteps

1. Preprocessing 2. Merging and Repeat Map Generation3. Classification4. BLAST searches & further merging

of ClustersResults

Page 4: Clustering Method for Repeat Analysis in DNA sequences

RepeatMatch

Uses Suffix Tree algorithm to determine all the exact repeats in a given sequence

It is a part of the MUMmer package used for the rapid alignment of very large DNA and amino acid sequences.

Page 5: Clustering Method for Repeat Analysis in DNA sequences

Example…

Forward & Reverse Complement Repeats

Page 6: Clustering Method for Repeat Analysis in DNA sequences

Definitions

Exact Repeat: Subsequence that occurs in DNA at least twice. Exact repeat is represented by pair of co-ordinates

(A1,A2) delimiting its location in the genome sequence and by the repeat length.

Maximal Repeat: Repeat that can not be extended in either direction without incurring a mismatch

Initial Repeat Set: Set of repeats chosen initially, from which the repeat classes will be constructed

Page 7: Clustering Method for Repeat Analysis in DNA sequences

Definitions

Page 8: Clustering Method for Repeat Analysis in DNA sequences

Preprocessing

Output from RepeatMatch is used to partition the original genome sequence

Each partition point has a reference to the pair co-ordinates (A1, A2) and repeat length l

For each repeat starting at co-ordinates A1 and A2, with length l, this list will include both (A1, A2, l) and (A2, A1,l)

Page 9: Clustering Method for Repeat Analysis in DNA sequences

Merging & Repeat Map Generation

This procedure works by repeatedly working together two exact repeats that either overlap or that occur within a limited distance(a gap) of each other.

Significant subsequences of the new merging repeats appear at least twice in the genome sequence. Merging with Gap Merging with Overlap

Page 10: Clustering Method for Repeat Analysis in DNA sequences

Merging with Gap

Given 2 partition points:p1 = (A1, A2, lA) and p2 = (B1, B2, lB) [A1 < B1]

Compute the distance between the non-overlapping repeats asd(p1,p2) = max(0, B1 - A1 - lA + 1)

Given a maximum Gap size G > 0Sequences corresponding to p1 and p2 are merged if d(p1, p2) < G

Page 11: Clustering Method for Repeat Analysis in DNA sequences

Merging with Overlap

Merges sequences which are partially identicalOverlap of 2 sequences is denoted as:

o(p1, p2) = max(o, A1 + lA – B1 + 1) for A1<B1

Given a minimum overlap proportion op, where 0 <= op <= 1, repeat points (A1, A2, lA) and (B1, B2, lB) are merged if at least one of the four repeats has overlap satisfying o(p1, p2) > op min(lA, lB)

op : Interpreted as a fraction of the shorter of the 2 repeats. E.g. if op = 0.75, the two overlapped sequences are merged if the length of their overlap is at least 75% of the length of the shorter sequence.

Page 12: Clustering Method for Repeat Analysis in DNA sequences

Output: Merging Repeat

The new sequence is defined as merging repeat with starting position M = A1 and with length lM = max(A1 + lA, B1 + lB) – A1

A data structure stores with each merging repeat its start co-ordinate, the length(nM), and a list of references to itself and to other repeats.

Page 13: Clustering Method for Repeat Analysis in DNA sequences

Classification

If a merging repeat has at least one reference in common with one another, then they belong to the same class

If a merging repeat has references that belong to multiple distinct classes, then those classes are combined into one.

If a merging repeat contains no reference to an existing class, then the merging repeat forms a new class.

Page 14: Clustering Method for Repeat Analysis in DNA sequences

BLAST searches and further merging

To merge similar but non-exact repeatsSearch all merging repeats against all others

A local BLAST database is created with all repeat sequences using formatdb

Classes are merged if any of their underlying sequences have a BLAST E-value less than a User-Specified Threshold when compared to any sequence in another class

If a class appears in multiple similarity pairs, all these similar classes are merged with the original class

Page 15: Clustering Method for Repeat Analysis in DNA sequences

Results

Page 16: Clustering Method for Repeat Analysis in DNA sequences

Results (contd.)

Page 17: Clustering Method for Repeat Analysis in DNA sequences

Results (contd.)

Page 18: Clustering Method for Repeat Analysis in DNA sequences

Results (contd.)

Page 19: Clustering Method for Repeat Analysis in DNA sequences

Results (contd.)

Page 20: Clustering Method for Repeat Analysis in DNA sequences

Results (contd.)

Page 21: Clustering Method for Repeat Analysis in DNA sequences

Results (contd.)

Page 22: Clustering Method for Repeat Analysis in DNA sequences

What’s Next?

The results collected help to cluster similar repeats and generate a database.

This DB, in future, can be used to find Similar Repeats faster.

Also, if on future classification, we find common traits among repeats, they can be analyzed only for that cluster alone.

Page 23: Clustering Method for Repeat Analysis in DNA sequences

Resources

Softwares Used: RepeatMatch (MUMmer 3.0) BLAST (formatdb, blastall) MS Excel 2007

Programmming Language Used:C++, UNIX Shell Script

Source for Project & Presentation www.cise.ufl.edu/~jmishra/bioinformatics.html

Page 24: Clustering Method for Repeat Analysis in DNA sequences

References

Volfovsky N., Haas Brian J. and Salzberg Steven L. A clustering method for repeat analysis in DNA sequences, 2001

Mummer software : http://mummer.sourceforge.net/

http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/ blastall, formatdb

www.ncbi.nlm.nih.gov/BLAST/

Page 25: Clustering Method for Repeat Analysis in DNA sequences

References

http://www.tigr.orghttp://www.animalgenome.org/blast/docs/blast

_faq.htmlOf course, Wikipedia …

Data obtained from:ftp://ftp.ncbi.nih.gov/genomes/Bacteria/

Page 26: Clustering Method for Repeat Analysis in DNA sequences

Thank You