suffix trees and derived applications carl bergenhem and michael smith

23
Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Upload: doris-goodman

Post on 18-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Suffix Trees and Derived

ApplicationsCarl Bergenhem and Michael Smith

Page 2: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

SimpleScalar Suite

• Linux Based Cache SimulatorAllows for simulation of predefined cache environments

• Cross-compiles code for SimulationThrough Linux GCC Fortran or C code can be compiled specifically for the SimpleScalar to allow complete execution of the code and keeping statistics

Page 3: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Sim-cache

• General sim-cache Code run through sim-cache uses the following paramaters

– Number of sets in the structure

– Block size

– Associativity

– Replacement policy

• What this lets us do Can simulate how well a program will perform on different types of CPUs in regards to

cache simulation.

Page 4: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Idea of a Suffix Tree

•A Suffix-Tree is a data structure that creates a path from the root to a leaf for each suffix of the input string.

•Ex: A seven letter string will have seven leaves

Page 5: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Idea of a Suffix Tree

•The internal nodes of a tree are created when the start of a suffix is the same as another suffix

•Ex: From “banana”, “anana” and “ana” both start with “ana” so they can share the same path from the root until the end where they diverge

Page 6: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Building a Tree

•Starting from an empty root, and building the suffix tree for “banana”

•The first step...

Page 7: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Building a Tree

Page 8: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Building a Tree

Page 9: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Building a Tree

Page 10: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Building a Tree

Page 11: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Building a Tree

Page 12: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Building a Tree

Page 13: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Building a Tree

Page 14: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Recap

•As seen, it is a simple process in a number of iterations equal to the length of the input string to create the suffix tree

Page 15: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Use

•Fast String Comparisons Can be made in a number of comparisons of a most the length of the second to be compared string.

Page 16: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Example

Page 17: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

REPuter

•The REPuter algorithm is a genetic algorithm that uses the Suffix Tree to efficiently find maximal repeats

Page 18: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Maximal Repeats

•A maximal repeat requires that within a string, there exists a substring that occurs at least twice and is at least of length equal to a set threshold length.

Page 19: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Example

•With a threshold value of 2, the word “banana” has the following maximal repeats

•“ana” appears twice

•“an” appears twice

•“na” appears twice

Page 20: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Use

•Scientists use the REPuter algorithm to find common substrings within a genome sequence that are of a certain length.

•A useful extension of this algorithm is to find similar substrings that can account for mutations in the DNA

Page 21: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

How It Works

•The REPuter algorithm uses the suffix tree structure by traversing the entire tree, and whenever it is on a node that represents a string longer than the threshold, it is a valid maximal repeat so long as that node has 2 or more children nodes

Page 22: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

Example

Page 23: Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith

PSP Algorithm

• Probe Selection Problem (PSP) Algorithm– Relies upon the Suffix Tree to function.

– Contains a set S of genomic sequences.

– In order to find an olignucleotide (probe) for each sequence, a suffix tree of all the sequences is used.

– Allows the probe to be identified in such a way that hybridization can occur for a specific sequence and that sequence only

– Also grants the temperature at which the hybridization can occur