2016/1/27summer course1 pattern search problems part i: fundament concept

Download 2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept

If you can't read please download the document

Upload: melinda-richard

Post on 17-Jan-2018

222 views

Category:

Documents


0 download

DESCRIPTION

2016/1/27Summer Course3 FASTA:

TRANSCRIPT

2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept 2016/1/27Summer Course2 2016/1/27Summer Course3 FASTA: 2016/1/27Summer Course4 FastP and FastA FastA is an algorithm that attempts to speed up string matching over the standard optimal alignment. The FastA algorithm is implemented in the following 6 stages: Locate hot spots Find the 10 best regions in the matrix Score using a substitution matrix Combine initial regions from different diagonals Optimal alignment Presentation 2016/1/27Summer Course5 2016/1/27Summer Course6 BLAST: 2016/1/27Summer Course7 BLAST The BLAST database consists of three files for every FastA file input. The first contains all of the sequence headers, textual information about the amino acid or nucleotide sequence. The second contains the compressed sequences (2 bits for each nucleotide, 5 bits for each amino acid). The third file contains an index of the compressed sequences so that they can be matched with the corresponding headers. The program runs in 3 rounds. Database Scanning (table search or Finite state machine) Seed Growing Combining Alignments 2016/1/27Summer Course8 2016/1/27Summer Course9 Pattern matching 2016/1/27Summer Course10 (Character to Character Comparison) 2016/1/27Summer Course11 2016/1/27Summer Course12 2016/1/27Summer Course13 2016/1/27Summer Course14 2016/1/27Summer Course15 2016/1/27Summer Course16 2016/1/27Summer Course17 2016/1/27Summer Course18 2016/1/27Summer Course19 2016/1/27Summer Course20 2016/1/27Summer Course21 2016/1/27Summer Course22 2016/1/27Summer Course23 2016/1/27Summer Course24 2016/1/27Summer Course25 (Under a preprocessing, path) 2016/1/27Summer Course26 2016/1/27Summer Course27 2016/1/27Summer Course28 2016/1/27Summer Course29 2016/1/27Summer Course30 2016/1/27Summer Course31 2016/1/27Summer Course32 2016/1/27Summer Course33 2016/1/27Summer Course34 Sliding Window Comparison 2016/1/27Summer Course35 Sliding Windows Coding the sequence DNA/RNA: A: 00, T: 01, G: 10, C: 11 Protein: 20 amino acid K-tuple overlapping sliding windows Sorting Bucket Sort 2016/1/27Summer Course36 Table Search 2016/1/27Summer Course37 Table Search Indexing table overlapping or non-overlapping Indexing for the text or patterns How to reduce the table size? How to do the search? How to do the filtration? 2016/1/27Summer Course38 Approximation string matching? (It still is very hard to do) 2016/1/27Summer Course39 Bio-Problems SNP finding? ESTs align to whole genome? Genome assembly? Consensus and signature pattern finding? Motif finding? 2016/1/27Summer Course40 Part II: Advance Concept Indexing Methods for Pattern Search and Motif Finding problems 2016/1/27Summer Course41 2016/1/27Summer Course42 BLAT: 2016/1/27Summer Course43 BLAT Non-overlapping indexing Table Exact and approximation match (by statistical method) Order concept 2016/1/27Summer Course44 2016/1/27Summer Course45 2016/1/27Summer Course46 2016/1/27Summer Course47 Using Single UMs for indexing table 2016/1/27Summer Course48 2016/1/27Summer Course49 Multiple-Unique Marker 2016/1/27Summer Course50 2016/1/27Summer Course51 Sandwich DP 2016/1/27Summer Course52 2016/1/27Summer Course53 2016/1/27Summer Course54 MEME: 2016/1/27Summer Course55 (not the traditional motif definition) 2016/1/27Summer Course56 Degenerate motif discovery problem Given a set of sequences S = {S 1, S 2, , S m | S i belongs to {A, G, C, T}* for all i} and three nonnegative integers k, l and d, find all degenerate (l, d)-motifs, each of which has occurrences in at least k sequences in S. A degenerate (l, d)-motif is defined as a pattern of length l over the IUPAC code with no more than d degenerate positions. (A degenerate position is a position occupied by a character other than A, G, C or T) e.g. ARATTYT degenerate (7,2)-motif ( ) 2016/1/27Summer Course57 New Challenge Solexa and 454 short reads New hardware support