laboratory submissions: - myweb at witmyweb.wit.edu/wernerm/comp602labs.doc · web viewlaboratory...

Comp602 Bioinformatics Algorithms Labs Spring 2011

Comp602 Bioinformatics Algorithms – Laboratory Exercises – Spring 2011 (draft)

Laboratory Rules

Teams:

Scoring: 10 points when correct and submitted on time. Lose 2 points for each week late.

Laboratory Submissions:Please submit labs by Blackboard as a single word or pdf document. Keep a copy for yourself. The document contains:

a) Problem statement. You can usually copy this from the lab assignment.b) Algorithm used: You may refer to the textbook or some other source. The algorithm

should be stated in standard algorithmic pseudocode as used in the Levitin book.c) Big-O complexity information (if specified).d) Program code. Submit only code you have written. Never submit generated code,

object code or executables. Acceptable languages are C++ or Java. Some labs may allow submission in other languages including Perl and Matlab. All files should have a comment header including your name(s), the lab number, a brief description of the code and the date written. If the assignment is to modify code given to you, be sure to credit the original author in the header. The header should also contain a brief description of the modifications you made. Throughout the code clearly mark the modifications you made with comments. For example, use your initials as in: //@mw …

e) Sample runs. Include at least two. Some problems will be run against data sets consisting of fragments of genetic code or amino acid sequences. Clearly identify the data and state how it was obtained. Show the output from each run. Also give timing information. Use the system clock time or some more sophisticated method such as the query performance counter to show the time used.

f) Your comments (optional). You may wish to make suggestions on how to improve the algorithm.

Lab 1 Brute Force Pattern Matching In this lab you will search for patterns in a string downloaded from a public dna database. You will use a brute force algorithm such as described in Levitin p. 104. You are not allowed to use built-in regular expression matching found in languages like Perl.

The brute-force pattern matching algorithm compares the pattern P with the text Tfor each possible shift of P relative to T, until either a match is found, or all placements of the pattern have been tried. Brute-force pattern matching runs in time O(nm), where n is the length of the test and m of the pattern.

Prof. M Werner 1 5/8/2023


Example of worst case:T = aaa … ahP = aaah

BruteForceMatch(T, P, m, n)//Input text T of size n and pattern P of size m//Output starting index of a substring of T equal to P or -1//if no such substring exists

for i 0 to n-1 do/* test shift i of the pattern */j 0;while (j <m && T[i + j] = P[j])

j j + 1;if ( j = m)

return i ; /* match at i */return -1; /* no match */

Part 1) Implement the BruteForceAlgorithm in C++ or Java.Do at least 2 sample runs searching for the pattern: AAACTGAAAAAGAACGAAACTGTC in different chromosomes. The database resource is Genbank. http://www.ncbi.nlm.nih.gov/Genbank/index.html. You can search Nucleotide for NC_004353.

For the first sample run use the genome for chromosome 4 of the drosophila. http://myweb.wit.edu/wernerm//DrosophilaNC_004353_ffn.txt

Part 2) Searching for Metalloenzymes Motifsa) Go to NCBI and search the protein database for “zinc-dependent metalloprotease”.b) Narrow the search results by pressing “Homo sapiens” under top Organisms.c) Under the second result:

metalloprotease [Echis pyramidum] 617 aa protein, choose FASTA.

d) Use the “Find in this Sequence” tool to locate the pattern “HExxHxxGxxH”.e) Try this search on some related species such as “Group III snake venom metalloproteinase [Echis ocellatus]”f) Modify the brute force pattern matching program from Part 1 so that letter ‘x’ in the pattern matches any letter in the text. g) Compare the results of your program with those from the “Find in this Sequence” tool on several sequences.


http://www.ncbi.nlm.nih.gov/protein/CAA55565.1

http://www.ncbi.nlm.nih.gov/guide/

http://myweb.wit.edu/wernerm//DrosophilaNC_004353_ffn.txt

http://www.ncbi.nlm.nih.gov/Genbank/index.html


Lab 1A (Extra Credit) Boyer-Moore-Horspool Pattern Matching

Pattern matching is useful in working with dna sequences and also in searching medical and other literature. Read the research article Fast Exact String Pattern-matching Algorithms Adapted to the Characteristics of the Medical Language. Then repeat the pattern searches in Lab 1 using the Boyer-Moore-Horspool algorithm.

Brief description of Boyer-Moore-Horspool:

Brute force slides the pattern alongside the text, matching letters left-to-right until either: all match – returns index of start of pattern in text mismatch – slides pattern rightwards by 1

Boyer-Moore and similar algorithms do better by sliding the pattern more than 1 square at-a-time. Letters are matched right-to-left. How much to slide depends on the first text letter that doesn’t match. If it is not in the pattern at all, then slide by the length of the pattern. If it is in the pattern slide by the distance from the right of the first occurrence of the letter in the pattern.

For example, consider the pattern: TCGCT. If the text letter is anything but T, C or G then slide by 5. If the text and the pattern agree, don’t slide but continue to match right-to-left. If they don’t agree slide by 1 if the text is C, slide by 2 if G, by 4 if T.

Compare the time taken by brute force and Boyer-Moore-Horspool on the same data.

To achieve this easily, you need to preprocess the pattern to make a skip table, with entries for all possible letters, and the amount to slide for each. For example:

Table for “TCGCT” over the alphabet {A,C,G,T}

Here is the algorithm copied roughly from Levitin for preparing the skip table.

SkipTable(P[0 .. m-1])//Input: Pattern P with length m//Output: Table[0 .. n-1] with skip values for all n letters in alphabetfor i 0 to n-1 do

Table[i] = mfor j 0 to m-2 do

Table[P[j]] = m – 1 – jreturn Table

To make the lookup fast, you can use the ASCII values of the letters. Make an array with entries indexed from 0 to 127, set all the values to the length of the pattern (in this case 5) then modify the values for the letters appearing in the pattern. In this case the entry for


A C G T5 1 2 4

http://www.ncbi.nlm.nih.gov/pubmed/10887166

http://www.ncbi.nlm.nih.gov/pubmed/10887166


65 is 5, for 67 is 1, etc. Then when deciding to slide, look up the text char in the array to see how much to slide by.

Lab 1B (Extra Credit) PROSITE pattern notationa. Acquaint yourself with the PROSITE pattern notation. Start with the Sequence

motif article in Wikipedia. It is excerpted here:

_____________________________________________________________________

The PROSITE notation uses the IUPAC one-letter codes and conforms to the above description with the exception that a concatenation symbol, '-', is used between pattern elements, but it is often dropped between letters of the pattern alphabet.

PROSITE allows the following pattern elements in addition to those described previously:

The lower case letter 'x' can be used as a pattern element to denote any amino acid.

A string of characters drawn from the alphabet and enclosed in braces (curly brackets) denotes any amino acid except for those in the string. For example, {ST} denotes any amino acid other than S or T.

If a pattern is restricted to the N-terminal of a sequence, the pattern is prefixed with '<'.

If a pattern is restricted to the C-terminal of a sequence, the pattern is suffixed with '>'.

The character '>' can also occur inside a terminating square bracket pattern, so that S[T>] matches both "ST" and "S>".

If e is a pattern element, and m and n are two decimal integers with m <= n, then: o e(m) is equivalent to the repetition of e exactly m times;o e(m,n) is equivalent to the repetition of e exactly k times for any integer k

satisfying: m <= k <= n.

Some examples:

x(3) is equivalent to x-x-x. x(2,4) matches any sequence that matches x-x or x-x-x or x-x-x-x.

________________________________________________________________________

b. Read some lecture notes on regular expressions and automata. Also on Pattern Matching.

Express the pattern signature of the C2H2-type zinc finger:


http://en.wikipedia.org/wiki/Zinc_finger

http://www.cwr.cl/publicaciones/jstat04.pdf

http://www.cwr.cl/publicaciones/jstat04.pdf

http://www.cs.princeton.edu/courses/archive/fall09/cos126/lectures/18Theory-2x2.pdf

http://en.wikipedia.org/wiki/IUPAC

http://en.wikipedia.org/wiki/PROSITE

http://en.wikipedia.org/wiki/Sequence_motif

http://en.wikipedia.org/wiki/Sequence_motif


C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

in the form of a finite state automaton.c. Write a program to match this pattern in any protein sequence. The program

should have a number of states and transitions corresponding to the automaton in Part b.

d. Is it possible to write a more general program? This program would take in any PROSITE pattern and generate the matching states and transitions. It could then be used against any sequence.

Lab 2 Keyword Trees

1. Read Sections 9.3, 9.4, 9.5.2. Write a program to implement multiple pattern matching by use of keyword

trees. Proceed as follows:3. The space delimited list of patterns and the text are stored in two separate

files. You need to be able to open the files and read the strings.4. Construct the keyword tree by adding patterns one-at-a-time. Each node may

have multiple children labeled by single letters. A flag should indicate if a node terminates one of the keywords. So you need to write a function like addChild(char c) which creates a new child node labeled by ‘c’ only if one doesn’t already exist.

5. You may choose to simplify the problem by assuming that no pattern is a prefix of any other in the set. This makes sense since if the longer pattern is matched, the prefix is matched too. This way all patterns terminate at leaves.

6. As described on page 319, traverse the keyword tree using letters from the text. If a terminal node (leaf) is found output the pattern matched and the position of the first letter in the text matching the pattern. Back up the text to the position following the last starting position whenever there is a mismatch or a leaf node has been reached. This way you can continue searching for other matches. i.e. Suppose pattern “that” is matched on a terminal leaf. Back up the text to the ‘h’ to match patterns such as “hatter”.

7. Test you code on some contrived examples as in Figure 9.4.8. Further test using the set of patterns matching the amino acid arg see p.66)

and a long string of DNA.

Lab 3 – Frequencies from the data due

The goal is to gain familiarity in working with DNA data sets.

1. You will divide a long DNA sequence into blocks, say 1000 letters long. For each block you will compute 4 counts, namely the numbers of A, C, G, T in the block. The user will enter the file name for the sequence and the desired block size. The output will be in the form of a text file, formatted so that it can be read into an Excel spreadsheet, as in this example.



>gi|17981852|ref|NC_001807.4| Homo sapiens mitochondrion, complete genomeBlock size: 1000 Block Start A C G T 0 0 309 311 149 231 1 1000 353 253 188 206 2 2000 347 259 171 223 3 3000 283 329 149 239 4 4000 316 312 115 257 5 5000 312 317 122 249 6 6000 260 306 168 266 7 7000 299 300 138 263 8 8000 324 319 115 242 9 9000 271 322 139 268 10 10000 314 294 102 290 11 11000 297 337 111 255 12 12000 309 313 116 262 13 13000 299 359 112 230 14 14000 349 354 81 216 15 15000 297 322 124 257

2. Open the file you created in Part 1 in Excel. Use the chart wizard to create a chart. It should look like this.

0

50

100

150

200

250

300

350

400

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Block

Freq

uenc

y ACGT

3. You will now focus on transitions from one letter to the next, i.e. count the number of transitions from A to C, from A to A, etc. With a 4-letter DNA alphabet there are 16 possible transitions. As in Part 1, allow the user to input a block size and output a text file with the frequencies for each transition.



4. Focus on GC and AT transitions. Obtain the DNA for the virus Bacteriophage lambda, which is 48502 bases long. Create an Excel chart showing the frequency of GC and AT transitions. What do you notice from examining the chart?

Lab 4 Open Reading Frames due An Open Reading Frame (ORF) begins with a START codon and ends with a STOP codon. In this lab you will examine a strand of RNA and find all possible ORF’s. Even though it is RNA, we will use the DNA alphabet ACTG instead of the RNA alphabet ACUG. An ORF may or may not code for a protein. Usually, we establish a threshold, say 70 residues long, and claim that ORFs above the threshold are unlikely to occur by chance alone (null hypothesis) and therefore represent a protein.

1. Write a program that reads a file of RNA and creates two arrays, the first containing the locations for all START (ATG) codons, and the second containing the locations of all STOP (TAA, TGA, TGA) codons.

2. Allow the user to input a threshold. Output a report of all ORF’s above the threshold. The report should simply list the START position in the sequence and the length (in residues) of the ORF.

3. Using the genetic code, translate the ORFs into proteins. Print a report showing all proteins thus obtained.

Test your code on the Sars virus. You can find it in GenBank, accession number AY274119.3.



Lab 5 Partial Digest Problem due

Write a program to implement the partial digest problem using the algorithm on page 90 in the Pevzner book. Test your program on at least 2 data sets.

1. The data set in Figure 4.1 on p.85, namely: 2 2 3 3 4 5 6 7 8 10. Your algorithm should produce 2 solutions, the one shown in Figure 4.1, namely {0,2,4,7,10} and also its homometric twin: {[0,3,6,8,10}.

2. The dataset: 1 3 3 3 3 4 4 4 6 7 7 7 7 10 10 10 11 13 14 14 17. Be careful to get this right since I know the answers.

Hints: Create a class named Multiset. I did this in C++. You need to have a way to store the values. I simply used an array. The array index is the value. The number stored at that index is the count. So if a[5] is 2 that means there are 2 fives in the multiset. This is a bit crude since many array slots aren’t used but it gets the job done. I also used variables to hold the current count of values and the current largest value.

My class has the following operations:

Multiset(void); //builds an empty multiset~Multiset(void);void addValue(int value);//increments the count for that valuevoid removeValue(int value); // decrements the count for that valuevoid printMultiset(ostream& out); // prints the nonzero values and

countsint readValues(istream& in); //builds a multiset by reading a file of

space delimited integersbool subset(Multiset superset); // returns true if all values in this

multiset are also in supersetMultiset* delta(int y); // Given a point y and set X, returns the

multiset of distances (made positive) between y and elements in X

void subtract(Multiset* other); // subtracts values in other multiset form this one

void add(Multiset* other); // Adds values in other multiset to this oneint computeLargest(int oldLargest); // If the oldLargest is removed a

new one is computed

The original sequence X that you are trying to recover is actually a set so you should make a set class too; I was lazy and used the Multiset class. Once you have created the Multiset class you can implement the PartialDigest(L) and Place(L,X) functions according to the algorithm. I made these stand-alone functions. Good Luck.

Lab 5a (Extra Credit) Working with public databases in biology This lab is meant to familiarize you with some of the important public databases used in biology and medicine. The lab is taken from the textbook site



http://www.bioalgorithms.info/problems/00_asmt.htm reproduced here for your convenience.

Practical Problem Set #1: Working with Databases

Part A: Green Fluorescent Protein

One of the most well studied proteins in molecular biology is the green fluorescent protein, or GFP. For this problem, you will visit popular online biological databases websites and gather information on GFP.

Genbank:1. Genbank is a database of nucleotide sequences. It can be accessed at the NCBI

website (National Center for Biotechnology Information) at http://www.ncbi.nlm.nih.gov/. In the search pull down menu at the top, make sure "nucleotide" is selected. In the text box at the top of the screen where it solicits input for searching, type "GFP" and hit the Go button.

2. This search will bring up over 1000 results. To narrow the search, click on "Limits" just below the box where you typed "GFP". Limit the search to "gene name" (in the dropdown box) and click the "Go" button again. You will now have approximately 50 results. Go to the end of the list (you will have to click "next" one time (the "next" link appears to the right).

3. The last two entries, M62653 and M62654, are from a seminal 1992 paper. Click on M62653 (the last entry), look over the Genbank record, and answer the following questions:

1. How long is the nucleotide sequence?2. How many "guanines" appear in the gene's DNA sequence? Is there a bias

towards any particular nucleotide?3. What is the Latin name of the organism whose DNA was sequenced for

this GFP?

Swissprot:1. Swissprot is a database of amino acid sequences that can be accessed at

http://us.expasy.org/sprot/. At the Swissprot homepage, type GFP and click the Search button. The last link in the Swissprot section (not the trembl section) should be GFP_AEQVI P42212).

2. Examine the web page for this protein, and answer the following:1. How many references are cited?2. This Swissprot record has links to other databases. Pfam (Protein

Families) is a database of multiple alignments. Pfam accession numbers begin with the letters PF, followed by five numbers (e.g. PF12345). What is the Pfam accession number for GFP_AEQVI? (NOTE: An accession number is simply a tag that you can use to refer to a particular item in a database. Many of the databases you will use will have accession numbers. There is no standard formatting for accession numbers across databases.)


http://us.expasy.org/sprot/

http://www.ncbi.nlm.nih.gov/

http://www.bioalgorithms.info/problems/00_asmt.htm


1. The Swissprot database is available via ftp. To see the data in its textual format (i.e. what you get when you ftp), scroll down to the bottom of the GFP_AEQVI web page, and click the link that says "View entry in raw text format (no links)." Answer the following questions:

1. The first two letters on each line identify what kind of line it is (e.g. ID = Identifier, DT = date, etc.) Find the line that has the Latin name for the species. What two letters appear at the beginning of the line?

2. What two symbols, which appear on a line by themselves at the bottom of the file, indicate the end of the record for GFP_AEQVI? (In the ftp file that you can download, these symbols are the "record separators".)

Protein Data Bank:1. The PDB (Protein Data Bank) is a database of protein structures at

http://www.rcsb.org/pdb. From that page, click the “SearchLite” link. On the resulting page, type “GFP” into the text box and click the “Search” button. Look at the first result (1EMB) and click the “Explore” link to the right. Then click on the “Download/Display File” link (on the left). Then, click on the link to display the structure file in PDB file format complete with coordinates as HTML.

2. In this file the majority of lines are “ATOM” lines. Scroll down until you see those lines and note how the atoms are numbered (in this case, 1 to 1908). Answer the following questions:

1. What kind of atom is #16 (3rd column)2. What kind of amino acid is atom #16 in? (4th column)3. What are the (x,y,z) coordinates of atom #16?

ENSEMBL:1. ENSEMBL is web-based genomic resource available at

http://www.ensembl.org/. The first website is the ENSEMBL home page. How many species are available on this website?

2. Search for anything with “zinc finger” (a structural motif in proteins). Find the first mouse GENE, and browse to that page. Please note that mouse is Mus musculus, and that the results are grouped in several ways. You are looking for GENE index. Follow the first link. Record the following:

1. The ensembl mouse gene id for this first link.2. The genomic location for this mouse gene.3. The cDNA transcript for this mouse gene - found by following link to

view gene in genomic location and looking over the basepair view. Move the pointer slowly.

4. The ensembl human gene id for homologous protein (back to the mouse gene specific page and look down to homology).

5. The genomic location for this human gene. 6. How many cDNA transcripts are given? Record for

ENSESTT00000205818.7. What is the Hamming distance between the first ten nucleotides?8. Human and mouse genomes can be partitioned into a large number of

synteny blocks, with each human synteny block corresponding to a mouse


http://www.ensembl.org/

http://www.rcsb.org/pdb


synteny block. What mouse synteny block is the mouse gene located on? What human synteny block does the human homolog belong on? What is the correspondence between these two synteny blocks?

Part B: Searching A Nucleotide Sequence Database1. Go to the following web page:

http://nh-brin.unh.edu/Bioinformatics/Tutorials/DinoDNA/2. Copy the DNA sequence marked JurassicPark DinoDNA from the book Jurassic

Park. (Read the text to learn the story behind this particular DNA).3. Go the NCBI Blast home page at http://www.ncbi.nlm.nih.gov/BLAST/. Go to

the link that says Nucleotide-nucleotide BLAST [blastn]4. Paste the DinoDNA DNA sequence into the text box and hit the Blast! button.

1. What is the gi number of the first result?2. What is the length of the match of the first result?3. What is the e-value of the first result?4. Is the DNA sequence in Jurassic park fictional (i.e. made up / random) or

“borrowed” (i.e. copied from real DNA)?

Lab 6 Dynamic programming and the Manhattan tourist problem due

Write a program to solve the Manhattan tourist problem. It should hard-code a graph consisting of nodes (street intersections) and edges (city blocks). Edges containing a tourist attraction should be weighted 1 (or n if there are n attractions). Other edges are wighted 0. A node edge has an int value field named score and an enum valued field named from. The enum values are {WEST, NORTH, DIAGANOL). The from field indicates the last direction used to reach the node. The idea is to fill in the node values starting from the northwest corner and ending at the southeast corner. When filling in a node consider the maximum score obtainable by reaching the node from the north, the west or from a diagonal (if there is one). At the end, the southeastern-most node contains the highest possible score. A path can be recovered by following the from pointers back to the northwest node.

Lab 7 Dynamic Programming and sequence alignment due

Extend the techniques from Lab 6 to finding optimal alignment between two sequences of DNA.

1. Find the longest common subsequence given two sequences. Follow the method described in Section 6.5. Use the LCS and PrintLCS algorithms on p. 176. Test your work on short sequences including the examples in Section 6.5. Then try your algorithm on longer sequences, say of length 1000.

2. Use the Smith-Waterman local alignment algorithm to find substrings of two sequences, v and w which are maximally aligned among all substrings. To do this use a scoring matrix that rewards matches with +2 and indels with -1. The


http://www.ncbi.nlm.nih.gov/BLAST/

http://nh-brin.unh.edu/Bioinformatics/Tutorials/DinoDNA/


minimum score at any vertex of the alignment graph is 0. Follow the methods in Section 6.8. Test your work on short and long sequences.

Lab 8 Hierarchical Clustering

1. Read Jones & Pevzner Chapter 10.1 – 10.3

2. Write a program that implements a variation of the Hierarchical Clustering algorithm on p. 345. Your program is meant to cluster similar genes. Similarity is based on the levels of gene expression in a microarray experiment. There are n genes and m samples. So each gene g is represented as an m-dimensional vector, g= <g1, g2, ..gm>, where gi is the level of observed expression in sample i.

Your program should allow for different distance functions to be used for finding the closest clusters. Three such functions are shown on p. 345.

The program should print out intermediate results at each level of clustering. Alternatively, you could keep the results in a data structure and allow the user to request the clusters at a certain stage, say when there are 10 clusters. The intermediate results could take the form of a parenthesized string, i.e.

(11 (3 5 2) 10 (6 4 (1 8)) (7 9))

3. Test your program on this tiny dataset taken from http://www-server.bcc.ac.uk/oncology/MicroCore/HTML_resource/Hier_Example.htm

sample 1 sample 2 sample 3p53 9 3 7mdm2 10 2 9bcl2 1 9 4cyclinE 6 5 5caspase 1 10 3

4. Test your program by clustering cities using the city mileage chart below:

Distance between some major USA cities

MilesKilometers

Atlanta GA

Chicago IL

Denver CO

Houston TX

Kansas City KS

Los Angeles

CA

Minneapolis MN

Miami, FL

New York

NY

San Franci

sco CA

Seattle WA

Atlanta, GA 715 1405 800 805 2185 1135 665 865 2495 2785


http://www-server.bcc.ac.uk/oncology/MicroCore/HTML_resource/Hier_Example.htm

http://www-server.bcc.ac.uk/oncology/MicroCore/HTML_resource/Hier_Example.htm


Chicago, IL 1150 1000 1085 525 2020 410 1380 795 2135 2070

Denver, CO 2260 1615 1120 600 1025 915 2065 1780 1270 1335

Houston, TX 1285 1750 1805 795 1550 1230 1190 1635 1930 2450

Kansas City, MO 1295 850 965 1280 1625 440 1470 1195 1865 1900

Los Angeles,

CA3515 3250 1650 2495 2610 1935 2740 2800 385 1140

Minneapolis, MN 1825 665 1470 1980 680 3110 1795 1200 2010 2015

Miami, FL 1070 2220 3320 1915 2365 4405 2885 1280 3115 3365New

York, NY 1390 1275 2865 2630 1925 4505 1935 2060 3055 2860

San Francisco,

CA4015 3435 2040 3105 3000 615 3240 5015 4915 810

Seattle, WA 4485 3330 2140 3940 3060 1835 2675 5415 4600 1305

Lab 8A (Extra Credit) Test your program with microarray data from a public source.

Try your program with some real microarray data. There are public repositories: Broad Institute: http://www.broad.mit.edu/tools/data.html Stanford http://genome-www5.stanford.edu/

You will need to write some code to translate between the formats used in these public databases and the data structures you will use to store the data in your program. Alternatively, you may find code you can use on the Internet.

Since these datasets are very large, please submit timing results as well as answers.

Lab 8B K-Means Clustering July 14 (Extra Credit)

1. Read Jones & Pevzner Section 10.32. Write a program to implement the algorithm on page 3483. Test your program with k = 4, using the city mileage chart in Lab 7.


http://genome-www5.stanford.edu/

http://www.broad.mit.edu/tools/data.html


Lab 9 Shortest Superstring 1. Carefully read Pevzner, Section 8.42. Write a program that reads a file containing a space delimited list of short strings.

It creates an overlap graph from this file. Each vertex contains one of the short strings. It has weighted directed edges to every other vertex as in Figure 8.15. Write a function that determines the edge weights as follows: For vertices u,v the edge (u,v) has weight equal to the length of the longest overlap formed by a suffix of u and a prefix of v. i.e. if u contains “WENTWORTH” and v contains “THREW”, then weight(u,v) = 2 and weight(v,u) = 1.

3. Finally write a function that traverses the graph visiting all the vertices. At each vertex it prints out its string except for the duplicated prefix. i.e. in traversing the (u,v) edge above the function would print “WENTWORTHREW”. Each such traversal prints a superstring. Unfortunately there are (n-1)! different possible traversals making it infeasible to test all of them to see which produces the shortest superstring. Instead, follow a greedy strategy. From each vertex choose an outgoing edge with the highest weight of all outgoing edges.

4. Test your program with the examples shown in Figures 8.14 and 8.15. You may not get the identical superstring, but its length should be the same.

5. Test your program again using 10 randomly generated 5-mers of {A,T,C,G}.

Hints:

I did this using an array of vertices. Each vertex object had a function which was able to compute the overlap between a suffix of its own string and the prefix of another string. It stored these results in an array. After all the vertices were constructed they were commanded to compute all their overlaps.

I then programmed a traversal for printing the overlaps and superstring starting with the first vertex and proceeding in a greedy fashion to other unused vertices until it returned to the first vertex. Here were my results on the problem in Figure 8.14.

Original Strings as read from file bin3.txt001000010011100101110111

Superstring001 010



100 000 011 110 101 111001000110111

length = 12

Notice that the superstring produced was not optimal. The string 0001110100 is 2 letters shorter. This is not surprising since we know that the greedy strategy gives only an approximate solution. -m werner

Lab 9A (Extra Credit) Sequencing by Hybridization) due July 21(Note: This lab needs some polishing @mw)

1. Read Sections 8.6 – 8.82. Solve the SBH problem.

SBH//Input: Set S of l-mers from strings//Output: String s such that Spectrum(s,l) = S

Where Spectrum(s,l) = multiset of n-l+1 l-mers in s.

3. Work with l-mers over a 4 letter alphabet. Construct a graph as follows: The vertices are all 4 l possible l-mers. Store a collection of l-mers from an unknown string in a file. For each such l-mer m, make an edge from vertex p to q where the first l-1 letters of m match the last l-1 letters of p and the last l-1 letters of m match the first l-1 letters of q.

4. Print the string s by traversing the Eulerian circuit of edges.

Lab 9B (Extra Credit) Greedy Approach to Motif Finding due Jul 21

This lab is meant to familiarize you with some of the important public motif finding tools. The lab is taken from the textbook site http://www.bioalgorithms.info/problems/01_asmt.htm reproduced here for your convenience.

Motifs and Profiles. Answer briefly using your own words.1. Give a precise definition of "motif".2. Table 1 (below) contains a set of 10 patterns representing a motif of length l=8.

1. Construct an alignment of all instances of the motif shown.




2. Do you think multiple alignment tools like Clustal would be able to find this motif?

3. Construct a profile.4. Construct a consensus sequence.5. For the alignment of patterns from a, compute the consensus score of your

profile from question b.6. For the alignment of patters from a, compute the entropy score of your

profile.7. Compute the total distance from your consensus string from question 5(c)

to each of the patterns in table 1.8. Assuming a uniform nucleotide distribution in the genome, how many

times would you expect to find the consensus sequence with up to 1 mismatch in the genome of length 106?

9. Assuming a uniform nucleotide distribution in the genome, how many times would you expect to find the consensus sequence with up to k mismatch in the genome of length 106?

10. Assuming a uniform nucleotide distribution in the genome, how many times would you expect to see your consensus sequence in a text of length 1,000,000?

GAACTCATGGTG

AAAAGCACGGTC

TCAAAGCAAGGC

CCTAATCAGGGC

AAGTATGGACTC

ACTAAGCAGGGT

TCTCACGGCCCA

CCTCGTGGTGGG

TACCGTATGGTT

ACCACTCGTCGA



Motif Finding Tools

A biologist at your university has found 15 target genes that she thinks are co-regulated. She gives you 15 upstream regions of length 50 base pairs in FASTA format, file DNASample50.txt, and asks you to identify the motif, and if possible the potential regulating protein. She tells you the sequences are from Homo sapiens, and by intuition feels the motif is of length 8. She wants you to suggest only the best possible candidate motif.

Part A: Instructions

Attach ALL output files with results. Record all your parameters and collect all output files. For each program, make a decision regarding the one motif that you think is best.

1. Run Consensus (use advanced version).2. Run MITRA.3. Run Gibbs Sampler with all options available. Do not invoke the recursive

sampler or provide a background model.4. Run MEME.

Consider all motifs generated, select the best motif and perform the following:

1. Using an alignment of the binding sites identified by the motif finders, generate a representing Sequence Logo.

2. Determine a potential DNA binding protein using TRANSFAC, a database of eukaryotic DNA binding proteins. Identify the potential regulating protein by either generating a small number of plausible patterns from your sequence logo or using the binding sites. Do not search using the full length sequences.

After you ran all the programs your biologist friend confesses that she is not sure if her intuition about the motif length was correct. Re-run all the tools above without knowledge of motif length. Do you get the same results?

Part B: Questions1. Did all tools generate the same motif?2. Which tool would you run if the length of the motif is unknown?3. If you increased the queue size for Consensus, what would you expect to happen?4. What would you expect to happen if you increased the number of iterations for

the Gibbs Sampler? 5. TRANSFAC contains several tools that can search a set of sequences using

known profiles. Imagine you are studying a set of sequences that are regulated by an unknown protein that is very similar to a protein in TRANSFAC. What might happen in this case if you were only to search using TRANSFAC profiles?

6. In this case you were given a very narrow upstream region to search. Often, you are instead asked to search upstream regions many base pairs in length. Using only Consensus, search the files DNASample300.txt, DNASample1000.txt, and


http://www.gene-regulation.com/

http://www.bio.cam.ac.uk/seqlogo/

http://meme.sdsc.edu/meme/website/intro.html

http://bayesweb.wadsworth.org/gibbs/gibbs.html

http://fluff.cs.columbia.edu:8080/domain/mitra.html

http://ural.wustl.edu/~jhc1/consensus/


DNASample3000.txt with sequences of length 300, 1000, and 3000 base pairs respectively. At what point is Consensus no longer about to identify your regulatory motif?

7. Perform the same experiment with MITRA. Where does MITRA break?8. Did you search the reverse strand for motif occurrences? Why?

Part C: Experimental Verification of Found Motifs

Describe a biological experiment to validate your hypothesis. How many hours do would you estimate the experiment requires?

Recent Developments in Motif Finding

Part A: Chromatin Immunoprecipitation

A popular experimental technique to confirm motif binding and determine protein-DNA interaction is chromatin immunoprecipitation (ChIP). A high-throughput variant of ChIP, ChIP on chip, was developed by Iyer et al (2001) and Ren et al (2000) , and is reviewed by Nal et al (2001) .

1. Briefly describe the methodology for ChIP.2. Describe, in general, the modifications of ChIP on chip to the standard ChIP

protocol.3. How could you modify ChIP on chip to detect binding of protein complexes, as

opposed to a single protein?4. Reformulate the Motif Finding Problem for ChIP on chip experiments. What are

the differences?

Part B: The assumption of independence

A simplifying assumption for motif finders is that nucleotide positions are independent. Several groups have developed approaches that do not require that independence. Refer to Barash et al (2003) and Keich et al (2002) for computational approaches to handle dependencies.

1. Consider the consensus and profile representations for a motif. How could you modify them to account for dependence?

2. Give a rigorous formulation of the Motif Finding Problem that waives the independence assumption. What is the objective function you want to optimize?

3. Imagine you are given the crystal structure of a DNA binding protein bound to its binding site. Would you expect the structure to provide information regarding dependencies?

Suggested Reading


http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=12376382&dopt=Abstract

http://citeseer.nj.nec.com/560686.html





Bailey T. and Elkan C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36.

Barash Y., Elidan G., Friedman N., and Kaplan T. (2003) Modeling Dependencies in Protein-DNA Binding Sites. Proceedings of the Seventh Annual International Conference on Computational Molecular Biology, 28-37.

Eskin E. and Pevzner P.A. (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics, 18, S354-63.

Hetz G.Z. and Stormo G.D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15, 563-77.

Iyer V.R., Horak C.E., Scafe C.S., Botsein D., Synder M., and Brown P.O. (2001) Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature, 409, 533-8.

Keich U., and Pevzner P.A. (2002) Finding motifs in the twilight zone. Bioinformatics, 18, 1374-81.

Lawerence C.E., Altschul S.F., Boguski M.S., Liu J.S., Neuwald A.F., and Wootton J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208-14. (available via JSTOR)

Mandel-Gutfreund Y., Baron A., and Margalit H. (2001) A structure-based approach for prediction of protein binding sites in gene upstream regions. Pac Symp Biocomput., 139-50.

Nal B., Mohr E., and Ferrier P. (2001) Location analysis of DNA-bound proteins at the whole-genome level: untangling transcriptional regulatory networks. Bioessays, 23, 473-6.

Orlando V. (2000) Mapping chromosomal proteins in vivo by formaldehyde-crosslinked-chromatin immunoprecipitation. Trends Biochem Sci., 25, 99-104.

Ren B., Robert F., Wyrick J.J., Aparicio O., Jennings E.G., Simon I., Zeitlinger J., Schreiber J., Hannett N., Kanin E., Volkert T.L., Wilson C.J., Bell S.P., and Young R.A. (2000) Genome-wide location and function of DNA binding proteins. Science, 290, 2306-9.

Schneider T.D. and Stephens R.M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Research, 18, 6097-100.

Matys V., Fricke E., Geffers R., Gossling E., Haubrock M., Hehl R., Hornischer K., Karas D., Kel A.E., Kel-Margoulis O.V., Kloos D.U., Land S., Lewicki-Potapov B.,


http://citeseer.nj.nec.com/schneider90sequence.html

http://citeseer.nj.nec.com/schneider90sequence.html


http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=retrieve&db=pubmed&list_uids=10694875&dopt=Abstract

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=retrieve&db=pubmed&list_uids=10694875&dopt=Abstract















http://citeseer.nj.nec.com/bailey94fitting.html

http://citeseer.nj.nec.com/bailey94fitting.html


Michael H., Munch R., Reuter I., Roter S., Saxel H., Scheer M., Thiele S., and Wingender E.(2003) TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Reseach, 31, 374-8.

Stormo GD. (2000) DNA binding sites: representation and discovery. Bioinformatics, 16,15-23.

Lab 10 Hidden Markov ModelsThis lab is meant to familiarize you with hidden markov models. The lab is taken from the textbook site http://www.bioalgorithms.info/problems/06_asmt.htm reproduced here for your convenience.

Using the HMM approach, you will now revisit the dishonest casino mentioned in the class handout. With a computer implementation of the Viterbi algorithm, we can now decode longer sequences of coin tossing.

Fortunately, the umdhmm package, a set of C programs that implement the Viterbi and forward-backward algorithms, is available from http://www.cfar.umd.edu/~kanungo/software/software.html. You can uncompress and compile it under Linux/Unix/Windows/Macintosh. The README file in the package explained how to specify a HMM and how to run the standard HMM algorithms. There is also a tutorial (hmmtut.pdf) in the zip file for additional background information on HMMs.

1. Using the Hidden Markov Model (HMM) in Figure 11.1 in the handout (also shown above representing a dishonest casino), decode the following sequence of coin tosses (i.e., compute the most probable sequence of states that generates the sequence of coin tosses). You should fill up a 2-by-10 dynamic programming table. HHHHHTTTTT What if the transition probabilities from F to B and from B to F were 3/10? How would you interpret these two results (you may want to use Excel to speed up your calculations)?

2. With the HMM shown in the figure above, with the HMM in Figure 11.1, what is the probability that the "T" at the seventh position is generated by the biased coin?

3. Compare the two probabilistic models we have learned: profile HMMs and PSSMs. Which one is more general? Can you devise a profile HMM to emulate a PSSM? Can a PSSM emulate any profile HMM?

4. Write a text file specifying the HMM in the dishonest casino, Figure 11.1, following the format explained in the umdhmm README file so that your HMM


http://www.cfar.umd.edu/~kanungo/software/software.html





is understandable by the programs. Next, use the HMM to find the most probable sequences of hidden states that generates the following sequences of coin tosses:

o HHHHHTTTTTo HHHHHHTTTTo HHTHTHTHTHTTTTHTHHTHHHHHHHHHTHTHTHHTHTHHHHTH

THo THHHHHHHHHHHHTTHTTHTHTHTHHTTHHHHHHHHHHHHHHH

HH5. We will use the umdhmm package to help us to identify CG-islands in a genomic

sequence.

The dinucleotide transition probabilities in CG-islands are different from that in non-CG-islands. The following transition probability tables (page 50 in Durbin et al's book) are obtained based on the statistics of annotated genomic sequences:

Transition probabilities outside a CG-island.

Transition probabilities inside a CG-island.

+ A C G T

A 0.1800.2740.4260.120

C 0.1710.3680.2740.188

G 0.1610.3390.3750.125

T 0.0790.3550.3840.182

- A C G T

A 0.3000.2050.2850.210

C 0.3220.2980.0780.302

G 0.2480.2460.2980.208

T 0.1770.2390.2920.292

Your HMM would have a group of 4 states A+, C+, G+, and T+ which emit A, C, G, and T respectively in CG-islands, and a group of another 4 states A-, C-, G-, and T- correspondingly toin normal genomic regions. The above 2 tables specify the transition probabilities within each group. Now it is your task to design the transition probabilities between the states across groups.

o Describe how do you want to design the transition probabilities between the states across groups?

o Turn your design into the format understandable by the umdhmm programs. You may use Microsoft excel to ease the pain of organizing tables as large as 8-by-8.

o Use your HMM to search the following stretch of genomic sequence chr22_10k.fa for CG-islands. The number of CG-islands can vary with different parameter settings for probabilities of switching between CG-islands and non-CG-islands. With reasonable parameter settings 4-5 CG-islands will be found.


http://bioinf.ucsd.edu/~cbenner/be202/hwk6/chr22_10k.fa


o Your colleague asks you why don't you build an intuitive HMM that consists of just 2 states, one emitting symbols in CG-islands and the other emitting symbols in non-CG-islands. How do you answer this question?

Multiple sequence based search

In the C. elegans (worm) genome, several large paralogous gene families that were first thought to be nematode specific have since been classified as putative G-protein coupled receptors (GPCRs). Detecting similarity between these nematode sequences and known GPCRs in other organisms is a nontrivial sequence analysis task. Here we arbitrarily choose the putative GPCR gene sra-4 (Wormpep AH6.8; SWISS-PROT Q09206; 329 aa) as an example. The task is to find a significant similarity between AH6.8 and a protein of known function in another organism. (Please note the Resources section after the Procedures which has links and information about how to use the tools necessary to complete this part)

1. Obtaining the sequence. Go to SwissProt and retrieve the amino acid sequence for Q09206. Be sure to read the annotations of the sequence before you do any further searches. Is there any experimental evidence that this protein is a GPCR?

2. Initial BLAST: To quickly find any proteins similar to AH6.8, you can run a BLASTP search against the nr database using the WWW BLAST at NCBI. How many hits do you find with a significant E-value (<0.01)? How many among them are non-worm genes? What do you think of the significance of these non-worm hits? What conclusions can you draw from this initial BLAST result?

3. Sequence gathering: The first step for further analysis is to more carefully define a nonredundant set of sequences that belong to the same family as AH6.8. You could have used the collection of significant hits returned from your BLASTP search. To be more careful, you want to use the Wormpep database, an authoritative nonredundant source of nematode predicted protein sequences. Blast AH6.8 against the Wormpep. How many hits are significant (this time you want to be more careful, so you use the E-value cutoff of 10-6)? As a crude protection against erroneous computational gene predictions, you want to exclude sequences that are too long or too short. How many sequences are shorter than 200 aa or longer than 500 aa? Removing these sequences, you can proceed to the next step with a clean set of sequences.

4. Multiple sequence alignment. The next step is to produce a multiple alignment. You will use ClustalW a popular program for multiple alignment. You can take a quick look at the alignment in the graphical display of ClustalX (part of the ClustalW package), just to make sure the result makes sense. Save your result as worm.aln and proceed to the next step.

5. Profile searches. Construct a profile HMM of the multiple alignment, and to search it against the sequence database. You will use the HMMER program.

The hmmbuild command builds a profile "worm.hmm" from the alignment, taking a few seconds. The hmmcalibrate command automatically estimates some parameters needed for calculating accurate E-values in database searches, taking



several minutes. The hmmsearch command searches Swissprot with using the profile, which can take several hours. The output is a ranked list of hits, giving E-values. Go through your hmmsearch result, can you find any non-worm GPCRs (with significant E-values)? The E-value in hmmsearch is different from that in the blast search: usually an E-value of 0.05 is already a marginal but significant result.

You can also run all the above programs in the www version of HMMER. But this is not encouraged because the server can only process 1-2 external jobs at one time.

6. PSI-BLAST. As a comparison, you also run PSI-BLAST to search AH6.8 against nr. Do you find non-worm GPCR genes? Do you find any non-GPCR genes in your result; what are they? Can you explain why PSI-BLAST includes genes of different functions? Do you think this an artifact of PSI-BLAST or do these non-GPCR genes suggest an alternative function of AH6.8?

Resources

1. HMMER: Sean Eddy's HMMER website http://hmmer.wustl.edu/. If you don't have an account on bioinf, You can download to your local machine and run it.

2. Web version of HMMER at http://bioweb.pasteur.fr/seqanal/motif/hmmer-uk.html. Easy to use. But can be very slow. You got to do your search early.

3. CLUSTALW: available at http://www.ebi.ac.uk/clustalw/. SDSC biology workbench (http://workbench.sdsc.edu/) also has it.

4. SwissProt: http://us.expasy.org/sprot/. A local copy of SwissProt on bioinf can be found at /software/bioinf/BLAST/data/swissprot.

5. BLASTP, and PSI-BLAST are available at http://www.ncbi.nlm.nih.gov/BLAST/ 6. Wormpep: http://www.sanger.ac.uk/Projects/C_elegans/wormpep/

Suggested Reading

Chapters 3-6. Durbin et al. Biological Sequence Analysis. 1998.

Lab 11 (Extra Credit) Sorting by Reversals Write a program to implement the SimpleReversalSort in Pevzner Section 5.2, p. 129. Your main() program should ask the user to enter file names for both the input and output files. It then opens the files, reads the numbers into an array and calls the function SimpleReversalSort. For the first test hand-code a sample data file consisting of the numbers 1 .. 10 permuted in some way. For the second test create a data file of 10,000 entries. Simply create an array with the numbers 1 .. 10000 in order. Then randomly choose pairs of indices in the range 0 .. 9999 and swap the array values for those indices. Do this at least 2000 times to get a fairly random permutation.

We are interested in counting the number of reversals required to do the sort. For example, to measure the evolutionary distance between two species. The simple reversal


http://www.sanger.ac.uk/Projects/C_elegans/wormpep/

http://www.ncbi.nlm.nih.gov/BLAST/

http://us.expasy.org/sprot/

http://workbench.sdsc.edu/

http://www.ebi.ac.uk/clustalw/

http://bioweb.pasteur.fr/seqanal/motif/hmmer-uk.html

http://bioweb.pasteur.fr/seqanal/motif/hmmer-uk.html

http://hmmer.wustl.edu/


sort does not yield the minimum but at least provides an upper bound for it. Your program should count the number of reversals made.

Lab 11B (Extra Credit) Improved Breakpoint Reversal Sort

Repeat Lab 4 using the ImprovedBreakpointReversalSort in Pevzner 5.4, p. 135. Note that the example on the bottom of p.134 is a bit misleading. You should work this out yourself by hand before trying to program it.

Lab 11C (Extra Credit) Sorting by translocations via reversals 1) Read the articles by M. Ozery-Flato, R. Shamir

Sorting by translocations via reversals theory, M. Ozery-Flato, R. ShamirProc. 4th RECOMB Satellite on Comparative Genomics, Lecture Notes in Computer Science Vol. 4205, pp. 87-98, Springer, Berlin (2006).

An O(n3/2√log(n)) algorithm for sorting by reciprocal translocations.M. Ozery-Flato and R. Shamir Proceedings of CPM 2006.. LNCS Vol. 4009 pp. 258--269 (2006).

2) See if you can write a program to carry out the translocation algorithm.


http://www.math.tau.ac.il/%7Ershamir/papers/srt_recomb_sat.pdf

http://www.math.tau.ac.il/%7Ershamir/papers/srt_cpm06.pdf

laboratory submissions: - myweb at witmyweb.wit.edu/wernerm/comp602labs.doc · web viewlaboratory...

Documents