bioinformatics made easy

Introduction to Bioinformatics

Lecture 1: Overview of Bioinformatics and Molecular Biology What is Bioinformatics?

Defining the terms bioinformatics and computational biology is not necessarily an easy task, as evidenced by multiple definitions available over the web. A recent google search for "definition of bioinformatics" returned over 43,000 results! In the past few years, as the areas have grown, a greater confusion into these two terms has prevailed. For some, the terms bioinformatics and computational biology have become completely interchangeable terms, while for others, there is a great distinction. I'll throw my two cents in, based on what my experience has been to the consensus use of these two terms.

Computational biology and bioinformatics are multidisciplinary fields, involving researchers from different areas of specialty, including (but in no means limited to) statistics, computer science, physics, biochemestry, genetics, molecular biology and mathematics. The goal of these two fields is as follows:

• Bioinformatics: Typically refers to the field concerned with the collection and storage of biological information. All matters concerned with biological databases are considered bioinformatics.

• Computational biology: Refers to the aspect of developing algorithms and statistical models necessary to analyze biological data through the aid of computers.

In this respect, my understanding of bioinformatics and computational biology follows the NIH definitions listed below:

Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

Others have offered various opinions into these definitions as well: http://kbrin.kwing.louisville.edu/~rouchka/definition.html

Image Source: http://ccb.wustl.edu/

Bioinformatics = Hot Field Smart Money: #1 among next hot jobs http://smartmoney.com/consumer/index.cfm?story=working-june02

Business Week: Among 50 Masters of Innovation http://www.businessweek.com/bw50/content/mar2001/bf20010323_198.htm So why is bioinformatics a hot field? One answer to this question is that it is tied to the human genome project which has generated a lot of popular interest. Various advances in molecular biology techniques (such as genome sequencing and microarrays) has led to a large amount of data that needs to be analyzed. Now that we are close to having the human genome finished, what does it all mean? That’s where bioinformatics steps in. Bioinformatics can lead to important discoveries as well as help companies save time and money in the long run. In addition, there needs to be methods to manage large amounts of data. One of the biggest reasons for bioinformatics being a hot field is the old supply and demand adage. There just are too few people adequately trained in both biology and computer science to solve the problems that biologists need to have solved.

Introduction to Molecular Biology (For a good overview of this topic, please read: http://www.ebi.ac.uk/microarray/biology_intro.html) In order to be a good computational biologist, it is important to understand the terminology and basic processes behind the biological problems. Many interesting problems arise out of sequence analysis. There are two different types of biological sequences studied in this class: DNA/RNA and amino acids. But first, let’s make sure the basics are covered. Cells Every organism is made up of tiny structures called cells. Often these cells are too small to be seen with the naked eye. Each cell is in itself a complex system enclosed in a membrane. Some organisms, such as bacteria and baker’s yeast are composed of only a single cell (i.e. they are unicellular). Other organisms are made up of many different cells (i.e. they are multicellular). For instance, the human body is composed of around 60 trillion cells. Humans have about 320 different cell types, each having a different type of function or structural property.

Structure of an animal cell. Image source: www.ebi.ac.uk/microarray/ biology_intro.htm There are two types of organisms: eukaryotes and prokaryotes. Eukaryotes (or as Bruce Roe from the University of Oklahoma calls them the “You and I” Karyotes) represent most of the organisms which we can see, including plants and animals. Prokaryotes

(such as bacteria) are smaller than eukaryotic cells and have simpler structure. Prokaryotes are single cellular organisms (but not all single-celled organisms are prokaryotes!) So what is the difference between the two types of cells? A eukaryotic cell has a nucleus, which is separated from the rest of the cell by a membrane. Inside the nucleus are the chromosomes, where all of the genetic information for the organism is stored. In addition, eukaryotic cells contain membrane bound organelles with various functions, including centrioles, lysosomes, mitochondria, ribosomes, etc. Contained within the nucleus are one or several long double stranded DNA molecules organized as chromosomes. For humans, there are 22 pairs of autosomes, as well as one pair of sex chromosomes. One copy of each pair is inherited from each parent.

Karyotype showing the 23 pairs of human chromosomes. Image source: http://avery.rutgers.edu/WSSP/StudentScholars/Session8/Session8.html

Image source: www.biotec.or.th/Genome/ whatGenome.html

DNA Deoxyribonucleic Acid (DNA) is the basis for the building blocks encoding the information of life. A single stranded DNA molecule, called a polynucleotide or oligomer, is a chain of small molecules called nucleotides. There are four different nucleotides, or bases: adenosine (A), cytosine (C), guanine (G) and thymine (T). The bases can be separated into two different types: purines (A and G) and pyrimidines (C and T). The difference between purines and pyrimidines is in the base structure. Stringing together a simple alphabet of four characters together we can get enough information to create a complex organism! Different nucleotides can be strung together to form a polynucleotide. However, the ends of the polynucleotide are different, meaning that each polynucleotide sequence will have a directionality. The ends of the polynucleotide are marked either 3’ or 5’. The general convention is to label the coding strand from 5’ to 3’ (left to right).

For instance, the following is a polynucleotide:

5’ G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’ DNA can be either single-stranded or double stranded. When DNA is double-stranded, the second strand is referred to as the reverse complement strand. This name is derived from the fact that the directionality of this second strand runs in the opposite direction as the first, and the fact that the bases in the second strand are complementary to the bases in the first. Complementary bases are determined by which pairs of nucleotides can form bonds between them. In the case of DNA, A binds to T, and C binds to G. For the polynucleotide given above, the double-stranded polynucleotide is as follows:

5’ G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’ | | | | | | | | | | | | | | | |

3’ C←A←T←T←T←C←A←G←G←G←C←A←A←T←C←G 5’ Two complementary polynucleotide chains form a stable structure known as the DNA double helix. This spring represents the 50th anniversary of the discovery of the double helix structure of DNA by Watson, Crick and Franklin.

DNA double helix structure.

Image source: www.genecrc.org/site/ lc/lc2b.htm Note that in this image, there appear to be two types of grooves: A larger one, which is called the major groove and a smaller one, known as the minor groove. In addition, there are roughly 10.5 base pairs in one complete turn of the helix. RNA Ribonucleic Acid (RNA) is similar to DNA in the fact that it is constructed from nucleotides. However, instead of thymine (T), an alternative base uracil (U) is found in RNA. RNA can be found as double-stranded or single-stranded, and can also be part of a hybrid helix where one strand is an RNA strand and the other is a DNA strand. RNA is generally found as a single stranded molecule that may form a secondary structure or tertiary structures due to the complementary bases between parts of the same strand. RNA folding will be discussed in detail during a later class period. RNA is important in

the cell and contributes in a variety of ways. One of the most important roles of RNA is in protein synthesis. Two of the major RNA molecules involved in protein synthesis are messenger RNA (mRNA) and transfer RNA (tRNA).

Secondary structure for E. coli Rnase P RNA. Image source: www.mbio.ncsu.edu/JWB/MB409/lecture/ lecture05/lecture05.htm mRNA mRNA encodes the genetic information as copied from the DNA molecules. Transcription is the process in which DNA is copied into an RNA molecule. The resulting linear molecule is an mRNA transcript. In eukaryotic cells, before the mRNA can be translated into a protein, it needs to be modified. The nature of most eukaryotic genes is that the genes are created in pieces, where coding regions, called exons, are interspersed with noncoding regions, called introns. One of the steps in processing the mRNA is to remove the intronic regions and to splice together the coding, or exonic regions. The processed mRNA can then be transported from the nucleus and translated into a protein sequence.

mRNA processing. Image source: http://departments.oxy.edu/biology/Stillman/bi221/111300/processing_of_hnrnas.htm tRNA tRNA molecules develop a well-defined three-dimensional structure which is critical in the creation of proteins. Attached to each tRNA molecule is an amino acid (which will be discussed momentarily). The amino acid to be attached is determined by a three base sequence called an anticodon sequence, which is complementary to the sequence in the mRNA. Translation is the process in which the nucleotide base sequence of the processed mRNA is used to order and join the amino acids into a protein with the help of ribosomes and tRNA.

tRNA secondary structure. Image Source: http://www.tulane.edu/~biochem/nolan/lectures/ rna/frames/trnabtx2.htm

tRNA tertiary structure. Image source: www.biology.ucsc.edu/people/ areslab/BMB100A/11-26.html

Genetic Code Since there are 4 possible bases (A, C, G, U) and 3 bases in the codon, there are 4 * 4 * 4 = 64 possible codon sequences. However, the codon AUG can also be used as a signal to initiate translation, while the codons UAA, UAG, and UGA are terminal codons signaling the end of translation. That leaves a 61 codon sequences that can code for amino acids (AUG can also code for an amino acid). However, there are only 20 amino acids. Therefore the genetic code is redundant, meaning that a single amino acid could be coded for by several different codons.

Second Position of Codon

U C A G

U

UUU Phe [F] UUC Phe [F] UUA Leu [L] UUG Leu [L]

UCU Ser [S]UCC Ser [S]UCA Ser [S]UCG Ser [S]

UAU Tyr [Y]UAC Tyr [Y]UAA STOPUAG STOP

UGU Cys [C] UGC Cys [C] UGA STOP UGG Trp [W]

UCAG

C

CUU Leu [L] CUC Leu [L] CUA Leu [L] CUG Leu [L]

CCU Pro [P]CCC Pro [P]CCA Pro [P]CCG Pro [P]

CAU His [H]CAC His [H]CAA Gln [Q]CAG Gln [Q]

CGU Arg [R] CGC Arg [R] CGA Arg [R] CGG Arg [R]

UCAG

A

AUU Ile [I] AUC Ile [I] AUA Ile [I] AUG Met [M]

ACU Thr [T]ACC Thr [T]ACA Thr [T]ACG Thr [T]

AAU Asn [N]AAC Asn [N]AAA Lys [K]AAG Lys [K]

AGU Ser [S] AGC Ser [S] AGA Arg [R] AGG Arg [R]

UCAG

F i r s t

P o s i t i o n

G

GUU Val [V] GUC Val [V] GUA Val [V] GUG Val [V]

GCU Ala [A]GCC Ala [A]GCA Ala [A]GCG Ala [A]

GAU Asp [D]GAC Asp [D]GAA Glu [E]GAG Glu [E]

GGU Gly [G] GGC Gly [G] GGA Gly [G] GGG Gly [G]

UCAG

Third

Position

Genetic Code. Note that the initiator codon is labeled in green, and the terminal codons are labeled in red. The first column gives the triplet base; the second the three letter amino acid label, and the third the one letter amino acid label. Adapted from: http://psyche.uthct.edu/shaun/SBlack/geneticd.html Amino Acids Amino acids are the building blocks from which proteins are made. There are 20 different amino acids that vary from each other by their side chain groups. Amino acids can be classified into different groups based on their solubility in water. Hydrophilic amino acids are water soluable, while hydrophobic are not. This property becomes important when a protein sequence is made. Amino acids are linked to one another via a single chemical bond, called a peptide bond. A linear chain of amino acids can be referred to as a peptide (if it is short – less than 30 a.a. long) or polypeptide (which can be upwards of 4000 residues long).

One-letter Three-letter Full name

G GLY Glycine A ALA Alanine V VAL Valine L LEU Leucine I ILE Isoleucine F PHE PhenylalanineP PRO Proline S SER Serine T THR Threonine C CYS Cysteine M MET Methionine W TRP Tryptophan Y TYR Tyrosine N ASN Asparagine Q GLN Glutamine D ASP Aspartic acidE GLU Glutamic acidK LYS Lysine R ARG Arginine H HIS Histidine

Amino Acid Codes.

Proteins Proteins are polypeptides that have a three dimensional structure. They can be described through four different hierarchical levels:

• Primary structure – the sequence of amino acids constituting the polypeptide chain.

• Secondary structure – the local organization of the parts of the polypeptide chain into secondary structures such as α helices and β sheets.

• Tertiary structure – the three dimensional arrangements of the amino acids as they react to one another due to the polarity and resulting interactions between their side chains.

• Quaternary structure – if a protein consists of several protein subunits held together, then the protein can be described as well by the number and relative positions of the subunits.

Visualization of Protein Structures.

Magenta: alpha helix Gold: Beta Sheets

Blue: Monomer A Orange: Monomer B

Image source: http://www.ebi.ac.uk/microarray/biology_intro.html Calculating the secondary and tertiary structure of a protein given its primary structure is not an easy task. Protein folding prediction will be covered at some point close to the end of the semester. Monomer – Any small molecule that can be linked with others of the same type to form a polymer. For the purpose of this class, the molecules could be nucleic acids, amino acids, or proteins. Dimer - Two small molecules of the same type linked together. Trimer – Three small molecules of the same type linked together. Oligimer – General term for a short polymer most commonly consisting of nucleic acids or amino acids. Polymer – Any large molecule consisting of multiple identical or similar subunits linked by covalent bonds.

Putting it all together, we get the flow of genetic information. That is, DNA directs the synthesis of RNA, and RNA then in turn directs the synthesis of protein. This flow of genetic information from nucleic acids to protein has been called the Central Dogma of Molecular Biology.

Central Dogma of Molecular Biology Image Source: http://www.people.virginia.edu/~rjh9u/dnaprot.html

DNA ↓

RNA ↓

PROTEIN

What is a Gene? Aaah, the million dollar question. In short, a gene can be described as the physical and functional unit of heredity that carries information from one generation to the next. A gene can be thought of as the DNA sequence necessary for the synthesis of a functional protein or RNA molecule. Genome, Transcriptome, Proteome Whenever the term genome is used, it typically refers to the chromosomal DNA of an organism, or as far as sequencing is concerned, the heterochromatic regions of the chromosomal DNA. The number of chromosomes and genome size varies quite significantly from one organism to another. An example list of genome sizes is given below. Don’t be fooled by this table that the size of the genome and the number of genes determines the complexity of an organism. In fact, many plant genomes are much greater in size than the human genome!

ORGANISM CHROMOSOMES GENOME SIZE GENES Homo sapiens

(Humans) 23 3,200,000,000 ~ 30,000

Mus musculus (Mouse)

20 2,600,000,000 ~30,000

Drosophila melanogaster

(Fruit Fly)

4 180,000,000 ~18,000

Saccharomyces cerevisiae (Yeast)

16 14,000,000 ~6,000

Zea mays (Corn) 10 2,400,000,000 ??? The term transcriptome refers to the complete collection of all possible mRNAs (including splice variants) of an organism. This can be thought of as the regions of an organism’s genome that get transcribed into messenger RNA. In some cases, the transcriptome can be extended to include all transcribed elements, including non-coding RNAs used for structural and regulatory purposes. The term proteome refers to the complete collection of proteins that can be produced by an organism. The proteome can be studied either as a static (sum of all proteins possible) or a dynamic (all proteins found at a specific time point) entity.

Molecular Biology Reference Books Lewin, B (1999), Genes VII (published by Oxford University Press) ISBN: 019879276X Lodish et al (1995), Molecular Cell Biology, 3rd edition (published by Scientific American Books, Freeman and Cpy, New York) ISBN 0 7167 2380 8 Gonick, L & Wheelis, M (1991), The Cartoon Guide to Genetics (published by Harper Perrenial, New York) ISBN 0 06 273099 1 Online tutorials The Learning Center: http://www.genecrc.org/site/lc/lc1a.htm On-Line Biology Book: http://www.emc.maricopa.edu/faculty/farabee/BIOBK/BioBookTOC.html EMBL-EBI Introduction to Biology: http://www.ebi.ac.uk/microarray/biology_intro.html One site you will be intimately familiar with by the end of the semester: http://www.ncbi.nlm.nih.gov Reading assignment http://www.ebi.ac.uk/microarray/biology_intro.html Chapters 1 & 2 (Durbin, et al.) Chapters 1 & 3 (Mount)


Lecture 2: Pairwise Sequence Alignment In molecular biology, a common question is to ask whether or not two sequences are related. The most common way to tell whether or not they are related is to compare them to one another to see if they are similar. If we look at two words in the English language, we note that two words that are spelled similarly may mean two completely different things, such as the words pear and tear.

Biological sequences that are similar (but not exact) provide useful information to help discover functional, structural, and evolutionary information. One common mistake is to describe two sequences as having some sort of homology or a percent homology based on their sequence similarity. This is a misuse of the biological term. Two sequences in different organisms are homologous if they have been derived from a common ancestor sequence. Two sequences may or may not be homologous regardless of their sequence similarity. However, the greater the sequence similarity, the greater chance there is that they share similar function and/or structure.

SEQUENCE SIMILARITY ≠ HOMOLOGY! Biological Definitions for Related Sequences Homologs are similar sequences in two different organisms that have been derived from a common ancestor sequence. Homologs can be described as either orthologous or paralogous. Orthologs are similar sequences in two different organisms that have arisen due to a speciation event. Orthologs typically retain their functionality throughout evolution. Paralogs are similar sequences within a single organism that have arisen due to a gene duplication event. Xenologs are similar sequences that do not share the same evolutionary origin, but rather have arisen out of horizontal transfer events through symbiosis, viruses, etc.

Image Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html Hamming or edit distance One method in determining sequence similarity is to determine the edit distance between two sequences. If we take the example of pear and tear, how similar are these two words? We notice that if we change the p to a t, and keep the ear, then we can change pear to tear. Thus, there is a mismatch in the first letter, and matches in the last three. An alignment of these two is as follows: P E A R | | | T E A R One way to score this alignment is to calculate the Hamming distance, which is the minimum number of letters by which the two words differ. In this example, the Hamming distance would be 1. The Hamming distance is calculated by summing up the number of mismatches when two words are aligned to one another.

With biological sequences, it is often necessary to align two sequences that are of different lengths, or that have regions that have been inserted or deleted over time. Thus, the notion of gaps needs to be introduced. Consider the words alignment and ligament. One alignment of these two words is as follows: A L I G N M E N T | | | | | | | - L I G A M E N T In this case, a gap is denoted in the alignment by a ‘-‘ character. Now an alignment can produce one of the following: a match between two characters, a mismatch between two characters (also called a substitution or mutation), a gap in the first sequence (which can be thought of as the deletion of a character in the first sequence), or a gap in the second sequence (which can be thought of as the insertion of a character in the first sequence). Consider the following two nucleic acid sequences: ACGGACT and ATCGGATCT. The following are two valid alignments: A – C – G G – A C T | | | | | A T C G G A T _ C T A T C G G A T C T | | | | | | A – C G G – A C T Alignment scoring schemes Which alignment is the better alignment? One way to judge this is to assign a positive score for each match, and a negative score for each mismatch, and a negative score for each insertion/deletion (collectively referred to as indels). One scoring scheme might assign the following values: match: +2 mismatch: -1 indel –2 Using this scoring scheme, the first alignment has 5 matches, 1 mismatch, and 4 indels. The score for this alignment is: 5 * 2 – 1(1) – 4(2) = 10 – 1 – 8 = 1. The second alignment has 6 matches, 1 mismatch, and 2 indels. The score for the second alignment is 6 * 2 – 1(1) – 2 (2) = 12 – 1 – 4 = 7.

Therefore, using the above scoring scheme, the second alignment is a better alignment, since it produces a higher alignment score. Visual Alignments -- Dot Plots One of the more basic, yet important techniques for determining the alignment between two sequences is by using a visual alignment known as dot plots. Dot plots of sequence similarity are created using a matrix where the rows in the matrix correspond to the characters in the first sequence and the columns in the matrix correspond to the characters in the second sequence. The dot plot is created as follows: loop through each row. For the current row, take the character in that row and compare it to the character in each column. If they are equal, place a dot in the matrix. Continue until all nodes in the matrix have been considered. A C C T G A G C T C A C C T G A G T T A A C C T G A G C T C A C C T G A G T T A

Results for aligning ACCTGAGCTCACCTGAGTTA to itself using the Dot Matrix option of the AlignX feature of Informax’s Vector NTI program. When a dotplot is created to compare nucleic acids, there will be a lot of noise, since one out of every four positions will match at random, if there are an equal number of A, C, G, and T in the sequence. Therefore, dot plots can be filtered for stringency requiring that a certain percentage of nucleotides match in a given window size. With the example above, if we filter the sequences to only show matches of two or more consecutive nucleotides, the dot plot now looks as the following:

Information within Dot Plots Dot plots are useful as a first-level filter for determining an alignment between two sequences. Regions of similarity will show up as diagonals within the dot plot matrix. Regions containing insertions/deletions can be readily determined. One potential application is to determine the number of coding regions (exons) contained within a processed mRNA.

Example of a Dot Plot showing insertion/deletion events. Regions of genomic DNA can contain repetitive regions. For instance, approximately 50 percent of the human genome is composed of repetitive elements, which can be on the order of a few hundred bases (SINEs – Alu elements) or a few thousand (LINES). In addition, regions of low complexity are present as well. Repetitive elements and methods to filter them out will be discussed during a later class period. In addition to repetitive elements, regions of a genome can be duplicated. The duplicated region can be found either as a direct repeat, meaning that it occurs in the same direction, or as an inverted repeat, meaning that the sequence of the duplicated region is found in the reverse complement direction. Dot plots can readily show regions of direct and inverted repeats. Dot plots show all possible matches of residues between two sequences given a certain threshold level. Thus, the researcher can decide which alignments are the most significant.

Example dot plots showing the presence of direct and inverted repeats. Dot plots can also be used in order to compare two different assemblies of the same sequence. Below are three dotplots of various chromosomes. The first shows two separate assemblies of human chromosome 5 compared against each other. The second shows one assembly of chromosome 5 compared against itself, indicating the presence of repetitive regions. The final dotplot shows chromosome Y compared against itself, indicating the presence of inverted repeats.

Comparison of two assemblies of chromosome 5. The figure to the left indicates the alignment of two separate assemblies, while the figure to the right indicates the alignment of a single assembly against itself.

Self plot of chromosome Y. Indicated are several regions of both direct and inverted repeats. Available Dot Plot Packages Vector NTI software package (under AlignX) Dotlet Java applet: http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html Dotter (http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html ) GCG software package:

Compare http://www.hku.hk/bruhk/gcgdoc/compare.html DotPlot+ http://www.hku.hk/bruhk/gcgdoc/dotplot.html

Emboss software package: Dotmatcher Dotpath Dotup DNA strider Pipmaker: http://bio.cse.psu.edu/pipmaker/ -- Returns back a pdf of the alignment dotmatcher: http://www.hku.hk/bruhk/emboss/dotmatcher.html Overview of Dotplot techniques: http://imagebeat.com/dotplot/overview.html

Dot Plot Articles Gibbs & McIntyre, 1970

Gibbs, A. J. & McIntyre, G. A. (1970). The diagram method for comparing sequences. its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16, 1-11.

Staden, 1982 Staden, R. (1982). An interactive graphics program for comparing and aligning nucleic-acid and amino-acid sequences. Nucl. Acid. Res. 10 (9), 2951-2961. The shortcoming of visual methods is that they do not yield a direct measure into the similarity between two sequences. In order to get a measure into sequence similarity, dynamic programming can be employed. Finding an optimal alignment of two sequences Suppose there are two sequences X and Z to be aligned, where |X| = m and |Z| = n. If gaps are allowed in the sequences, then the potential length of both the first and second sequences is m+n. Several methods will be discussed to align these sequences. Brute Force Method If we are interested in determining the optimal alignment (either global or local), then we note that there are 2m+n subsequences with spaces for the sequence X, and 2m+n

subsequences with spaces for the sequence Z using the power set rules. Thus, a brute force method of comparing these two sequences for the optimal alignment would require 2m+n * 2m+n = 2(2(m+n)) = 4m+n comparisons. It doesn’t take long for this to be an impossible search! Dynamic Programming Luckily, sequence alignment has an optimal-substructure property, and therefore there is a much easier way to consider all of the possible alignments using what is called dynamic programming (DP). Dynamic programming techniques are used in many different aspects of computer science. DP algorithms solve optimization problems by dividing the problem into independent subproblems. Each subproblem is then only stored once, and the answer is stored in a table, thus avoiding the work of recomputing the solution.

With sequence alignment, the subproblems can be thought of as the alignment of the “prefixes” of the two sequences to a certain point. Therefore, a dynamic programming matrix is computed. The optimal alignment score for any particular point in the matrix is built upon the optimal alignment that has been computed to that point. Dynamic programming techniques align two sequences by beginning at the ends of the two sequences and attempting to align all possible pairs of characters (one from each sequence) using a scoring scheme for matches, mismatches, and gaps. The highest set of scores defines the optimal alignment between the two sequences. We will first consider dynamic programming in terms of DNA, where only exact matches are considered for a match score. Later we will discuss how substitution matrices can be used to score amino acid matches and mismatches. Dynamic programming approaches are guaranteed to provide the optimal alignment given a particular scoring scheme. For large sequences, dynamic programming can be slow and memory intensive. Discuss the time and space necessary for microarray analysis. Setting up the Dynamic Programming Matrix Now we are ready to go ahead and start creating the dynamic programming matrix. The first step is to align one of the sequences across the columns of the matrix, and the other sequence across the rows. Note that an alignment can also begin with a gap in one of the sequences, so that has to be taken care of as well. Let’s assume that we want to align the sequence GAATTCAGTTA to GGATCGA. The length of the first sequence is 11 residues, and the length of the second is 7. Since it is possible to begin an alignment with a gap, the size of the matrix should be 8 x 12. Row 0 and column 0 will represent gaps. Rows 1-7 will be labeled with the corresponding residue of the sequence GGATCGA, while columns 1-11 will be labeled with the corresponding residue of the sequence GAATTCAGTTA. The initial matrix, S, is as follows:

- G A A T T C A G T T A -

G

G

A

T

C

G

A

Now we need to decide upon the scoring scheme to be used. This requires parameters for a match score, a mismatch score, and a gap score. The match and mismatch scores will be combined into a single match/mismatch score, s(aibj). We’ll see how this can later be used with a substitution matrix. There will also be a single linear gap penalty score, w. For our first example, we have the following parameters: Sequence #1: GAATTCAGTTA; M = 11 Sequence #2: GGATCGA; N = 7

• s(aibj) = +5 if ai = bj (match score) • s(aibj) = -3 if ai≠bj (mismatch score) • w = -4 (gap penalty)

Three steps in dynamic programming Once you have the scoring functions set and the sequences to align, there are three steps involved in calculating the optimal scoring alignment. The methods used to finish these three steps are dependent upon whether global or local sequence alignment is desired. The three steps are as follows:

• Initialization • Matrix Fill (scoring) • Traceback (alignment)

Global Alignment: Needleman-Wunsch Algorithm In global sequence alignment, an attempt to align the entirety of two different sequences is made, up to and including the ends of the sequence. Needleman and Wunsch (1970) were among the first to describe a dynamic programming algorithm for global sequence alignment.

Initialization Step. In the initialization step of global alignment, each row Si,0 is set to w * i. In addition, each column S0,j is set to w * j. Remember, that w is the gap penalty. Using the scoring scheme described above, the initialization step results in the following:

- G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44

G -4

G -8

A -12

T -16

C -20

G -24

A -28

Matrix Fill Step. One possible solution of the matrix fill step finds the maximum global alignment score by starting in the upper left hand corner in the matrix and finding the maximal score Si,j for each position in the matrix. In order to find Si,j for any i,j it is minimal to know the score for the matrix positions to the left, above and diagonal to i, j. In terms of matrix positions, it is necessary to know Si-1,j, Si,j-1 and Si-1, j-1.

For each position, Si,j is defined to be the maximum score at position i,j; i.e.

Si,j = MAXIMUM[

Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),

Si,j-1 + w (gap in sequence #1),

Si-1,j + w (gap in sequence #2)]

Note that in the example, Si-1,j-1 will be red, Si,j-1 will be green and Si-1,j will be blue.

Using this information, the score at position 1,1 in the matrix can be calculated. Since the first residue in both sequences is a G, s(a1b1) = 5, and by the assumptions stated earlier, w = -4. Thus, S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 - 4] = MAX[5, -8, -8].

A value of 5 is then placed in position 1,1 of the scoring matrix. Note that there is also an arrow placed back into the cell that resulted in the maximum score, S0,0.

- G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44

G -4 5

G -8

A -12

T -16

C -20

G -24

A -28

Now we proceed to S1,2. Since a1 = G and b2 = A, there is a mismatch. Therefore, sa1b2 = -3 and by the assumptions stated earlier, w = -4. Thus, S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 - 4] = MAX[-4 - 3, 5 – 4, -8 – 4] = MAX[-7, 1, -12] = 1. An arrow is placed back into the cell that resulted in the maximum score, which is the cell S1,1.

- G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44

G -4 5 1

G -8

A -12

T -16

C -20

G -24

A -28

We can proceed to fill in the rest of the first row in a similar fashion, resulting in the following matrix:

- G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44

G -4 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35G -8

A -12

T -16

C -20

G -24

A -28

Now we can start to fill in the second row, beginning with S2,1. Note that a2 = G and b1 = G, so sa2b1 = 5 and by the assumptions stated earlier, w = -4. Thus, S2,1= MAX[S1,0 +5, S0,2 - 4, S1,1 - 4] = MAX-4 + 5, -8 – 4, 5 - 4] = MAX[1, -12, 1] = 1. Note that in this case, there are two possible paths to the maximum value. Therefore, an arrow is placed back into each cell resulting in the maximum score, which are sells S1,0 and S1,1.

- G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44

G -4 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35G -8 1

A -12

T -16

C -20

G -24

A -28

We can then proceed to fill in the rest of the matrix in a similar fashion. The resulting matrix is as follows:

- G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44

G -4 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35G -8 1 2 -2 -6 -10 -14 -18 -14 -18 -22 -26A -12 -3 6 7 3 -1 -5 -9 -13 -17 -21 -17T -16 -7 2 3 12 8 4 0 -4 -8 -12 -16C -20 -11 -2 -1 8 9 13 9 5 1 -3 -7 G -24 -15 -6 -5 4 5 9 10 14 10 6 2 A -28 -19 -10 -1 0 1 5 14 10 11 7 11

Each cell has one to three arrows indicating from which cell the maximum score was obtained. The matrix fill step is now complete. Traceback Step. After the matrix fill step, the maximum global alignment score for the two sequences is 11 (the value in the lower right hand cell). The traceback step will obtain the actual alignment(s) that result in the maximum score. The traceback begins in position SM,N; i.e. the position where both sequences are globally aligned. Since pointers have been kept back to all possible predacessors, the traceback is simple. At each cell, we look to see where we move next according to the pointers. To begin, the only possible predacessor is the diagonal match.

- G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44

G -4 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35G -8 1 2 -2 -6 -10 -14 -18 -14 -18 -22 -26A -12 -3 6 7 3 -1 -5 -9 -13 -17 -21 -17T -16 -7 2 3 12 8 4 0 -4 -8 -12 -16C -20 -11 -2 -1 8 9 13 9 5 1 -3 -7 G -24 -15 -6 5 4 5 9 10 14 10 6 2 A -28 -19 -10 -1 0 1 5 14 10 11 7 11

This gives us an alignment of A | A Note that the blue letters and gold arrows indicate the path leading to the maximum score. We can continue to follow the path until we get to the following situation:

- G A A T T C A G T T A - 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44

G -4 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35G 2 -2 -6 -10 -14 -18 -14 -18 -22 -26A -12 -3 6 7 3 -1 -5 -9 -13 -17 -21 -17T -16 -7 2 3 12 8 4 0 -4 -8 -12 -16C -20 -11 -2 -1 8 9 13 9 5 1 -3 -7 G -24 -15 -6 5 4 5 9 10 14 10 6 2 A -28 -19 -10 -1 0 1 5 14 10 11 7 11

The resulting global alignment is as follows: G A A T T C A G T T A | | | | | | G G A – T C – G - — A Remembering that the scoring scheme used was +5 for a match, -3 for a mismatch, and –4 for a gap, we can double check the score of the alignment: G A A T T C A G T T A | | | | | | G G A – T C – G - — A + - + - + + - + - - + 5 3 5 4 5 5 4 5 4 4 5 5 – 3 + 5 – 4 + 5 + 5 – 4 + 5 – 4 – 4 + 5 = 11

so this alignment results in a global alignment score of 11. Note that in the case of the sequence and scoring schemes we chose, there was only one maximal alignment. It is possible that there could be multiple alignments yielding the same score, as evidenced by having multiple ways to obtain the maximal score in a given cell in the scoring matrix. In such a case, the traceback can be accomplished in any manner desired, as long as the same set of rules is consistently used in order for reproducibility. Local Alignment: Smith-Waterman Algorithm In 1981, Temple Smith and Mike Waterman proposed a modification to the Needleman-Wunsch algorithm in order to obtain a local sequence alignment resulting in the highest-scoring local match between two sequences. Why choose a local alignment algorithm?

• More meaningful – point out conserved regions between two sequences • Aligns two sequences of different lengths to be matched • Aligns two partially overlapping sequences • Aligns two sequences where one is a subsequence of another

There are only two slight modifications that need to be made to the Needleman-Wunsch Algorithm in order to make it a local alignment algorithm. The first modification requires negative scores for mismatches. The second modification requires that when the dynamic programming scoring matrix value becomes negative, the value is set to zero, which has the effect of terminating any alignment up to that point. This has the effect of changing the matrix score to: Si,j = MAXIMUM[



Si-1,j + w (gap in sequence #2), 0] The local alignments are then produced by starting at the highest-scoring positions in the scoring matrix and following a trace path from those positions up to a box that scores zero.

Initialization Step. In the initialization step of local alignment, each row Si,0 is set to 0. In addition, each column S0,j is set to 0. Using the scoring scheme described above, the initialization step results in the following:

- G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 G 0 A 0 T 0 C 0 G 0 A 0

Matrix Fill Step. One possible solution of the matrix fill step finds the maximum local alignment score by starting in the upper left hand corner in the matrix and finding the maximal score Si,j for each position in the matrix. In order to find Si,j for any i,j it is minimal to know the score for the matrix positions to the left, above and diagonal to i, j. In terms of matrix positions, it is necessary to know Si-1,j, Si,j-1 and Si-1, j-1.

For each position, Si,j is defined to be the maximum score at position i,j; i.e.

Si,j = MAXIMUM[



Si-1,j + w (gap in sequence #2),

0]

- G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 G 0 A 0 T 0 C 0 G 0 A 0

Note that in the example, Si-1,j-1 will be red, Si,j-1 will be green and Si-1,j will be blue. Using this information, the score at position 1,1 in the matrix can be calculated. Since the first residue in both sequences is a G, s(a1b1) = 5, and by the assumptions stated earlier, w = -4. Thus, S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 – 4,0] = MAX[5, -4, -4, 0]. Now we proceed to S1,2. Since a1 = G and b2 = A, there is a mismatch. Therefore, sa1b2 = -3 and by the assumptions stated earlier, w = -4. Thus, S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 – 4, 0] = MAX[0 - 3, 5 – 4, 0 – 4, 0] = MAX[-3, 1, -4, 0] = 1. An arrow is placed back into the cell that resulted in the maximum score, which is the cell S1,1.

- G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 1 G 0 A 0 T 0 C 0 G 0 A 0

Now we proceed to S1,3. Since a1 = G and b3 = A, there is a mismatch. Therefore, sa1b2 = -3 and by the assumptions stated earlier, w = -4. Thus, S1,3 = MAX[S0,2 -3, S1,2 - 4, S0,3 – 4, 0] = MAX[0 - 3, 1 – 4, 0 – 4, 0] = MAX[-3, -3, -4, 0] = 0. Since the maximum score is 0 (all other possible scores are negative), no arrow is drawn back from this location.

- G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 1 0 G 0 A 0 T 0 C 0 G 0 A 0

We can then proceed to fill in the rest of the matrix in a similar fashion. The resulting matrix is as follows:

- G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 1 0 0 0 0 0 5 1 0 0 G 0 5 2 0 0 0 0 0 5 2 0 0 A 0 1 10 7 3 0 0 5 1 2 0 5 T 0 0 6 7 12 8 4 1 2 6 7 3 C 0 0 2 3 8 9 13 9 5 2 3 4 G 0 5 1 0 4 5 9 10 14 10 6 2 A 0 1 10 6 2 1 4 14 10 11 7 11

Each cell has one to three arrows indicating from which cell the maximum score was obtained. The matrix fill step is now complete. Traceback Step. After the matrix fill step, the maximum local alignment score for the two sequences is 14, which can be found by locating the highest values in the score matrix. Note that 14 is found in two separate cells, indicating there are multiple alignments producing the maximal alignment score. The traceback step will find the actual local alignments resulting in the maximum score. The traceback begins in the position with the highest value. Since pointers have been kept back to all possible predacessors, the traceback is simple. At each cell, we look to see where we move next according to the pointers. When we reach a cell where there is not a pointer to a previous cell, then we have reached the beginning of the local alignment. First, consider the case where the 14 is in the last row.

- G A A T T C A G T T A - 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 1 0 0 0 0 0 5 1 0 0 G 0 5 2 0 0 0 0 0 5 2 0 0 A 0 1 10 7 3 0 0 5 1 2 0 5 T 0 0 6 7 12 8 4 1 2 6 7 3 C 0 0 2 3 8 9 13 9 5 2 3 4 G 0 5 1 0 4 5 9 10 14 10 6 2

A 0 1 10 6 2 1 4 14 10 11 7 11 Note that the blue letters and gold arrows indicate the path leading to the maximum score. We can continue to follow the path until we get to the following situation:


At this point, or alignment (which is built starting at the end of the alignment) is as follows: C - A | | C G A Now the current cell gets its score either from a match of the T’s or a gap in the second sequence. We’ll consider both as possibilities: Match of the T’s (1) and gap in second (2).


Once we reach the node with 0 and there are no pointers from this node, we are finished. The two local alignments resulting in a score of 14 in the final row are: G A A T T C - A | | | | | G G A T – C G A + - + + - + - + 5 3 5 5 4 5 4 5 G A A T T C - A | | | | | G G A – T C G A + - + - + + - + 5 3 5 4 5 5 4 5 As you can see, each of these has 5 matches, 1 mismatch, and 2 gaps, so the score is 5(5) – 1(3) – 2(4) = 25 – 3 – 8 = 14. This coincides with the maximum local alignment score calculated in the matrix. Incorporation of Scoring Matrices Amino Acids Certain amino acid substitutions commonly occur in related proteins from different species. Since the proteins in all of the species are functional, the substations maintain protein structure and function. Often the substitutions result in a chemically similar amino acid. Other substitutions are relatively rare. Thus, rather than create a dynamic programming matrix with a match/mismatch score, it would be better to weight a matching score for two residues dependent upon the likelihood that such a substitution would be observed in nature. In a substitution matrix (whether it is an amino acid or nucleic acid), the residues are listed both as column and row headings. Each position is in the matrix is filled with a score reflecting how often one residue would be paired with another in an alignment of related sequences.

Percent Accepted Mutation (PAM) Matrices Margaret Dayhoff pioneered the research in amino acid substitutions for found through the alignment of common protein sequences. The resulting Percent Accepted Mutation (PAM) Matrices give the changes expected for a given period of evolutionary time. The assumption with this evolutionary model is that amino acid substitutions over short periods of evolutionary history can be extrapolated to longer distances. Assumptions in Creating PAM matrices Each change in the current amino acid at a particular site is assumed to be independent of previous mutational events at that site. Calculation of PAM matrices

• amino acid substitutions of evolving proteins were estimated using 1572 changes in 71 groups of protein sequences at least 85% similar. Since the proteins have similar functions, the mutations are called “accepted” mutations – meaning they are accepted by natural selection without negatively affecting a protein’s fitness.

• Similar sequences were organized into phylogenetic trees • The number of changes of each amino acid into every other amino acid was

counted. • Relative mutabilities were evaluated by counting the number of changes of each

amino acid divided by a normalization factor. This normalized the data for variations in amino acid composition, mutation rate, and sequence length.

• The amino acid exchange counts and mutability values were used to generate a 20 x 20 mutation probability matrix representing all possible amino acid changes.

A detailed example of calculating the PAM matrix is located in Mount, p50. Since the changes are independent of previous mutational events, the PAM1 matrix can be multiplied by itself N times to give the transition matrices for sequences that have undergone N mutations. Thus, the PAM250 matrix can be used for sequences that are 20% similar, while the PAM 120, PAM80, and PAM60 matrices represent 40%, 50%, and 60% similarity. Note that PAM1 is 1 accepted mutation per 100 amino acids; PAM10 is 10 accepted mutations per 100 amino acids; PAM250 is 250 accepted mutations per 100 amino acids and so on. Thus, the substitution matrix chosen when aligning two sequences should take into account the divergence between the two sequences.

Example PAM1 matrix (normalized probabilities multiplied by 10000) Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V Ala A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 Arg R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 Asn N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 Asp D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 Cys C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 Gln Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 Glu E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 Gly G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 His H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 Ile I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33 Leu L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15 Lys K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1 Met M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4 Phe F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0 Pro P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2 Ser S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2 Thr T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9 Trp W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0 Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1 Val V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901

Taken from: http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeE.html#page7 Log Odds matrices PAM matrices are usually converted into another form, called log odds matrices. Odds ratios are converted into logarithms in order that the scores may be added, rather than multiplied. Each cell of the log-odds matrix is calculated by first finding the odds ratio for each substitution. The odds ratio is calculated by taking the scores in the above matrix, which is the probability of one amino acid mutating to another given amino acid, and dividing it by the frequency of the first amino acid. Such a ratio gives the relative frequency of change. The ratio is then converted to a log10, so that the scores are additive, and it is multiplied by 10. The log odds for converting from the first amino acid to the second is added to the log odds for converting from the second amino acid to the first, and the average is taken to produce a symmetric matrix, since the direction of mutation cannot necessarily be inferred. An example of how the log odds score for changes between Phe and Tyr is given in Mount, pp 80 – 81. Make sure to look at this to see if you have any questions.

log-odds form of PAM250 Scoring matrix. Image Source: http://www.blc.arizona.edu/courses/bioinformatics/dayhoff.html (Image in Mount, p82) Blocks Amino Acid Substitution Matrices (BLOSUM) One of the arguments against the Dayhoff PAM matrices is that they represent only a small number of families, and therefore may not truly reflect amino acid distributions that one is likely to encounter. Therefore, another set of substitution matrices, called BLOSUM matrices were developed using a much larger number of protein families. The BLOSUM matrices were developed by Stephen and Georgia Henikoff by looking at a large set of approximately 2000 amino acid patterns organized into blocks, which are conserved regions within protein families as identified by the protein database, Prosite. The blocks that were studied were also signatures of a protein family, indicating that members of the family could be found by searching for these blocks. In order to deal with overrepresentation of amino acid substitutions occurring in the most closely related members of the family, a consensus sequence of the block is formed. Sequences that were 60% identical to the consensus were grouped together to form the BLOSUM60 matrix; sequences 80% identical were grouped together to form the BLOSUM80 matrix, etc. Nucleic Acid Scoring Matrices In addition to using a match/mismatch scoring scheme for DNA sequences, nucleotide mutation matrices can be constructed as well. These matrices are based upon two

different models of nucleotide evolution: the first, the Jukes-Cantor model, assumes there are uniform mutation rates among nucleotides, while the second, the Kimura model, assumes that there are two separate mutation rates: one for transitions (where the structure of purine/pyrimidine stays the same), and one for transversions. Generally, the rate of transitions is thought to be higher than the rate of transversions. Jukes-Cantor Model of evolution: α = common rate of base substitution

Kimura Model of Evolution: α = rate of transitions; β = rate of transversions

A

C

G

T

PURINES: A, GPYRIMIDINES C, T Transitions: A↔G; C↔T Transversions: A↔C, A↔T, C↔G, G↔T

http://www.cs.man.ac.uk/~jowh6/phase/node26.html Tables 3.4 and 3.5 indicate nucleotide substitution matrices with the equivalent distance of 1 PAM. Table 3.4 PAM1 Odds Matrices

A. Model of uniform mutation rates among nucleotides.

A G T C A 0.99 G 0.00333 0.99 T 0.00333 0.00333 0.99 C 0.00333 0.00333 0.00333 0.99

B. Model of 3-fold higher transitions than transversions.

A G T C A 0.99 G 0.006 0.99 T 0.002 0.002 0.99 C 0.002 0.002 0.006 0.99

Table 3.5 PAM1 Log-Odds Matrices

A. Model of uniform mutation rates among nucleotides.

A G T C A 2 G -6 2 T -6 -6 2 C -6 -6 -6 2

B. Model of 3-fold higher transitions than transversions.

A G T C A 2 G -5 2

T -7 -7 2 C -7 -7 -5 2

Gap Penalties The scoring matrices used to this point assume a linear gap penalty where each gap is given the same penalty score. However, over evolutionary time, it is more likely that a contiguous block of residues has become inserted/deleted in a certain region (for example, it is more likely to have 1 gap of length k than k gaps of length 1). Therefore, a better scoring scheme to use is an initial higher penalty for opening a gap, and a smaller penalty for extending the gap. The affine gap penalty can then be formulated as follows:

wx = g + r(x-1) where wx is the total gap penalty, g is the gap open penalty, r is the gap extend penalty, and x is the length of the gap. The gap penalty needs to be chosen relative to the score matrix, so that gaps will not be excluded from the alignment, or propagate throughout the alignment. Typical values are –12 for gap opening, and –4 for gap extension. Affine gap penalties increase the number of matrices (or at least storage space) to be filled out. The information to be processed is now:

Di - 1, j - 1 + subst(Ai, Bj) Mi - 1, j - 1 + subst(Ai, Bj) Mi, j = max { Ii - 1, j - 1 + subst(Ai, Bj)

Di , j - 1 - extend Di, j = max { Mi , j - 1 - open Mi-1 , j - open Ii, j = max { Ii-1 , j - extend

Where M is the match matrix, D is the delete matrix, and I is the insert matrix. Assessing the significance of sequence alignments When two sequences of length m and n are not obviously similar but show an alignment, it becomes necessary to assess the significance of the alignment. The alignment of scores of random sequences has been shown to follow a Gumbel extreme value distribution.

Image source: •http://roso.epfl.ch/mbi/papers/discretechoice/node11.html

Using a Gumbel extreme value distribution, the expected number of alignments with a score at least S (E-value) is:

E = Kmn e-λS Where: m,n: Lengths of sequences K ,λ: statistical parameters dependent upon scoring system and background residue frequencies Recall that the log-odds scoring schemes examined to this point normally use a S = 10*log10x scoring system. We can normalize the raw scores obtained using these non-gapped scoring systems to obtain the amount of bits of information contained in a score, or the amout of nats of information contained within a score.

Converting to bit scores A raw score can be normalized to a bit score using the formula:

The E-value corresponding to a given bit score can then be calculated as:

Converting to nats is similar. However, we just substitute e for 2 in the above equations. Converting scores to either bits or nats gives a standardized unit by which the scores can be compared. P-values P values can be calculated as the probability of obtaining a given score at random. P-values can be estimated as: P = 1 – e-E Which is approximately e-E A quick determination of significance If a scoring matrix has been scaled to bit scores, then it can quickly be determined whether or not an alignment is significant. For a typical amino acid scoring matrix, K = 0.1 and lambda depends on the values of the scoring matrix. If a PAM or BLOSUM matrix is used, then lambda is precomputed. For instance, if the log odds matrix is in units of bits, then lambda = loge2, and the significance cutoff can be calculates as log2(mn). Example (p110 Mount)

Suppose we have two sequences, each approximately 250 amino acids long that are aligned using a Smith-Waterman approach and the PAM250 matrix. The following local alignment occurs: F W L E V E G N S M T A P T G F W L D V Q G D S M T A P A G Using the PAM250 matrix (p82), the score for this local alignment can be calculated as: S = 9 + 17 + 6 + 3 + 4 + 2 + 5 + 2 + 2 + 6 + 3 + 2 + 6 + 1 + 5 = 73 S is in 10 * log10x, so this should be converted to a bit score. S = 10 log10x S/10 = log10x S/10 = log10x * (log210/log210) S/10 * log210 = log10x / log210 S/10 * log210 = log2x 1/3 S ~ log2x so S’ ~ 1/3S In this case, S’ = 1/3 * 73 = 24.3 The significance cutoff is: log2(mn) = log2(250 * 250) = 16 bits Since the alignment score is above the significance cutoff, this is a significant local alignment. Estimation of P and E When a PAM250 scoring matrix is being used, K is estimated to be 0.09, while lambda is estimated to be 0.229. Using equations 30 and 31 (Mount), we can convert the score to a bit score: S’ = 0.229 * 73 – ln 0.09 * 250 * 250 S’ = 16.72 – 8.63 = 8.09 bits P(S’ >= 8.09) = 1 – e(-e-8.09) = 3.1* 10-4 Therefore, we see that the probability of observing an alignment with a bitscore greater than 8.09 is about 3 in 1000.

Significance of Gapped Alignments Gapped alignments make use of the same statistics as ungapped alignments in determining the statistical significance. However, in gapped alignments, the values for lambda and K cannot be easily estimated. Emperical estimations and gap scores have been determined by looking at the alignments of randomized sequences. Bayesian Statistics Bayesian statistics are built upon conditional probabilities, which are used to derive the joint probability of two events or conditions. P(B|A) is the probability of B given condition A is true. P(B) is the probability of condition B occurring, regardless of conditions A. Suppose that A can have two states, A1 and A2, and B can have two states, B1 and B2. Suppose that P(B1) = 0.3 is known. Therefore, P(B2) = 1 – 0.3 = 0.7. These probabilities are known as marginal probabilities. Now we would like to determine the probability of A1 and B1 occurring together, which is denoted as: P(A1, B1) and is called the joint probability. Note that in this case the marginal probabilities A1 and A2 are missing. Thus, there is not enough information at this point to calculate the marginal probability. However, if more information about the joint occurrence of A1 and B1 are given, then the joint probabilities may be derived using Bayes Rule: P(A1, B1) = P(B1)P(A1|B1) P(A1, B1) = P(A1)P(B1|A1) Suppose that we are given P(A1|B1) = 0.8. Then, since there are only two different possible states for A, P(A2|B1) = 1 – 0.8 = 0.2. If we are also given P(A2|B2) = 0.7, then P(A1|B2) = 0.3. Using Bayes Rule, the joint probability of having states A1 and B1 occurring at the same time is P(B1)P(A1|B1) = 0.3 * 0.8 = 0.24 and P(A2,B2) = P(B2)P(A2|B2) = 0.7 * 0.7 = 0.49. The other joint probabilities can be calculated from these as well. The calculation of the joint probabilities results in posterior probabilities, since they are not known initially, but are calculated using prior probabilities and initial information. Applications of Bayesian Statistics Bayesian statistics have many applications in bioinformatics. One application is in determining the evolutionary distance between two sequences (Agarwal and States, 1996 – covered in Mount, pp 122-124). Another is in sequence alignment algorithms (Zhu et al, 1998; Mount pp 124-134). The significance of an alignment can also be computed using a Bayesian framwork (Durbin, et al, pp 36-38). More applications using Bayesian

statistics will be examined when the Gibbs Sampling algorithm is discussed during a later class period. Drawbacks to Dynamic Programming Approach Dynamic programming approaches are guaranteed to give the optimal alignment between two sequences given a scoring scheme. However, the two main drawbacks to DP approaches is that they are compute and memory intensive, in the cases discussed to this point taking at least O(n2) time and space. Linear space algorithms have been used in order to deal with one drawback to dynamic programming. The basic idea is to concentrate only on those areas of the matrix more likely to contain the maximum alignment. The most well-known of these linear space algorithms is the Myers-Miller algorithm. Available pairwise sequence alignment programs FASTA suite of programs LALIGN BESTFIT SIM GAP NAP LAP2 GAP2 http://genome.cs.mtu.edu/align/align.html EMBOSS APPLICATIONS http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/index.html WEB FORMS FOR EMBOSS APPLICATIONS http://bioweb.pasteur.fr/seqanal/alignment/intro-uk.html#EMBOSS http://bioinfo.pbi.nrc.ca:8090/EMBOSS/index.html BAYESIAN TUTORIAL http://www.wadsworth.org/resnres/bioinfo/tut1/index.htm Expressed Sequences to Genomes Sim4 est2genome spidey

######################################## # Program: needle # Rundate: Wed Jan 22 20:09:50 2003 # Report_file: outfile.align ######################################## #======================================= # # Aligned_sequences: 2 # 1: gi # 2: gi # Matrix: EDNAFULL # Gap_penalty: 12.0 # Extend_penalty: 4.0 # # Length: 1030 # Identity: 537/1030 (52.1%) # Similarity: 537/1030 (52.1%) # Gaps: 493/1030 (47.9%) # Score: 1649.0 # # #======================================= gi 1 0 gi 1 ATACAAAATTTACGTGACTGGAGGGTGAAAGGGAATGTGGGAGGTCAGTG 50 gi 1 GGCAATAATGATACAATGTATCATGCCTCT 30 |||||||||||||||||||||||||||||| gi 51 CATTTAAAACATAAAGAAATGGCAATAATGATACAATGTATCATGCCTCT 100 gi 31 TTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGCAATAG 80 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 101 TTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGCAATAG 150 gi 81 CAA---------------------------ATAAATTGTAACTGATGTAA 103 ||| |||||||||||||||||||| gi 151 CAATATCTCTGCATATAAATATTTCTGCATATAAATTGTAACTGATGTAA 200 gi 104 GAGGTTTCATATTGCTAATAGCAGCTACAATCCAGCTACCATTCTGCTTT 153 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 201 GAGGTTTCATATTGCTAATAGCAGCTACAATCCAGCTACCATTCTGCTTT 250 gi 154 TATTTTA---------------------------------------TGGT 164 ||||||| |||| gi 251 TATTTTAAATTTATATGCAGAAATATTTATATGCAGAGATATTGCTTGGT 300 gi 165 TGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATC 214 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 301 TGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATC 350 gi 215 ATGTTCATACCTCTTATCTTCCTCCCACGGCTCCTGGGCAACGTGCTGGT 264 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 351 ATGTTCATACCTCTTATCTTCCTCCCACGGCTCCTGGGCAACGTGCTGGT 400 gi 265 CTGTGTGC--------------------------------CCAGTGCAGG 282 |||||||| |||||||||| gi 401 CTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGG 450 gi 283 CTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAG 332 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 451 CTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAG 500 gi 333 TATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTT 382 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 501 TATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTT 550 gi 383 GTTCCCTAAGTCCAACTACTAAAC-------------------------- 406 |||||||||||||||||||||||| gi 551 GTTCCCTAAGTCCAACTACTAAACAAGCTAGGCCCTTTTGCTAATCATGT 600

gi 407 -----------------------TGGGGGATATTATGAAGGGCCTTGAGC 433 ||||||||||||||||||||||||||| gi 601 TCATACCTCTTATCTTCCTCCCATGGGGGATATTATGAAGGGCCTTGAGC 650 gi 434 ATCTGGATTCTGCCTAATAAAA---------------------------- 455 |||||||||||||||||||||| gi 651 ATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGCATCTGCATAT 700 gi 456 -----------------------------------------TATTTCTGA 464 ||||||||| gi 701 AAATATTTCTGCATATAAATTGTAACATGATGTATTTAAATTATTTCTGA 750 gi 465 ATA-------------------------------TTTTACTAAAAAGGGA 483 ||| |||||||||||||||| gi 751 ATAAGAAATCTTACCACGTTTCTCCGTACTATGTTTTTACTAAAAAGGGA 800 gi 484 ATGTGGGAGGTCAGTGCATTTAAAACATAAAGAAATGAAGAGCTAGTTCA 533 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 801 ATGTGGGAGGTCAGTGCATTTAAAACATAAAGAAATGAAGAGCTAGTTCA 850 gi 534 AACC 537 |||| gi 851 AACCACTTACATCAGTTACAATTTATATGCAGAAATATTTATATGCAGAG 900 gi 538 537 gi 901 ATATTGCTTTAGGTCGGAATAGGGTTGGTATTTTATTTTCGTCTTACCAT 950 gi 538 537 gi 951 CGACCTAACATCGACGATAATAGCAGCTACAATCCAGCTACCATTCTGCT 1000 gi 538 537 gi 1001 TTTATTTTATGGTTGGGATAAGGCTGGATT 1030 #--------------------------------------- #---------------------------------------

water results ######################################## # Program: water # Rundate: Wed Jan 22 20:11:48 2003 # Report_file: outfile.align ######################################## #======================================= # # Aligned_sequences: 2 # 1: gi # 2: gi # Matrix: EDNAFULL # Gap_penalty: 12.0 # Extend_penalty: 4.0 # # Length: 660 # Identity: 484/660 (73.3%) # Similarity: 484/660 (73.3%) # Gaps: 152/660 (23.0%) # Score: 1660.0 # # #======================================= gi 1 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATA 50 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 71 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATA 120 gi 51 ACAGTGATAATTTCTGGGTTAAGGCAATAGCAA----------------- 83 ||||||||||||||||||||||||||||||||| gi 121 ACAGTGATAATTTCTGGGTTAAGGCAATAGCAATATCTCTGCATATAAAT 170 gi 84 ----------ATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA 123 |||||||||||||||||||||||||||||||||||||||| gi 171 ATTTCTGCATATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA 220 gi 124 GCAGCTACAATCCAGCTACCATTCTGCTTTTATTTTA------------- 160 ||||||||||||||||||||||||||||||||||||| gi 221 GCAGCTACAATCCAGCTACCATTCTGCTTTTATTTTAAATTTATATGCAG 270 gi 161 --------------------------TGGTTGGGATAAGGCTGGATTATT 184 |||||||||||||||||||||||| gi 271 AAATATTTATATGCAGAGATATTGCTTGGTTGGGATAAGGCTGGATTATT 320 gi 185 CTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTT 234 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 321 CTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTT 370 gi 235 CCTCCCACGGCTCCTGGGCAACGTGCTGGTCTGTGTGC------------ 272 |||||||||||||||||||||||||||||||||||||| gi 371 CCTCCCACGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACT 420 gi 273 --------------------CCAGTGCAGGCTGCCTATCAGAAAGTGGTG 302 |||||||||||||||||||||||||||||| gi 421 TTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTG 470 gi 303 GCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCT 352 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 471 GCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCT 520 gi 353 TGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACT 402 |||||||||||||||||||||||||||||||||||||||||||||||||| gi 521 TGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACT 570 gi 403 AAAC---------------------------------------------- 406 |||| gi 571 AAACAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTTCCTC 620 gi 407 ---TGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAA 453 |||||||||||||||||||||||||||||||||||||||||||||||

gi 621 CCATGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAA 670 gi 454 AATATTTCTGAATATTTTACTAAAAAGGGAATGTGGGAGGTCAGTGCATT 503 ||.|..| | ||||||..|...|...|.||.| ..|..|...|||||. gi 671 AAAACAT-T---TATTTTCATTGCATCTGCATAT-AAATATTTCTGCATA 715 gi 504 TAAAACATAA 513 ||||...||| gi 716 TAAATTGTAA 725 #--------------------------------------- #---------------------------------------

Blast 2 sequences Score = 258 bits (134), Expect = 7e-66 Identities = 134/134 (100%) Strand = Plus / Plus

Query: 273 ccagtgcaggctgcctatcagaaagtggtggctggtgtggctaatgccctggcccacaag 332 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 441 ccagtgcaggctgcctatcagaaagtggtggctggtgtggctaatgccctggcccacaag 500 Query: 333 tatcactaagctcgctttcttgctgtccaatttctattaaaggttcctttgttccctaag 392 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 501 tatcactaagctcgctttcttgctgtccaatttctattaaaggttcctttgttccctaag 560 Query: 393 tccaactactaaac 406 |||||||||||||| Sbjct: 561 tccaactactaaac 574 Score = 216 bits (112), Expect = 4e-53 Identities = 112/112 (100%) Strand = Plus / Plus

Query: 161 tggttgggataaggctggattattctgagtccaagctaggcccttttgctaatcatgttc 220 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 297 tggttgggataaggctggattattctgagtccaagctaggcccttttgctaatcatgttc 356 Query: 221 atacctcttatcttcctcccacggctcctgggcaacgtgctggtctgtgtgc 272 |||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 357 atacctcttatcttcctcccacggctcctgggcaacgtgctggtctgtgtgc 408 ….

LALIGN /seqprg/slib/bin/lalign -N 5000 -n -r "+5/-4" -f -12 -g -4 -w 75 -q @ @ resetting to DNA matrix resetting to DNA matrix LALIGN finds the best local alignments between two sequences version 2.1u03 April 2000 Please cite: X. Huang and W. Miller (1991) Adv. Appl. Math. 12:373-381 resetting to DNA matrix alignments < E( 0.05):score: 75 (50 max) Comparison of: (A) @ gi|22758817|gb|AY128651.1| Homo sapiens beta-globi - 537 nt (B) @ gi|22758817|gb|AY128651.1| Homo sapiens beta-globi - 1058 nt using matrix file: DNA, gap penalties: -12/-4 E(limit) 0.05 73.3% identity in 660 nt overlap (1-513:99-753); score: 1660 E(10000): 3.5e-130 10 20 30 40 50 60 70 gi|227 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGC ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: gi|227 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGC 100 110 120 130 140 150 160 170 80 90 100 110 120 gi|227 AATAGCAA---------------------------ATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA :::::::: :::::::::::::::::::::::::::::::::::::::: gi|227 AATAGCAATATCTCTGCATATAAATATTTCTGCATATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA 180 190 200 210 220 230 240 130 140 150 160 gi|227 GCAGCTACAATCCAGCTACCATTCTGCTTTTATTTTA-------------------------------------- ::::::::::::::::::::::::::::::::::::: gi|227 GCAGCTACAATCCAGCTACCATTCTGCTTTTATTTTAAATTTATATGCAGAAATATTTATATGCAGAGATATTGC 250 260 270 280 290 300 310 320 170 180 190 200 210 220 230 gi|227 -TGGTTGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTT :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: gi|227 TTGGTTGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTT 330 340 350 360 370 380 390 240 250 260 270 gi|227 CCTCCCACGGCTCCTGGGCAACGTGCTGGTCTGTGTGC--------------------------------CCAGT :::::::::::::::::::::::::::::::::::::: ::::: gi|227 CCTCCCACGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGT 400 410 420 430 440 450 460 470 280 290 300 310 320 330 340 350 gi|227 GCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCT ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: gi|227 GCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCT 480 490 500 510 520 530 540 360 370 380 390 400 gi|227 TGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAAC--------------------- :::::::::::::::::::::::::::::::::::::::::::::::::::::: gi|227 TGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACAAGCTAGGCCCTTTTGCTAAT 550 560 570 580 590 600 610 620 410 420 430 440 450 gi|227 ----------------------------TGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAA ::::::::::::::::::::::::::::::::::::::::::::::: gi|227 CATGTTCATACCTCTTATCTTCCTCCCATGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAA 630 640 650 660 670 680 690 460 470 480 490 500 510 gi|227 AATATTTCTGAATATTTTACTAAAAAGGGAATGTGGGAGGTCAGTGCATTTAAAACATAA :: : : : :::::: : : : :: : : : ::::: :::: ::: gi|227 AAAACAT-T---TATTTTCATTGCATCTGCATATAA-ATATTTCTGCATATAAATTGTAA 700 710 720 730 740 750

CECS 694-02 Introduction to Bioinformatics

Lecture 3: Multiple Sequence Alignment

Two Issues with the Programming Project 1. Amino Acid Sequence Alignment 2. Calculating alignment score using affine gap penalties

Amino Acid Sequence alignment With amino acid sequence alignment, there is no longer a straight match/mismatch score as there is with DNA sequence alignment, since different amino acids are allowed to mutate while still maintaining the functionality of a protein. Therefore, when aligning two sequences using amino acid sequences, it is necessary to use a lookup table to find the match score between two amino acids. This lookup table is the scoring matrix as described in the previous class, such as a PAM or BLOSUM matrix. Now when you have two residues, you can look up in this matrix to determine their match score. You can use the symmetric PAM250 matrix on page 82 for amino acid sequence alignments for this project.

Pam250 Matrix, P 82 (Mount)

Calculating alignments using affine gap penalties (Don’t worry about for this project – this will be part of the second programming assignment) In order to calculate an alignment using affine gap penalties, it is necessary to consider the possibility of either extending an existing gap, or to open a new gap. In order to calculate the maximum alignment score matrix, V, it is necessary to consider three separate matrices: a match matrix (M), an insertion matrix (I), and a deletion matrix (D). The scores for each of these matrices is calculated as follows: Mi,j = MAX{ Mi-1, j-1 + s(xi, yi), Ii-1, j-1 + s(xi, yi), Di-1, j-1 + s(xi, yi) } Ii,j = MAX{ Mi-1, j – g, // Opening new gap, g = gap open penalty; Ii-1, j – r} // Extending existing gap, r = gap extend penalty Di,j = MAX{Mi,j-1 – g, // Opening new gap; Di,j-1 – r} // Extending existing gap Vi,j = MAX {Mi,j, Ii,j, Di,j}

Multiple Sequence Alignment

Description Similar genes can be conserved across species that perform similar or identical functions. Many genes are represented in highly conserved forms across organisms. Unique human and mouse genes By performing a simultaneous alignment of multiple sequences having similar or identical functions, we can gain information about which regions have been subject to mutations over evolutionary time and which are evolutionarily conserved. Such knowledge tells which regions or domains of a gene are critical to its functionality. Sometimes genes that are similar in sequence can be mutated or rearranged to perform an altered function. By looking at multiple alignments of such sequences, we can tell which changes in the sequence have caused a change in the functionality.

Multiple sequence alignment yields information concerning the structure and function of proteins, and can help lead to the discovery of important sequence domains or motifs with biological significance while at the same time uncovering evolutionary relationships among genes. In multiple sequence alignment, the idea is to take three or more sequences, and align them so that the greatest number of similar characters are aligned in the same column of the alignment. The difficulty with multiple sequence alignment is that now there are a number of different combinations of matches, insertions, and deletions that must be considered when looking at several different sequences. Methods to guarantee the highest scoring alignment are not feasible. Therefore, approximation methods are put to use in multiple sequence alignment.

Example multiple alignment of 8 immunoglobulin sequences. There are four approaches to multiple sequence alignment we will consider: Dynamic Programming Approach, Progressive alignment, Iterative alignment, and statistical modeling.

Extension of Dynamic Programming Approach The attractiveness of dynamic programming with two sequences is that it guarantees to give the optimal alignment of sequences given a specific scoring scheme. In addition, it is a relatively easy method to implement.

Dynamic programming approaches can be extended to multiple alignment as well. Consider the example where we have three amino acid sequences VSNS, SNA, and AS to align. Instead of filling a two dimensional matrix as we did with two sequences, we now fill a three dimensional space.

Figure source: http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/node2.html#SECTION00020000000000000000

Suppose the length of each sequence is n residues. If there are two such sequences, then the number of comparisons needed to fill in the scoring matrix is n2, since it is a two-dimensional matrix. The number of comparisons needed to fill in the scoring cube when three sequences are aligned is n3, and when four sequences are aligned, the number of comparisons needed is n4. Thus, as the number of sequences increases, the number of comparisons needed increases exponentially, i.e. nN where n is the length of the sequences, and N is the number of sequences. Thus, without any changes to the dynamic programming approach, this becomes impractical for even a small number of short sequences rather quickly. Carillo and Lipman – Sum of Pairs (1988) MSA – Lipman, et al. 1989 Gupta et al 1995 – Substantial reduction in memory and number of required steps Idea for reduction of memory and computations:

Multiple sequence alignment imposes an alignment on each of the pairs of sequences. Alignments found for each of the pairs of sequences can imposes bounds on the location of the MSA within the cube (three sequences) or N-dimensional space (N sequences). Step 1: Find the alignment for each pair of sequences. Step 2: Trial msa is produced by first predicting a phylogenetic tree for the sequences Step 3: Sequences are multiply aligned in the order of their relationship on the tree While this is a heuristic alignment (and is therefore not guaranteed to be optimal), it does provide a limit to the search space within which optimal alignments are likely to be found. Figures 4.2 and 4.3 (Mount) describe how the two dimensional search spaces can be projected into a three dimensional volume that can be searched. MSA calculates the multiple alignment score within the lattice by adding the scores of the corresponding pairwise alignments in the multiple sequence alignment. This measure is known as the sum of pairs (SP) measure. The optimal alignment is based on the best SP score. Scoring Multiple Sequence Alignments using sum of pairs

method The sum of pairs method scores all possible combinations of pairs of residues in a column of a multiple sequence alignment. For instance, consider the alignment ECSQ (1) SNSG (2) SWKN (3) SCSN (4) Since there are four sequences, there will be six different alignments to consider for each column. The alignments, listed by the sequence number are listed as follows: 1-2 1-3 1-4 2-3 2-4 3-4

Residues Score Residues Score Residues Score Residues Score 1-2 E-S 0 C-N -4 S-S 2 Q-G -1 1-3 E-S 0 C-W -8 S-K 0 Q-N 1 1-4 E-S 0 C-C 12 S-S 2 Q-N 1

2-3 S-S 2 N-W -4 S-K 0 G-N 0 2-4 S-S 2 N-C -4 S-S 2 G-N 0 3-4 S-S 2 W-C -8 K-S 0 N-N 2

6 -16 6 3 Residues Score Residues Score Residues Score Residues Score 1-2 E-S 0 C-N -4 S-S 2 Q-G -1 1-3 E-S 0 C-W -8 S-K 0 Q-N 1 1-4 E-S 0 C-C 12 S-S 2 Q-N 1 2-3 S-S 2 N-W -4 S-K 0 G-N 0 2-4 S-S 2 N-C -4 S-S 2 G-N 0 3-4 S-S 2 W-C -8 K-S 0 N-N 2 6 -16 6 3 Using PAM250 matrix, p250 Problem with this approach: more closely related sequences will have a higher weight The MSA program gets around this by calculating weights to associate to each sequence alignment pair. The weights are assigned based on the predicted tree of the aligned sequences. In summary, the steps of MSA are as follows:

1. Calculate all pairwise alignment scores 2. Use the scores to predict tree. 3. Calcuate pair weights based on the tree. 4. Produce a heuristic msa based on the tree. 5. Calculate the maximum weight for each sequence pair. 6. Determine the spatial positions that must be calculated to obtain the optimal

alignment. 7. Perform the optimal alignment. 8. Report the weight found compared to the maximum weight previously found.

calculates a e value for each pair of sequences which acts as the weight for that sequence pair. Sequences that are more divergent will have a higher e value.

Carillo, H. and Lipman, D. (1988). The multiple sequence alignment problem in biology. SIAM J. Appl. Math., 48(5):1073-1082.

Lipman, D. J., Altschul, S. F., and Kececioglu, J. D. (June 1989). A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA, 86:4412-4415. Description of Carrillo-Lipman technique: http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/node3.html http://www.psc.edu/biomed/genedoc/whygd.htm

Progressive alignment methods The approach of progressive alignment is to begin with an alignment of the most alike sequences, and then build upon the alignment using other sequences. Progressive alignments work by first aligning the most alike sequences using dynamic programming, and then progressively adding less related sequences to the initial alignment. Difficulties with progressive alignments CLUSTALW and CLUSTALX are progressive alignment programs that follow the following steps:

1) Perform pairwise alignments of all of the sequences 2) Use the alignment scores to produces a phylogenetic tree using neighbor-joining

methods 3) Align the sequences sequentially, guided by the phylogenetic relationships

indicated by the tree The initial pairwise alignments are calculated using an enhanced dynamic programming algorithm, and the genetic distances used to create the phylogenetic tree are calculated by dividing the total number of mismatched positions by the total number of matched positions. Alignments are associated a weight based on their distance from the root node. In a progressive alignment, gaps are added to a profile of an existing multiple sequence alignment. Statistical tests have been prepared in order to accumulate gaps between secondary structure elements, which models what is found in nature. Such information is incorporated into scoring an alignment using CLUSTALW. PILEUP is the multiple sequence alignment program that is part of the Genetics Computer Group (GCG) package developed at the University of Wisconsin. In PILEUP, multiple sequence alignment is performed by first aligning each of the sequences in a pair-wise fashion using a Needleman-Wunsch approach. The resulting scores are used to produce a tree by the unweighed pair-group method using arithmetic averages (UPGMA).

The resulting tree is then used to guide the alignment of the most closely related sequences and groups of sequences.

Problems with progressive alignments The difficulty with progressive alignments is that they depend upon the initial pair-wise sequence alignments. If the sequences are closely related, then the likelihood is good that the initial alignment contains relatively few errors. However, if the initial sequences are distantly related, then there will be more errors in the alignment, which will propagate through the rest of the alignments. The second issue is that suitable scoring matrices and gap penalties must be chosen to apply to the sequences as a set.

Iterative alignment methods Iterative alignment methods begin by making an initial alignment of the sequences. These alignments are then revised to give a more reasonable result. The objective of this approach is to improve the overall alignment score. MultAlin PRRP DIALIGN Genetic Algorithms The goal of genetic algorithms used in sequence alignment is to generate as many different multiple sequence alignments by rearrangements that simulate gaps and genetic recombination events. SAGA (Serial Alignment by Genetic Algorithm) is one such approach that yields very promising results, but becomes slow when more than 20 sequences are used. Steps of SAGA (Genetic Algorithm)

1) Up to 20 different sequences are written in a row, allowing for overlaps of a

random length. The ends of these sequences are then padded with gaps. Typically, upwards of 100 initial alignments are made.

2) The initial alignments are scored by the sum of pairs method. Standard amino

acid scoring matrices and gap open, gap extension penalties are used.

3) Initial alignments are replaced to give another generation of multiple sequence alignments. One half of the multiple sequence alignments are chosen to proceed to the next generation unchanged (natural selection). This half is chosen by assigning probabilities to each sequence based on an inverse proportion of their SP scores (the best alignments, since the SP scores are weighted according to their distance from the parent). The other half of the alignments are sent to the next generation, but are first subject to mutation.

4) In the mutation process, gaps are inserted into the sequences subject to mutation

and rearranged in an attempt to create a better scoring alignment. In this step, the sequences are split into two sets based on an estimated phylogenetic tree, and gaps of random lengths are inserted into random positions in the alignment.

5) Recombination of two parent alignments is accomplished ???

6) The next generation is evaluated going back to step 2, and steps 2-5 are repeated a

number (100-1000) times. The best scoring multiple sequence alignment is then obtained (note that it may not be the optimal scoring alignment).

7) The entire process is repeated several times, starting from a different initial

alignment each time. The best scoring multiple sequence alignment is then chosen and reported to the user.

Simulated Annealing Another approach to sequence alignment that works in a manner similar to genetic algorithms is simulated annealing. In these approaches, you begin with a heuristically determined multiple sequence alignment that is then changed using probabilistic models that identifies changes in the alignment that increase the alignment score. The drawback of simulated annealing approaches is that you can get stuck finding only the locally optimal alignment rather than the alignment score that is globally optimal.

Other methods for multiple sequence alignment Group approach In the group approach, sequences are clustered into related groups. A consensus sequence is produced to make alignments between the groups. Examples of programs implementing the group approach are PIMA and MULTAL. PIMA MULTAL

Tree approach The tree method uses the distance method of phylogenetic analysis to arrange the sequences. The two closest sequences are then aligned, and the consensus of these two is aligned with the next best sequence (or group of sequences) until an alignment is produced that includes all of the sequences. This approach is a popular approach used by PILEUP, CLUSTALW and ALIGN. TREEALIGN is a program that uses the tree approach, but rearranges the tree as sequences are aligned to produce the tree by maximum parsimony of the tree.

Localized Alignments in sequences Just like with pairwise alignments, we may not be interested in the global alignment of multiple sequences, but rather only specific regions that are conserved. For instance, given regions of genomic DNA occurring upstream or before a certain gene, there might be sequences where transcription factors bind to the DNA so that the gene can be transcribed. Thus, if we are interested in determining if there is any signal in the regions upstream of a certain family of genes across several different organisms, it would be important to only find the conserved region, and not try to align all of the genomic DNA. Localized alignments of protein sequences can yield information about conserved domains found in otherwise unrelated proteins. Programs to detect localized alignments typically use one of the following three approaches: Profile Analysis; block analysis; pattern-searching or statistical methods

Profile analysis Profiles are found by first multiply aligning the sequences, determining which regions are the most highly conserved, and then creating a scoring matrix for the alignment of the highly conserved region. The profile is composed of columns, and may include matches, mismatches, insertions, and deletions found in a particular column. Once a profile is created, it can be used to search a target sequence or database for possible matches to the profile using the profiles scores to evaluate the likelihood at each position. The drawback of profiles is that the profile is only as representative as the variation in the sequences used to construct it. Thus, there is a bias in the profile towards the training data.

For each position in a profile, there is a column for each amino acid, plus a column for an unknown amino acid (z), and a column for gap opening and gap extension. There is a row for each position in the multiple alignment. Block Analysis Expectation-Maximization Gibbs Sampling Hidden Markov Models Position Specific Scoring Matrix Sequence Logos Profile: Scores for substitutions and gaps in each column Blocks: ungapped aligned regions Alignments based on locally conserved patterns found in the same order in the sequences (synteny) Use of statistical methods and probabilistic models of the sequences Multiple sequence alignments yield information into the evolutionary history of the sequences – sequences that are most similar are likely to be recently derived from a common ancestor sequence If the sequences in a multiple alignment have quite a bit of variation then it is difficult to create a multiple sequence alignment due to the different combinations of substitutions, insertions, and deletions that can be used Local Alignment of proteins ECSQ SNSG

SWKN SCSN Profiles and Position-Specific scoring matrices Motif-Based Approaches Gibbs Sampling Algorithm Describe this – hand out papers MEME Meta-MEME Hidden Markov Models Scoring Multiple Alignments Programs for multiple sequence alignment Progressive Alignment programs: CLUSTALW, CLUSTALX MSA PRALINE Iterative Alignment Programs: DIALIGN MULTALIGN

PRRP SAGA Local Alignment of proteins Asset BLOCkS EMOTIF Gibbs Sampler HMMER MACAW MEME SAM MSA

ClustalX ClustalW Viewing Multiple Alignments sequence logos Once an alignment is made, they can be compared using Hidden Markov Models IBM’s MUSCA http://cbcsrv.watson.ibm.com/Tmsa.html ClustalW http://www.ebi.ac.uk/clustalw/ http://clustalw.genome.ad.jp/ DIALIGN http://bibiserv.techfak.uni-bielefeld.de/cgi-bin/dialign_submit Web Logo http://www.bio.cam.ac.uk/cgi-bin/seqlogo/logo.cgi DNA Sequence Data sets Fasta File Format GenBank File Format ASN.1 File Format XML File Format


Lecture 4: Multiple Sequence Alignment

Localized Alignments in sequences

Just like with pairwise alignments, we may not be interested in the global alignment of multiple sequences, but rather only specific regions that are conserved. For instance, given regions of genomic DNA occurring upstream or before a certain gene, there might be sequences where transcription factors bind to the DNA so that the gene can be transcribed. Thus, if we are interested in determining if there is any signal in the regions upstream of a certain family of genes across several different organisms, it would be important to only find the conserved region, and not try to align all of the genomic DNA. Localized alignments of protein sequences can yield information about conserved domains found in otherwise unrelated proteins.

Programs to detect localized alignments typically use one of the following three approaches: Profile Analysis; block analysis; pattern-searching or statistical methods

Profile analysis Profiles are found by first multiply aligning the sequences, determining which regions are the most highly conserved, and then creating a scoring matrix for the alignment of the highly conserved region. The profile is composed of columns, and may include matches, mismatches, insertions, and deletions found in a particular column. Once a profile is created, it can be used to search a target sequence or database for possible matches to the profile using the profiles scores to evaluate the likelihood at each position. The drawback of profiles is that the profile is only as representative as the variation in the sequences used to construct it. Thus, there is a bias in the profile towards the training data. For each position in a profile, there is a column for each amino acid, plus a column for an unknown amino acid (z), and a column for gap opening and gap extension. There is a row for each position in the multiple alignment. Calculating profiles How are the values in a profile created? The value of an individual cell is calculated as the log odds score of finding a particular residue in a particular location in an alignment divided by the probability of aligning the two amino acids by random chance using a particular scoring scheme (such as PAM250, BLOSUM80, …). Additional penalties must be calculated for gap opening and gap extension in the profile as well. If a gap exists in the multiple alignment, then the penalties for gaps will be reduced. One method (average method) weighs the proportion of the amino acids found in a particular column, and weights the score of matching the consensus residue at a given position to that particular residue.

Shannon Entropy One method to calculate the observed column variation given the expected variation in the evolutionary model is to use an information measure known as entropy. The smaller the entropy, the more conserved a column is. Entropy for a single column is calculated by the following formula:

∑−=)(

)log(aresidues

aa pfH

Where fa is the observed proportion of each residue a in the msa column and pa is the expected frequency of the residue when derived from a given ancestor residue. With an amino acid msa, the entropy measure can be used with several different evolutionary distances to determine which one minimizes entropy. (See page 164 of Mount for the discussion). Another measure of creating a creating a profile is by using log-odds score. In this method, the log2 of the ratio of observed/background frequencies is calculated for each position. What results is the amount of information available in an alignment given in bits. A new sequence can then be searched to see if it possibly contains the motif.

Block Analysis Blocks are similar to profiles in the sense that they represent locally conserved regions within a multiple sequence alignment. However, the difference is that blocks lack indels. Blocks can be determined either by performing a multiple sequence alignment, or by searching a database for similar sequences of the same length. Algorithms for searching for a BLOCK were initially developed by Henikoff and Henikoff (1991). Statistical approaches to finding the most alike sequences have been proposed, such as the Expectation-Maximization algorithms and the Gibbs sampler. In any case, once a set of blocks has been determined, the information contained within the block alignment can be displayed as a sequence profile. Extraction of blocks from a multiple sequence alignment A global sequence alignment will usually contain ungapped regions that are aligned between multiple sequences. These regions can be extracted to produce blocks. Two programs that allows for the extraction of a block from a multiple sequence alignment are BLOCKS and eMOTIF, both of which can read in a multiple sequence alignment in one of many different file formats and perform the extraction of the blocks. The websites for these programs are: http://www.blocks.fhcrc.org/blocks/process_blocks.html http://dna.stanford.edu/emotif/ Below is a set of 10 truncated kinases

>D28 CD28 S. CEREVISIAE CELL CYCLE CONTROL PROTEIN KINASE ANYKRLEKVGEGTYGVVYKALDLRPGQGQR

VVALKKIRLESEDEGVPSTAIREISLLKEL >SKH SKH HELA MYSTERY PUTATIVE PROTEIN KINASE AKYDIKALIGRGSFSRVVRVEHRATRQPYA IKMIETKYREGREVCESELRVLRRVRHANI >APK CAPK BOVINE CARDIAC MUSCLE CYCLIC AMP-DEPENDENT (ALPHA) DQFERIKTLGTGSFGRVMLVKHMETGNHYA MKILDKQKVVKLKQIEHTLNEKRILQAVNF >EE1 WEE1 S. POMBE MITOTIC INHIBITOR TRFRNVTLLGSGEFSEVFQVEDPVEKTLKY AVKKLKVKFSGPKERNRLLQEVSIQRALKG >GFR EGFR HUMAN EPIDERMAL GROWTH FACTOR RECEPTOR TEFKKIKVLGSGAFGTVYKGLWIPEGEKVK IPVAIKELREATSPKANKEILDEAYVMASV >DGM PDGF RECEPTOR, MOUSE KINASE REGION DQLVLGRTLGSGAFGQVVEATAHGLSHSQA TMKVAVKMLKSTARSSEKQALMSELYGDLV >FES THIS IS VFES TYROSINE KINASE VLNRAVPKDKWVLNHEDLVLGEQIGRGNFG EVFSGRLRADNTLVAVKSCRETLPPDIKAK >AF1 RAF1 HUMAN C-RAF-1 ONCOGENE

SEVMLSTRIGSGSFGTVYKGKWHGDVAVKI LKVVDPTPEQFQAFRNEVAVLRKTRHVNIL >MOS CMOS HUMAN C-MOS ONCOGENE EQVCLLQRLGAGGFGSVYKATYRGVPVAIK QVNKCTKNRLASRRSFWAELNVARLRHDNI >SVK HSVK HERPES SIMPLEX VIRUS PUTATIVE PROTEIN KINASE MGFTIHGALTPGSEGCVFDSSHPDYPQRVI VKAGWYTSTSHEARLLRRLDHPAILPLLDL (Mulitple Alignment created using ClustalW; Colors Added using BoxShade) AF1 1 -SEVMLSTRIGSGSFGTVYKGKWHGDVAVKILKVVDPTPEQFQAFRNEVAVLRKT--RHVNIL MOS 1 -EQVCLLQRLGAGGFGSVYKATYRG-VPVAIKQVNKCTKNRLASRRSFWAELNVARLRHDNI- DGM 1 -DQLVLGRTLGSGAFGQVVEATAHG-

LSHSQATMKVAVKMLKSTARSSEKQALMSELYGDLV- GFR 1 -TEFKKIKVLGSGAFGTVYKGLWIP-EGEKVKIPVAIKELREATSPKANKEILDEAYVMASV- D28 1 -ANYKRLEKVGEGTYGVVYKALDLR--PGQGQRVVALKKIRLESEDEGVPSTAIREISLLKEL SKH 1 -AKYDIKALIGRGSFSRVVRVEHRA-TRQPYAIKMIETKYREGREVCESELRVLRRVRHANI- APK 1 -DQFERIKTLGTGSFGRVMLVKHME-TGNHYAMKILDKQKVVKLKQIEHTLNEKRILQAVNF- EE1 1 -TRFRNVTLLGSGEFSEVFQVEDPVEKTLKYAVKKLKVKFSGPKERNRLLQEVSIQRALKG-- FES 1 VLNRAVPKDKWVLNHEDLVLGEQIG-RGNFGEVFSGRLRADNTLVAVKSCRETLPPDIKAK-- SVK 1 -MGFTIHGALTPGSEGCVFDSSHPD-YPQRVIVKAGWYTSTSHEARLLRRLDHPAILPLLDL- cons 1 qf ll lgsgsfg vykg g k i v k r v l i

Taking this alignment, we can generate blocks using the BLOCKS server:

ID x6676xbli; BLOCK

AC x6676xbliA; distance from previous blocks=(1,1)

DE ../tmp/6676.blin

BL UNK motif; width=24; seqs=10; 99.5%=0; strength=0

AF1 ( 1) SEVMLSTRIGSGSFGTVYKGKWHG 41

MOS ( 1) EQVCLLQRLGAGGFGSVYKATYRG 48

DGM ( 1) DQLVLGRTLGSGAFGQVVEATAHG 49

GFR ( 1) TEFKKIKVLGSGAFGTVYKGLWIP 41

D28 ( 1) ANYKRLEKVGEGTYGVVYKALDLR 61

SKH ( 1) AKYDIKALIGRGSFSRVVRVEHRA 54

APK ( 1) DQFERIKTLGTGSFGRVMLVKHME 46

EE1 ( 1) TRFRNVTLLGSGEFSEVFQVEDPV 55

FES ( 1) LNRAVPKDKWVLNHEDLVLGEQIG 100

SVK ( 1) MGFTIHGALTPGSEGCVFDSSHPD 73

//

ID x6676xbli; BLOCK

AC x6676xbliB; distance from previous blocks=(2,2)

DE ../tmp/6676.blin

BL UNK motif; width=28; seqs=10; 99.5%=0; strength=0

AF1 ( 27) AVKILKVVDPTPEQFQAFRNEVAVLRKT 87

MOS ( 27) PVAIKQVNKCTKNRLASRRSFWAELNVA 75

DGM ( 27) SHSQATMKVAVKMLKSTARSSEKQALMS 92

GFR ( 27) GEKVKIPVAIKELREATSPKANKEILDE 83

D28 ( 27) PGQGQRVVALKKIRLESEDEGVPSTAIR 83

SKH ( 27) RQPYAIKMIETKYREGREVCESELRVLR 74

APK ( 27) GNHYAMKILDKQKVVKLKQIEHTLNEKR 85

EE1 ( 27) TLKYAVKKLKVKFSGPKERNRLLQEVSI 77

FES ( 27) GNFGEVFSGRLRADNTLVAVKSCRETLP 100

SVK ( 27) PQRVIVKAGWYTSTSHEARLLRRLDHPA 92

//

Expectation-Maximization In the expectation-maximization algorithms, the starting point is a set of sequences expected to have a common sequence pattern that may not be easily detectible. An initial guess is made as to the location and size of the site of interest in each of the sequences. These initial sites are then aligned. Expectation Step In the expectation step, background residue frequencies are calculated based on those residues that are not in the initially aligned sites. Column specific resides are calculated for each position in the initial motif alignment. Using this information, the probability of finding the site at any position in the sequences can then be calculated. Maximization Step In the maximization step, the counts of residues for each position in the site as found in the expectation step are used to calculate the location within each sequence that maximally aligns to the motif pattern calculated in the expectation step. This is done for each of the sequences. Once a new motif location has been calculated, the expectation step is repeated. This cycle continues until the solution converges. Example of EM: begin with an initial, Random alignment: TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAG AAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTC GGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGC AGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGA GCCTCAGGATCCAGCACACATTATCACAAACTTAGTGTCCA CATTATCACAAACTTAGTGTCCATCCATCACTGCTGACCCT TCGGAACAAGGCAAAGGCTATAAAAAAAATTAAGCAGC GCCCCTTCCCCACACTATCTCAATGCAAATATCTGTCTGAAACGGTTCC CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG GATTGGTCACAGCATTTCAAGGGAGAGACCTCATTGTAAG TCCCCAACTCCCAACTGACCTTATCTGTGGGGGAGGCTTTTGA CCTTATCTGTGGGGGAGGCTTTTGAAAAGTAATTAGGTTTAGC ATTATTTTCCTTATCAGAAGCAGAGAGACAAGCCATTTCTCTTTCCTCCCGGT AGGCTATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTC

CCAGCACACACACTTATCCAGTGGTAAATACACATCAT TCAAATAGGTACGGATAAGTAGATATTGAAGTAAGGAT ACTTGGGGTTCCAGTTTGATAAGAAAAGACTTCCTGTGGA TGGCCGCAGGAAGGTGGGCCTGGAAGATAACAGCTAGTAGGCTAAGGCCAG CAACCACAACCTCTGTATCCGGTAGTGGCAGATGGAAA CTGTATCCGGTAGTGGCAGATGGAAAGAGAAACGGTTAGAA GAAAAAAAATAAATGAAGTCTGCCTATCTCCGGGCCAGAGCCCCT TGCCTTGTCTGTTGTAGATAATGAATCTATCCTCCAGTGACT GGCCAGGCTGATGGGCCTTATCTCTTTACCCACCTGGCTGT CAACAGCAGGTCCTACTATCGCCTCCCTCTAGTCTCTG CCAACCGTTAATGCTAGAGTTATCACTTTCTGTTATCAAGTGGCTTCAGCTATGCA GGGAGGGTGGGGCCCCTATCTCTCCTAGACTCTGTG CTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC From this alignment, the frequency of each base occurring is calculated. In this case, the motif we are searching for is six bases wide. Therefore, we need to calculate seven different sets of frequencies: One for the background, and one for each of the columns in the motif. Calculating the total counts, we get:

After calculating the observed counts for each of the positions, we can convert these to observed frequencies:

In the expectation step, the residue frequencies for the motif are used to estimate the composition of the motif site. The expectation step attempts to maximally discriminate between sequence within and not within the site. For each sequence, each possible motif location is considered in order to find the most probable location given the current motif. Consider the first sequence:

TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT There are a total of 41 residues, so there are 41 – 5 + 1 = 37 potential sites to consider:

1 2 3 4 5 6 1*2*3*4*5*6 RANDOM ODDSTCAGAA .241 .230 .256 .226 .289 .263 0.000244 0.000274 0.89 CAGAAC .263 .296 .246 .256 .289 .256 0.000363 0.000362 1.00 AGAACC .256 .233 .256 .256 .256 .256 0.000256 0.000362 0.71 GAACCA .240 .296 .256 .256 .256 .263 0.000313 0.000362 0.87 AACCAG .256 .296 .243 .256 .289 .233 0.000317 0.000362 0.88 ACCAGT .256 .230 .243 .256 .213 .248 0.000193 0.000274 0.71 CCAGTT .263 .230 .256 .226 .241 .248 0.000209 0.000257 0.81 CAGTTA .263 .296 .246 .261 .241 .263 0.000317 0.000257 1.23 AGTTAT .256 .233 .254 .261 .289 .248 0.000283 0.000241 1.18 GTTATA .240 .241 .254 .256 .241 .263 0.000238 0.000241 0.99 TTATAA .241 .241 .256 .261 .289 .263 0.000295 0.000297 0.99 TATAAA .241 .296 .254 .256 .289 .263 0.000353 0.000297 1.19 ATAAAT .256 .241 .256 .256 .289 .248 0.000290 0.000318 0.91 TAAATT .241 .296 .256 .256 .241 .248 0.000279 0.000297 0.94 AAATTT .256 .296 .256 .261 .241 .248 0.000303 0.000297 1.02 AATTTA .256 .296 .254 .261 .241 .263 0.000318 0.000297 1.07 ATTTAT .256 .241 .254 .261 .289 .248 0.000293 0.000278 1.05 TTTATC .241 .241 .254 .256 .241 .256 0.000233 0.000278 0.84 TTATCA .241 .241 .256 .261 .256 .263 0.000261 0.000297 0.88 TATCAT .241 .296 .254 .256 .289 .248 0.000332 0.000297 1.12 ATCATT .256 .241 .243 .256 .241 .248 0.000229 0.000297 0.77 TCATTT .241 .230 .256 .261 .241 .248 0.000221 0.000278 0.80 CATTTC .263 .296 .254 .261 .241 .256 0.000318 0.000297 1.07 ATTTCC .256 .241 .254 .261 .256 .256 0.000268 0.000297 0.90 TTTCCT .241 .241 .254 .256 .256 .248 0.000240 0.000278 0.86 TTCCTT .241 .241 .243 .256 .241 .248 0.000216 0.000278 0.78 TCCTTC .241 .230 .243 .261 .241 .256 0.000217 0.000297 0.73 CCTTCT .263 .230 .254 .261 .256 .248 0.000255 0.000297 0.86 CTTCTC .263 .241 .254 .256 .241 .256 0.000254 0.000297 0.86 TTCTCC .241 .241 .243 .261 .256 .256 0.000241 0.000297 0.81

TCTCCA .241 .230 .254 .256 .256 .263 0.000243 0.000318 0.76 CTCCAC .263 .241 .243 .256 .289 .256 0.000292 0.000339 0.86 TCCACT .241 .230 .243 .256 .256 .248 0.000219 0.000318 0.69 CCACTC .263 .230 .256 .256 .241 .256 0.000245 0.000339 0.72 CACTCC .263 .296 .243 .261 .256 .256 0.000324 0.000339 0.95 ACTCCT .256 .230 .254 .256 .256 .248 0.000243 0.000318 0.76 The six base site CAGTTA beginning at base 8 is calculated to have the highest odds probability. Therefore, it is chosen as the new site in sequence 1. This is repeated for each of the sequences. In the maximization step, the newly chosen sites for each of the sequences are used to recalculate the frequency table. The expectation/maximization cycle is then repeated, until the results converge on a set of motifs. Multiple EM for Motif Elcitation (MEME) MEME is a program developed that uses the expectation-maximization methods as described previously. ParaMEME searches for blocks using the EM algorithm, while MetaMEME searches for profiles using Hidden Markov Models (HMMs). MEME locates one or more ungapped patterns in a single DNA or protein sequence, or in a series of sequences. A search is conducted on a variety of motif widths in order to determine the most likely width for the profile. This likelihood is based on the log likelihood score calculated after the EM algorithm. One of three types of motif models can be chosen: OOPS: One expected occurrence per sequence ZOOPS: Zero or one expected occurrence per sequence TCM: Any number of occurrences of the motif Various prior knowledge can be added to MEME, including the expected number of motifs, the expected length of the motif, and whether or not the motif is palindromic (only applicable for DNA sequences).

Gibbs Sampling Gibbs Sampling is another statistical method similar in nature to the EM algorithms. Gibbs sampling combines both EM and simulated annealing techniques in order to determine a maximal local alignment of multiple sequences.

The idea behind Gibbs sampling is to determine the most probable pattern common to all of the sequences by sliding them back and forth until the ratio of the motif probability to the background probability is a maximum. In the first step of Gibbs Sampling, the predictive update step, a random start position for the motif is chosen in all of the sequences except one that is chosen either in a random or specified order. The initial alignment of the randomly assigned motifs is used to calculated the residue frequencies in each position of the motif, and the background frequencies. This is done in a manner similar to the EM algorithm. The ratio of probabilities for the model:background is designated for the weights for each of the possible motif starting positions in the assigned sequence. These weights are normalized by dividing by their sum, resulting in a probability for each motif position. A motif start position is then chosen based on a random sampling with the given weights. This process is repeated until the residue frequencies in each column do not change. The sampling step is then repeated for a different initial random alignment. Since Gibbs sampler samples based on the probability rather than taking the maximum probability, there is a way to escape local maxima that might hinder EM algorithms. In order to improve the performance of the Bayesian approach to Gibbs sampling, Dirichlet priors (pseudocounts) are added into the nucleotide counts. Gibbs sampling also employs a shifting routine that will take a current multiple motif alignment, and shift it a few bases to the left or the right, in order to see if only part of the motif is being found. A range of motif sizes can be explored in Gibbs sampling as well. In addition, Gibbs sampling can be extended to search for multiple motifs in the same set of sequences, and to find a pattern in only a fraction of the sequences. In addition, certain model-specific parameters can be enforced, such as palindromic sequences. Gibbs Sampler Web interface http://bayesweb.wadsworth.org/gibbs/gibbs.html

Hidden Markov Models Hidden Markov models are statistical models that can take into account various probabilities. We will come back and talk about Hidden Markov models in greater detail later in the class. Position Specific Scoring Matrix (PSSM) Position Specific Scoring Matrices incorporate information theory in order to gain a measure of how much information is contained within each column of a multiple alignment. The information contained within a PSSM is a logarithmic transformation of the frequency of each residue in the motif.

Pseudocounts One problem with creating a model of a sequence alignment that is then used to search databases is that there is a bias towards the training data. For instance, one column in a motif may contain a completely conserved residue. However, such an occurrence will make it highly unlikely to detect a new member of the family that doesn’t have the same residue in that position. In addition, the residues found in a specific column may not be highly representative of the family as a whole, especially if a small training set is used. In order to get around these problems, the idea of pseudocounts is introduced in order to estimate the probabilities. So now the estimated probability is changed from a frequency of counts in the data to the following form:

cc

cacaca BN

bnP

++

=

Where Pca is the probability of seeing residue a in column c; nca is the counts of residue a in column c; bca are the pseudocounts for residue a in column c; Nc is the number of residues in column c; Bc is the number of pseudocounts in column c. These probabilities are then converted into a log-odds form (usually log2 so the information can be reported in bits) and placed in the PSSM. In order to search a sequence against a PSSM, the value for the first residue in the sequence occurring in the first column is calculated by searching the PSSM. Similarly, the value for the residue occurring in each column is calculated. These values are added (since they are logarithms) to produce a summed log odds score, S. This score can be converted to an odds score using the formula 2S. The odds scores for the motif beginning at each position can be summed together and normalized to produce a probability of the motif occurring at each location. Information theory can give an appreciation for the amount of information contained within each sequence. When there is no information contained within a column, the amount of uncertainty can be measured as log220 = 4.32 for amino acids, since there are 20 amino acids. For nucleic acid sequences, the amount of uncertainty can be measured as log24 = 2. PSSMs can be used in order to reduce the uncertainty. If only one amino acid is found in a particular column, then the uncertainty is 0 – there is only one choice. If there are two amino acids occurring with equal probability, then there is an uncertainty to deciding which residue it is. The amount of uncertainty for a particular column is measured as the entropy, as introduced previously:

∑−=)(

)log(aresidues

acacC pfH

the uncertainty for the whole PSSM can be calculated as a sum over all columns:

∑=allcolumns

cc HH

Sequence Logos

One way to look at a particular PSSM is to view it visually. Sequence logos are one way to do so, by illustrating the information in each column of a motif. Such a graph can indicate which residues and which columns are the most important as far as sequence conservation is concerned. The height of the logo is calculated as the amount by which uncertainty has been decreased. In addition to the entropy measure given before, a relative entropy measure could be calculated as well. Relative entropy takes into account not only the data in the columns of the motif, but also the overall composition of the organism being studied. Relative entropy can be measured as:

∑−=)(

2 )/(logaresidues

aacacC bpfR

Where ba is the background frequency of residue a in the orgainism. If the frequency in the column is less than the frequency in the background, then a negative information can be computed, which is shown by an inverted character in the logo.

Sequence Editors and formatters Sequence editors allow the user to take a given multiple alignment and manually fix it. Examples of sequence editors include: CINEMA http://www.biochem.ucl.ac.uk/bsm/dbbrowser/CINEMA2.02/kit.html GeneDoc http://www.psc.edu/biomed/genedoc/ MACAW http://ncbi.nlm.nih.gov/pub/schuler/macaw BoxShade http://www.ch.embnet.org/software/BOX_form.html

Biological sequence data formats

IUPAC Codes In order to standardize sequence data, The Nomenclature Committee of the International Union of Biochemistry and the International Union of Pure and Applied Chemistry has established a standard code to represent bases that are uncertain or ambiguous. The code, often referred to as the IUPAC code, is as follows: A = adenine

C = cytosine

G = guanine

T = thymine

U = uracil

R = G A (purine)

Y = T C (pyrimidine)

K = G T (keto)

M = A C (amino)

S = G C

W = A T

B = G T C

D = G A T

H = A C T

V = G C A

N = A G C T (any) Any other character besides the ones listed above (with the exception of the gap character ‘-‘) represents an error that will not be tolerated by nearly all sequence analysis programs. Standard Amino Acid Code In addition to the nucleic acid codes, a standard single letter and three letter amino acid code has been formulated by IUPAC as well. The table for this code is as follows:

1-letter 3-letter description A Ala Alanine R Arg Arginine N Asn Asparagine D Asp Aspartic acid C Cys Cysteine Q Gln Glutamine E Glu Glutamic acid G Gly Glycine H His Histidine I Ile Isoleucine L Leu Leucine K Lys Lysine M Met Methionine F Phe Phenylalanine P Pro Proline S Ser Serine T Thr Threonine W Trp Tryptophan Y Tyr Tyrosine V Val Valine B Asx Aspartic acid or AsparagineZ Glx Glutamine or Glutamic acid

X Xaa or Xxx Any amino acid

FASTA Fasta sequence format is one of the most basic and widespread sequence formats. A sequence in fasta format has as its first line a descriptor beginning with a ‘>’ character. The proceeding lines contain the sequence (either nucleotide or amino acid) using standard one-letter symbols. This format is extremely useful for sequence analysis programs, since it is devoid of numerical and nonsequence characters (with the exception of the newline character).

Example Fasta Sequence: >gi|27819608|ref|NP_776342.1| hemoglobin, beta [beta globin] [Bos taurus]

MLTAEEKAAVTAFWGKVKVDEVGGEALGRLLVVYPWTQRFFESFGDLSTADAVMNNPKVKAHGKKVLDSF

SNGMKHLDDLKGTFAALSELHCDKLHVDPENFKLLGNVLVVVLARNFGKEFTPVLQADFQKVVAGVANAL

AHRYH Note the first line begins with ‘>’, which in this case is followed by gi, indicating that the next field surrounded by ‘|’ will be the GenBank identifier. Following the GenBank identifier is the keyword ‘ref’ indicating the next field will be the reference for the version of this sequence. The final field is the description. Note that nearly all sequence based programs will treat anything following the ‘>’ as a comment and disregard it (or only use it as a sequence descriptor). There are, however, a few sequence analysis programs that expect the sequences to be in a strict fasta format. GenBank GenBank is the National Center for Biotechnology Information’s nucleic acid and protein sequence database. It is the most widely used source of biological sequence data. GenBank file format contains information about the sequence, including literature references, functions of the sequence, locations of various features, etc. The information in GenBank records is organized into fields, each with an identifier, justified to the farthest left column. Some identifiers have additional subfields. The actual sequence data lies between the identifier ORIGIN and the ‘//’ which signals the end of a GenBank record.

Example GenBank Sequence: LOCUS HBB 145 aa linear MAM 22-JAN-2003

DEFINITION hemoglobin, beta [beta globin] [Bos taurus].

ACCESSION NP_776342

VERSION NP_776342.1 GI:27819608

DBSOURCE REFSEQ: accession NM_173917.1

KEYWORDS .

SOURCE Bos taurus (cow)

ORGANISM Bos taurus

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

Mammalia; Eutheria; Cetartiodactyla; Ruminantia; Pecora; Bovoidea;

Bovidae; Bovinae; Bos.

REFERENCE 1 (residues 1 to 145)

AUTHORS Duncan,C.H.

JOURNAL Unpublished (1991)

COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final

NCBI review. The reference sequence was derived from M63453.1.

FEATURES Location/Qualifiers

source 1..145

/organism="Bos taurus"

/db_xref="taxon:9913"

/chromosome="15"

/map="15q22-q27"

/tissue_type="thymus"

/dev_stage="newborn"

Protein 1..145

/product="hemoglobin, beta [beta globin]"

Region 3..145

/region_name="Globin"

/note="globin"

/db_xref="CDD:pfam00042"

CDS 1..145

/gene="HBB"

/coded_by="NM_173917.1:53..490"

/db_xref="LocusID:280813"

ORIGIN

1 mltaeekaav tafwgkvkvd evggealgrl lvvypwtqrf fesfgdlsta davmnnpkvk

61 ahgkkvldsf sngmkhlddl kgtfaalsel hcdklhvdpe nfkllgnvlv vvlarnfgke

121 ftpvlqadfq kvvagvanal ahryh

ASN.1 Abstract Syntax Notation (ASN.1) is a formal description language that has been developed to encode various data such that it can be easily connected across computer systems. ASN.1 format is highly structured and detailed. ASN.1 format contains all of the other information found in other formats. Seq-entry ::= set {

level 1 ,

class nuc-prot ,

descr {

source {

genome genomic ,

org {

taxname "Bos taurus" ,

common "cow" ,

db {

{

db "taxon" ,

tag

id 9913 } } ,

orgname {

name

binomial {

genus "Bos" ,

species "taurus" } ,

lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;

Euteleostomi; Mammalia; Eutheria; Cetartiodactyla; Ruminantia; Pecora;

Bovoidea; Bovidae; Bovinae; Bos" ,

gcode 1 ,

mgcode 2 ,

div "MAM" } } ,

subtype {

{

subtype chromosome ,

name "15" } ,

{

subtype map ,

name "15q22-q27" } ,

{

subtype tissue-type ,

name "thymus" } ,

{

subtype dev-stage ,

name "newborn" } } } ,

user {

type

str "RefGeneTracking" ,

data {

{

label

str "Status" ,

data

str "Provisional" } ,

{

label

str "Assembly" ,

data

fields {

{

label

id 0 ,

data

fields {

{

label

str "accession" ,

data

str "M63453.1" } ,

{

label

str "gi" ,

data

int 162741 } } } } } ,

{

label

str "Related" ,

data

fields {

{

label

id 0 ,

data

fields {

{

label

str "accession" ,

data

str "X00376.1" } ,

{

label

str "gi" ,

data

int 395 } } } } } ,

{

label

str "Unknown" ,

data

fields {

{

label

id 0 ,

data

fields {

{

label

str "accession" ,

data

str "X03248.1" } ,

{

label

str "gi" ,

data

int 319 } } } } } } } ,

pub {

pub {

gen {

cit "Unpublished" ,

authors {

names

std {

{

name

name {

last "Duncan" ,

initials "C.H." } } } } ,

date

std {

year 1991 } } } ,

comment "simple staff_entry" } ,

update-date

std {

year 2003 ,

month 1 ,

day 22 } } ,

seq-set {

seq {

id {

other {

accession "NM_173917" ,

version 1 } ,

gi 27819607 } ,

descr {

molinfo {

biomol mRNA } ,

title "Bos taurus hemoglobin, beta [beta globin] (HBB), mRNA" ,

create-date

std {

year 2003 ,

month 1 ,

day 22 } } ,

inst {

repr raw ,

mol rna ,

length 821 ,

strand ss ,

seq-data

ncbi2na '11F9F784416EF47241C440484539E1E78A20A796D165FFAA42B80BA382F

AEB8A57A929E7AFB7157A1D22BDFE2D7FAA1FB51E78E7BCE0415C2B82953A420AE723D7F2C3A4E

09376385D0A917F9E678B89E47B8C2793BA35E207D09D7A906E72EBEE7A7643FE90A0F455AE792

9E1FD20AEBA7AEE94395E9512334F09D5FD79FD4A02BFFD35D225408F833A003CE0BBFE24DE977

970C084FCFF4F91EBB3F03CFD1EDDF1D23A913A8A900478213020382A72D885F8803334B37E855

38492EBEC0C9E3BCE80129FE75F25F1DD5F020F40'H } ,

annot {

{

data

ftable {

{

data

gene {

locus "HBB" ,

db {

{

db "LocusID" ,

tag

id 280813 } } } ,

location

int {

from 0 ,

to 820 ,

strand plus ,

id

gi 27819607 } } } } } } ,

seq {

id {

other {

accession "NP_776342" ,

version 1 } ,

gi 27819608 } ,

descr {

molinfo {

biomol peptide } ,

title "hemoglobin, beta [beta globin] [Bos taurus]" ,

create-date

std {

year 2003 ,

month 1 ,

day 22 } } ,

inst {

repr raw ,

mol aa ,

length 145 ,

seq-data

ncbieaa "MLTAEEKAAVTAFWGKVKVDEVGGEALGRLLVVYPWTQRFFESFGDLSTADAVMNNPKV

KAHGKKVLDSFSNGMKHLDDLKGTFAALSELHCDKLHVDPENFKLLGNVLVVVLARNFGKEFTPVLQADFQKVVAGVA

NALAHRYH" } ,

annot {

{

data

ftable {

{

data

prot {

name {

"hemoglobin, beta [beta globin]" } } ,

location

whole

gi 27819608 } ,

{

data

region "Globin" ,

comment "globin" ,

location

int {

from 2 ,

to 144 ,

id

gi 27819608 } ,

ext {

type

str "cddScoreData" ,

data {

{

label

str "definition" ,

data

str "Globin" } ,

{

label

str "short_name" ,

data

str "globin" } ,

{

label

str "score" ,

data

int 327 } ,

{

label

str "evalue" ,

data

real { 813255, 10, -37 } } ,

{

label

str "bit_score" ,

data

real { 130091, 10, -3 } } } } ,

dbxref {

{

db "CDD" ,

tag

str "pfam00042" } } } } } } } } ,

annot {

{

data

ftable {

{

data

cdregion {

frame one ,

code {

id 1 } } ,

product

whole

gi 27819608 ,

location

int {

from 52 ,

to 489 ,

id

gi 27819607 } } } } } }

Sample ASN.1 file SwissProt XML File Format Databases GenBank DDBJ EMBL SwissProt BLOCKS PFAM Using Entrez Complete Process:

1) Determine sequences to align (Globins) >sp|P02023|HBB_HUMAN Hemoglobin beta chain - Homo sapiens (Human), Pan troglodytes (Chimpanzee), and Pan paniscus (Pygmy chimpanzee) (Bonobo). VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >sp|P02062|HBB_HORSE Hemoglobin beta chain - Equus caballus (Horse). VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK DFTPELQASYQKVVAGVANALAHKYH

>sp|P01922|HBA_HUMAN Hemoglobin alpha chain - Homo sapiens (Human), Pan troglodytes (Chimpanzee), and Pan paniscus (Pygmy chimpanzee) (Bonobo). VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >sp|P01958|HBA_HORSE Hemoglobin alpha chains (Slow and fast) - Equus caballus (Horse). VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA VHASLDKFLSSVSTVLTSKYR >sp|P02185|MYG_PHYCA Myoglobin - Physeter catodon (Sperm whale) (Physeter macrocephalus). VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP GDFGADAQGAMNKALELFRKDIAAKYKELGYQG >sp|P02208|GLB5_PETMA Globin V - Petromyzon marinus (Sea lamprey). PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA AVIADTVAAGDAGFEKLMSMICILLRSAY >sp|P02240|LGB2_LUPLU Leghemoglobin II - Lupinus luteus (Yellow lupine). GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPEL QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKE VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA 2) Determine multiple alignment (ClustalW)

>sp|P02023|HBB_HUMAN

--------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQR

FFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTF

ATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVA

GVANALAHKYH------

>sp|P02062|HBB_HORSE

--------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQR

FFDSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTF

AALSELHCDKLHVDPENFRLLGNVLVVVLARHFGKDFTPELQASYQKVVA

GVANALAHKYH------

>sp|P01922|HBA_HUMAN

---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKT

YFPHF-DLS-----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNAL

SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLA

SVSTVLTSKYR------

>sp|P01958|HBA_HORSE

---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKT

YFPHF-DLS-----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGAL

SNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPAVHASLDKFLS

SVSTVLTSKYR------

>sp|P02185|MYG_PHYCA

---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLE

KFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAEL

KPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALE

LFRKDIAAKYKELGYQG

>sp|P02208|GLB5_PETMA

PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQE

FFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKL

RDLSGKHAKSFQVDPQYFKVLAAVIADTVAAG---------DAGFEKLMS

MICILLRSAY-------

>sp|P02240|LGB2_LUPLU

--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKD

LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL

KNLGSVHVSKGVAD-AHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYD

ELAIVIKKEMNDAA--- 3) View alignment using various methods 4) Find Blocks in the alignment (BLOCKS)

Profile: Scores for substitutions and gaps in each column Blocks: ungapped aligned regions

Alignments based on locally conserved patterns found in the same order in the sequences (synteny) Use of statistical methods and probabilistic models of the sequences Multiple sequence alignments yield information into the evolutionary history of the sequences – sequences that are most similar are likely to be recently derived from a common ancestor sequence If the sequences in a multiple alignment have quite a bit of variation then it is difficult to create a multiple sequence alignment due to the different combinations of substitutions, insertions, and deletions that can be used READ MOUNT, Chapters 2 and 7

CECS 694-02 Introduction to Bioinformatics

Lecture 5: Searching Sequence Databases

Multiple Alignment format

In addition to storing individual sequences in a specified format, the results from a multiple sequence alignment can be stored in a specified format as well. Various programs (including the BLOCKS server) can then read in these multiple sequence alignments and perform analysis on them. The most widely used multiple sequence alignment file formats are: FASTA, GCG Multiple Sequence Format, and ALN.

FASTA Format

In Fasta Format, each sequence in the multiple alignment starts with a Fasta description line (beginning with a ‘>’). Following the description line is the sequence data. The gap character ‘-‘ is found in locations corresponding to gaps in the sequence when the multiple alignment was created. >JC2395

NVSDVNLNK---YIWRTAEKMK---ICDAKKFARQHKIPESKIDEIEHNSPQDAAE----

-------------------------QKIQLLQCWYQSHGKT--GACQALIQGLRKANRCD

IAEEIQAM

>KPEL_DROME

MAIRLLPLPVRAQLCAHLDAL-----DVWQQLATAVKLYPDQVEQISSQKQRGRS-----

-------------------------ASNEFLNIWGGQYN----HTVQTLFALFKKLKLHN

AMRLIKDY

>FASA_MOUSE

NASNLSLSK---YIPRIAEDMT---IQEAKKFARENNIKEGKIDEIMHDSIQDTAE----

-------------------------QKVQLLLCWYQSHGKS--DAYQDLIKGLKKAECRR

TLDKFQDM

Stockholm Format Stockholm Format (http://www.cgr.ki.se/cgr/groups/sonnhammer/Stockholm.html) # STOCKHOLM 1.0

#=GF ID CBS

#=GF AC PF00571

#=GF DE CBS domain

#=GF AU Bateman A

#=GF CC CBS domains are small intracellular modules mostly found

#=GF CC in 2 or four copies within a protein.

#=GF SQ 67

#=GS O31698/18-71 AC O31698

#=GS O83071/192-246 AC O83071

#=GS O83071/259-312 AC O83071

#=GS O31698/88-139 AC O31698

#=GS O31698/88-139 OS Bacillus subtilis

O83071/192-246 MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRVPVYERS

#=GR O83071/192-246 SA 999887756453524252..55152525....36463774777

O83071/259-312 MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVAIVLDEY

#=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEE

O31698/18-71 MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAIPVLDPS

#=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH

O31698/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE

#=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHH..HEEEEEEE....EEEEEEEEEEH

#=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH

O31699/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE

#=GR O31699/88-139 AS * #=GR O31699/88-139 IN

GCG Multiple Sequence Format

!!AA_MULTIPLE_ALIGNMENT 1.0

msf MSF: 131 Type: P 22/01/02 CompCheck: 3003 ..

Name: IXI_234 Len: 131 Check: 6808 Weight: 1.00




//

1 50

IXI_234 TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT

IXI_235 TSPASIRPPAGPSSR.........RPSPPGPRRPTGRPCCSAAPRRPQAT

IXI_236 TSPASIRPPAGPSSRPAMVSSR..RPSPPPPRRPPGRPCCSAAPPRPQAT

IXI_237 TSPASLRPPAGPSSRPAMVSSRR.RPSPPGPRRPT....CSAAPRRPQAT

51 100

IXI_234 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG

IXI_235 GGWKTCSGTCTTSTSTRHRGRSGW..........RASRKSMRAACSRSAG

IXI_236 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR..G

IXI_237 GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR..G

101 131

IXI_234 SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE

IXI_235 SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE

IXI_236 SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE

IXI_237 SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE

PileUp

MSF: 92 Type: P Check: 1886 ..

Name: JC2395 oo Len: 92 Check: 8870 Weight: 35.3

Name: FASA_MOUSE oo Len: 92 Check: 527 Weight: 64.6

Name: KPEL_DROME oo Len: 92 Check: 2489 Weight: 41.2

//

JC2395 .NVSDVNLNK YIWRTAEKMK ICDAKKFARQ HKIPESKIDE IEHNSPQDAA

FASA_MOUSE .NASNLSLSK YIPRIAEDMT IQEAKKFARE NNIKEGKIDE IMHDSIQDTA

KPEL_DROME MAIRLLPLPV RAQLCAHLDA LDVWQQLATA VKLYPDQVEQ ISSQKQRGRS

JC2395 EQKIQLLQCW YQSHGKTGAC QALIQGLRKA NRCDIAEEIQ AM

FASA_MOUSE EQKVQLLLCW YQSHGKSDAY QDLIKGLKKA ECRRTLDKFQ DM

KPEL_DROME ASN.EFLNIW GGQYNHT..V QTLFALFKKL KLHNAMRLIK DY

ClustalW ALN Format CLUSTAL W (1.82) multiple sequence alignment

JC2395 -NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAEQKIQLLQCW 59

FASA_MOUSE -NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAEQKVQLLLCW 59

KPEL_DROME MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQRGRSASN-EFLNIW 59

: * *. : :::* :: .::::* :. :. : .: ::* *

JC2395 YQSHGKTGACQALIQGLRKANRCDIAEEIQAM 91

FASA_MOUSE YQSHGKSDAYQDLIKGLKKAECRRTLDKFQDM 91

KPEL_DROME GGQYNHT--VQTLFALFKKLKLHNAMRLIKDY 89

.:.:: * *: ::* : ::

CLUSTAL W(1.4) multiple sequence alignment

IXI_234 TSPASIRPPA GPSSRPAMVS SRRTRPSPPG PRRPTGRPCC SAAPRRPQAT

IXI_235 TSPASIRPPA GPSSR----- ----RPSPPG PRRPTGRPCC SAAPRRPQAT

IXI_236 TSPASIRPPA GPSSRPAMVS SR--RPSPPP PRRPPGRPCC SAAPPRPQAT

IXI_237 TSPASLRPPA GPSSRPAMVS SRR-RPSPPG PRRPT----C SAAPRRPQAT

IXI_234 GGWKTCSGTC TTSTSTRHRG RSGWSARTTT AACLRASRKS MRAACSRSAG

IXI_235 GGWKTCSGTC TTSTSTRHRG RSGW------ ----RASRKS MRAACSRSAG

IXI_236 GGWKTCSGTC TTSTSTRHRG RSGWSARTTT AACLRASRKS MRAACSR--G

IXI_237 GGYKTCSGTC TTSTSTRHRG RSGYSARTTT AACLRASRKS MRAACSR--G

IXI_234 SRPNRFAPTL MSSCITSTTG PPAWAGDRSH E

IXI_235 SRPNRFAPTL MSSCITSTTG PPAWAGDRSH E

IXI_236 SRPPRFAPPL MSSCITSTTG PPPPAGDRSH E

IXI_237 SRPNRFAPTL MSSCLTSTTG PPAYAGDRSH E

Phylip

3 92

JC2395 -NVSDVNLNK YIWRTAEKMK ICDAKKFARQ HKIPESKIDE IEHNSPQDAA

FASA_MOUSE -NASNLSLSK YIPRIAEDMT IQEAKKFARE NNIKEGKIDE IMHDSIQDTA

KPEL_DROME MAIRLLPLPV RAQLCAHLDA LDVWQQLATA VKLYPDQVEQ ISSQKQRGRS

EQKIQLLQCW YQSHGKTGAC QALIQGLRKA NRCDIAEEIQ AM

EQKVQLLLCW YQSHGKSDAY QDLIKGLKKA ECRRTLDKFQ DM

ASN-EFLNIW GGQYNHT--V QTLFALFKKL KLHNAMRLIK DY

PIR Format >P1;JC2395

-NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAEQKIQLLQCW

YQSHGKTGACQALIQGLRKANRCDIAEEIQAM

*

>P1;FASA_MOUSE

-NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAEQKVQLLLCW

YQSHGKSDAYQDLIKGLKKAECRRTLDKFQDM

*

>P1;KPEL_DROME

MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQRGRSASN-EFLNIW

GGQYNHT--VQTLFALFKKLKLHNAMRLIKDY

*

GDE %JC2395

nvsdvnlnkyiwrtaekmkicdakkfarqhkipeskideiehnspqdaaeqkiqllqcwy

qshgktgacqaliqglrkanrcdiaeeiqam

%FASA_MOUSE

nasnlslskyipriaedmtiqeakkfarennikegkideimhdsiqdtaeqkvqlllcwy

qshgksdayqdlikglkkaecrrtldkfqdm

%KPEL_DROME

--mairllplpvraqlcahldaldvwqqlatavklypdqveqissqkqrgrsasneflni

wggqynhtvqtlfalfkklklhnamrlikdy

Nexus

#NEXUS

BEGIN DATA;

dimensions ntax=3 nchar=91;

format missing=?

symbols="ABCDEFGHIKLMNPQRSTUVWXYZ"

interleave datatype=PROTEIN gap= -;

matrix

JC2395 NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAE

FASA_MOUSE NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAE

KPEL_DROME --MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQRG

JC2395 QKIQLLQCWYQSHGKTGACQALIQGLRKANRCDIAEEIQAM

FASA_MOUSE QKVQLLLCWYQSHGKSDAYQDLIKGLKKAECRRTLDKFQDM

KPEL_DROME RSASNEFLNIWGGQYNHTVQTLFALFKKLKLHNAMRLIKDY

;

end;

General Feature Format (GFF) The general feature format was developed so that annotations could be readily parsed by a number of programs to quickly determine the location of various features. Example uses of GFF include importing data into ACE formats for quick feature viewing, and for creating sequence images complete with features. http://www.sanger.ac.uk/Software/formats/GFF/ A description of multiple alignment formats is given on the BLOCKS server page: http://www.blocks.fhcrc.org/blocks/help/blocks_format.html

The sequence formats used by EMBL are found at: http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Themes/SequenceFormats.html Sequence Conversion Programs SeqIO ReadSeq

Searching Sequence Databases Sequence Similarity Searches The most common type of search used is to compare a single query sequence against a database. Such a search is typically performed to gather information on the potential function of a gene. This is done by comparing the search results, and the functions of the sequences that are related on a sequence similarity level. Such a search can be expanded to find more distantly related sequences (at least on the sequence level) to the query sequence. Such sequence similarity searches can yield information concerning related proteins that may lead to the discovery of a family that can then be characterized, and perhaps multiply aligned and profiled. With all sequence searches, it is important to consider the sensitivity and the selectivity of the algorithms. Sensitivity refers to the ability to find most of the related members (reduction of false negatives) while selectivity refers to the ability to detect only members of the family you are interested in studying (reduction of false positives). This is important to keep in mind when interpreting alignment results and assigning a function to a sequence, since this assignment may be given through transitive relationships. DNA versus protein searches It is much easier to determine patterns of sequence similarity between protein sequences than DNA sequences due to the fact that DNA sequences only have four potential characters per position, while amino acid sequences have 20. To illustrate this example, consider a sequence of length four. With DNA sequences, such a sequence has a chance of 1/44 = 1/256 of aligning at random. With protein sequences, this would be 1/204 = 1/160,000. In addition, since multiple codon sequences code for the same amino acid, it is possible that the translated amino acid sequences could be identical, yet the underlying nucleic acids could be different. For instance, consider the following sequences: AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA blast2sequences reports that no significant alignment is found! If we look at an ungapped alignment between these two sequences, we get: AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA ||||| | || || || | || || || | | AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA

which gives 21 identical residues out of 36, for a percent identity of 58%. However, translation of both of these sequences yields the protein sequence: ELVISISALIVE This sequence is 100% identical for both protein sequences. Therefore, if the nucleotide region we are interested in searching for is known to be in a protein coding region, it would be beneficial to translate the DNA sequence into a protein sequence. Generally, both the target (or database) and the query sequences are translated into all six reading frames and compared to one another. (Recall that there are three reading frames in the forward direction, and three in the reverse complement). Now rather than having four comparisons (target forward and reverse complement AGAINST query forward and reverse complement) there are now thirty-six comparisons to be made. Therefore, while translation of the sequences into proteins will lead to better results, the time it takes to run will be approximately nine times as long. Scoring matrices Most database searching utilities will allow the user to change between various scoring systems. At one time, the default scoring matrix used for amino acid searches was the Dayhoff PAM250 matrix. In most instances, the PAM250 matrix has been replaced by the BLOSUM62 matrix, since the BLOSUM matrices were based on more sequence data.

FASTA FASTA was the first rapid search method developed for database searching. FASTA uses a heuristic algorithm to speed up the process of locating similar regions. Unlike dynamic programming, FASTA is not guaranteed to lead to the optimal solution. However, the search time is roughly 50 times faster than DP solutions. FASTA Algorithm In the initial stage of searching for regions of similarity, FASTA uses a hashing approach. For each of the sequences being compared, a table is constructed showing the positions of

each word of length k, or k-tuple. The relative positions of each word in the two sequences are calculated by subtracting the position of the first sequence from the position of the second. Words having the same offset are in phase and reveal a longer region of alignment between the two sequences. Step 2: the ten regions with the highest density of identities are identified. The ends of each region is trimmed to include only residues contributing to the highest score. Each resulting region is now a partial alignment without gaps. Each is given a score (init1 score) Step 3: If there are several initial regions with scores greater than a cutoff value, check to see if the trimmed initial regions can be joined to form an approximate alignment with gaps. A similarity score is calculated as the sum of the init1 scores for each of the initial regions minus a penalty for each gap. (initn score) Step 4: Construct a needleman-wunch optimal alignment of the query sequence and the library sequence, considering only those residues that lie in a band 32 residues wide, centered on the best initial region found in step 2 (opt score) After locating the k-tuples and grouping the ones with the same offset together, an optimization step is invoked to piece together k-tuple alignments allowing gaps. Using this approach, the search time increases linearly with the size of the query and target sequences. Compared to the polynomial increase with dynamic programming, FASTA presents a much faster alternative, particularly as the sequence size increases. For DNA and RNA sequences, the typical size of the k-tuple in the FASTA algorithm is 4-6, while in protein sequences it is 1 or 2. The larger the k-tuple, the faster FASTA will run, but the less thorough it will be in determining regions of similarity. Significance of fasta scores In order to determine the significance of an alignment for a target database and a query sequence, FASTA calculates the u and lambda parameters for the extreme value distribution, which will vary with the length and the composition of the sequences being compared. The steps to calculate z-scores for each possible score is calculated as follows: 1) The average score for database sequences in the same length range is determined. 2) The average score is plotted against the logarithm of average sequence length in each

length range. 3) The points are then fitted to a straight line by linear regression. 4) A z score, the number of standard deviations from the fitted line, is calculated for

each score. 5) High-scoring, presumably related sequences, and also very low scoring alignments

that do not fit the straight line are removed from consideration.

6) Steps 1-5 are repeated one or more times. 7) The known statistical distribution of alignment scores is used to calculated the

probability that a Z score between unrelated or random sequences of the same lengths as the query and database sequence could be greater than z, which follow an extreme value distribution such that: (Pearson, 2000 ISMB)

)(1)()5772.02825.1( −−−−=>

zeezZP The expectation of observing a Z-score greater than z in a database of D sequences is:

)()( zZPDzZE >∗=> 8) Z scores are then normalized to z’ = 50 + 10z so that an alignment score with a

standard deviation of 5 now has a normalized score of 100. 9) The significance of the alignment score between a sequence and a database can be

further analyzed by aligning a sequence with a shuffled library. HISTOGRAM OF FASTA DATA One of the items reported in the FASTA output is a histogram showing a graphical representation of the distribution of the normalized scores when matched with the query sequence. These scores are expected to fall approximately into a normal distribution, and any significant matches will fall outside the normal curve. The first column listed in the fasta score distribution is the z’ score, which is a z score normalized to a mean of 50 and a standard deviation of 10. The second column lists the number of optimized scores found in that range. The third column lists the number of expected sequences to lie within a range, given an extreme value distribution and the calculated values of u and lambda. The “=” signs give an approximate curve for the actual distribution, while the “*” indicates the expected score distribution. The z’-scores greater than 120 are considered to be high-scoring alignments.

opt E() < 20 188 0:== 22 0 0: one = represents 109 library sequences 24 0 0: 26 2 1:* 28 7 15:* 30 28 91:* 32 200 353:== * 34 841 958:========* 36 2217 1968:==================*== 38 3746 3253:=============================*===== 40 5360 4538:=========================================*======== 42 6055 5547:==================================================*===== 44 6496 6119:========================================================*=== 46 5820 6232:====================================================== * 48 5469 5966:=================================================== * 50 4820 5444:============================================= * 52 4202 4787:======================================= * 54 3815 4089:=================================== * 56 3271 3415:===============================* 58 2755 2804:=========================* 60 2268 2271:====================* 62 1813 1821:================* 64 1500 1448:=============* 66 1233 1145:==========*= 68 951 900:========* 70 746 706:======* 72 699 551:=====*= 74 460 430:===*= 76 337 335:===* 78 287 260:==* 80 244 202:=*= 82 185 154:=* 84 115 122:=* 86 114 95:*= 88 75 73:* inset = represents 1 library sequences 90 70 57:* 92 48 44:* :=======================================* 94 26 34:* :========================== * 96 33 26:* :=========================*======= 98 14 20:* :============== * 100 10 16:* :========== * 102 7 12:* :======= * 104 6 9:* :====== * 106 5 7:* :===== * 108 2 6:* :== * 110 2 4:* :== * 112 1 3:* := * 114 0 3:* : * 116 0 2:* : * 118 0 2:* : * >120 27 1:* :*==========================

After the histogram is calculation of the Kolmogorov-Smirnov statistic, which yields some information into the deviation between the observed and expected distributions. If the deviation is significant enough, then the alignment should be performed again with different gap penalties. After the statistics is a list of the best scoring hits. Note that FASTA presents at most one highest scoring hit per sequence, whereas other alignment programs may present many. Listed in the hits section are the description of the sequence, the z’ score, the initn, initl, and opt scores (note the initn score is the extended hit score; the init1 score is the initial

hit score; the opt score is the score calculated by stringing together regions with gaps – see Figure 7.2 of Mount for a more in-depth explanation) and the E score (calculated as an estimate of the likelihood of a match occurring by chance). The best scores are: initn init1 opt z-sc E(66345) MERR_PSEAE mercuric resistance operon regu ( 144) 928 928 928 1129.8 0 MERR_SHIFL mercuric resistance operon regu ( 144) 871 871 871 1061.3 0 MERR_SERMA mercuric resistance operon regu ( 144) 810 810 810 988.1 0 MERR_STAAU mercuric resistance operon regu ( 135) 292 172 298 373.6 3.5e-14 MERR_BACSR (strain rc607). mercuric resist ( 132) 241 198 289 363.0 1.4e-13 YHDM_ECOLI hypothetical transcriptional re ( 141) 175 175 276 347.0 1.1e-12 After the list of the highest scoring hits are the smith-waterman alignments between the query and the highest scoring hits. A ‘:’ marks conservation; ‘.’ denotes a conservative substitution: >>MERR_STAAU mercuric resistance operon regulatory protei (135 aa) initn: 292 init1: 172 opt: 298 Z-score: 373.6 expect() 3.5e-14 Smith-Waterman score: 298; 36.923% identity in 130 aa overlap 10 20 30 40 50 60 MerR MENNLENLTIGVFAKAAGVNVETIRFYQRKGLLLEPDKPYGSIRRYGEADVTRVRFVKSA . :. .::: :: ::.:.:.::::. : . .. : :.: . ::::.: MERR_S MGMKISELAKACDVNKETVRYYERKGLIAGPPRNESGYRIYSEETADRVRFIKRM 10 20 30 40 50 70 80 90 100 110 MerR QRLGFSLDEIAELLRL--EDGTHCEEASSLAEHKLKDVREKMADLARMEAVLSELVCACH ..: ::: :: :. . .:: .:.. ... .: :....:. : :.. .: :: : MERR_S KELDFSLKEIHLLFGVVDQDGERCKDMYAFTVQKTKEIERKVQGLLRIQRLLEELKEKCP 60 70 80 90 100 110 120 130 140 MerR ARRGNVSCPLIASLQGGASLAGSAMP ... .::.: .:.:: MERR_S DEKAMYTCPIIETLMGGPDK 120 130

FASTA Programs FASTA – compares a query protein sequence to a protein sequence library or a DNA sequence to a DNA sequence library. TFASTA – compares a query protein sequence to a DNA sequence library, after the DNA sequence library has been translated in all six reading frames. FASTF – compares a set of ordered peptide fragments, obtained from analysis of a protein by cleavage and sequencing of protein bands resolved by electrophoresis, against a protein database TFASTF – compares a set of ordered peptide fragments, obtained from analysis of a protein by cleavage and sequencing of protein bands resolved by electrophoresis, against a DNA database

FASTS – compares a set of ordered peptide fragments, obtained from mass-spectometry analysis of a protein, against a protein database. TFASTS – compares a set of ordered peptide fragments, obtained from mass-spectometry analysis of a protein, against a DNA database. Example >mgstm1 MGCEN,MIDYP,MLLAY,MLLGY FASTX, FASTY – compares a query DNA sequence to a protein sequence database, translating the DNA sequence in all six reading frames and allowing frameshifts. TFASTX, TFASTY – Compares a protein sequence to a DNA sequence or DNA sequence library, such that the DNA sequence is translated in all six reading frames, and the protein query sequence is compared to each of the six derived protein sequences. The DNA sequence is translated from one end to the other; termination codons are translated into unknown amino acids. LALIGN, LFASTA – Same as the FASTA program, except that multiple aligning regions may be reported for each sequence. PLALIGN – dot plot algorithm available through the fasta suite FAST-pat, FAST-swap: compares a sequence to a pattern database FAST-swap BLAST Basic Local Alignment Search Tool Blast has supplanted FASTA as the most commonly used database search tool. BLAST was developed as an improvement in speed from the FASTA suite without a sacrifice in sensitivity. The first step of the BLAST algorithm is to locate common words or k-tuples in the query sequence and the target database sequences. However, BLAST does not search for every possible k-tuple, it only considers those that are most significant. For the NCBI BLAST program, the word length is fixed at 3 for proteins and 11 for nucleic acids. This k-tuple is referred to as the word-length, and is the minimum length needed to achieve a word score that is high enough to be significant but not so long as to miss short but significant patterns. MSP – Maximal Segment Pair: The highest scoring pair of identical length segments chosen from two sequences. The boundaries of an MSP are chosen to maximize its score, so an MSP can be of any length.

The number of MSP scores with a score greater than a cutoff score S are reported. BLAST minimizes the time spent on sequence regions where the score is unlikely to exceed this cutoff score. The main strategy of BLAST is to seek only segment pairs that contain a word pair with a score of at least T. Any such hit is extended to determine if it is contained within a segment pair whose score is greater than or equal to the cutoff score S. The scanning phase of BLAST locates the words within the sequences in linear time. One method is to map each possible word to an integer so that it can be used as an index into an array. For instance, if the word size was 4, and amino acids were used, there are 204 = 160,000 entries in the array. The second approach was the use of a deterministic finite state automaton Hit Extension Initial hits are then examined, and extended in either direction until they fall below a certain score threshold. In order to get around the problem of using uninformative hits, BLAST stores a list of words that are found much more often than expected by random. Hits to these words are discarded from consideration Steps Used by BLAST

1) The sequence is optimally filtered to remove low-complexity regions that will not lead to meaningful sequence alignments.

2) A list of words of the predefined word length (3 for amino acids; 11 for DNA sequences) in the query sequence is made.

3) The query words are evaluated for an exact match with a word in any database sequence, using substitution scores for amino acids, and +5,-4 scoring scheme for DNA.

4) A cutoff score, called neighborhood word score threshold (T) is selected to reduce the number of possible matches to the word to be the most significant ones. This pares down the list of possible matching words to those resulting in the most significant alignments.

5) The procedure is repeated for each word in the query sequence. 6) The remaining high-scoring words for each possible match to a word are

organized into an efficient search tree. 7) Each database sequence is scanned for an exact match to one of words in the

search tree, one position at a time. If a match is found, it is used to seed a possible ungapped alignment between the query and database sequences.

8) (UNGAPPED BLAST – VERSION 1.0) In the original BLAST suite of programs, an attempt is made to extend an alignment from the matching words in

each direction along the sequences, as long as the score does not drop below a certain threshold. At this point, a larger stretch of sequence (called the HSP (high-scoring segment pair) which has a larger score than the original word may have been found.

(GAPPED BLAST – VERSION 2.0) In the newer version of BLAST, the neighborhood word threshold T is reduced in order to find shorter matching word hits that can be aligned along the same diagonal

9) The score of each HSP is compared against a cutoff score S, which is empirically

determined.

10) The statistical significance for each HSP is calculated using the Karlin-Altschul statistics and the extreme value distribution, as previously discussed with sequence alignments. Recall that the probability, p, of observing a score S greater than or equal to x is given by the equation:

)(

1)(uxeexSP

−−−−=≥λ

where

λ''log nKmu =

and m’ and n’ are the effective lengths of the query and database, such that

HKmnmm ln' −≈

HKmnnn ln' −≈

where H is the average expected score per aligned pairs of residues in an alignment of two random sequences; m and n are the length of the query and database; K and lambda are parameters calculated based on the sequences and the scoring scheme. These effective, or reduced, lengths are used as a correction factor in order to allow alignments starting near the end of one of the sequences to be detected.

The expectation, E, of seeing a score S >= x in a database of D sequences is approximately given by the Poisson distribution,

DxspeE )(1 >−−≈

11) Two or more HSP regions may be combined to a longer alignment region, even

though the individual HSPs may result in a lower score. 12) Smith-Waterman type alignments are shown for the query sequence with each of

the matched sequences in the database. BLAST-2 can produce alignments with gaps, while BLAST-1 cannot.

13) When the expected score for a given database sequence satisfies the threshold for

E, the match score is reported. SHOW STEPS OF BLAST SHOW EXAMPLES OF BLAST OUTPUT USING SEQUENCES FROM THE CLASS BLAST Programs BLASTP: Compares a protein query sequence against a protein database, allowing for gaps BLASTN: Compares a DNA query sequence against a DNA database, allowing for gaps BLASTX: Compares a DNA query sequence, translated into all six reading frames, against a protein database, allowing for gaps TBLASTN: Compares a protein query sequence against a DNA database, translated into all six reading frames, allowing for gaps TBLASTX: Compares a DNA query sequence, translated into all six reading frames, against a DNA sequence database, translated into all six reading frames. TBLASTX does not allow for gaps. There are a number of different BLAST options. One list of these options and a description of them is available through the WU-BLAST home page: http://blast.wustl.edu/blast/README.html BAYES BLOCK ALIGNER Another approach to searching databases is the Bayes Block Aligner. The methodology behind the Bayes Block Aligner is to find all possible blocks located within two

sequences. A larger number of possible alignments between two sequences are generated by aligning combinations of blocks. Gaps will be present between the blocks. The Bayes Block Aligner uses Bayesian statistics to derive the posterior probabilities of each alignment assuming various scoring models and different number of blocks. This approach has been shown to locate some weak, yet real, similarities between sequences. SSAHA SSAHA stands for Sequence Search and Alignment by Hashing Algorithm. It can align DNA sequences by converting the sequence information into a ‘hash table’ data structure that can then be searched very rapidly for matches. SSAHA is best suited towards problems in locating identical or near identical matches. The hash word length is defined to be 10 bases by default. Example applications include SNP detection; rapid sequence assembly; detecting order and orientation of contigs. SSEARCH While dynamic programming algorithms can be painfully slow when searching against large databases, they are more likely to discover sequences that are distantly related to one another. SSEARCH is one program that implements the Smith-Waterman approach to sequence alignment. ftp.virginia.edu/pub/fasta SSEARCH is part of the FASTA suite of programs. This approach compares a protein sequence to another protein sequence or sequence database (or DNA sequence to a DNA sequence or database) using enhanced Smith-Waterman local sequence alignments.

BLAT BLAT (BLAST-Like Alignment Tool), developed by Jim Kent at UCSC, is used to locate smaller regions of higher identity within genomic assemblies. BLAT on nucleic acids will quickly identify regions at least 95% similar consisting of 40 bases or more. More divergent and shorter sequence alignments may be missed. BLAT on amino acids will find sequences at least 80% similar consisting of at least 20 amino acids. DNA BLAT works by keeping an index of an entire genome in memory, where the index consists of all non-overlapping 11-mers except those involved in repetitive elements. For the human genome, this corresponds to a little less than a gigabyte of RAM. Protein BLAT works in the same fashion, except that 4-mers are used. The protein index is slightly larger than 2 gigabytes for humans. BLAT is a very fast tool for localizing highly similar regions. However, distant homologies are not detected. The typical use for BLAT is to localize a specific sequence

on a genome. This can be very useful, since the BLAT web interface directly ties to the UCSC GoldenPath genomic browser. The BLAT web server is: http://genome.ucsc.edu/cgi-bin/hgBlat?command=start&org=human SEQUENCE FILTERING Low-Complexity Regions Low-complexity regions are amino acid or DNA sequence regions that offer very low information due to their highly biased content. Examples of low complexity regions include histidine-rich domains in amino acids, poly-A tails in DNA sequences, poly-G tails in nucleotides, runs of purines, runs of pyrimidines, runs of a single amino acid, etc. The complexity of a window of size L can be calculated as:

)!

!(log

1

∏∗

=

allii

N nLL

K

A resulting K value of 0 results in a region of very low complexity; a value of 1 results in a high complexity. Consider the sequence AAAA: L is 4, so L! = 4*3*2*1 = 24 nA = 4; nC = nG = nT = 0 So the product of the factorials is 4!*0!*0!*0! = 24 K = ¼ log4(24/24) = 0, so this is low complexity region. Now consider the sequence ACTG: L is 4, so L! is 4*3*2*1 = 24 nA =nC=nG=nT = 1 so the product of the factorials is 1!1!1!1! = 1 so K = 1/4log4(24/1) = 0.573

Short, periodic repeats Another possible source of low information are regions of DNA or amino acid sequences with repeats with a short periodicity (such as 10 bases long). Examples of such sequences commonly found in DNA are short tandem repeats. Fortunately, there are programs out there to remove such sequences from a query and target database before it is searched. SEG, PSEG are programs developed at NCBI that are used to mask out low-complexity regions in amino acid sequences. NSEG is the NCBI program that masks out low-complexity regions in nucleic acid sequences. DUST is another program that removes low-complexity regions in DNA sequences. Each of these programs calculates the complexity for a given window using the algorithm defined above and masks out regions based on a given complexity threshold. XNU is a program that will locate internal repeats with a short periodicity. Interspersed Repeats In addition to short, periodic repeats, genomes are filled with longer interspersed repetitive elements. These can be in the form of short-interspersed elements (SINES) on the order of 300 bases long, or long-interspersed elements (LINES) on the order of 1-2 KB long. There are other classes of interspersed repeats, including Mammalian-interspersed repeats (MIRS), and other elements that have been transposed and fixed into genomes through viral-like events. Transposable elements are numerous in many plant species, which leads to large genome sizes. Consider the human genome. Somewhere around 50% of its composition comes from interspersed repeats. Thus, these regions should be masked out as well. RepeatMasker is a program that takes a query sequence and compares it against a set of target repetitive element databases. Those regions in the query which match a repetitive element are masked out. Run in its native mode, RepeatMasker calls cross-match, which implements a Smith-Waterman dynamic programming algorithm to locate instances of repetitive elements. Improvements have been made to the RepeatMasker software so that it can use the speed of BLAST to speed up the time it takes to locate repeats. This updated version is called MaskerAid. The libraries of repeats that RepeatMasker uses are maintained by the Genetics Information Research Institute in their Repbase database.

Soft versus Hard masking There are generally two approaches to masking out repetitive elements. The first approach, in which the repetitive element sequences are replaced by either N or X characters, is called hard masking. The second approach, which is becoming a more popular approach to use, maintains the sequence data, but denotes repetitive portions of sequences with lowercase letters. This can be a preferred approach for many reasons. However, when using either approach it is important to understand how the sequence analysis software you are using will treat each of these. Removal of Vector Sequence Searching Databases with PSSMS The method to search a database with a PSSM is very similar to seeing whether or not a sequence belongs to a family that the PSSM defines. Every possible sequence position in each database sequence is evaluated as a possible sequence position by sliding the PSSM along one sequence at a time. Positions with high scores are the best matches, and can be quickly identified. EXAMPLES: BLOCKS Server; MAST Server (p323 for more) Searching Databases with Regular Expressions Certain databases (such as ProSite) allow the databases to be searched using a regular expression. PSI-BLAST PSI-BLAST (position specific iterated blast) is a newer version of BLAST designed to take in an initial query sequence and find similar sequences to the query which can then be multiply aligned to create a scoring matrix that can be used to search the database for even more matches. At this point, even more sequences are potentially found, that can then be added onto the multiple alignment. This process of iteratively building the multiple alignment continues until the user is statisfied with the search results. Of course, caution should be used with PSI-BLAST since a greedy algorithm is used in the sense that the most recently added sequences will now influence the next round of sequences that are to be found.

PHI-BLAST PHI-BLAST (pattern hit initiated blast) functions in same manner as PSI-BLAST except that the query sequence is first searched for a complex pattern, or regular expression, provided by the user. The subsequent search for similar sequences is then focused on regions containing the pattern. One example of a regular expression that might be used is: [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV] Reference Books: BLAST by Mark Yandell, Joseph Bedell (Editor), Ian Korf, Ian Korf, Mark Yandell Joseph Bedell, Lorrie LeJeune Availability: This item will be released on June 1, 2003. $39.95

Sequence Databases

Major Sequence Repositories Many of the applications in computational biology and bioinformatics are based on the analysis of nucleotide and protein sequences. There are three major repositories that contain all of the known nucleotide and protein sequences. They all share their information with each other through the International Nucleotide Sequence Database Collaboration. These three repositories are: DNA Data Bank of Japan (DDBJ) http://www.ddbj.nig.ac.jp EMBL Nucleotide Sequence Database http://www.ebi.ac.uk.embl.html GenBank http://www.ncbi.nlm.nih.gov/ Currently, GenBank contains over 28 billion nucleotide bases, representing over 22 million sequences in over 100,000 species. This represents a large amount of data to be stored! Looking at the growth of GenBank over the past 20 years, we can see the explosion of sequence data, particularly in the last five years.

Image source: http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

Genome Databases Nucleotide sequence information has also been organized in such a manner that it is stored in genome databases. One of the most widely used resources of genomic data is the UCSC Genome Browser, which contains genome assemblies and annotation for the rat, mouse and human genomes. Another widely used resource is the Ensembl genome browser. Other genome databases include: WormBase, which contains information on the C. elegans and C. briggsae worm genomes; AceDB which contains information on the C. elegans, S. pombe, and H. sapiens genomes; Comprehensive Microbial Resource which contains information on 95 completed microbial genomes; FlyBase – Drosophila melanogaster genome sequence; HIV sequence database; MOsDB: rice genome database; MGD – Mouse Genome Database; Rat Genome Database; Saccharomyces Genome Database; The Arabidopsis Information Resource; ArkDB: Genome databases for animals; along with many other genomic resources. Ensembl Genome Browser (http://www.ensembl.org) UCSC Genome Browser http://genome.ucsc.edu/ WormBase: http://www.wormbase.org/ AceDB: http://www.acedb.org/ Comprehensive Microbial Resource: http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl FlyBase: http://flybase.bio.indiana.edu/ HIV Sequence Database: http://hiv-web.lanl.gov/ MOsDB Rice Database http://mips.gsf.de/gams/rice/index.jsp MGD Mouse Genome Database: http://www.informatics.jax.org/ Rat Genome Database: http://rgd.mcw.edu/ Saccharomyces Genome Database: http://genome-www.stanford.edu/Saccharomyces/ The Arabidopsis Information Resource (TAIR): http://www.arabidopsis.org/ ArkDB: http://thearkdb.org/ Gene Databases Once a genome is in place, it is desirable to study the regions that make a particular organism what it is. One such resource is located in the geneic regions of the organism. Several databases of genes and related structures exist. Perhaps the largest such database is the RefSeq database curated at NCBI. This data set contains information on a non-redundant collection of molecules naturally occurring. These are typically given as mRNA sequences where various information is known about them. For instance, these mRNA could be well studied and annotated to a degree that they are known to be geneic regions. Or these regions could be predicted mRNAs, where the predictions are based upon either computational methods, or by the mapping of EST sequences onto these regions.

Other gene and gene structure databases include: AllGenes: Human and mouse gene index integrating gene, transcript and protein annotation; ASAP: Alternatively Splicesd isoforms of genes; ExInt: exon-intron structures of genes; IDB/IEDB: intron sequence and evolution; SpliceDB: Canonical and non-canonical mammalian splice sites; GDB and GenAtlas: Human genes and geonomic maps; HS3D: Human exon, intron and splice regions; RefSeq: NCBI Reference Sequence Project http://www.ncbi.nlm.nih.gov/RefSeq/ AllGenes: http://www.allgenes.org GDB http://www.gdb.org/ GenAtlas: http://www.citi2.fr/GENATLAS/ Genew (Approved gene names): http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl ASAP: Alternatively spliced genes http://www.bioinformatics.ucla.edu/ASAP ExInt: http://intron.bic.nus.edu/sg/exint/exint.html IDB/IEDB: http://nutmeg.bio.indiana.edu/intron/index.html SpliceDB: http://genomic.sanger.ac.uk/spldb/SpliceDB.html HS3D: http://www.sci.unisannio.it/docenti/rampone/ SNP Resources In human sequences, single base changes are thought to occur approximately once every 2000 bases between individuals. While this may not seem like a lot, that still leads to over 1.6 million SNPs in the human population. SNPs play an important role in differentiation, but can also be the cause of disease (one example is sickle-cell anemia). Databases to locate and characterize single nucleotide polymorphisms are available for use. These include dbSNP; SNP Consortium database; rSNP Guide: Single nucleotide polymorphisms in regulatory gene regions; dbSNP: database of single nucleotide polymorphisms http://www.ncbi.nlm.nih.gov/SNP/ SNP Consortium database: http://snp.cshl.org/ rSNP Guide: http://util.bionet/nsc.ru/databases/rsnp.html EST Resources ESTs are expressed sequence tags, which are partial copies of mRNA found within a particular cell. Information from ESTs can be used to tell the splicing patterns of genes, the occurrence of genes, etc. dbEST http://www.ncbi.nlm.nih.gov/dbEST/ Gene Resource Locator (Alignment of ESTs with finished human sequence) http://grl.gi.k.u-tokyo.ac.jp HUNT: Annotated human full-length cDNA sequences http://www.hri.co.jp/HUNT/ Sputnik: Annotation of clustered plant ESTs: http://mips.gsf.de/proj/sputnik STACK: non-redundant, gene-oriented clusters: http://www.sanbi.ac.za/Dbases.html TIGR Gene Indices: non-redundant EST clusters: http://www.tigr.org/tdb/tgi.shtml

UniGene: non-redundant EST clusters: http://www.ncbi.nlm.nih.gov/UniGene/ Binding Sites, Promoters, ETC Besides locating genes within the genome, it is important to understand the signaling mechanisms that an organism employs in order to turn a gene on or off. Databases of various factors such as promoters and transcription factor binding sites are available. Various databases include: DBTBS: Bacillus subtilis binding factors and promoters; EPD: Eukaryotic POL II Promoters; PromEC: E. coli mRNA promoters; TRANSFAC: Transcription factors and binding sites; DBTBS: http://elmo.ims.u-tokyo.ac.jp/dbtbs/ EPD: http://www.epd.isb-sib.ch/ PromEC: http://bioinfo.md.huji.ac.il/marg/promec TRANSFAC: http://transfac.gbf.de/TRANSFAC/index.html Protein Databases The process of the central dogma states that DNA gets coded into RNA, which in turn gets turned into proteins. Since proteins code for genes, it is important to store known information about proteins inside of databases. There are many different protein databases, many of them dealing with specific protein families. Databases for curated proteins include: InterPro: Protein families and domains http://www.ebi.ac.uk/interpro EXProt: proteins with experimentally verified functions: http://www.cmbi.nl/exprot Protein Information Resource (PIR): http://pir.georgetown.edu/ SWISS-PROT/TrEMBL curated protein sequences: http://www.expasy.ch/sprot Protein Sequence Motifs (Domains) In addition to proteins, we can have families of proteins defined with conserved regions called motifs or domains. Databases to store this information includes: BLOCKS (Multiple alignments of conserved regions) http://blocks.fhcrc.org/ CDD: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml eMOTIF: http://motif.stanford.edu/emotif/ Pfam: http://www.sanger.ac.uk/Software/Pfam/ PRINTS: http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ ProDom: http://www.toulouse.inra.fr/prodom.html PROSITE: http://www.expasy.org/prosite ProtoMap: http://protomap.cornell.edu

Structure Databases After a protein sequence has been created, it takes on a three dimensional structure. Various structure databases exist that contain proteins where the structure is known, typically through NMR and X-ray crystallography. Some of the larger structure databases include: ASTRAL http://astral.stanford.edu/ PDB http://www.pdb.org/ SCOP http://scop.mrc-lmb.cam.ac.uk/scop MMDB http://www.ncbi.nlm.nih.gov/Structure/ Gene Expression Databases (Microarray experiments; etc) Once the location and sequence of genes is known, the next step is to determine their function. Various biological experiments can be performed on gene data, including the newer microarray technology which we will cover in class. Databases containing the results of this experimental data are available. Included might be experimental images, analysis of results, etc. Examples of experimental Gene Expression and Metabolic pathway databases are: ArrayExpress http://www.ebi.ac.uk/arrayexpress BodyMap http://bodymap.ims.u-tokyo.ac.jp/ HugeIndex http://hugeindex.org/ Mouse Atlas and Gene Expression Database: http://genex.hgu.mrc.ac.uk/ NetAffx http://www.affymetrix.com/ Stanford Microarray Database http://genome-www.stanford.edu/microarray/ KEGG http://www.genome.ad.jp/kegg/ Klotho http://www.ibc.wustl.edu/klotho/ MetaCyc http://ecocyc.org/ Disease Databases After the function of genes is known, those genes involved in disease are classified. Mutational databases include: OMIM: http://www.ncbi.nlm.nih.gov/Omim/ OMIA: http://www.angis.org.au/omia/ HGMD: http://www.hgmd.org/ Tumor Gene Family Databases: http://www.tumor-gene.org/tgdf.html

Literature References After all this work has been done, there needs to be a way to do a search through the literature references for a specific gene, disorder, organism, sequencing project, etc. The most widely used resource in this regard is the PubMed http://www.ncbi.nlm.nih.giv/PubMed/ database. Factors in Considering Biological Databases What is important as far as databases are concerned: Fast retrieval of information Ability to store large amounts of data ability to update data – databases provide a moving target choice of paradigm – object oriented or relational? Storage of Data Next week – Dr. Chang will talk about storage of data GenBank – flat file format ensembl – mySQL ports XML ports of databases as well DISCUSS ORACLE PAPER: http://otn.oracle.com/oramag/oracle/03-jan/o13science.html

Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) are probabilistic models for studying sequences of symbols. In particular, HMMs can model matches, mismatches, insertions and deletions of symbols. Hidden Markov Models have been deeply rooted in speech recognition problems. In speech recognition, the problem is the phonemes (or words) that have been spoken in a particular time frame. Consider the difficulty. Everyone you meet has a different voice. Everyone speaks with a slight variation – this might be caused by an accent, the person having a cold, or differences in physiological development. However, humans are able to distinguish what the speaker is saying. The idea behind speech recognition is to take in a spoken word and to try to fit it to a specific model of possible words. This may in fact be close to what the brain does – just think about the Sprint PCS commercials! Problems in sequence analysis are similar. For instance, given an amino acid sequence, we may want to determine the protein family to which it belongs. Now the amino acid sequence can be treated similarly to the speech signal in a given frame, and the amino acids can be treated as the phonemes. Markov Chain A Markov Chain is a probabilistic model that generates a sequence where the probability of a symbol depends upon the previous symbol. A traffic light is an example of a Markov chain.A Markov Chain can be used to model a random DNA sequence, where there are four states: A, C, G, T, one for each letter in the alphabet. When we are given a certain state, there is a transition from that state to another state with an associated probability called a transition probability. An example Markov Chain can be drawn as follows:

A

G T

C

start END

The key property of a Markov chain is that the probability of a symbol S at position p (Sp) depends only upon the previous symbol S at position p – 1 (Sp-1), and not on the entire previous sequence. Since the probability of a symbol is dependent upon the previous symbol, a prime example for the use of Markov chains is in the detection of CpG islands, which are rich in the dinucleotide CG. The process of methylation in biological systems will typically convert the nucleotide C to a T with a high probability when a CG nucleotide is encountered. As a result, there will be an overabundance of the dinucleotide TG, and an underabundance of the dinucleotide CG. If we ignore the start and end states for now, we can see that there are sixteen different transitions. A study of regions of genomic DNA has determined normal genomic transition probabilities to be the following, where the FROM node is labeled along the rows to the left, and the TO node is labeled along the columns above:

A C G T A 0.300 0.205 0.285 0.210 C 0.322 0.298 0.078 0.302 G 0.248 0.246 0.298 0.208 T 0.177 0.239 0.292 0.292

The model shown above can then assign these weights to the edges of the graph. In some regions of the genome, such as the promoter region of genes, methylation is suppressed. In these regions, the dinucleotide CG is found in greater quantities. In fact, the nucleotides C and G are found to a greater degree than elsewhere in the genome. A study of regions of genomic DNA where CpG islands exist has determined the transition probabilities to be the following:

A C G T

A 0.180 0.274 0.426 0.120 C 0.171 0.368 0.274 0.188 G 0.161 0.339 0.375 0.125 T 0.079 0.355 0.384 0.182

A new model just like the one above can have its transition properties assigned according to the new table. Now we have two different models: the first where CpG islands are absent, and the second where CpG islands are present.

Let’s call the first model the non-CpG model and the second model the CpG model. Given a new sequence, how would we determine whether it belongs to the non-CpG model or the CpG model? Remember, the key property of a Markov chain is that the probability of a symbol S at position p (Sp) depends only upon the previous symbol S at position p – 1 (Sp-1), and not on the entire previous sequence. Therefore, to find the probability that a sequence fits a model, you would multiply all of the conditional probabilities:

P(x) = P(xL|x L-1)P(x L-1|x L-2)…P(x2|x1)P(x1) Which can be rewritten as:

∏=

−=

L

ixx ii

axPxP2

1 1)()(

where

ii xxa1−

is the transition probability from residue at position i-1 to the residue at position i. Let’s consider for now that in the non-CpG model, P(A) = P(T) = 0.3; P(C) = P(G) = 0.2, so that A and T are more probable. In the CpG model, consider P(A) = P(C) = P(G) = P(T) = 0.25. Now consider the sequence: GGCGACG The probability for this sequence is as follows: P(G)P(G|G)P(C|G)P(G|C)P(A|G)P(C|A)P(G|C) For the non-CpG model can be calculated as: (0.20)(0.298)(0.246)(0.078)(0.248)(0.205)(0.078) = 0.000000453499

For the CpG model can be calculated as: (0.25)(0.375)(0.339)(0.274)(0.161)(0.274)(0.274)(0.125) = 0.0010526 Given this information, it is more likely that this sequence fits the CpG model. One thing to note is how quickly the probability gets to zero. This shows the importance of using log statistics. Using Markov models for discrimination One question that might arise is how different the non-CpG and CpG models are in relation to each other. If they are not different enough, then there is not enough information to determine from which model a particular sequence is derived. In order to test whether we are able to discriminate between the two models, a log ratio is taken for each of the scores in the two previous tables to create a third table, where each entry, x, in the new table is equal to: log2(P(x|CpG model) / P(x| non-CpG model)). The resulting table is as follows:

A C G T A -0.740 0.419 0.580 -0.803 C -0.913 0.302 1.812 -0.685 G -0.624 0.461 0.331 -0.730 T -1.169 0.573 0.393 -0.679

Using this log-odds ratio table as the scores, we can then see that a sequence with a negative score will belong to the non-CpG model, while a sequence with a positive score will belong to the CpG model. Hidden Markov Models The two models we have created can now be used to test which of the two models a sequence belongs to. However, consider the case where we are given a long genomic piece of DNA. How do we determine where the regions are where CpG islands are located? Our models cannot answer these questions as they currently exist. The solution to this problem is to combine both of our models into a single model where there are small probabilities in switching back and forth between the two models. The problem now becomes more complicated, because there are now two states corresponding to each nucleotide symbol. The difference between a Markov chain and a hidden Markov model is that a hidden Markov model does not have a one-to-one correspondence between the states and the

symbols, and therefore it is no longer possible to tell what state the model was in when a particular symbol was emitted. Therefore, the state is “hidden” from us. In the example of the CpG islands used thus far, only one symbol is emitted at each state. However, consider the example of multiple alignments where any one of a number of amino acids is likely in any given column. In this case, the state of a hidden Markov model could emit a symbol from a given distribution. The probabilities of emitting a symbol given that you are in a specific state is referred to as the emission probabilities. Using emission probabilities, the joint probability of seeing an observed sequence x and a path through the Hidden Markov model , π, is equal to:

1)(),(

11 +∏

=

=iii

axeaxP i

L

io πππππ

Note that in this case,

)( ixeiπ

Is the probability of emitting the residue x found position i in the sequence, when you are at the state πi in the path. This equation is not all that useful, since it is often the case that the path is not known. However, it is important to be able to calculate the most probable path. Viterbi Algorithm – Most Probable Path The Viterbi algorithm is a dynamic programming algorithm. The most probable path through the HMM can be calculated recursively. If vk(i) is the probability of the most probable path ending in state k with observation i is known for all states k, then for the next observation x i+1, the most probable path is equal to the probability of emitting the symbol x i+1 while in state l multiplied by the maximum over all previous states k that can transition into the state l.

))(max()()1( 1 klkill aivxeiv +=+

Initialization In the initialization step, before you begin emitting any characters, set the probability of a path of length 0 ending at state 0 to 1, and the probability of all other paths of length 0 ending at all other states equal to 0:

v0(0) = 1; vk(0) = 0 for k > 0 Recursion The recursion step goes through the whole length of the input sequence, one at a time, and calculates the maximum probability for being in a state, l, given the current input i:

vl(i) = el(xi)maxk(vk(i-1)akl) In addition, the recursion step keeps track of a pointer for getting to each state, so that a traceback can be performed to reconstruct the path with the maximum probability:

ptri(l) = argmaxk(vk(i-1)akl) Termination In the termination step, the probability of the sequence and the maximum path is set to be the maximum value at the final position, and the pointer for the maximal path is set at that point, similar to the recursive step, except that a termination step, ak0 is introduced:

P(x,π* ) = maxk(vk(L)ak0)

πL* = argmaxk(vk(L)ak0)

Traceback Since pointers were kept for the path with the maximum probability using a recursive dynamic programming approach, traceback continues in a similar fashion to the sequence alignment. For the path with the maximum probability, we start at the final state, and trace back through the set of transitions that led to that state. Then we will recurse back until we get to the first state:

(i = L..1): πi-1* = ptri(πi

*)

Forward Algorithm In addition to determining the path with the highest probability, it is also necessary to determine the probability of a sequence given a particular HMM. This could be done by summing the probability over all possible paths. However, the number of potential paths increases exponentially with the length of the sequence, so a brute force method is not possible. Luckily, the probability of the observed sequence up to and including a point xi, where the path ends at state πi , can be calculated using a dynamic programming approach, similar to the Viterbi algorithm:

fk(i)=P(x1...xi, πi=k) The forward algorithm can be described as follows, with an initialization, recursion, and termination step. Initialization The initialization step is identical to the Viterbi algorithm, except now the v’s (maximum probable path) are replaced by f’s (overall probability)

f0(0) = 1; fk(0) = 0 for k > 0 Recursion The recursion step goes through the whole length of the input sequence, one at a time, and calculates the overall probability for being in a state, l, given the current input i. In the Viterbi algorithm, this recursive step takes the maximum; in the forward algorithm, we will consider sum over all possibilities:

∑ −=k

klkill aifxeif )1()()(

Termination The termination step is an extension of the recursion step with the difference that a terminating transition is used. The overall probability of the sequence being described by the HMM is then given in the final cell:

∑=k

kk aLfxP 0)()(

Backward Algorithm While the Viterbi algorithm finds the most probable path through the model, and the forward algorithm finds the probability of fitting the sequence to the model, we might also be interested in calculating the posterior probability that the emitted value came from a particular state given the observed seqeuence. Formally, this is written as: P(πi = k | x). The backward algorithm is very similar to the traceback step in pairwise sequence alignment using dynamic programming. We start at the final step and work our way back to the beginning. Initialization In the initialization step, the value of the posterior probability for each of the final transitions is assigned the value of the final transition probability:

bk(L) = ak0 for all k Recursion The recursion works backwards (i = L-1, ..., 1). Thus, the recursive step is:

∑ += +l

lilklk ibxeaib )1()()( 1

Termination The termination step reports back the same value as the forward algorithm, calculated in the reverse direction:

∑=l

lll bxeaxP )1()()( 10

HIDDEN MARKOV MODELS Parameter Estimation

So now the question becomes, how does the model and the associated probabilities get specified in the first place? There are two parts to HMM parameter estimation: the design of the structure (what states there are and how they are connected) and the assignment of the transition and emission probabilities. Calculation of Probabilities Estimation when the state sequence is known The estimation of probabilities when the state sequence is known is a trivial task. Consider a case where we have a multiple alignment from a protein family we want to describe as a Hidden Markov Model. Thus, the transition probabilities and emission probabilities can be calculated using a maximum likelihood estimation by taking the observed frequencies of residues in a column, and the observed frequencies of transitioning from one column to the next. In order to deal with insufficient data and overtraining of models, pseudocounts should be incorporated into these maximum likelihood estimations. Estimation when the Paths are unknown – Baum-Welch When the state sequence is unknown, it becomes trickier to calculate the probabilities. This is usually done in some sort of iterative fashion, until some sort of stop criterion is reached. In the Baum-Welch algorithm, the transition and emission probabilities are calculated as the expected number of times each transition or emission is used, given the training data. Once a model is in place, its overall log likelihood is computed, and transition and emission probabilities are calculated again based on the values given.

The Baum-Welch algorithm is summarized as follows: Initialization:

Pick arbitrary model parameters Recurrence: Set all the transition and emission variables to their pseudocount values For each sequence j = 1..n: Calculate the forward value for sequence j Calculate the backward value for sequence j Add the contribution of sequence j to the transition and emission variables Calculate the new model parameters using maximum likelihood Calculate the new log likelihood of the model Termination: Stop if the change in log likelihood is less than some predefined threshold HMM Model Structure Duration Modeling Complex length distributions can be modeled by introducing several states with transtitions between one another: HMM Resulting in one or more characters HMM Resulting in six or more sequences

HMM With 2 to 7 emissions Silent States In some cases, it would be nice to be able to skip over states. This is done by the introduction of silent states. Silent states will allow for arbitrary deletions of a chain states. In order to make the model less cluttered, silent states are introduced as circles in the model. This will achieve the same effect of forward connecting transitions from each of the states to each other state in the future. Using the above HMM model, we can now skip some of the emitting states by traveling through the silent states. The next effect would be a deletion of emitted sequence. Insertion States In addition to skipping over states, it may be necessary to emit residues between matching states. This is done by introducing insert states which are labeled as diamonds in the HMMs:

HMMs have many different applications in computational biology. Among them are: Multiple sequence alignments sequence profiles gene prediction protein structure prediction protein family classification There are two main programs used to calculate Hidden Markov Models: HMMER home page: http://hmmer.wustl.edu SAM home page: http://www.cse.ucsd.edu/research/compbio/sam.html

Pairwise Alignments Using HMMs A finite state machine can be created to show the calculation of a pairwise alignment with five states: A start state, a stop state, a match/mismatch state, an insertion state (for sequence 1), and an insertion state (for sequence 2). The finite state machine can be drawn as follows:

By assigning probabilities for the transitions, and for the emissions at states M, X, and Y, this finite state machine can be converted into a pair HMM. A pair HMM differs from a standard HMM in the sense that a pair of sequences is emitted in this case. This pair HMM will generate an aligned pair of sequences. Chapter 4 discusses using HMMs to find pairwise alignments; look it over, but we will not discuss it further in class. Rather, we will concentrate on where the real power of HMMs lies: in determining families of sequences based on multiple alignments. Profile HMMs for sequence families The power of HMMs in computational biology is to create a model of a sequence family that will all the determination of the relationship of an individual sequence to a sequence family. Such a relationship would allow you to concentrate on conserved features in the

M

X

Y

BEG

END

family. Profile HMMs, are similar to the sequence profiles already discussed with the multiple alignment programs. The assumption with profile HMMs is that we begin with a multiple alignment that has been correctly calculated. This multiple alignment can then be used to build a model to score potential matches to new sequences. For example, consider the following multiple alignment of globin sequences: 10 20 30 40 50 60 ....*....|....*....|....*....|....*....|....*....|....*....| consensus 1 SAAQKALVKASWGKVKG------NREELGAEILARLFK-------AYPDTKAYFPKFg-D 46 1ASH 1 ANKTRELCMKSLEHAKVdt--snEARQDGIDLYKHMFE-------NYPPLRKYFKSR--E 49 1ITH_A 3 TAAQIKAIQDHWFLNIKg-----CLQAAADSIFFKYLT-------AYPGDLAFFHKF--S 48 gi 1065933 162 DKESCEVVADSWRLVESrssaaeTSACFGLFVFQRVFS-------KIPMLRPLFG-Ls-E 212gi 3877400 71 nvyEKELLRRTWSDEFD------NLYELGSAIYCYIFD-------HNPNCKQLFP-Fi-S 115gi 3877381 15 TDEEVTAIRDVWRRA--------KTDNVGKKILQTLIE-------KRPKFAEYFG-IqsE 58 gi 3874505 230 TCAQIHLVRALWRQVYTt----kGPTVIGASIYHRLCFknvmvkeQMKQVE-LPPKFq-N 283gi 4098133 39 EDRDALRVLQNAFKL--------DDPELVRRFYAHWFA-------LDASVRDLFP-P--- 79 gi 1707914 18 SPADVK--KHTVESMKAvpv-grDKAQNGIDFYKFFFT-------HHKDLRKFFKGA--E 65 gi 2494780 3 TKDEFDSLLHELDPKIDte---eHRMELGLGAYTELFA-------AHPEYIKKFSRL--Q 50

70 80 90 100 110 120 ....*....|....*....|....*....|....*....|....*....|....*....| consensus 47 LSTAAALKSSPKFKAHGKKVLGALDEAVKHL---DDDGNLKAALAKLGAR-HAKRG---H 99 1ASH 50 EYTAEDVQNDPFFAKQGQKILLACHVLCATY---DDRETFNAYTRELLDR-HARDHv--H 1031ITH_A 49 SVPLYGLRSNPAYKAQTLTVINYLDKVVDAL-----GGNAGALMKAKVPS-HDAMG---- 98 gi 1065933 213 SDDVFDLPDNHPVRRHARLFTSILHISVKNVd--ELEAQVAPTVFKYGER-HYRPDitpH 269gi 3877400 116 KYQGDEWKESKEFRSQALKFVQTLAQVVKNIyhmERTESFLYMVGQKHVK-FADRG---- 170gi 3877381 59 SLDIRALNQSKEFHLQAHRIQNFLDTAVGSLg-fCPISSVFDMAHRIGQI-HFYRGv--N 114gi 3874505 284 --------RDNFIKAHCKAVAELIDQVVENL---DHLDNVTGELMRIGRV-HAKVL---- 327gi 4098133 80 -------DMGAQRAAFGQALHWVYGELVAQr-----AEEPVAFLAQLGRD-HRKYG---- 122gi 1707914 66 NFGADDVQKSKRFEKQGTALLLAVHVLANVY---DNQAVFHGFVRELMNR-HEKRGvdpK 121gi 2494780 51 EATPANVMAQDGAKYYAKTLINDLVELLKAS---TDEATLNTAIARTATKdHKPRN---- 103

130 140 150 160 170 ....*....|....*....|....*....|....*....|....*....|.... consensus 100 VDPANFKLFGEALL---VVLAEHLg---DFTPEVKAAWDKALDVVADALKSGYR 147 1ASH 104 MPPEVWTDFWKLFE---EYLGKKT----TLDEPTKQAWHEIGREFAKEINKHGR 150 1ITH_A 99 ITPKHFGQLLKLVG---GVFQEEFs---ADPTTVAAWGDAAGVLVAAMK----- 141 gi 1065933 270 MTEENVRVFCAQIV---CTVFDFLrd-tEATPKCAESWIELMRYLGQKLLDGFD 319 gi 3877400 171 FKHEYWDIFQDAME---FALEHRLsimtDLDDNQKRDAVTVWRTLALYTTVHMR 221 gi 3877381 115 FGADNWLVFKKVTV---DQVTTGTt---DSSKEKEDtnsngtangkvdtdasli 162 gi 3874505 328 RGELTGKLWNTVAEtiiDCTLEWGdr-rCRSETVRKAWALIVAFVIEKIKAGHH 380 gi 4098133 123 VLPTQYDTLRRALY---TTLRDYLg------HPSRGAWTDAVDEAAGQSLNLII 167 gi 1707914 122 LWKIFFDDVWVPFL---ESKGAKLs------GDAKAAWKELNKNFNSEAQHQLE 166 gi 2494780 104 VSGAEFQTGEPIFI---KYFSHVL-----TTPANQAFMEKLLTKIFTGVAGQL- 148

First, consider only the BLOCKS, where there are no insertion and deletion events. For example, we can consider the block of width 5 that is highlighted above: HAKRG HARDH HDAMG HYRPD FADRG HFYRG HAKVL HRKYG HEKRG HKPRN

As we have shown with the multiple alignment algorithms, a position specific scoring matrix (PSSM) can be calculated for the BLOCK above. (REVIEW CALCULATIONS FOR PSSMs) Converting a PSSM to a HMM Converting a PSSM for a block to a HMM is trivial, due to the absence of insertions and deletions. There will be a begin state, and five match states, one for each column. Match states are represented by squares in a diagram of a profile HMM. The transition probabilities between match states in the HMM would be 1 (we can only transition to the next match state), and the emission probabilities for each match state would be calculated based on the PSSM. The diagram for the resulting HMM is shown below: Alignment to this HMM is also trivial, since there is no choice of transitions. Adding Insert and Delete States to Obtain Profile HMMs Insertions For a profile HMM, insertions and deletions are treated separately. In order to handle insertions, where a portion of a new sequence does not match anything else in the model, a new set of insert states is inserted, denoted by diamonds. Whenever an insertion is possible, a transition is needed from the last match state in a block to the insertion, from the insertion to itself (to allow for multiple length insertions), and from the insertion to the first match state of the next block. For instance, if our alignment had been:

E B

HAKVPRG HAR--DH HDAV-MG HYR--PD FAD--RG HFY--RG HAK-PVL HRKG-YG HEKGGRG HKP--RN We now have an insertion after the third match state. Thus, the HMM now looks like the following: In this case, the score of a gap of length k is equal to the score of the transition from the match state to the insert state, plus (k-1) times the score of the transition remaining in the insert state, plus the score of the transition from the insert state to the next match state. This can be rewritten as:

1loglog)1(log ++−+jjjjji MIIIIM aaka

Deletions Deletions are handled in HMMs by introducing silent states between matches. Deletion states do not emit a residue (thus, the name “silent” state). These states are denoted by circles in a HMM. An example of a HMM emitting a sequence between 0 and 3 residues long is as follows:

EB

EB

HMM With Matches, Insertions, and Deletions A HMM with match states, insertion states, and delete states is referred to as a profile HMM (first introduced by David Haussler, et al and Anders Krogh, et al in the mid-1990s). An example structure of a profile HMM is as follows: Use of Profile HMMs to Generate Pairwise Alignments Profile HMMs can be used to perform generalized pairwise alignments, similar to the dynamic programming approach, where the emission probabilities are the match/mismatch probabilities or the probabilities of matching two amino acids in a scoring matrix. The transition probabilities from a match state to an insert or delete state is equivalent to a gap-open score, while the transition probabilities within an insertion state or between delete states is equivalent to a gap extension penalty. Deriving profile HMMs from multiple alignments Multiple sequence alignments can be used to create profile HMMs that act as models describing the consensus sequence of a sequence family. Basic Profile HMM parameterization Choosing the length of the model

The choice of the length of the model corresponds to a decision on which multiple alignment columns to assign to match states, and which to assign to insert states. Considering the alignment below, we might consider columns 1, 2, 3, 6, and 7 to be match states, and columns 4 and 5 to be insert states. A simple rule is that columns that are more than half gap characters should be modeled by inserts. HAKVPRG HAR--DH HDAV-MG HYR---D FAD--RG HFY--RG HAK-PVL HRKG-YG HEKGGRG HKP--RN Assigning the probability parameters Once the length of the model is chosen, the next problem is to assign the transition and emission probability parameters. The simplest manner in assigning probabilities is to take the frequencies. First, consider the transition frequencies. Note that for each state has three possible transitions leading from it. The transition probability from state k to state l can be calculated as:

∑=

''

lkl

klkl A

Aa

Similarly, the emission probability of residue a at state k can be calculated:

∑=

'

)'()()(

ak

kk aE

aEae

as we have discussed before, there is a difficulty in using straight frequencies when considering multiple alignments. The problem is that if not enough sequences are used in the multiple alignment, then some residues will be underrepresented (or not represented at all) and others will be overrepresented. In order to overcome this difficulty, pseudocounts are introduced (much like in problem 3 of the homework). The easiest form of pseudocounts is Laplace’s rule, which adds one to each count. Let’s consider our alignment again: HAKVPRG HAR--DH HDAV-MG HYR---D FAD--RG HFY--RG HAK-PVL HRKG-YG HEKGGRG HKP--RN First, consider the transitions. From column 1 to column 2, there are 10 transitions to the next match state, 0 transitions to an insertion state, and 0 transitions to a delete state. Using Laplace’s rule, the probabilities would be 11/13, 1/13, and 1/13 for aM1M2, aM1I1, and aM1D2 respectively. The probabilities are the same for the transitions from the second column to the third column. Note that the fourth and fifth columns are insertion columns. Therefore, the next set of match transitions will be from column three to column four. There are four matches where columns four and five have gaps, (probability of matching: 5/13); there is one where 4, 5 and 6 have gaps (probability of deletion 2/13), and five with an insertion in column 4 and/or column 5 (probability of insertion 6/13). Remembering that we have a total of five match states, the probabilities can be stored in the following table:

MATCH 11/13 11/13 5/13 10/13 11/13

INSERTION 1/13 1/13 6/13 1/13 1/13 DELETION 1/13 1/13 2/13 2/13 1/13 The emission probabilities can be calculated for the match states in a similar fashion. In the first column, there are 9 H’s, and 1 F. Using Laplace’s rule, this becomes 10 H’s, 2

F’s, and 1 each of the remaining 18 amino acids. Therefore, the probabilities are 10/30 H; 2/30 F; 1/30 for each of the remaining 18 amino acids. (Go through this example on the board) Searching with Profile HMMs Once the profile HMM is in place, sequences can be searched against the HMM to detect whether or not they belong to a particular family of sequences described by the profile HMM. Using a global alignment, the probability of the most probable alignment and sequence can be determined using the Viterbi algorithm (yielding P(x, PI*|M)). In addition, the full probability of a sequence aligning to the profile HMM can be determined using the forward algorithm (yielding P(x|M)). The Viterbi equations specifically designed for profile HMMs are given on page 109, Durbin et al. The forward equations are given on page 110. Local Alignments to HMMs To be discussed next week

CECS 694-02 Introduction to Bioinformatics Lecture 8 Protein Family Classification and Gene Prediction Hidden Markov Model Programs HMMER HMMER (“hammer”) is a profile-HMM package used for protein sequences, currently in version 2.2. Various programs in the HMMER suite can take in a multiple alignment as input, and produce a profile-HMM based on that alignment. Once a HMM is in place (either a user-defined HMM or one of the Pfam HMMs to be described later), sequences can be searched against the HMM to check for membership into a particular family. In addition, sequences can be emitted from a HMM based on the transition and emission probabilities. The source code for HMMER is freely available for download from the following site: http://hmmer.wustl.edu/

SAM

“Sequence Alignment and Modeling System” is a profile-HMM package used for protein sequences as well. It is very similar to HMMER as far as the functionality is concerned. http://www.cse.ucsc.edu/research/compbio/sam.html

Meta-meme Meta-meme is a program that creates Hidden Markov Models of ungapped alignments. The benefit is that there are fewer parameters to be learned in creating the HMMs. Thus, Meta-meme is a Motif-based hidden markov approach. http://metameme.sdsc.edu/

HMMPro HMMPro is a commericial software package (free academic license) that has been developed to add additional HMM features, including graphical interfaces, multiple topologies, and multiple training methods. http://www.netid.com/html/hmmpro.html Protein Family Classification

Pfam Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. Over 73% of all known protein sequences have at least one match to one of 5,193 different protein families. PFAM families are extensively hand curated to assure a greater reliability in results. Profile HMMs have been created to describe each protein family. In general, the HMMs are seeded with a training data set of 50 or more sequences that are multiply aligned. Generally, this alignment first is accomplished using a multiple alignment program such as Clustal. After the automatic multiple alignment is generated, it is scrutinized by eye and adjusted. This assures that the seed alignments that produce the Profile HMM are more likely to be correct. After the HMM has been created (using the HMMER suite), additional sequences are added to the family by comparing the HMM against sequence databases. The resulting full alignments with additional family members may look worse than the initial seed alignments.

Pfam families can be broken down into four basic types:

• family – default classification, stating members are related • domain – structural unit found in multiple protein contexts • repeat – a domain that in itself is not stable, but when combined with multiple

tandem repeats forms a domain or structure • motif – shorter sequence units found outside of domains

Links to the Pfam software: http://pfam.wustl.edu/ http://www.sanger.ac.uk/Software/Pfam/index.shtml The best way to look at the Pfam families is to jump right into the Pfam program and view some examples: http://pfam.wustl.edu/ Pfam also contains more information concerning the three-dimensional structures of proteins. We will revisit these properties closer to the end of the semester.

ProDom ProDom is a database of all protein domain families automatically generated from the SWISS-PROT and TrEMBL databases. ProDom incorporates Pfam-A families as well as generating new ProDom alignments using the PSI-BLAST program. Let’s look at the example with entry PD000039 http://prodes.toulouse.inra.fr/prodom/2002.1/html/home.php

ProSite ProSite is a database of protein domains that can be searched by either regular expression patterns or sequence profiles. The ProSite data can be accessed at: http://www.expasy.org/prosite/

InterPro InterPro is an integrated resource of protein families, domains, and functional sites created to handle the data from various protein family sites such as PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs into a single, comprehensive resource. Current release of InterPro: 5629 Entries 4280 Families 1239 Domains 95 Repeats 15 post-translational modifications Represented in InterPro are 74% of all proteins within the SWISS-PROT and TrEMBL databases.

Computational Gene Prediction Locating open reading frames (ORFs) The simplest method of predicting geneic regions is to search for open reading frames (ORFs). We have already discussed open reading frames, which begins with a start (AUG) codon, and ends with one of three stop codons. In prokaryotic organisms, DNA sequences coding for proteins generally are transcribed into mRNA which is translated into protein with very little modification. Therefore, in prokaryotes, locating an open reading frame from a start codon to a stop codon can give a strong suggestion into protein coding regions. Longer ORFs are more likely to predict protein-coding regions than shorter ORFs. In eukaryotic organisms, the mRNA undergoes processing to remove intronic regions before the protein is translated. Therefore, the ORF corresponding to a gene in a eukaryotic organism may contain regions with stop codons that are found within intronic regions. Posttranscriptional modification makes it more difficult for gene prediction. Locating Homologous Genes Once a new genomic DNA sequence has been obtained, the first step in gene prediction is to locate homologous genes. This is accomplished by taking the new DNA sequence, translating it into all six reading frames, and comparing it to protein sequence databases. This step will locate known open reading frames, where the translation of the proteins is known. We have shown various algorithms for searching databases, and in particular, aspects that allow translation into all possible reading frames before searching. (Examples are BLASTX, FASTX, TBLASTN, TFASTX/TFASTY). It is thought that only about half of the genes can be found by homology searches. The remaining 50% need to be found using other mechanisms. Locating ESTs /cDNAs If the organism from which the genomic DNA has been extracted has EST or cDNA sequences available, the next step is to perform a similarity search between the genomic DNA and these ESTs or cDNAs. In addition, it can be helpful to perform similarity searches with ESTs and cDNAs from other organisms as well. This step locates potential exons within genomic sequences. Programs that take into account not only sequence similarity but also intron/exon boundary information (such as acceptor/donor sites and branch points) are listed as follows:

EST-GENOME (http://www.hgmp.mrc.ac.uk/Registered/Option/est_genome.html) SIM4 (http://pbil.univ-lyon1.fr/sim4.html) SPIDEY (http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/) Other Programs: TAP (Transcript Analysis Program) Computationally Predicting Genes The third step is to take the genomic DNA and run it though gene prediction programs to try and locate genes. Various programs exist to predict genes in different organisms, usually basing the methodology on the observed characteristics of known exons, introns, splice sites, and other regulatory sites in known genes. One important aspect to consider is that gene structure varies from one organism to the next, so a program trained on one organism is not generally useful for finding genes on another organism. Methods for computationally predicting genes are generally error prone. We will discuss the important statistics to consider when comparing gene prediction programs. Gene prediction in prokaryotic organisms In general, predicting genes in prokaryotic organisms is much easier. This is due to the fact that prokaryotic genes generally lack introns, and several highly conserved promoter regions are found around the start sites and transcription and translation. Example: E. coli lexA gene

As an example, consider the E. coli lexA gene, which plays a role in… There are two promoter sites (at positions -10 and -35 from the translation start site) that mark positions of interaction with the RNA polymerase (which copies the DNA into RNA). There is also a ribosomal binding site on the mRNA product complementary to ribosomal RNA. (note the ribosome is used to translate the RNA into amino acids). There are also three potential binding sides for the lexA product to the promoter region, indicating that lexA provides a self feedback loop. When enough lexA has been produced, these sites will be bound with lexA, telling it to stop creating this protein. In addition, there is an open reading frame that is devoid of introns. Many other prokaryotic genes operate in this same manner, and it is therefore fairly straightforward to locate genes in prokaryotic organisms. The highly conserved features of prokaryotic genes have made computational gene identification a possibility. One method to detect these genes is to create HMMs based on the gene structures. One such HMM is given in Mount, p50. This model suggests that

a model can be constructed based on each of the 61 codons in the genetic code, as well as for the start codon and 3 stop codons. Since codon usage and intergenic sequences vary from one organism to the next, a model trained on the genes of one organism may not be useful in detecting genes in a second. The reliability of the model depends on the accuracy of the information used to train the initial model. One program that uses a fifth order HMM (such that hexamers are important) in modeling E. coli genes is the program GeneMark.hmm. GeneMark (http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi ) Interpolated Markov Models (IMMs) addresses the problems of underrepresented sequences by looking at smaller subsets (such as looking at pentamers instead of hexamers). Of course, the longer the pattern, the more accurate the prediction. One example of a program using the IMM method is Glimmer. Glimmer (http://www.cs.jhu.edu/labs/compbio/glimmer.html) Prediction of Genes in Eukaryotes The commonly used methods for eukaryotic gene prediction train a computer program to recognize sequences that are characteristic of known exons in genomic DNA sequences. Then these programs are used to predict exons in unknown genomic sequences, and then connect these exons to produce a gene structure. The patterns used include intron-exon boundaries and upstream promoter sequences. However, in eukaryotes, the signals for these are poorly defined, and therefore cannot be searched by a simple pattern-matching technique as used with prokaryotes. Splice sites

(Image source: http://genio.informatik.uni-stuttgart.de/GENIO/splice/) Cannonical: GT->AG Alternative: GC->AG U12 Introns: AT->AC Splice site prediction: SPLICEVIEW SplicePredictor BCM-SPL http://www.fruitfly.org/seq_tools/splice.html http://www.softberry.com/spldb/SpliceDB.html Gene Prediction Using Neural Networks Neural networks provide a framework for finding complex yet subtle patterns and relationships among sequences. Grail II provides analyses of protein-coding regions, poly(A) sites, and promoters; constructs gene models; and predicts encoded protein sequences. The underlying algorithm makes a list of the most likely exon candidates and these are further evaluated using a neural network. Dynamic programming is then applied to define the most probable gene models. Input for Grail II includes several indicators of sequence patterns, including a Markov model for gene recognition and inputs from two additional neural networks that evaluate the region for potential splice sites. One important indicator in Grail II (and other gene prediction programs as well) is the in-frame 6-mer preference score, since the occurrence of codon pairs in coding regions is not random, while in noncoding regions it is more likely to be random. A higher frequency of 6-mers that are more commonly found in coding regions can be an indicator of the presence of an exon. The neural network used for Grail II is trained using a set of known coding sequences. A schematic for the Grail II system is as follows (Mount, p53):

Gene Prediction Using Dynamic Programming GeneParser is a program that predicts the most likely combination of exons and introns in a genomic sequence using a combined neural network and DP approach. SEE PAGE 355 FOR A DESCRIPTION Pattern Discrimination Methods Discrimination methods are statistical methods used to classify the sequence based on a number of observed sequence patterns, including a 6-mer exon preference score (EPS), 3’-flanking splice site score (FSS). The idea behind discrimination analysis is to is to plot a pair of scores for known sequences against one another, labeling each point as either intron or exon. Based on this data, a discriminating curve is drawn to discriminate between the introns and exons. The scores are then calculated for the new genomic sequence, and depending on where the score falls, the region is labeled either as an intron or exon. (See page 356 for an example). Example pattern discrimination methods include: HEXON FGENEH MZEF Hidden Markov Models

Genscan

Twinscan Assessing Gene Prediction Programs Burset and Guigo (1996) came out with an important paper describing methods in order to test gene prediction programs. In this paper, they describe a set of known coding sequences that should be used as data to train the models. In addition, a set of known coding sequences is provided to evaluate the success of the model. The important statistics to look at include: True Positives (TP): Number of correctly predicted coding regions False Positives (FP): Number of incorrectly predicted coding regions True Negatives (TN): Number of correctly predicted non-coding regions False Negatives (FN): Number of incorrectly predicted non-coding regions Using these measures, the following can be calculated:

Actual Positives (AP) = TP + FN Actual Negatives (AN) = TN + FP Predicted Positives (PP) = TP + FP Predicted Negatives (PN) = TN + FN Sensitivity (SN) = TP / (TP + FN) (Percentage of coding regions found) Specificity (SP) = TP / (TP + FP) (Percentage of positive that are correct) These measures can be combined to form a single correlation coefficient:

)])()()([()])(())([(

PNAPPPANFNFPTNTPCC −=

Gene Prediction Programs: GeneFinder http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html GeneMark http://genemark.biology.gatech.edu/GeneMark/ GeneParser http://beagle.colorado.edu/~eesnyder/GeneParser.html GeneScan http://202.41.10.146/GS.html Genie http://www.cse.ucsc.edu/~dkulp/cgi-bin/genie GenScan http://genes.mit.edu/GENSCAN.html Grail http://compbio.ornl.gov/ MZEF http://argon.cshl.org/genefinder/ MetaGene: http://rgd.mcw.edu/METAGENE/ BCM Web Launcher for Gene Predictions: http://searchlauncher.bcm.tmc.edu/seq-search/gene-search.html Genie http://www.fruitfly.org/seq_tools/genie.html Grail EXP http://grail.lsd.ornl.gov/grailexp/ HMMGene http://www.cbs.dtu.dk/services/HMMgene/ NetGene2 http://www.cbs.dtu.dk/services/NetGene2/ geneid http://www1.imim.es/geneid.html Procrustes http://hto-13.usc.edu/software/procrustes/ Genewise

Twinscan http://genes.cs.wustl.edu/ Assessing gene prediction programs Burset and Guigo http://www1.imim.es/GeneIdentification/Evaluation/Index.html Rogic et al genome research 2001

Promoter prediction programs

Burge CB, Karlin S. (1998) Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346-354.

Burge C, Karlin S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94.

Burset M, Guigo R. (1996) Evaluation of gene structure prediction programs. Genomics, 34:353-357. Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8, 967-974 Kulp D, Haussler D, Reese MG, Eeckman FH. (1996). A generalized Hidden Markov Model for the recognition of human genes in DNA. ISMB-96, St. Louis, MO, AAAI/MIT Press. Lukashin A, Borodovsky M, GeneMark.hmm: new solutions for gene finding, NAR, 1998, Vol. 26, No. 4, pp. 1107-1115. Reese MG, Eeckman FH, Kulp D, Haussler D. (1997). Improved splice site detection in Genie. Proceedings of the First Annual International Conference on Computational Molecular Biology (RECOMB) 1997, Santa Fe, NM, ACM Press, New York. Snyder EE, Stormo GD. (1995) Identification of Coding Regions in Genomic DNA. J. Mol. Biol. 248: 1-18. Snyder EE, Stormo GD. (1993) Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucl. Acids Res. 21(3): 607-613.

CECS694-02

Introduction to Bioinformatics Lecture 9

Phylogenetic Prediction Overview of phylogenetics Phylogenetic analysis gives insight into how a family of related sequences has been derived during evolution. The evolutionary relationships among the sequences are shown as branches of a tree. The length and nesting of these branches reflects the degree of similarity between any two given sequences. The objective of phylogenetic analysis is to determine the length of the branches and to figure out how the tree should be drawn. Sequences that are the most closely related are drawn as neighboring branches on a tree. Phylogenetic analysis is dependent upon good multiple sequence alignment programs. Given a multiple sequence alignment, phylogenetic analysis tries to group sequences with similar patterns of substitutions in order to reconstruct a phylogenetic tree. For instance, consider that we have two sequences that are related. Given these two sequences, an ancestoral sequence can be (partially) derived. With more similar sequences, more information can be gathered to add to a correct derivation and evolutionary history. Uses of phylogenetic Analysis Given a set of genes (such as a family of genes) phylogenetic analysis can help determine which genes are likely to have equivalent functions. Used to follow changes occurring in a rapidly changing species such as a virus. Take for instance influenza. By studying the rapidly changing genes through phylogenetic analysis, the next year’s strain can be predicted, and a flu vaccination can be developed. The prediction is not always correct, but it gives a level of protection. Tree of Life On one level, it is interesting to understand and study how the evolution of species has occurred. There are many different resources discussing the evolution of species. This includes the NCBI taxonomy web sites, and the University of Arizona’s tree of life project. We’ll take a look at both of these web sites in order to get a better appreciation for the evolution of species relative to one another. NCBI Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/ Tree of Life http://tolweb.org/tree/

Evolutionary Trees An evolutionary tree is a two dimensional graph showing the evolutionary relationship among a set of items being compared. This set can be organisms, genes, or dna sequences. Consider for the moment that each of the units in the set are referred to as a taxon. Each taxon will be defined by a distinct unit on the tree. An evolutionary tree is composed of outer branches or leaves that represent the taxa and nodes and branches representing the relationships among the taxa. Two taxa that are derived from the same common ancestor will share a node in the graph. In general, approaches to designing evolutionary trees attempt to define the length of each branch to the next node according to the number of sequence level changes that occurred. One thing to be careful of in phylogenetic analysis is that this distance may not be in direct relation to evolutionary time. Analyses that prescribe to the theory of a uniform rate of mutation are known as the molecular clock hypothesis. Rooted Trees In a rooted tree topology, one sequence (the root) is defined to be the common ancestor of all of the other sequences. A unique path leads from the root node to any other node, and the direction of the path indicates evolutionary time. The root is chosen by including a sequence from an organism that is thought to have branched off earlier than the other sequences. If the molecular clock hypothesis holds, it is also possible to predict a root. As the number of sequences increase, the number of possible rooted trees increases very rapidly. In some cases, a bifurcating binary tree is the best model to simulate evolutionary events in which case one species branches off into two separate species. Example of a rooted tree:

Image source: http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

Star Topology (Unrooted Trees) An unrooted tree (sometimes referred to as a star topology) shows the evolutionary relationship among sequences, without revealing the location of the oldest ancestry. There are fewer choices for an unrooted tree than a rooted tree. Example of an unrooted tree:

Image source: http://www.shef.ac.uk/english/language/quantling/images/quantling1.jpg Methods for Determining Evolutionary Trees There are three methods used to calculate the tree(s) that best account for the observed variation in a set of sequences. These methods are maximum parsimony, distance, and maximum likelihood. Maximum Parsimony Maximum parsimony methods predict the evolutionary tree that minimizes the number of steps required to generate the observed variation in the sequences. In order to construct a tree using maximum parsimony, a multiple sequence alignment must first be obtained. For each aligned position, phylogenetic trees that require the smallest number of evolutionary changes to produce the observed sequence changes are identified. This

continues for each position in the alignment. Those trees that produce the smallest number of changes overall for all sequence positions are identified. This is a rather time consuming algorithm that only works well if the sequences have a strong sequence similarity.

1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G Consider the example above (Mount, 250). There are a total of four sequences, which gives a possibility of three different unrooted trees. In this case some sites are informative, and other sites are not. An informative site has the same sequence character in at least two different sequences. Only the informative sites need to be considered. Possible trees: In this case, the optimal tree is obtained by adding the number of changes at each informative site for each tree, and picking the tree requiring the least total number of changes. For a large number of sequences the number of trees to examine becomes so large that it might not be possible to examine all possible trees. Some programs, such as PAUP, add features that will allow the user to envoke a heuristic that will keep representative trees that best fit the data. The informative sites in the example alignment are 5, 7, and 9. Let’s go through the possible trees, and figure out the number of rearrangements for each in the informative sites. (SEE THE POWERPOINT PRESENTATION) One problem with determining evolutionary distance between sequences is that columns representing greater variation dominate the analysis. In order to overcome this problem of determining long branch lengths is to look only at transversion events, which are the

1

2 4

3 1

3 4

2 1

4 2

3

most significant base changes (i.e. changes a purine to a pyrimidine or vice versa). This is referred to as Lake’s method of invariants. LOOK AT THE MITOCHONDRIAL SEQUENCE ANALYSIS ON P 252 Distance Methods The distance method for construction of phylogenetic trees looks at the number of changes between each pair in a group of sequences to produce a phylogenetic tree of the group. The goal of distance methods is to identify a tree that positions neighbors correctly and that also has branch lengths which reproduce the original data as closely as possible. CLUSTALW uses the neighbor-joining method as a guide to multiple sequence alignments. The PHYLIP suite of programs employ neighbor-joining methods. Phylip http://evolution.genetics.washington.edu/phylip.html Distance analysis programs in PHYLIP FITCH: estimates a phylogenetic tree assuming additivity of branch lengths using the Fitch-Margoliash method. KITSH: same as FITCH, but under the assumption of a molecular clock. NEIGHBOR: estimates phylogenies using the neighbor-joining (no molecular clock assumed) or unweighted pair group method with arithmetic mean (UPGMA) (molecular clock assumed). For phylogenetic analysis, the distance score counted as either the number of mismatched positions in the alignment or the number of sequence positions that must be changed to generate the other sequence is used. The success of distance methods depends on the degree to which the distances among a set of sequences can be made additive on a predicted evolutionary tree. Consider the alignment:

A ACGCGTTGGGCGATGGCAAC B ACGCGTTGGGCGACGGTAAT C ACGCATTGAATGATGATAAT D ACACATTGAGTGATAATAAT

The distances between these sequences can be shown as a table:

A B C D A - 3 7 8 B - - 6 7 C - - - 3 D - - - - Using this information, an unrooted tree showing the relationship between these sequences can be drawn:

Fitch and Margoliash Method The Fitch and Margoliash method uses a distance table. The sequences are combined in threes to define the branches of the predicted tree and to calculate the branch lengths of the tree. Example using three sequences:

1) Draw an unrooted tree with three branches originating from a common node and label the ends:

C

D

A

B

41

2

2

1

2) Calculate the lengths of tree branches algebraically:

A B CA -- 22 39B -- -- 41C -- -- --

distance from A to B = a + b = 22 (1) distance from A to C = a + c = 39 (2) distance from B to C = b + c = 41 (3) subtracting (3) from (2) yields: b + c = 41 -a – c = -39 __________ b – a = 2 (4) adding (1) and (4) yields 2b = 24; b = 12 so a + 12 = 22; a = 10 10 + c = 39; c = 29

C29

12

10

B

A

Cc

b

a

B

A

Example of Fitch-Margoliash Using Five Sequences The Fitch-Margoliash algorithm can be extended to three or more sequences. Consider the following table of distances between five separate sequences:

A B C D EA -- 22 39 39 41B -- -- 41 41 43C -- -- -- 18 20D -- -- -- -- 10E -- -- -- -- --

Suppose that the initial tree is as follows: 1) The first step is to locate the most closely related sequences in the distance table. In this case, that would be sequences D and E. 2) Now create a new table by combining the remaining sequences. For the distance from D to A,B,C take the average distance of each of these to D ( (39 + 41 + 18) / 3 = 32.7) For the distance from E to A,B,C, take the average distance of each of these to E ((41+43+20)/3 = 34.7). The resulting table is as follows:

D E AVG ABC D -- 10 32.7 E -- -- 34.7 AVG ABC -- -- --

A

B

C

D

E

a

b d

c

e

f

g

3) The average distances from D to ABC and E to ABC could also be found by averaging the sum of the appropriate branch lengths:

D to E: d + e = 10 (1) D to ABC: d + m = 32.7 where m = g + (c + 2f + a + b) / 3 (2) E to ABC: e + m = 34.7 (3) By subtracting the third equation from the second equation we get: d – e = -2 Adding this result to (1) we get: 2d = 8; d = 4 Substitute back in to get e = 6

4) Now treat D and E as a single sequence, and create a new distance table. The

distance to DE is taken as the average of sequence A to D and A to E. The other distances are calculated in a similar fashion. The resulting distance table is:

A B C (DE)A -- 22 39 40 B -- -- 41 42 C -- -- -- 19

(DE) -- -- -- --

5) Identify the closely related sequences in the table. In this case, it is C to DE. Using algebra, the distance c can be calculated to be 9, and g is calculated to be 5.

6) Repeat the process until all lengths have been identified, in which case there is

only single composite node left. Summary of Fitch-Margoliash Algorithm

1) Find the mostly closely related pairs of sequences (A, B). 2) Treat the rest of the sequences as a composite. Calculate the average distance

from A to all others; and from B to all others. 3) Use these values to calculate the length of the edges a and b. 4) Treat A and B as a composite. Calculate the average distances between AB and

each of the other sequences. Create a new distance table. 5) Identify next pair of related sequences and begin as with step 1. 6) Subtract extended branch lengths to calculate lengths of intermediate branches. 7) Repeat the entire process with all possible pairs of sequences. 8) Calculate predicted distances between each pair of sequences for each tree to find

the best tree.

Neighbor-joining algorithm The neighbor-joining method is very similar to the Fitch-Margoliash method. The sequences that should be joined are chosen to give the best least-squares estimates of the branch lengths that most closely reflect the actual distances between the sequences. The neighbor-joining method begins by creating a star topology in which no neighbors are joined: The tree is modified by joining pairs of sequences. The pair to be joined is chosen by calculating the sum of the branch lengths for the corresponding tree. The sum of the branch lengths is calculated as follows:

22)2(2 −++

−+

= ∑∑N

ddN

ddS ijmninim

mn

where i,j represent all sequences except m and n, and i < j.

For example, consider the tree when A and B are joined:

22)2(2 −++

−+

= ∑∑N

ddN

ddS ijmninim

mn

B

A

C

D

E

B

A

C

D

E

The pair that results in the smallest branch length is then chosen to be the pair that is joined. Based on this choice, the Fitch-Margoliash algorithm is used to compute the actual branch lengths. After the pair has been joined, a new distance table is created with the recently joined sequences now entered as a composite. The neighbor-joining algorithm chooses the next pair of sequences to join, and the F-M algorithm computes the branch lengths. The process continues until the correctly branched tree and distances have been identified. Unweighted Pair Group Method with Arithmetic Mean (UPGMA) SEE STARTING ON P 262 Maximum Likelihood SEE P 274 –277 MOUNT Which Method Do I Choose? The choice of which of these methods to choose depends upon the sequences that are being compared. If there is strong sequence similarity, then maximum parsimony methods are best. If there is not strong sequence similarity, but clearly recognizable sequence similarity, then distance methods work best. For all others, the best approach is a maximum likelihood model. Difficulties with phylogenetic analysis Phylogenetic analysis would be easier if evolution occurred in a vertical fashion. However, horizontal or lateral transfer of genetic material (for instance through viruses) occurs, which makes it difficult to determine the phylogenetic origin of some evolutionary events. If a gene is under selective pressure in different organisms, it can be rapidly evolving. Such an evolution can mask earlier changes that had occurred phylogenetically. In addition, different regions of a genome are under different pressures, and therefore different sites within two comparative sequences may be evolving at different rates. Rearrangements of genetic material can also lead to false conclusions with phylogenetic analysis, especially if two sequences of different evolutionary origins are place next to each other.

Gene duplication events also cause problems with phylogenetic analysis, since the duplicated genes can evolve along separate pathways, leading to different functions. PAUP (Maximum Parsimony) MacClade (Maximum Parsimony) CONSENSE PHYLIP – (Distance -- neighbor joining) CLUSTALW – distance-based tree TreeTop http://www.genebee.msu.su/services/phtree_reduced.html Phylodendron http://iubio.bio.indiana.edu/treeapp/treeprint-form.html ATV (A Tree Viewer) http://www.genetics.wustl.edu/eddy/atv/ How to Make a phylogenetic tree http://hiv-web.lanl.gov/content/hiv-db/TREE_TUTORIAL/Tree-tutorial.html A Brief Review of Common Tree Making Methods http://bioinfo.mbb.yale.edu/mbb452a/projects/Patricia-M-Strickler.html NCBI Primer on Phylogenetics http://www.ncbi.nlm.nih.gov/About/primer/phylo.html List of Phylogeny Programs http://evolution.genetics.washington.edu/phylip/software.html TreeViewer http://www.avl.iu.edu/projects/DNAml/ Phylip http://evolution.genetics.washington.edu/phylip.html

CECS694-02 Introduction to Bioinformatics

Lecture 10 Phylogenetic Prediction

Tree of Life On one level, it is interesting to understand and study how the evolution of species has occurred. There are many different resources discussing the evolution of species. This includes the NCBI taxonomy web sites, and the University of Arizona’s tree of life project. We’ll take a look at both of these web sites in order to get a better appreciation for the evolution of species relative to one another. NCBI Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/ Tree of Life http://tolweb.org/tree/ Evolutionary Trees An evolutionary tree is a two dimensional graph showing the evolutionary relationship among a set of items being compared. This set can be organisms, genes, or dna

sequences. Consider for the moment that each of the units in the set are referred to as a taxon. Each taxon will be defined by a distinct unit on the tree. An evolutionary tree is composed of outer branches or leaves that represent the taxa and nodes and branches representing the relationships among the taxa. Two taxa that are derived from the same common ancestor will share a node in the graph. In general, approaches to designing evolutionary trees attempt to define the length of each branch to the next node according to the number of sequence level changes that occurred. One thing to be careful of in phylogenetic analysis is that this distance may not be in direct relation to evolutionary time. Analyses that prescribe to the theory of a uniform rate of mutation are known as the molecular clock hypothesis. Rooted Trees In a rooted tree topology, one sequence (the root) is defined to be the common ancestor of all of the other sequences. A unique path leads from the root node to any other node, and the direction of the path indicates evolutionary time. The root is chosen by including a sequence from an organism that is thought to have branched off earlier than the other sequences. If the molecular clock hypothesis holds, it is also possible to predict a root. As the number of sequences increase, the number of possible rooted trees increases very rapidly. In some cases, a bifurcating binary tree is the best model to simulate evolutionary events in which case one species branches off into two separate species. Example of a rooted tree:

Image source: http://www.ncbi.nlm.nih.gov/About/primer/phylo.html Star Topology (Unrooted Trees)

An unrooted tree (sometimes referred to as a star topology) shows the evolutionary relationship among sequences, without revealing the location of the oldest ancestry. There are fewer choices for an unrooted tree than a rooted tree. Example of an unrooted tree:

Image source: http://www.shef.ac.uk/english/language/quantling/images/quantling1.jpg Methods for Determining Evolutionary Trees There are three methods used to calculate the tree(s) that best account for the observed variation in a set of sequences. These methods are maximum parsimony, distance, and maximum likelihood. Maximum Parsimony Maximum parsimony methods predict the evolutionary tree that minimizes the number of steps required to generate the observed variation in the sequences. In order to construct a tree using maximum parsimony, a multiple sequence alignment must first be obtained. For each aligned position, phylogenetic trees that require the smallest number of evolutionary changes to produce the observed sequence changes are identified. This continues for each position in the alignment. Those trees that produce the smallest number of changes overall for all sequence positions are identified. This is a rather time

consuming algorithm that only works well if the sequences have a strong sequence similarity.

1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G Consider the example above (Mount, 250). There are a total of four sequences, which gives a possibility of three different unrooted trees. In this case some sites are informative, and other sites are not. An informative site has the same sequence character in at least two different sequences. Only the informative sites need to be considered. Possible trees: In this case, the optimal tree is obtained by adding the number of changes at each informative site for each tree, and picking the tree requiring the least total number of changes. For a large number of sequences the number of trees to examine becomes so large that it might not be possible to examine all possible trees. Some programs, such as PAUP, add features that will allow the user to envoke a heuristic that will keep representative trees that best fit the data. The informative sites in the example alignment are 5, 7, and 9. Let’s go through the possible trees, and figure out the number of rearrangements for each in the informative sites. (SEE THE POWERPOINT PRESENTATION) One problem with determining evolutionary distance between sequences is that columns representing greater variation dominate the analysis. In order to overcome this problem of determining long branch lengths is to look only at transversion events, which are the most significant base changes (i.e. changes a purine to a pyrimidine or vice versa). This is referred to as Lake’s method of invariants.

1

2 4

3 1

3 4

2 1

4 2

3

LOOK AT THE MITOCHONDRIAL SEQUENCE ANALYSIS ON P 252 Distance Methods The distance method for construction of phylogenetic trees looks at the number of changes between each pair in a group of sequences to produce a phylogenetic tree of the group. The goal of distance methods is to identify a tree that positions neighbors correctly and that also has branch lengths which reproduce the original data as closely as possible. CLUSTALW uses the neighbor-joining method as a guide to multiple sequence alignments. The PHYLIP suite of programs employ neighbor-joining methods. Phylip http://evolution.genetics.washington.edu/phylip.html Distance analysis programs in PHYLIP FITCH: estimates a phylogenetic tree assuming additivity of branch lengths using the Fitch-Margoliash method. KITSH: same as FITCH, but under the assumption of a molecular clock. NEIGHBOR: estimates phylogenies using the neighbor-joining (no molecular clock assumed) or unweighted pair group method with arithmetic mean (UPGMA) (molecular clock assumed). For phylogenetic analysis, the distance score counted as either the number of mismatched positions in the alignment or the number of sequence positions that must be changed to generate the other sequence is used. The success of distance methods depends on the degree to which the distances among a set of sequences can be made additive on a predicted evolutionary tree. Consider the alignment:

A ACGCGTTGGGCGATGGCAAC B ACGCGTTGGGCGACGGTAAT C ACGCATTGAATGATGATAAT D ACACATTGAGTGATAATAAT The distances between these sequences can be shown as a table:

A B C D A - 3 7 8 B - - 6 7 C - - - 3 D - - - - Using this information, an unrooted tree showing the relationship between these sequences can be drawn:

Fitch and Margoliash Method The Fitch and Margoliash method uses a distance table. The sequences are combined in threes to define the branches of the predicted tree and to calculate the branch lengths of the tree. Example using three sequences:

7) Draw an unrooted tree with three branches originating from a common node and label the ends:

C

D

A

B

41

2

2

1

8) Calculate the lengths of tree branches algebraically:

A B CA -- 22 39B -- -- 41C -- -- --

distance from A to B = a + b = 22 (1) distance from A to C = a + c = 39 (2) distance from B to C = b + c = 41 (3) subtracting (3) from (2) yields: b + c = 41 -a – c = -39 __________ b – a = 2 (4) adding (1) and (4) yields 2b = 24; b = 12 so a + 12 = 22; a = 10 10 + c = 39; c = 29

C29

12

10

B

A

Cc

b

a

B

A

Example of Fitch-Margoliash Using Five Sequences The Fitch-Margoliash algorithm can be extended to three or more sequences. Consider the following table of distances between five separate sequences:

A B C D EA -- 22 39 39 41B -- -- 41 41 43C -- -- -- 18 20D -- -- -- -- 10E -- -- -- -- --

Suppose that the initial tree is as follows: 1) The first step is to locate the most closely related sequences in the distance table. In this case, that would be sequences D and E. 2) Now create a new table by combining the remaining sequences. For the distance from D to A,B,C take the average distance of each of these to D ( (39 + 41 + 18) / 3 = 32.7) For the distance from E to A,B,C, take the average distance of each of these to E ((41+43+20)/3 = 34.7). The resulting table is as follows:

D E AVG ABC D -- 10 32.7 E -- -- 34.7 AVG ABC -- -- --

A

B

C

D

E

a

b d

c

e

f

g

9) The average distances from D to ABC and E to ABC could also be found by averaging the sum of the appropriate branch lengths:

D to E: d + e = 10 (1) D to ABC: d + m = 32.7 where m = g + (c + 2f + a + b) / 3 (2) E to ABC: e + m = 34.7 (3) By subtracting the third equation from the second equation we get: d – e = -2 Adding this result to (1) we get: 2d = 8; d = 4 Substitute back in to get e = 6

10) Now treat D and E as a single sequence, and create a new distance table. The

distance to DE is taken as the average of sequence A to D and A to E. The other distances are calculated in a similar fashion. The resulting distance table is:

A B C (DE)A -- 22 39 40 B -- -- 41 42 C -- -- -- 19

(DE) -- -- -- --

11) Identify the closely related sequences in the table. In this case, it is C to DE. Using algebra, the distance c can be calculated to be 9, and g is calculated to be 5.

12) Repeat the process until all lengths have been identified, in which case there is

only single composite node left. Summary of Fitch-Margoliash Algorithm

9) Find the mostly closely related pairs of sequences (A, B). 10) Treat the rest of the sequences as a composite. Calculate the average distance

from A to all others; and from B to all others. 11) Use these values to calculate the length of the edges a and b. 12) Treat A and B as a composite. Calculate the average distances between AB and

each of the other sequences. Create a new distance table. 13) Identify next pair of related sequences and begin as with step 1. 14) Subtract extended branch lengths to calculate lengths of intermediate branches. 15) Repeat the entire process with all possible pairs of sequences. 16) Calculate predicted distances between each pair of sequences for each tree to find

the best tree.

Neighbor-joining algorithm The neighbor-joining method is very similar to the Fitch-Margoliash method. The sequences that should be joined are chosen to give the best least-squares estimates of the branch lengths that most closely reflect the actual distances between the sequences. The neighbor-joining method begins by creating a star topology in which no neighbors are joined: The tree is modified by joining pairs of sequences. The pair to be joined is chosen by calculating the sum of the branch lengths for the corresponding tree. The sum of the branch lengths is calculated as follows:

22)2(2 −++

−+

= ∑∑N

ddN

ddS ijmninim

mn

where i,j represent all sequences except m and n, and i < j.

For example, consider the tree when A and B are joined:

B

A

C

D

E

B

A

C

D

E

22)2(2 −++

−+

= ∑∑N

ddN

ddS ijmninim

mn

The pair that results in the smallest branch length is then chosen to be the pair that is joined. Based on this choice, the Fitch-Margoliash algorithm is used to compute the actual branch lengths. After the pair has been joined, a new distance table is created with the recently joined sequences now entered as a composite. The neighbor-joining algorithm chooses the next pair of sequences to join, and the F-M algorithm computes the branch lengths. The process continues until the correctly branched tree and distances have been identified. Unweighted Pair Group Method with Arithmetic Mean (UPGMA) Works by clustering the sequences, starting with more similar sequences and working towards more distant sequences. The process assembles a tree upwards, with each node being added above the others, and the edge lengths being determined by the difference in the heights of the nodes. The distance dij between two clusters Ci and Cj is defined to be the average distance between pairs of sequences from each cluster:

∑=ji qinCpinCpq

jiij d

CCd

,||||1

where |Ci| and |Cj| are the number of sequences in clusters i and j, respectively The algorithm for UPGMA clustering (Durbin p 166) is as follows:

1. Assign each sequence i to its own cluster Ci 2. Define one leaf of the tree T for each sequence, and place it at height 0.

3. Determine the two clusters, i and j for which dij is minimal 4. Define a new cluster k by Ck = Ci ∪ Cj, and define dkl for all l 5. Define a node k with daughter nodes i and j, and place it at height dij/2. 6. Add k to the current clusters and remove i and j. 7. Continue steps 3-6 until only two clusters i and j remain, and place the root of the

tree at height dij/2 EXAMPLE OF UPGMA Consider the case where there are five sequences represented by dots on a graph. The spacing between each of these is representative of the distance between them: The first step is to assign each of the sequences to their own cluster, which now gives a number to each of these. In addition, the tree can be constructed at the base, where each sequence is a leaf of the tree: .Now select the two clusters that are closest to each other. These are the sequences 1 and 2. Create a single cluster for these two sequences, and create a parent node in the tree at height d12/2. Contine on, selecting the two clusters that are closest: in this case, it is 4 and 5. Combine into a single cluster, and update the tree:

1 2

3 4

5

1 2

6 1 2

3 4

5

The next two clusters are the one containing 4 and 5, and the one containing 3: There are now only two clusters left, so join them to complete the tree: SEE STARTING ON P 262 Maximum Likelihood SEE P 274 –277 MOUNT Which Method Do I Choose? The choice of which of these methods to choose depends upon the sequences that are being compared. If there is strong sequence similarity, then maximum parsimony methods are best. If there is not strong sequence similarity, but clearly recognizable sequence similarity, then distance methods work best. For all others, the best approach is a maximum likelihood model.

1 2

6

4 5 7

1 2

3 4

5

1 2

3 4

5 1 2

6

4 5 7

3

8

1 2

3 4

5

9

1 2

6

4 5 7

3

8

Difficulties with phylogenetic analysis Phylogenetic analysis would be easier if evolution occurred in a vertical fashion. However, horizontal or lateral transfer of genetic material (for instance through viruses) occurs, which makes it difficult to determine the phylogenetic origin of some evolutionary events. If a gene is under selective pressure in different organisms, it can be rapidly evolving. Such an evolution can mask earlier changes that had occurred phylogenetically. In addition, different regions of a genome are under different pressures, and therefore different sites within two comparative sequences may be evolving at different rates. Rearrangements of genetic material can also lead to false conclusions with phylogenetic analysis, especially if two sequences of different evolutionary origins are place next to each other. Gene duplication events also cause problems with phylogenetic analysis, since the duplicated genes can evolve along separate pathways, leading to different functions. PAUP (Maximum Parsimony) MacClade (Maximum Parsimony) CONSENSE PHYLIP – (Distance -- neighbor joining) CLUSTALW – distance-based tree Consider the following list of Globin sequences: >gamma_A

MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAH

GKKVLT

SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMV

TAVAS

ALSSRYH

>alfa

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK

KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA

VHASLDKFLASVSTVLTSKYR

>beta

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV

KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK

EFTPPVQAAYQKVVAGVANALAHKYH

>delta

VHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKV

KAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGK

EFTPQMQAAYQKVVAGVANALAHKYH

>epsilon

VHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKV

KAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGK

EFTPEVQAAWQKLVSAVAIALAHKYH

>gamma_G

MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAH

GKKVLT

SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMV

TGVAS

ALSSRYH

>myoglobin

MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKK

HGATVL

TALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKAL

ELFR

KDMASNYKELGFQG

>teta1

ALSAEDRALVRALWKKLGSNVGVYTTEALERTFLAFPATKTYFSHLDLSPGSSQVRAHGQ

KVADALSLAVERLDDLPHALSALSHLHACQLRVDPASFQLLGHCLLVTLARHYPGDFSPA

LQASLDKFLSHVISALVSEYR

>zeta

SLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGS

KVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAE

AHAAWDKFLSVVSSVLTEKYR Create a phylogeny from these.

1) Demonstrate using the GCG software (http://kingtut.spd.louisville.edu:999/)

Examples using a phlogenetic program:

http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html TreeTop http://www.genebee.msu.su/services/phtree_reduced.html Phylodendron http://iubio.bio.indiana.edu/treeapp/treeprint-form.html ATV (A Tree Viewer) http://www.genetics.wustl.edu/eddy/atv/ How to Make a phylogenetic tree http://hiv-web.lanl.gov/content/hiv-db/TREE_TUTORIAL/Tree-tutorial.html A Brief Review of Common Tree Making Methods http://bioinfo.mbb.yale.edu/mbb452a/projects/Patricia-M-Strickler.html NCBI Primer on Phylogenetics http://www.ncbi.nlm.nih.gov/About/primer/phylo.html List of Phylogeny Programs http://evolution.genetics.washington.edu/phylip/software.html TreeViewer http://www.avl.iu.edu/projects/DNAml/ Phylip http://evolution.genetics.washington.edu/phylip.html


Lecture 11 RNA Secondary Structure Prediction

Introduction to RNA sequence analysis RNA molecules are important to study since they are involved in important biochemical functions, including translation, RNA splicing, processing and editing, cellular localization, and catalysis. RNA sequence analysis needs to be treated differently than DNA sequence analysis, since RNA structures fold and base pair with themselves to form secondary structures. Therefore, it is not necessarily the sequence but the structure conservation that is most important in RNA sequence analysis.

Variations in RNA sequence maintain base-pairing patterns that give rise to these secondary structures. Therefore, to maintain the secondary structure, when a nucleotide in one base changes, the base with which it pairs must also change to maintain the same structure. For instance, if you have the base pair G-C, and the G mutates to an A, then the C should mutate to a U to maintain a base pairing at this location, which promotes the same secondary structure. Such a variation is referred to as covariation. In order to determine the secondary structure of the RNA molecule, all possible choices of complementary sequences are considered, and the sets that provide the most energetically stable molecules are chosen. Another method to predict secondary structure in RNA takes into account conserved patterns of base-pairing. Positions of covariance are studied, and are taken to be conserved matches, since they maintain the secondary structure. Locating regions of covariance in sequence data is a computationally challenging task. Features of RNA Secondary structure RNA is a polymer composed of a combination of four nucleotides: adenine (A), cytosine (C), guanine (G), and uracil (U). G-C and A-U form complementary hydrogen bonded base pairs, with the GC base pairs being more stable since they form three hydrogen bonds as opposed to the two hydrogen bonds formed by AU base pairs. In addition to the canonical Watson-Crick GC and AU base pairs, non-canonical pairs can occur in RNA secondary structure as well. The most common of these non-canonical pairs in GU. RNA is typically produced as a single stranded molecule (unlike DNA) which folds upon itself to form base pairs. This structure is referred to as the secondary structure of the RNA.

RNA secondary structure can be viewed as an intermediary between a linear molecule and a three-dimensional structure. RNA secondary structure is mainly composed of double-stranded RNA regions formed by folding the single-stranded RNA molecule back on itself. There are a number of different secondary structures that can be formed from this base-pairing, including: Stem Loops (Hairpin loops) Loops are generally at least 4 bases long

Bulge Loops Bulge Loops occur when bases on one side of the structure cannot form base pairs

Interior Loops Interior loops occur when bases on both sides of the structure cannot form base pairs.

Junctions or Multiloops Junctions include two or more double-stranded regions converge to form a closed structure.

In addition, tertiary interactions can be present as well. Such tertiary interactions are located using covariance analysis. The types of tertiary interactions present in RNA molecules include: Kissing Hairpins In kissing hairpins, the unpaired bases of two separate hairpin loops base pair with one another.

Pseudoknots

Hairpin-Bulge Interactions

Limitations of Secondary Structure Prediction Three assumptions are made in determining secondary structure prediction:

1) The most likely structure is similar to the energetically most stable structure 2) The energy associated with any position in the structure is only influenced by

local sequence and structure. 3) The structure formed does not produce pseudoknots.

One method of representing the base pairs of a secondary structure is to draw the structure in a circle. An arc is drawn to represent each base pairing found in the structure. If any of the arcs cross, then a pseudoknot is present. An example of the circular method is shown below:

Image source: http://www.finchcms.edu/cms/biochem/Walters/rna_folding.html RNA sequence evolution With RNA sequences, homology is not defined in terms of sequence similarity, but rather in terms of common secondary structure. Two sequences that do not appear to have significant sequence similarity can still have conserved secondary structure. Inferring structure by comparative sequence analysis

Comparative sequence analysis is the most reliable computational method for determining the secondary structure of an RNA sequence. For example, consider the following example from Durbin, et al., p 266: In order to use comparative sequence analysis, the first step is to calculate a multiple sequence alignment. This requires that the sequences be similar enough so that they can be initially aligned. At the same time, the sequences should be dissimilar enough so that covarying substitutions can be detected. The mutual information gained by aligning two columns that covary is determined by the function:

∑=ji ji

ji

jixx xx

xxxxij ff

ffM

,2log

Where fxi is the frequency of a base in column i; fxixj is the joint (pairwise) frequency of a base pair between columns i and j. For RNA, the information ranges from 0 and 2 bits. If columns i and j are uncorrelated, the mutual information is 0. An example of a plot for the mutual information of the yeast tRNA-Phe is given below (Durbin, et al., p 268):

The mutual information from this graph produces the following structure:

Predicting Structure from a single sequence

Suppose we don’t have a set of similar RNAs from which the structure can be inferred using covariance methods. There are a number of possible secondary structures that can be determined from a single sequence. For example, an RNA molecule only 200 bases long has 1050 possible secondary structures, many of which are not plausible. A method to detect the correct structure is needed. One of the simplest methods to find self-complementary regions in an RNA sequence is to perform a dot-plot of the sequence against its complement. The repeat regions that are found can potentially base pair with each other to form secondary structures. More advanced dot-plot techniques incorporate free energy measures as well.

Image Source: http://www.finchcms.edu/cms/biochem/Walters/rna_folding.html Base Pair Maximization – Nussinov Folding Algorithm One approach to predicting secondary structure looks at finding the structure with the most base pairs. An efficient dynamic programming approach to this problem was introduced in the late 1970’s by Nussinov. According to the Nussinov algorithms, there are four ways to get the best structure from I to j from the best structures of the smaller subsequences:

1) Add i,j pair onto best structure found for subsequence i+1, j-1 2) add unpaired position i onto best structure for subsequence i+1, j 3) add unpaired position j onto best structure for subsequence i, j-1 4) combine two optimal structures i,k and k+1, j

The possible structures are shown below (Durbin et al., p 269):

The Nussinov RNA folding prediction program works by comparing a sequence against itself in a dynamic programming matrix with the above rules for scoring the structure at a particular point. Since the structure is folding upon itself, it is only necessary to calculate half of the matrix. Initialization step:

In the matrix fill step, the score for the matches along the main diagonal and the diagonal just below it are set to zero. Formally, the scoring matrix, M, is initialized as follows: M[i][i] = 0 for i = 1 to L (where L is the length of the sequence) M[i][i-1] = 0 for i = 2 to L Using the example in Durbin, et al. with the RNA sequence GGGAAAUCC, the matrix now looks like the following, such that sequences of length 1 will score 0:

G G G A A A U C CG 0 G 0 0 G 0 0 A 0 0 A 0 0 A 0 0 U 0 0 C 0 0 C 0 0

Now the matrix is filled in, starting with subsequences of length 2, and ending at subsequences of length L. The four rules for filling in the matrix are used: M[i][j] = max of the following four: M[i+1][j] (Ith residue is hanging off by itself) M[i][j-1] (jth residue is hanging off by itself) M[i+1][j-1] + S(xi, xj) (ith and jth residue are paired; if xi = complement of xj,

then S(xi, xj) = 1; otherwise it is 0. M[i][j] = MAXi<k<j (M[i][k] + M[k+1][j]) (merging two substructures) When looking for subsequences of length 2, the matrix is filled as follows, since A-U is the only base-pair found:

G G G A A A U C CG 0 0 G 0 0 0 G 0 0 0 A 0 0 0 A 0 0 0 A 0 0 1 U 0 0 0 C 0 0 0

C 0 0 Filling in for subsequences of length 3, the matrix becomes:

G G G A A A U C CG 0 0 0 G 0 0 0 0 G 0 0 0 0 A 0 0 0 0 A 0 0 0 1 A 0 0 1 1 U 0 0 0 0C 0 0 0C 0 0

The final filled matrix is as follows:

G G G A A A U C CG 0 0 0 0 0 0 1 2 3G 0 0 0 0 0 0 1 2 3G 0 0 0 0 0 1 2 2A 0 0 0 0 1 1 1A 0 0 0 1 1 1A 0 0 1 1 1U 0 0 0 0C 0 0 0C 0 0

Traceback through this matrix (covered on P 271, Durbin et al) leads to the following structure:

Given the four possibilities for the maximum structure in the Nussinov algorithm, it can be computed to a stochastic context-free grammar as follows: S → aS | cS | gS | uS S → Sa | Sc | Sg | Su S → aSu | cSg | uSa | gSc S → SS Such a simplistic approach will not give accurate structure predictions, since it does not take into account important structural features, such as nearest neighbor interactions, stacking interactions, and loop length preferences. Energy Minimization Methods Since RNA folding is determined by biophysical properties, methods that take into account these properties are more likely to yield accurate predictions. One method that is widely used is the energy minimization algorithm that predicts the correct secondary structure is the one that minimizes the free energy (∆G). The free energy of an RNA secondary structure is calculated as the sum of the individual contributions of loops, base pairs, and other secondary structure elements. Energies of stems are calculated as the stacking contributions between neighboring base pairs. The predicted free-energy values (kcal/mole at 37oC ) are calculated as follows:

Stacking Energies for base pairs A/U C/G G/C U/A G/U U/G

A/U -0.9 -1.8 -2.3 -1.1 -1.1 -0.8

C/G -1.7 -2.9 -3.4 -2.3 -2.1 -1.4 G/C -2.1 -2.0 -2.9 -1.8 -1.9 -1.2 U/A -0.9 -1.7 -2.1 -0.9 -1.0 -0.5 G/U -0.5 -1.2 -1.4 -0.8 -0.4 -0.2 U/G -1.0 -1.9 -2.1 -1.1 -1.5 -0.4

Destabilizing Energies for Loops Number of Bases 1 5 10 20 30

Internal -- 5.3 6.6 7.0 7.4 Bulge 3.9 4.8 5.5 6.3 6.7

Hairpin -- 4.4 5.3 6.1 6.5 In order to find the structure for which the minimum free energy is found, the sequence is compared against itself using a dynamic programming approach similar to the maximum base-paired structure approach previously described. However, instead of using a scoring scheme for the base pairs present, the score is based upon the free energies described above. Gaps between matches represent some form of a loop, so the gap score is calculated using the above tables as well. The most widely used software that incorporates this minimum free energy algorithm is MFOLD. Suboptimal folds The correct structure is not necessarily the structure with the optimal structure, but a structure within a certain threshold of the calculated minimum energy. Therefore, the MFOLD algorithm has been updated to report suboptimal foldings as well. Covariance Models In order to locate covarying sites in RNA sequences, 7 different approaches are offered in Mount, p225. The key to covariance is the measure of the mutual information content previously discussed. The mutual information content can be plotted on a motif logo, which can give insight into the folding of a particular sequence.

Image source: http://www.cbs.dtu.dk/~gorodkin/appl/slogo.html A formal covariance model, COVE, was devised by Eddy and Durbin. The model provides very accurate results, but is extremely slow and unsuitable for large genomes. Stochastic Context Free Grammars (SCFGs) have also been used to model RNA secondary structure. Examples of these are tRNAScan-SE, and a program created to find snoRNAs. Typically, with SCFG approaches, the grammars are created by using a training set of data, and then the grammars are applied to potential sequences to see if they fit into the language. SCFGs allow the detection of sequences belonging to a family, such as tRNAs, group I introns, snoRNAs, snRNAs, etc. With a SCFG approach, base-paired columns are modeled by pairwise emitting non terminals (for example aWu) while single-stranded columns are modeled by leftwise emitting nonterminals (such as gW), when possible. Any RNA structure can then be reduced to a SCFG (see Durbin, et al., p 278-279). Tranformational Grammars Transformational grammars were first described by the linguist Noam Chomsky in the 1950’s. (Yes, this is the same Noam Chomsky who has expressed various dissident

political views throughout the years!) Transformational grammars are very important in computer science, most notably in compiler design. Grammars are covered in more detail in compiler and automaton classes, so we will only briefly touch on them here. Web site: http://web.mit.edu/linguistics/www/chomsky.home.html The idea behind transformational grammars is to take a set of outputs (such as a sentence, or in our case, an RNA structure) and determine whether or not it can be produced using a set of rules for the language. Transformational grammars consist of a set of symbols and production rules on which the symbols can be put together. The symbols can be either terminal (emitting) symbols or non-terminal symbols that can be used to create longer strings of symbols. Grammar for Palindromic sequences First, consider the case of palindromic DNA sequences. There are a total of five possible terminal symbols: {A, C, G, T, ε) where ε represents the blank terminal symbol. The production rules for creating a palindromic sequence are as follows, where S and W are non-terminal symbols: S→W W→ aWa | cWc | gWg | tWt W→ a | c| g | t | ε Using these production rules, we can create a derivation of the palindromic sequence acttgttca as follows: S ⇒ W ⇒ aWa ⇒ acWca⇒actWtca ⇒ acttWttca ⇒ acttgttca In order to align a context-free grammar to a sequence, a parse tree can be created, where the root of the tree is the non-terminal start symbol, S. Leaves of the parse tree are the terminal symbols in the sequence, and internal nodes are the nonterminals. The leaves can be parsed from left to right to view the results of the production. An example for the parse tree on the above production is as follows: S

WW

More information on parse trees can be found in Durbin, et al., Chapter 9. A SCFG for RNA secondary structure can be constructed as follows: S→W

W→ WW (bifurcation) W→ aWu | cWg | gWc | uWa (loops) W→ gWu | uWg W→ aW | cW | gW | uW (bulges on one side) W→ Wa | Wc | Wg | Wu (bulges on opposite side) W→ a | c| g | t | ε

Using this grammar, the structure for the RNA structure for the sequence: GCUUACGACCAUAUCACGUUGAAUGCACGCCAUCCCGUCCGAUCUGGCAAGUUAAGCAACGUUGAGUCCAGUUAGUACUUGGAUCGGAGACGGCCUGGGAAUCCUGGAUGUUGUAAGCU Produced by MFOLD, can be constructed using the following productions (5’ to 3’): S⇒W⇒Wu⇒gWcu⇒gcWgcu⇒gcuWagcu⇒gcuuWaagcu⇒ gcuuaWuaagcu⇒gcuuacWguaagcu⇒ gcuuacgWuguaagcu⇒gcuuacgaWuuguaagcu⇒ gcuuacgacWguuguaagcu⇒gcuuacgaccWguuguaagcu⇒gcuuacgaccaWguuguaagcu⇒.... Read Mount, Chapter 5 Durbin, et al, Chapters 9 and 10


Lecture 12 Microarray Image Analysis

Introduction to Microarray Images analysis Genes are regions of a genome that code for either a structural or functional protein. Genes are of interest to biologists due to their association with diseases. In the past, the study on whether a gene was turned on or turned off under a specific condition was an expensive and time consuming task. Within the past 10 years, the emergence of a new technology, called microarrays, has made it possible to study the expression pattern of thousands of genes instantaneously. Microarrays allow the study of genes (actually any sequence of interest) under differing conditions. Approach to microarray construction The idea behind microarray construction is to spot up to tens of thousands of DNA/RNA molecules on a slide, each of which uniquely identifies a certain region. These molecules can be small, on the order of 25 bp long, or can be somewhat larger. For typical gene expression experiments, the molecule is around 500 bp long. For the image detection, it is necessary for the molecules to be a consistent size for each item being studied. The ultimate goal of microarray data is to be able to understand how the expression levels of different genes differ under two separate conditions. By asking and answering such questions, we can get an idea of which genes are involved in a certain disease, and potentially, the pathways involved in these diseases. In order to figure out which genes are expressed in a given condition, cells in a given condition are taken, and the mRNA from these cells is extracted. The mRNA represents the genes that are turned on in these cells. These mRNA sequences are then labeled. The manner in which these cells are labeled is dependent upon the technique that is being used. Single Channel Microarrays With single channel microarrays, the genes present under a given condition are labeled with biotin. The expressed genes are washed over the microarray slide, and the expressed genes will hybridize at the appropriate locations. What results is a dark spot where the expressed genes have hybridized. If a clear microscope slide has been used to spot the microarrays, then light can be passed underneath. Black spots represent genes that are expressed in a given condition. In order to study two different conditions in single channel microarrays, two separate slides must be used. An example of a single channel microarray is given below:

Two Channel Microarrays With two channel microarrays, the samples under different conditions are labeled separately. The labels normally incorporated are green and red. For argument sake, assume that the control is labeled green, and the sample is labeled red. Both samples are washed over the microarray slide, and hybridization occurs. Each spot on the slide is now one of four colors as shown below:

The colors correspond to the expression of the gene under the different conditions. For example, spots that are only green are highly expressed in the control, while spots that are red are highly expressed in the sample. Spots that are yellow are equally expressed in both sample and control, while black spots are genes that are not expressed in either the sample nor the control. Determining image intensity Once the spots are determined, the difficulty is in quantifying the image signals. Generally, the images are converted to some sort of matrix of numbers. This step requires processing. Besides the spot intensity, other measures that might be taken include measurements of error and measurements of background noise. For instance, you might ask how green is a spot? Answering this question can give an indication as to how the difference in expression levels between the control and the sample. Such an approach is often referred to as a fold approach. In otherwords, how does the expression level change under a given condition? (Two-fold difference? Four-fold difference?) Besides determining this value, it is important to figure out when a significant change has been made. One thing to be aware of is that a four-fold observed difference does not necessarily mean that a gene is expressed four times as much in a given condition! Clustering One might be inclined to ask questions concerning the relationship among sequences in an experiment. Several approaches have been suggested. Included are: k-Means Clustering k-Means clustering attempts to partition the results into groups that have similar expression patterns, where k is the number of clusters the user believes that the data should fall into. There are three steps in the k-Means clustering algorithm:

1) Randomly assign each of the data points to one on the k-clusters 2) Calculate the mean inter- and intraclass distances 3) minimize the mean interclass distances and maximize intraclass distances using

an iterative approach. EXAMPLE: rana.lbl.gov/FuzzyK/ images/figure2.html

Heirarchical Clustering Hierarchical clustering creates a “phylogeny” or hierarchy of the data points by employing the following algorithm:

1) Generate a gene similarity score for all pairs of genes 2) Place the gene similarity scores in a matrix 3) Join the genes that have the highest score 4) Continue to join next similar pairs of genes

Hierarchical clustering methods include: complete-linkage clustering, average-linkage clustering, weighted pair-group averaging, and within pair-group averaging. Clustering approaches have several disadvantages, and should be used with extreme caution (if they are used at all).

Image source: http://cfpub.epa.gov/ncer_abstracts/index.cfm/fuseaction/display.abstractDetail/abstract/975/report/2001 Self-Organizing Maps (SOMs) SOMs are a type of neural network approach. A SOM has a set of nodes with a simple topology and a distance function on the nodes. The nodes are iteratively mapped into a k-dimensional gene expression space. The steps in assembling a SOM are as follows:

1) Random vectors are constructed and assigned to each partition 2) A gene is picked at random and the reference vector closest to that gene is

identified 3) The reference vector is adjusted to be more similar to the vector of the assigned

gene. 4) Steps 2 and 3 are iterated through, until the reference vectors converge. Web page for SOMs: http://staff.aist.go.jp/utsugi-a/Lab/BSOM1/index.html

Support Vector Machines Support Vector Machines are supervised machine learning techniques. These techniques organize the data by mapping the gene expression vectors into a higher dimensional space based on a kernel function. The SVM is trained to discriminate between positive and negative data points. SVMs find the hyperplane that is needed to maximize the margin between the surface between the positive and negative data points.

Image source: iipl.jaist.ac.jp/ research/svm/ Other Clustering Approaches Hidden Markov Models Genetic Algorithms Artificial Neural Networks Read Mount, p519-526 Important Microarray Papers Homework #4: Due 4/17/2003

Project #3: Due 4/14/2003 Final Project: Due 5/1/2003


Lecture 13 Protein Structure Prediction

Proteins are polypeptides that have a three dimensional structure. They can be described through four different hierarchical levels:

• Primary structure – the sequence of amino acids constituting the polypeptide chain.

• Secondary structure – the local organization of the parts of the polypeptide chain into secondary structures such as α helices and β sheets.

• Tertiary structure – the three dimensional arrangements of the amino acids as they react to one another due to the polarity and resulting interactions between their side chains.

• Quaternary structure – if a protein consists of several protein subunits held together, then the protein can be described as well by the number and relative positions of the subunits.

Once the polypeptide sequence (primary structure) of a protein has been determined, the next step is to determine the secondary and tertiary structure of the protein. The secondary structures of a protein are packed into a core region with a hydrophobic environment. Interactions between the amino acid side chains occur within the core structure. Outside of the core are loops and structural elements the come in contact with water, other proteins, and other structures. Review of Protein Structure Proteins are chains of amino acids joined by peptide bonds. Each amino acid is polar, meaning that it has separate positive and negatively charged regions. Each amino acid has a free C=O group (CARBOXYL), which can act as a hydrogen bond acceptor, and an NH group (AMINYL), which can act as a hydrogen bond donor. Many confirmations of the chain are possible due to the rotation around the Alpha-Carbon (Cα) atom. These confirmational changes lead to differences in the three-dimensional structure of the protein. Within a polypeptide chain, there is a pattern of N-Cα-C repeated. The angle between the aminyl group and the Alpha-carbon is the PHI (φ) angle; the angle between the Cα and the carboxyl group is the PSI (ψ) angle.

Image Source: Bioinformatics, Mount The difference between each of the 20 amino acids is in the R side chains. Amino acids can be separated into distinct groups based on the chemical properties of the side chains: hydrophobic: Alanine(A), Valine(V), phenylalanine (Y), Proline (P), Methionine (M), isoleucine (I), and Leucine(L); charged: Aspartic acid (D), Glutamic Acid (E), Lysine (K), Arginine (R); Polar: Serine (S), Theronine (T), Tyrosine (Y); Histidine (H), Cysteine (C), Asparagine (N), Glutamine (Q), Tryptophan (W). Secondary Structures

Image source: http://www.ebi.ac.uk/microarray/biology_intro.html The core of each protein is made up of regular secondary structures that fold into a three-dimensional configuration. In these secondary structures, regular patterns of hydrogen bonds are formed between neighboring amino acids, and the amino acids have similar φ and ψ angles. These structures act to neutralize the polar groups on each amino acid. These secondary structures are tightly packed in the protein core and a hydrophobic environment, and thus, each amino acid side group has a limited space to occupy and therefore a limited number of possible interactions. Alpha Helix The alpha helix (Picture, p 388) is the most abundant type of secondary structure in proteins. The helix has 3.6 amino acids per turn with a Hydrogen bond formed between every fourth reside. The average length of an alpha helix is 10 amino acids, or 3 turns, but it varies from 5 to 40 amino acids.

http://www.hhmi.princeton.edu/sw/ 2002/psidelsk/scavengerhunt.htm

http://www4.ocn.ne.jp/~bio/biology/protein.htm

Alpha helix structures are normally found on the surface of protein cores where they interact with the aqueous environment. The inner facing side of the helix tends to have hydrophobic amino acids, while the outer-facing side has hydrophilic amino acids. This

means that every third amino acid will tend to be hydrophobic. This is a pattern that can be detected computationally. Sequences rich in alanine (A), gutamic acid (E), leucine (L), and methionine (M) and poorer in proline (P), glycine (G), tyrosine (Y), and serine (S) tend to form alpha helices. Program to detect alpha helices: Beta Sheet Beta sheets are formed by hydrogen bonds between an average of 5-10 consecutive amino acids in one portion of the chain with another 5-10 farther down the chain. The interacting regions may be adjacent, with a short loop in between, or far apart with other structures in between. If the chains run in the same direction, they form a parallel sheet. If they run in opposite directions, the form an antiparallel sheet. A mixed sheet may also be formed. The pattern of hydrogen bond formation in parallel and anti-parallel sheets is different. Beta sheets have a slight counterclockwise rotation, and the Alpha carbons (as well as the R side groups) alternate above and below the sheet in a pleated structure. Prediction of beta sheets is more difficult, due to the wide range of the PHI and PSI angles.

http://broccoli.mfn.ki.se/pps_course_96/ ss_960723_12.html

http://www4.ocn.ne.jp/~bio/ biology/protein.htm

Image Source: Bioinformatics, Mount Loops Loops are regions of a protein chain are regions between alpha helicies and beta sheets. They have various lengths and three-dimensional configurations, and they are located on the surface of the structure. Hairpin loops represent a complete turn in the polypeptide chain, as is found in anti-parallel beta sheets. Loops are allowed to be more variable as far as the sequence structure is concerned. They tend to have charged and polar amino acids and are frequently a component of active sites. Coils A region of secondary structure that is not a helix, sheet, or loop is commonly referred to as a coil.

Classes of Protein Structure:

1) Class α: bundles of α helices connected by loops on the surface of the proteins 2) Class β: antiparallel β sheets, usually two sheets in close contact forming a

sandwich (enzymes, transfport proteins, antibodies, virus coat proteins) 3) Class α/β: comprised mainly of parallel β sheets with intervening α helices; may

also have mixed β sheets (metabolic enzymes) 4) Class α+ β: composed mainly of segregated α helices and antiparallel β sheets 5) Multidomain (α and β) proteins comprising domains representing more than one

of the above four domains. 6) Membrane and cell-surface proteins and peptides excluding proteins of the

immune system.

alpha class protein (hemoglobin) B-class protein (T-cell receptor CD8)

a/B class protein (tryptophan synthase) a+B class protein (1RNB)

membrane protein (10PF) Sources: http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=3hhb;page=;pid=&opt=show&size=250 http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics&pdbId=1cd8&page=&pid= http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=2wsy;page=;pid=&opt=show&size=250 http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=1rnb;page=;pid=&opt=show&size=250 http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=1opf;page=;pid=&opt=show&size=250

Protein Structure Databases There are a number of databases that contain information on three dimensional structures of proteins, where the structure has been solved using either X-ray crystallography or nuclear magnetic resonance (NMR) techniques. Examples of the available sequence databases include: PDB SCOP PIR Swiss-Prot The most extensive of these for 3-D structure is the Protein Data Bank (PDB). The current release of PDB (April 8, 2003) has 20,622 structures.

A full description of PDB File Format can be obtained at: http://www.rcsb.org/pdb/info.html A partial example PDB file for the entry 3hhb is given below (the full file can be obtained at http://www.rcsb.org/pdb/cgi/explore.cgi?job=download;pdbId=3HHB;page=0&opt=show&format=PDB&pre=1) : ATOM 1 N VAL A 1 6.452 16.459 4.843 7.00 47.38 3HHB 162 ATOM 2 CA VAL A 1 7.060 17.792 4.760 6.00 48.47 3HHB 163 ATOM 3 C VAL A 1 8.561 17.703 5.038 6.00 37.13 3HHB 164 ATOM 4 O VAL A 1 8.992 17.182 6.072 8.00 36.25 3HHB 165 ATOM 5 CB VAL A 1 6.342 18.738 5.727 6.00 55.13 3HHB 166 ATOM 6 CG1 VAL A 1 7.114 20.033 5.993 6.00 54.30 3HHB 167 ATOM 7 CG2 VAL A 1 4.924 19.032 5.232 6.00 64.75 3HHB 168 ATOM 8 N LEU A 2 9.333 18.209 4.095 7.00 30.18 3HHB 169 ATOM 9 CA LEU A 2 10.785 18.159 4.237 6.00 35.60 3HHB 170 ATOM 10 C LEU A 2 11.247 19.305 5.133 6.00 35.47 3HHB 171 ATOM 11 O LEU A 2 11.017 20.477 4.819 8.00 37.64 3HHB 172 ATOM 12 CB LEU A 2 11.451 18.286 2.866 6.00 35.22 3HHB 173 ATOM 13 CG LEU A 2 11.081 17.137 1.927 6.00 31.04 3HHB 174 ATOM 14 CD1 LEU A 2 11.766 17.306 .570 6.00 39.08 3HHB 175 ATOM 15 CD2 LEU A 2 11.427 15.778 2.539 6.00 38.96 3HHB 176

The second column indicates the amino acid position in the polypeptide chain The fourth column indicates the current amino acid Columns 7, 8, and 9 represent the x, y, and z coordinates (in angstroms) The 11th column represents the temperature factor, which can be used as a measurement of uncertainty. Protein Structure Classification Databases Structural Classification of proteins (SCOP) SCOP is based on expert definition of structural similarities. SCOP classifies by class, family, superfamily, and fold. SCOP is found at http://scop.mrc-lmb.cam.ac.uk/scop/ Classification by class, architecture, topology, and homology (CATH) CATH classifies proteins into hierarchical levels by class, except that a/B and a+B are considered to be a single class. CATH is located at http://www.biochem.ucl.ac.uk/bsm/cath/ Fold classification based on structure-structure alignment of proteins (FSSP) FSSP is based on structure alignment of all pairwise combinations of the proteins in PDB using the structural alignment program DALI. Each protein is separated into individual domains, and the domains are aligned using DALI to find common folds. FSSP is located at http://www2.embl-ebi.ac.uk/dali/fssp/fssp.html

Molecular Modelling Database (MMDB) MMDB categorizes structures from PDB into structurally related groups using the VAST structure alignment program, that looks for similar arrangements of secondary structural elements. MMDB has been incorporated into ENTREZ at http://www.ncbi.nlm.nih.gov/Entrez Spatial Arrangement of Backbone Fragments (SARF) SARF provides a protein database categorized on structural similarities, similar to the MMDB. SARF is found at: http://www-lmmb.ncifcrf.gov/~nicka/sarf2.html Viewing Protein Structures There are a number of programs available that convert the atomic coordinates of the 3-d structures into views of the molecule. Viewers also allow the user to manipulate the molecule by rotation, zooming, etc. Such a viewer can be critical in drug design, since it yields insight into how the protein might interact with ligands at active sites. The most popular program for viewing 3-dimensional structures is Rasmol. The following is a list of the most popular viewers: Rasmol: http://www.umass.edu/microbio/rasmol/ Chime: http://www.umass.edu/microbio/chime/ Cn3D: http://www.ncbi.nlm.nih.gov/Structure/ Mage: http://kinemage.biochem.duke.edu/website/kinhome.html Swiss 3D viewer: http://www.expasy.ch/spdbv/mainpage.html In addition to viewing 3-dimensional structures, there are repositories for still images. One such site is the swissprot website: http://www.expasy.ch/databases/swiss-3dimage/IMAGES/ Alignment of Protein Structures To perform a structural alignment, the three-dimensional structure of one protein is compared against the three-dimensional structure of a second protein, fitting together the atoms as closely as possible to minimize the average deviation. Structural similarity between proteins does not necessarily translate into an evolutionary relationship between the two. When structures are compared, positions of atoms in two three-dimensional structures are compared. Typically these methods to align structures look for the positions of

secondary structural elements (helices and strands) within a protein domain to determine whether or not the structures are similar. Distances between the carbon atoms are examined to determine the degree to which the structures may be superimposed. Additional information about the side chains (such as whether they are buried or visible) can be used as well. Secondary Structure Alignment Program (SSAP) SSAP uses a method called double dynamic programming to produce a structural alignment between two proteins. A local structural environment is created for each residue in each sequence. This environment is defined by the degree of burial in the hydrophobic core of the protein and the type of secondary structure to which the residue belongs. One of the environment variables is a representing of the geometry of the protein by drawing a series of vectors from the CB atoms of an amino acid to the CB atoms of all of the other amino acids in the protein. If the geometric views in two protein structures are similar, the structures must also be similar. These structural environments are compared to produce matching residues. Steps involved in SSAP:

1) Calculate vectors from one Cβ of one amino acid to a set of other nearby amino acids. The resulting vectors from two separate proteins are compared, and a difference (expressed as an angle) is calculated. A score for this difference is then computed.

2) A matrix for the scores of vector differences from one protein to the next is computed.

3) An optimal alignment is found using global dynamic programming, with a constant gap penalty.

4) The next amino acid residue in one of the sequences is considered, and an optimal path to align this amino acid to the second sequence is computed using the steps above.

5) Resulting alignments are then transferred to a summary matrix. If the paths cross the same matrix position, the scores are summed. If part of the alignment path is found in both matrices, then there is evidence of similarity between the vectors.

6) When all of the alignments have been placed in the summary matrix, a dynamic programming alignment is performed for the summary matrix. The final alignment represents the optimal alignment between the protein structures. The resulting score is converted such that it can be compared to see how closely related the two structures are to each other.

Image Source: Bioinformatics, Mount (p420)

Distance Matrix Distance method uses graphical procedure similar to dot plots to identify the atoms that lie most closely together in the three-dimensional structure. If two sequences have a similar structure, then their resulting dot plots can be superimposed. For the dot plot, the sequence of the protein is listed along both axes. The values in the distance matrix represent the distance between the corresponding Cα atoms in the three dimensional structure. The positions of the closest packing atoms are marked with a dot to highlight regions of interest. Similar groups of secondary structural elements are superimposed as closely as possible by minimizing the sum of the atomic distances. Distance Alignment Tool (DALI) Dali is one example of a program that uses the distance matrix method to align protein structures. Existing structures that have been compared to one another are organized into the FSSP database. The assembly step of DALI uses a Monte Carlo simulation strategy to find submatrices that can be aligned with one another. Fast Structural Similarity Search One way to quickly compare two structures is to compare the types and arrangements of the secondary structures within two proteins. If the elements are similarly arranged, the three-dimensional structures are similar. VAST and SARF are example programs that use these methods to compare two structures. Structural Motifs based on Sequence Analysis A few structural elements can be determined by looking at the sequence composition. Examples of such structures include zinc finger motifs, leucine zippers, and coiled-coil structures. Zinc finger motifs can be found by looking at order and spacing of cysteine and histidine residues in a sequence. Typical zinc finger motifs are composed of two cysteines followed by two histidines.

Image source: www.bmb.psu.edu/faculty/tan/lab/ tanlab_gallery_protdna.html Leucine zippers can be found by looking for two antiparallel alpha helices held together by interactions between hydrophobic leucine residues found at every seventh position in the helix.

Image source: ww2.mcgill.ca/biology/undergra/ c200a/sec3-5.htm Coiled-coil structures have two to three alpha helices coiled around each other in a left-handed supercoil. They may be predicted by searching for a 7-residue periodicity.

COILS2 (http://www.ch.embnet.org/software/COILS_form.html) is a package to detect coils Transmembrane-spanning Proteins Membrane proteins traverse back and forth through a series of alpha helices composed of amino acids with hydrophobic side chains. the typical length of these regions is 20-30 residues in length. Therefore, these protein regions can be detected by scanning for hydrophobic regions around 19 residues in length. Membrane spanning alpha helices tend to have hydrophobic residues on the inside facing portions, and hydrophilic residues on the outside or exposed residues.

Image source: http://www.northwestern.edu/neurobiology/faculty/pinto2/pinto_12big.jpg

PHDhtm is a program that is used to predict membrane spanning helices. PHDhtm employs a neural network approach, where the neural network is trained to recognize sequence patterns and variations of helices in transmembrane proteins of known structures. The details for training PHDhtm are given in Mount, p437-439. TMpred is another progam that predicts alpha helices of transmembrane proteins. It functions by searching a protein against a sequence scoring matrix that has been obtained by aligning the sequences of all the transmembrane alpha helix regions that are known. Secondary Structure Prediction Approaches Chou-Fasman and GOR methods The Chou-Fasman method was based on analyzing the frequency of amino acids in the different secondary structures. For instance, it was determined that A, E, L, and M are strong predictors of alpha helices, while P and G are predictors in the break of a helix. A table of predictive values was created for alpha helices, beta sheets, and turns. The structure with the greatest overall prediction value greater than 1 is used to determine the structure for that region. The GOR method improves upon the Chou-Fasman method by basing the assumption that amino acids surrounding the central amino acid influence the secondary structure that the central amino acid is likely to adopt, as opposed to it individually influencing the secondary structure. Scoring matrices are used in the GOR method, which incorporates both information theory and Bayesian statistics. Details of the GOR method are provided in Mount, p450-451. Neural Network Models In the neural network approach, programs are trained to recognize amino acid patterns that are located in known secondary structures and to distinguish these patterns from other patterns not located in structures. PHD and NNPREDICT are two programs that incorporate neural network models. Nearest-Neighbor Methods Nearest-neighbor methods are also a type of machine learning method. The secondary structure confirmation of an amino acid in the query is calculated by identifying http://www.google.com/search?q=Simpa96&hl=en&lr=&ie=UTF-8&oe=UTF-8&start=10&sa=Nsequences of known structures that are similar to the query by looking at the surrounding amino acids. The programs using the nearest-neighbor methods include PSSP, Simpa96, SOPM, and SOPMA.

Prediction of Three-dimensional Protein Structure Retrieve Examples of each of these: a: hemoglobin b: T-cell receptor CD8 Protein classification Programs to predict secondary structure: nnpredict (http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html) nnpredict uses a two-layer, feed-forward neural network to determine the secondary structure classification. Results for nnpredict:

Sequence: VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR

Secondary structure prediction (H = helix, E = strand, - = no prediction): -------H--HHHHHHH---H-HHHHHHHHHHH--------------------HEH---- HHHHHHHHHHH------HHHHHHHHHHH---------HHHHHHHHHHHHHH-------HH HHHHHHHHHHHHHEEE-----

P389 Programs for viewing protein structure RasMol … Predicting Secondary Structure Predicting Tertiary Structure “Protein-folding Problem”

Threading Most robust of structure prediction techniques. Searches for structures that have a similar fold without apparent sequence similarity. Threading takes a query sequence whose structure is not known and threads it through the coordinates of a target protein whose structure has been solved, using either X-ray crystallography or NMR imaging. Sequence is moved position by position through the structure subject to predetermined constrants. Thermodynamic calculations are made to determine most energetically favorable and conformationally stable alignment of the query sequence against the target structure. Threading is a computationally intensive task Programs: Protein Structure Prediction Center http://predictioncenter.llnl.gov/ PIR Quaternary structure prediction: http://msd.ebi.ac.uk/Services/Quaternary/quaternary.html WHAT-IF LOOK SWISS-MODEL VAST DALI 3Dee FSSP PHD TOPITS SignalP http://www.cbs.dtu.dk/services/SignalP/ TMpred http://www.isrec.isb.sib.ch/ftp-server/tmpred/www/TMPRED_form.html Bryant, Altschul (1995) Eisenhaber (1995) Lemer (1995) Bryant Lawrence (1993) Fetrow, Bryant (1993) Jones, Thornton (1996)


Lecture 14 Protein Structure Prediction

Distance Matrix Distance method uses graphical procedure similar to dot plots to identify the atoms that lie most closely together in the three-dimensional structure. If two sequences have a similar structure, then their resulting dot plots can be superimposed. For the dot plot, the sequence of the protein is listed along both axes. The values in the distance matrix represent the distance between the corresponding Cα atoms in the three dimensional structure. The positions of the closest packing atoms are marked with a dot to highlight regions of interest. Similar groups of secondary structural elements are superimposed as closely as possible by minimizing the sum of the atomic distances. Distance Alignment Tool (DALI) Dali is one example of a program that uses the distance matrix method to align protein structures. Existing structures that have been compared to one another are organized into the FSSP database. The assembly step of DALI uses a Monte Carlo simulation strategy to find submatrices that can be aligned with one another. Fast Structural Similarity Search One way to quickly compare two structures is to compare the types and arrangements of the secondary structures within two proteins. If the elements are similarly arranged, the three-dimensional structures are similar. VAST and SARF are example programs that use these methods to compare two structures. Structural Motifs based on Sequence Analysis A few structural elements can be determined by looking at the sequence composition. Examples of such structures include zinc finger motifs, leucine zippers, and coiled-coil structures. Zinc finger motifs can be found by looking at order and spacing of cysteine and histidine residues in a sequence. Typical zinc finger motifs are composed of two cysteines followed by two histidines.

Image source: www.bmb.psu.edu/faculty/tan/lab/ tanlab_gallery_protdna.html Leucine zippers can be found by looking for two antiparallel alpha helices held together by interactions between hydrophobic leucine residues found at every seventh position in the helix.

Image source: ww2.mcgill.ca/biology/undergra/ c200a/sec3-5.htm Coiled-coil structures have two to three alpha helices coiled around each other in a left-handed supercoil. They may be predicted by searching for a 7-residue periodicity.

COILS2 (http://www.ch.embnet.org/software/COILS_form.html) is a package to detect coils Transmembrane-spanning Proteins Membrane proteins traverse back and forth through a series of alpha helices composed of amino acids with hydrophobic side chains. the typical length of these regions is 20-30 residues in length. Therefore, these protein regions can be detected by scanning for hydrophobic regions around 19 residues in length. Membrane spanning alpha helices tend to have hydrophobic residues on the inside facing portions, and hydrophilic residues on the outside or exposed residues.

Image source: http://www.northwestern.edu/neurobiology/faculty/pinto2/pinto_12big.jpg

PHDhtm is a program that is used to predict membrane spanning helices. PHDhtm employs a neural network approach, where the neural network is trained to recognize sequence patterns and variations of helices in transmembrane proteins of known structures. The details for training PHDhtm are given in Mount, p437-439. TMpred is another progam that predicts alpha helices of transmembrane proteins. It functions by searching a protein against a sequence scoring matrix that has been obtained by aligning the sequences of all the transmembrane alpha helix regions that are known. Secondary Structure Prediction Approaches Chou-Fasman and GOR methods The Chou-Fasman method was based on analyzing the frequency of amino acids in the different secondary structures. For instance, it was determined that A, E, L, and M are strong predictors of alpha helices, while P and G are predictors in the break of a helix. A table of predictive values was created for alpha helices, beta sheets, and turns. The structure with the greatest overall prediction value greater than 1 is used to determine the structure for that region. The GOR method improves upon the Chou-Fasman method by basing the assumption that amino acids surrounding the central amino acid influence the secondary structure that the central amino acid is likely to adopt, as opposed to it individually influencing the secondary structure. Scoring matrices are used in the GOR method, which incorporates both information theory and Bayesian statistics. Details of the GOR method are provided in Mount, p450-451. Neural Network Models In the neural network approach, programs are trained to recognize amino acid patterns that are located in known secondary structures and to distinguish these patterns from other patterns not located in structures. PHD and NNPREDICT are two programs that incorporate neural network models. Nearest-Neighbor Methods Nearest-neighbor methods are also a type of machine learning method. The secondary structure confirmation of an amino acid in the query is calculated by identifying http://www.google.com/search?q=Simpa96&hl=en&lr=&ie=UTF-8&oe=UTF-8&start=10&sa=Nsequences of known structures that are similar to the query by looking at the surrounding amino acids. The programs using the nearest-neighbor methods include PSSP, Simpa96, SOPM, and SOPMA.

Prediction of Three-dimensional Protein Structure Threading Threading is the most robust of structure prediction techniques. Threading searches for structures that have a similar fold without apparent sequence similarity. Threading takes a query sequence whose structure is not known and threads it through the coordinates of a target protein whose structure has been solved, using either X-ray crystallography or NMR imaging. Sequence is moved position by position through the structure subject to predetermined constrants. Thermodynamic calculations are made to determine most energetically favorable and conformationally stable alignment of the query sequence against the target structure. Threading is a computationally intensive task, and requires a great deal of knowledge about protein structure. Environmental template method In the environmental template method, the environment of each amino acid in each known structural core is determined, including the secondary structure, the area of the side chain that is buried by closeness to other atoms, and types of nearby side chains. Each position is classified into one of 18 types, 6 representing increasing levels of residue burial, combined with three classes of secondary structure (alpha helices, beta sheets, and loops). Each amino acid is then assessed for its ability to fit into that type of structure. Residue contact potential The number and closeness between amino acids in the core are analyzed. The query sequence is evaluated for amino acid interactions that will correspond to those in the core and that will contribute to the stability of the protein. The most energetically stable confirmations are the most likely three-dimensional structures. Structure profile method Predictions as to which amino acids are able to fit into a structural position are given as a sequence profile. Substitutions in different structures have different effects – substitutions in loops do not have as many constrants. A structure profile is created for each core in the PDB. These profiles are then used to score the query sequence for compatibility with that core. The structural profile is a table of scores with one row for each amino acid position in the core and a column for each amino acid substitution at that position plus two columns for deletion penalties. A dynamic programming algorithm is used to identify an optimal, best scoring alignment. Threading Services

123D http://www-lmmb.ncifcrf.gov/~nicka/123D.html 3D-PSSM Honig lab Libra I NCBI structure site Profit Threader 2 TOPITS UCLA-DOE structure prediction Server DNA Sequencing Sequencing DNA is a routine molecular biology technique. The most common form of DNA sequencing used today is the Sanger dideoxynucleotide chain termination method. In this method, new strands of DNA complementary to a single-stranded DNA template are synthesized. The template DNA is supplied with a mixture of all four deoxynucleotides (A, C, G, T) along with four dideoxynucleotides (A, C, G, T) that terminate the elongation of the DNA sequence. Each nucleotide is labeled with a different color fluorescent tag. The result is a set of DNA sequences, each with of different lengths. The fragments are separated by their size using a technique known as gel electrophoresis. As each labeled DNA fragment passes a laser detector, the color is recorded. The DNA sequence is then reconstructed from the pattern of colors.

www.ncbi.nlm.nih.gov/About/primer/ genetics_molecular.html

http://jcsmr.anu.edu.au/group_pages/brf/services/DNA%20sequencing/Templiphi.html Automated Sequencing Machine

www.csic.es/mostrar/tecnicas/ area2/iib1/abi377.htm

The procedure of determining the actual base that is represented is referred to as base-calling. Often, automated sequencers have software installed that automatically takes the trace data and calculates the bases. This information can also be used by programs such as PHRED. For each base, there is an associated quality value, which represents the probability that the base has been called correctly. Typically, the beginning and end of the sequence will have lower values. These low quality regions are usually trimmed from the final data. The PHRED quality value is calculated by the following formula:

)_(log10 10 ePueQualityVal −= where P_e is the probability that the base is an error. PHRED: http://www.genome.washington.edu/UWGC/analysistools/Phred.cfm TraceTuner http://www.paracel.com/tracetuner/ http://www.phrap.com/

Berno, A. 1996. A graph theoretic approach to the analysis of DNA sequencing data. Genome Res. 6:80-91 .

DNA sequence assembly http://icb.ime.usp.br/tdr/material/arthur/assembly.pdf Ewing B, Green P. (1998) Base-calling of automated sequencer traces using phred II. Genome Res, 8(3):186-194. Genomic Sequencing In order to sequence large molecules, such as chromosomes, the region to be sequenced must be purified and broken into 100-kb or slightly larger random fragments, which are cloned into vectors such as yeast artificial chromosomes (YACs) or bacterial artificial chromosomes (BACs). The library of clones is screened for contigs, which are overlapping regions. Building such a map of overlapping clones is a very laborious procedure. Once a map is obtained, unique overlapping clones are chosen for sequencing. However, these molecules are too large for direct sequencing. In order to sequence each clone, the clone is broken down into subclones, with some level of redundancy (typically 4x – 10x coverage). The subclones, on the order of 500 bases long, are then sequenced. It is then necessary to assemble these subclones based on overlapping sequences. Shotgun sequencing Shotgun sequencing is the process of sequencing a whole genome by ignoring map data. The idea is to sequence both ends of DNA fragments of short (2kb), medium (10 kb) and long (100 kb) fragments, and use these end sequences as anchors. The genome is then

randomly broken up into small (500 base) pieces which are then sequenced. The problem of sequence assembly is much tougher with shotgun sequencing. Comparison of sequencing strategies

Taken from Waterston RH, Lander ES, Sulston JE. (2002) On the sequencing of the human genome. PNAS 99(6):3712-3716. Sequence Assembly Programs PHRAP (fragment assembly program) is the most widely used program when it comes to assembling the smaller pieces of each clone together. Other programs that are used to assemble whole genomes include ARACHNE (MIT’s Whitehead center); GigAssembler (UCSC), and … The most valuable of these whole genome assembly techniques take into account various pieces of information concerning BAC ends, polymorphisms, and mapping markers in order to correctly orient and assemble the pieces of the genome. Huang X, Maddan A. (1999) CAP3: A DNA sequence assembly program. Genome Res, 9(9):868-977. Bonfield JK, Smith K, Staden R. (1995) A new DNA sequence assembly program. Nucleic Acids Res, 23(24):4992-4999. Mullikin JC, Ning, Z. (2003) The phusion assembler. Genome Res, 13(1):81-90.

Waterston RH, Lander ES, Sulston JE. (2002) On the sequencing of the human genome. PNAS 99(6):3712-3716. Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES. (2002) ARACHNE: a whole-genome shotgun assembler. Genome Res 12(1):177-189. Venter C, et al. (2001) The sequence of the human genome. Science 291(5507):1304-1351. Adams, et al. (2000) The genome sequence of Drosophila melanogaster. Science 287(5461):2185-2195. Myers EW, et al. (2000) A whole-genome assembly of Drosophilia. Science 287(5461):2196-2204. Genome sequence assembly process: http://www.ncbi.nlm.nih.gov/genome/guide/build.html Predicting Structural Features Modeller http://guitar.rockefeller.edu/modeller/modeller.html Swiss-model http://www.expasy.ch/swissmod/SWISS-MODEL.html Whatif http://www.cmbi.kun.nl/whatif/ DNA Sequencing and Assembly Whole genome assemblers Arachne GigAssembler

bioinformatics made easy

Documents