analysis of biological sequences. (lesk chapter...

Analysis of biological sequences.(Lesk chapter 4)

Sequence alignment • Sequence assembly• Classification• Prediction of function• Comparative genomics• Phylogeny / Evolutionary history

Pattern matching

Recognition of signals / statistical properties

/ character relationships

• Prediction of protein function• Identification of transcription regulatory sites

• Gene prediction• RNA and protein secondary structure prediction

Regulatory elements

PromoterTranslation start

Transcription stop

polyA signal

Transcription start

Translation stop

Exons

Introns

Expression from a eukaryotic gene

Transcription

Translation

DNA

RNA (primarytranscript)

RNA (spliced)

Protein

%G 11 74 100 0 29%A 64 9 0 0 61%U 13 12 0 100 7%C 11 6 0 0 2

Exon Intron

Two-dimensional weight matrices are used in Two-dimensional weight matrices are used in identification of splicing signalsidentification of splicing signals

Prediction of RNA secondary structure

GCCUCUUGGC

G

CC

U

C

G

C

G

UU

5’ 3’

5’ 3’

Analysis of biological sequences.(Lesk chapter 4)

Sequence alignment • Sequence assembly• Classification• Prediction of function• Comparative genomics• Phylogeny / Evolutionary history

Pattern matching

Recognition of signals / statistical properties

/ character relationships

• Prediction of protein function• Identification of transcription regulatory sites

• Gene prediction• RNA and protein secondary structure prediction

Problem of sequence alignment - interaction betweenmolecular biology / computer science / statistics

* What biological problems are addressed ?* Algorithm (dynamic programming)* ‘Simple’ implementation / source code / compiling

* Common implementations in molecular biology software packages* Statistics and probability theory of alignments

Why do we want to align 2 sequences?

As one example, consider this common application:

We have a ‘new’ sequence. It is similar to a previously known sequence?

Alignment to all previously known sequences. (Many of these have annotation such as a description of function )

similarity

?

no similarity

•Prediction of function •Phylogeny / evolutionary history

Basic concepts of protein sequence alignments

Proteins are homologous if they are related by divergence from a common ancestor.

Two kinds of homology:

Orthologs Proteins that carry out the same function in different species

Paralogs Proteins that perform different but related functions within one organism

X

X

X1

X

X2

Speciation

Ancestral organism

Organism A

Organism A

Organism B

Organism B

Orthologs

Orthologs

X

X

Xa

X

Xb

Gene duplication

Paralogs

Paralogs

Mouse trypsin -- orthologs -- Human trypsin | | paralogs paralogs | | Mouse chymotrypsin -- orthologs -- Human chymotrypsin

How do we know from an alignment if two sequences are evolutionary related?

This seems convincing:

GWFTREKLREEDHIKKGWFTKEKIREEDHIKK

But what about this:

VAKTSRNAPEEKASVG IASGNRNFGEAYGRAG ?

We need some input from statistics / probability theory

For instance, alignment methods like BLAST will ask:What is the probability that this match occurs by chance only ?

M A K L Q G A L G K R Y

M *A * *K * *I

Q *G * *A * *L * *A * * K * *R *Y


* * * * * * * * * *M A K I Q G A L A K R Y

Comparing 2 sequences - Dotplot analysis

Sequence alignment

M A K L Q L G K R Y

M *A *K * *L * *Q *G *A *L * *G *K * *R *Y *

M A K L Q L G K R Y

* * * * * * * * * *M A K L Q G A L G K R Y

Gap

Sequence alignment

Comparing 2 sequences - Gaps

Gaps are results of mutations (changes in DNA) that occur during evolution

For instance consider this deletion mutation:

AACTTGACGTTGAACTGC

GACTGGGCGTATCTGACCCGCATA

CGGGCACCGGCCCGTGGC

N L T D W A Y R A P

N L T R A P

AACTTGACGTTGAACTGC

CGGGCACCGGCCCGTGGC

DNAprotein

Comparing / aligning two sequences. Gaps

In pairwise comparison gaps cannot be inserted in anunrestricted manner. For these reasons a gap penalty isassigned to gaps. Two parameters frequently used insequence comparison :

-Gap creation penalty-Gap extension penalty

There are two parameters because it is more ’difficult’to create a gap than to extend an existing gap.


M *A * *K * *I

Q *G * *A * *L * *A * * K * *R *Y


* * * * * * * * * *M A K I Q G A L A K R Y


Sequence alignment

Substitution matricesEach amino acid change has a characteristic probability

Dot plot analysis reveals repeats


M *A * *K * *I

Q *G * *A * *L * *A * * K * *R *Y


* * * * * * * * * *M A K I Q G A L A K R Y


Sequence alignment

Searching databases with FASTA / BLAST

Improvement of speed as compared to local alignment algorithm:

Initial search is for short words.Word hits are then extended in either direction.

Output from Fasta

Fasta searches a protein or DNA sequence data bank version 3.3t04 January 25, 2000Please cite: W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

../seq/ramp4.seq: 75 aa >ramp4.seq vs /vol1/gcgdata/ncbi_nr/nr.dat librarysearching /vol1/gcgdata/ncbi_nr/nr.dat library

173831120 residues in 553635 sequences statistics extrapolated from 60000 to 552908 sequences Expectation_n fit: rho(ln(x))= 4.8232+/-0.0004; mu= 0.7959+/- 0.022; mean_var=53.2306+/- 9.966, 0's: 686 Z-trim: 26 B-trim: 2227 in 1/63 Kolmogorov-Smirnov statistic: 0.0519 (N=29) at 46

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 join: 36, opt: 24, gap-pen: -12/ -2, width: 16 Scan time: 102.010The best scores are: opt bits E(552908)gi|4585827|emb|CAB40910.1| (AJ238236) ribosome as ( 75) 483 130 1.9e-30gi|7657552|ref|NP_055260.1| stress-associated end ( 66) 426 116 3.7e-26gi|7504801|pir||T23009 hypothetical protein F59F4 ( 65) 251 71 8.5e-13gi|9802529|gb|AAF99731.1|AC004557_10 (AC004557) F ( 77) 145 45 0.00012gi|2498673|sp|Q47415|NRDI_ECOLI NRDI PROTEIN gi|2 ( 136) 105 35 0.22gi|1800061|dbj|BAA16538.1| (D90891) similar to [S ( 217) 105 35 0.33gi|6319639|ref|NP_009721.1| involved in the secre ( 65) 92 31 1.2gi|2498674|sp|Q56109|NRDI_SALTY NRDI PROTEIN gi|1 ( 136) 93 32 1.8

>>gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associa (75 aa) initn: 483 init1: 483 opt: 483 Z-score: 682.4 bits: 130.3 E(): 1.9e-30Smith-Waterman score: 483; 100.000% identity in 75 aa overlap (1-75:1-75)

10 20 30 40 50 60ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::gi|458 MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC 10 20 30 40 50 60

70ramp4. GSAIFQIIQSIRMGM :::::::::::::::gi|458 GSAIFQIIQSIRMGM 70

>>gi|7504801|pir||T23009 hypothetical protein F59F4.2 - (65 aa) initn: 227 init1: 143 opt: 251 Z-score: 365.3 bits: 71.4 E(): 8.5e-13Smith-Waterman score: 251; 53.846% identity in 65 aa overlap (10-74:1-64)

10 20 30 40 50 60ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC :. :::. .::.. :::...::::::. . : :.: ..:::..::.::::gi|750 MAPKQRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVC 10 20 30 40 50

70 ramp4. GSAIFQIIQSIRMGM :::.:.::. ..:: gi|750 GSAVFEIIRYVKMGW 60

>>gi|2498673|sp|Q47415|NRDI_ECOLI NRDI PROTEIN gi|212121 (136 aa) initn: 66 init1: 41 opt: 105 Z-score: 160.3 bits: 34.6 E(): 0.22Smith-Waterman score: 105; 30.488% identity in 82 aa overlap (3-75:50-125)

10 20 30 ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGN :.::.: : .: ::. :..:.. . :: gi|249 RLGLPAVRIPLNERERIQVDEPYILIVPSYGGGGTAGAVPRQVIRFLNDEHNRALL-RGV 20 30 40 50 60 70

40 50 60 70 ramp4. VAKTSRNAPEEKASVG---------PWLLALFIFVVCGSAIFQIIQSIRMGM .:. .:: : . .: ::: . : . :. . :...: :. gi|249 IASGNRNFGEAYGRAGDVIARKCGVPWL---YRFELMGTQ--SDIENVRKGVTEFWQRQP 80 90 100 110 120 130

Output from Blast

BLASTP 2.0.11 [Jan-20-2000]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.

Query= ramp4.seq (75 letters)

Database: nr 457,798 sequences; 140,871,481 total letters

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6

>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65 Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)

Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++MSbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63

Query: 74 G 74 GSbjct: 64 G 64

Query Database

blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA

The different variants of BLAST

In a BLAST search low complexity regions in the query sequence arefiltered out by default

Regions with low-complexity sequence have an unusual composition andthis can create problems in sequence similarity searching. Low-complexity sequence can often be recognized by visual inspection. Forexample, the protein sequence PPCDPPPPPKDKKKKDDGPP has lowcomplexity and so does the nucleotide sequenceAAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexitysequence because it can cause artifactual hits. In BLAST searchesperformed without a filter, often certain hits will be reported with highscores only because of the presence of a low-complexity region. Mostoften, this type of match cannot be thought of as the result of homologyshared by the sequences. Rather, it is as if the low-complexity region is"sticky" and is pulling out many sequences that are not truly related.

Another reason why hits to low-complexity regions in proteins should befiltered out is that such regions often have a disordered 3D structure andare not associated with well-defined biological functions.

BLAST and filtering of low-complexity sequence

Query:295 DDIFGELSSGKNAPKTGGGAKGNNASPAGSGNTKNNGASGADINNYAGQIKSAIESKFYD DDIFGELSSGKNAPKTGGGAKGNNASPAGSGNTKNNGASGADINNYAGQIKSAIESKFYD Sbjct:87 DDIFGELSSGKNAPKTGGGAKGNNASPAGSGNTKNNGASGADINNYAGQIKSAIESKFYD

Query:355 ASSYAGKTCTLRIKLAPDGMLLDIKPEGGDXXXXXXXXXXXXXXXXXXXXSQAVYEVFKN ASSYAGKTCTLRIKLAPDGMLLDIKPEGGD SQAVYEVFKNSbjct:147 ASSYAGKTCTLRIKLAPDGMLLDIKPEGGDPALCQAALAAAKLAKIPKPPSQAVYEVFKN

Query:415 APLDFKP 421 APLDFKPSbjct:207 APLDFKP 213

Multiple sequence alignment

Multiple sequence alignment formatting Mview

Identity. Pattern matching

Pattern matching is used for finding short sequence patterns in asingle sequence, in a group of sequences or in the databases.

Examples of patterns (regular expressions):

GAATTCRecognition site for the restrictionenzyme EcoRI

GDSGGP Typical of serine proteases.

[AG]-x(4)-G-K-[ST] motif A of the ATP/GTP-binding site

C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H zinc finger proteins

Pattern matching with the unix utility ‘grep’

% grep"C.\{2,4\}C...[LIVMFYWC].\{8\}H.\{3,5\}H”

sequence_file

Pattern matching with perl

$seq = ‘TRRCKTTCREQLYSGATGGHHASSHGAQR’;if ($seq =~/C.{2,4}C...[LIVMFYWC].{8}H.{3,5}H/){ print ”Found zinc finger motif”;}

WP: The program Findpatterns uses patterns to search a sequence(s). Theprogram Motifs specifically search a protein sequence or setof sequences for the motifs present in the PROSITE database.

analysis of biological sequences. (lesk chapter...

Documents