b ioinform atics alignment of biological sequences - databases and software

BBioinformioinformaticsatics

Alignment of biological sequencesAlignment of biological sequences - - databases and softwaredatabases and software

LU, 204,LU, 204, Juris ViJuris Viksnaksna

Praksē lietotās virkņu salīdzināšanas metodes

Smith-Waterman algoritms (1981)Lokālā salīdzināšanaLineāras gap penaltiesLieto substitūciju matricas

Vai laiks (n2) ir praktiski pieņemams?

Proteīna garums - ap 100 aminoskābju...Divu proteīnu salīdzināšana: 10000 op. (0.01 sek, pie 1 MHz)Proteīnu datubāze - 1000000 ieraksti...Meklēšana datubāzē: 108 op, 100 sekDivu datubāzu salīdzināšana - 1016 op, 25 gadi :(

Heiristiskas metodes - FASTA

• Hash table of short words in the query sequence• Go through DB and look for matches in the query hash

(linear in size of DB)• K-tuple determines word size (k-tup 1 is single aa)• Lipman & Pearson 1985

VLICTAVLMVLICTAAAVLICTMSDFFD

VLICT = _

[Adapted from M.Gerstein]

The FASTA Algorithm

4 steps:

• use lookup table to find all identities at least ktup long, find regions of identities (Fig.1A)

• rescan 10 regions (diagonals) with highest density of identities using PAM250 (Fig.1B)

• join regions if possible without decreasing score below threshold (Fig.1C)

• rescore ala Smith-Waterman 32 residues around initial region (Note: doesn’t save alignment) (Fig.1D)

FASTA Parameters

• ktup = 2 for proteins, 6 for DNA

• init1 Score after rescanning with PAM250 (or other)

• initn Score after joining regions

• opt Score after Smith-Waterman

[Fig.1 of Pearson and Lipman 1988]

FASTA algoritms

[Adapted from D Brutlag]

FASTA algoritms

FASTA histogrammas

• Altschul, S., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403-410

• Indexes query and DB• Starts with all overlapping words from query• Calculates “neighborhood” of each word using PAM matrix

and probability threshold matrix and probability threshold• Looks up all words and neighbors from query in database

index• Extends High Scoring Pairs (HSPs) left and right to maximal

length• Finds Maximal Segment Pairs (MSPs) between query and

database• Blast 1 does not permit gaps in alignments


Heiristiskas metodes - Blast

BLAST algoritms• Keyword search of all words of length w from

the in the query of length n in database of length m with score above threshold– w = 11 for nucleotide queries, 3 for proteins

• Do local alignment extension for each found keyword– Extend result until longest match above threshold is achieved

• Running time O(nm)

[Adapted from S.Daudenarde]

BLAST algoritms

Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++KSbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263

Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD

keyword

GVK 18GAK 16GIK 16GGK 14GLK 13GNK 12GRK 11GEK 11GDK 11

neighborhoodscore threshold

(T = 13)

Neighborhoodwords

High-scoring Pair (HSP)

extension


Original BLAST• Dictionary

– All words of length w

• Alignment– Ungapped extensions until score falls below some

threshold

• Output– All local alignments with score > statistical threshold


Original BLAST: Example

A C G A A G T A A G G T C C A G T

C

T

G

A

T

C C

T

G

G

A

T

T

G C

G

A

w = 4

Exact keyword match of GGTC

Extend diagonals with mismatches until score is under 50%

Output result:

GTAAGGTCCGTTAGGTCC

From lectures by Serafim Batzoglou (Stanford)

Gapped BLAST: Example

A C G A A G T A A G G T C C A G T

C

T

G

A

T

C C

T

G

G

A

T

T

G C

G

A• Original BLAST exact keyword

search, THEN:

• Extend with gaps in a zone around ends of exact match until score < threshold then merge nearby alignments

• Output result:

GTAAGGTCC-AGTGTTAGGTCCTAGT

From lectures by Serafim Batzoglou (Stanford)

BLAST - Programmasblastpcompares an amino acid query sequence against a protein sequence database

blastncompares a nucleotide query sequence against a nucleotide sequence database

blastxcompares a nucleotide query sequence translated in all reading frames against a protein sequence database

tblastncompares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames

tblastxcompares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx is extremely slow and cpu-intensive

PSI-BLAST and transitive sequence comparison

Rezultātu vizualizācija (dotplots)

Proteīnu virkņu formāti - FASTA>sp|P00156|CYB_HUMAN Cytochrome b - Homo sapiens (Human). MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGACLILQITTGLFLAMHYSPDASTAFSSIAHITRDVNYGWIIRYLHANGASMFFICLFLHIGRGLYYGSFLYSETWNIGIILLLATMATAFMGYVLPWGQMSFWGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFTFHFILPFIIAALATLHLLFLHETGSNNPLGITSHSDKITFHPYYTIKDALGLLLFLLSLMTLTLFSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTILRSVPNKLGGVLALLLSILILAMIPILHMSKQQSMMFRPLSQSLYWLLAADLLILTWIGGQPVSYPFTIIGQVASVLYFTTILILMPTISLIENKMLKWA

Proteīnu virkņu formāti - SwissProtID CYB_HUMAN STANDARD; PRT; 380 AA.

AC P00156; Q34786; Q8HBR6; Q8HNQ0; Q8HNQ1; Q8HNQ9; Q8HNR4; Q8HNR7; AC Q8W7V8; Q8WCV9; Q8WCY2; Q8WCY7; Q8WCY8; Q9B1A6; Q9B1B6; Q9B1B8; AC Q9B2X7; Q9B2X9; Q9B2Y3; Q9B2Z0; Q9B2Z4; Q9T6H6; Q9T9Y0; Q9TEH4; DT 21-JUL-1986, integrated into UniProtKB/Swiss-Prot. DT 21-JUL-1986, sequence version 1. DT 19-SEP-2006, entry version 75. DE Cytochrome b. GN Name=MT-CYB; Synonyms=COB, CYTB, MTCYB; OS Homo sapiens (Human). OG Mitochondrion. OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; OC Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP NUCLEOTIDE SEQUENCE. RC TISSUE=Placenta;RX MEDLINE=81173052; PubMed=7219534 [NCBI, ExPASy, EBI, Israel]; RA Anderson S., Bankier A.T., Barrell B.G., de Bruijn M.H.L., RA Coulson A.R., Drouin J., Eperon I.C., Nierlich D.P., Roe B.A., RA Sanger F., Schreier P.H., Smith A.J.H., Staden R., Young I.G.; RT "Sequence and organization of the human mitochondrial genome."; RL Nature 290:457-465(1981).

Proteīnu virkņu formāti - SwissProtFT VARIANT 360 360 T -> A.

FT /FTId=VAR_013667. FT VARIANT 368 368 T -> I. FT /FTId=VAR_013668. SQ SEQUENCE 380 AA; 42730 MW; 90097B07FF2C1FD8 CRC64; MTPMRKINPL MKLINHSFID LPTPSNISAW WNFGSLLGAC LILQITTGLF LAMHYSPDAS TAFSSIAHIT RDVNYGWIIR YLHANGASMF FICLFLHIGR GLYYGSFLYS ETWNIGIILL LATMATAFMG YVLPWGQMSF WGATVITNLL SAIPYIGTDL VQWIWGGYSV DSPTLTRFFT FHFILPFIIA ALATLHLLFL HETGSNNPLG ITSHSDKITF HPYYTIKDAL GLLLFLLSLM TLTLFSPDLL GDPDNYTLAN PLNTPPHIKP EWYFLFAYTI LRSVPNKLGG VLALLLSILI LAMIPILHMS KQQSMMFRPL SQSLYWLLAA DLLILTWIGG QPVSYPFTII GQVASVLYFT TILILMPTIS LIENKMLKWA

Proteīnu virkņu formāti - NiceProt

Proteīnu virkņu datubāzes - UniProt

http://www.expasy.uniprot.org/

Proteīnu virkņu datubāzes - PIR

http://pir.georgetown.edu/

Proteīnu virkņu datubāzes - SwissProt

http://www.expasy.org/sprot/

http://www.expasy.org/sprot/

Proteīnu virkņu datubāzes - NRL-3D :)

Proteīnu struktūru datubāzes - PDB

http://www.pdb.org

Salīdzināšanas programmas - BLAST http://www.ncbi.nlm.nih.gov/BLAST/

http://www.ebi.ac.uk/blastall/

Salīdzināšanas programmas - FASTA http://fasta.bioch.virginia.edu/

http://www.ebi.ac.uk/fasta33/

http://fasta.bioch.virginia.edu/

http://www.ebi.ac.uk/fasta33/

Salīdzināšanas programmas - Smith-Waterman

http://www.ebi.ac.uk/MPsrch/

http://www.ebi.ac.uk/MPsrch/

Salīdzinājumu novērtēšana

S = Total Score

S(i,j) = similarity matrix score for aligning i and j

Sum is carried out over all aligned i and j

n = number of gaps (assuming no gap ext. penalty)

G = gap penalty

nGjiSSji

,

),(

Simplest score (for identity matrix) is S = # matches

What does a Score of 10 mean? What is the Right Cutoff?


Assessing sequence similarity

• Need to know how strong an alignment can be expected from chance alone

• “Chance” is the comparison of– Real but non-homologous sequences– Real sequences that are shuffled to preserve compositional

properties– Sequences that are generated randomly based upon a DNA or

protein sequence model (favored)


Model Random Sequence

• Necessary to evaluate the score of a match

• Take into account background– Adjust for G+C content

– Poly-A tails

– “Junk” sequences

– Codon bias


E-ValuesE-Value

Expectation value cutoff. An E-Value of 1 means that one would expect to find at most 1 alignment with a score as high as reported in a search of the given database. Setting the E-value higher will report more sequence matches

Resp., jo mazāka, jo labāk...

Rēķina apmēram šādi:

E = mn2-S’, kur m,n - virkņu garumi,S' - rēķina, piem., šādi: S’ = S–ln(K)

ln(2)

P-Values

P-Value

Varbūtība, ka šāds (vai labāks) rezultāts tiks iegūts "nejaušam"proteīnam

Arī, jo mazāka, jo labāk...

P-Values

• According to Poison’s distribution, the probability of finding b HSPs with a score S is given by:– (e-EEb)/b!

• For b = 0, that chance is:– e-E

• Thus the probability of finding at least one such HSP is:– P = 1 – e-E


P-Values• P(s > S) = .01

– P-value of .01 occurs at score threshold S (392 below) where score s from random comparison is greater than this threshold 1% of the time

• Likewise for P=.001 and so on.


Z-score

General Protein Search Principles

• Choose between local or global search algorithms

• Use most sensitive search algorithm available

• Original BLAST for no gaps• Smith-Waterman for most

sensitivity• FASTA with k-tuple 1 is a good

compromise• Gapped BLAST for well delimited

regions• PSI-BLAST for families• Initially BLOSUM62 and default

gap penalties

• If no significant results, use BLOSUM30 and lower gap penalties

• FASTA cutoff of .01

• Blast cutoff of .0001

• Examine results between exp. 0.05 and 10 for biological significance

• Ensure expected score is negative

• Beware of hits on long sequences or hits with unusual aa composition

• Reevaluate results of borderline significance using limited query region

• Segment long queries 300 amino acids

• Segment around known motifs

(some text adapted from D Brutlag)

Uzskatīsim, ka proteīni ir homologi, ja “līdzība” parsniedz kaut kadu slieksni t

true positives (tp) - s(a,b) t un a, b ir homologifalse positives (fp) - s(a,b) t un a, b nav homologi

true negatives (tn) - s(a,b) < t un a, b nav homologifalse negatives (fn) - s(a,b) < t un a, b ir homologi

Sensitivity = tp/n = tp/(tp+fn)Specificity = tn/n = tn(tn+fp)

ROC diagrammas

ROC diagrammas

http://en.wikipedia.org/wiki/Image:Roc-general.png

ROC diagrammas

Error rate (fraction of the “statements” that are false positives)

Coverage (roughly, fraction of sequences that one confidently “says something” about)

100%

100%Different score thresholds

Two “methods” (red is more effective)

Thresh=30

Thresh=20

Thresh=10

[sensitivity=tp/n=tp/(tp+fn)]

[Specificity = tn/n =tn/(tn+fp)]error rate = 1-specificity = fp/n


b ioinform atics alignment of biological sequences - databases and software

Documents

g g t c c

t c c t g g

g tc t g

t t g c g aw

examplea c g

dn g ir

otherinitn score

dnainit1 score