b ioinform atics alignment of biological sequences - databases and software

41
B B ioinform ioinform atics atics Alignment of biological sequences Alignment of biological sequences - - databases and software databases and software LU, 204, LU, 204, Juris Vi Juris Vi ksna ksna

Upload: india

Post on 12-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

B ioinform atics Alignment of biological sequences - databases and software. LU, 204, Juris V i ksna. Praksē lietotās virkņu salīdzināšanas metodes. Smith-Waterman algoritms (1981) Lokālā salīdzināšana Lineāras gap penalties Lieto substitūciju matricas. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: B ioinform atics Alignment  of biological sequences  - databases and software

BBioinformioinformaticsatics

Alignment of biological sequencesAlignment of biological sequences - - databases and softwaredatabases and software

LU, 204,LU, 204, Juris ViJuris Viksnaksna

Page 2: B ioinform atics Alignment  of biological sequences  - databases and software

Praksē lietotās virkņu salīdzināšanas metodes

Smith-Waterman algoritms (1981)Lokālā salīdzināšanaLineāras gap penaltiesLieto substitūciju matricas

Vai laiks (n2) ir praktiski pieņemams?

Proteīna garums - ap 100 aminoskābju...Divu proteīnu salīdzināšana: 10000 op. (0.01 sek, pie 1 MHz)Proteīnu datubāze - 1000000 ieraksti...Meklēšana datubāzē: 108 op, 100 sekDivu datubāzu salīdzināšana - 1016 op, 25 gadi :(

Page 3: B ioinform atics Alignment  of biological sequences  - databases and software

Heiristiskas metodes - FASTA

• Hash table of short words in the query sequence• Go through DB and look for matches in the query hash

(linear in size of DB)• K-tuple determines word size (k-tup 1 is single aa)• Lipman & Pearson 1985

VLICTAVLMVLICTAAAVLICTMSDFFD

VLICT = _

[Adapted from M.Gerstein]

Page 4: B ioinform atics Alignment  of biological sequences  - databases and software

The FASTA Algorithm

4 steps:

• use lookup table to find all identities at least ktup long, find regions of identities (Fig.1A)

• rescan 10 regions (diagonals) with highest density of identities using PAM250 (Fig.1B)

• join regions if possible without decreasing score below threshold (Fig.1C)

• rescore ala Smith-Waterman 32 residues around initial region (Note: doesn’t save alignment) (Fig.1D)

Page 5: B ioinform atics Alignment  of biological sequences  - databases and software

FASTA Parameters

• ktup = 2 for proteins, 6 for DNA

• init1 Score after rescanning with PAM250 (or other)

• initn Score after joining regions

• opt Score after Smith-Waterman

Page 6: B ioinform atics Alignment  of biological sequences  - databases and software

[Fig.1 of Pearson and Lipman 1988]

FASTA algoritms

Page 7: B ioinform atics Alignment  of biological sequences  - databases and software

[Adapted from D Brutlag]

FASTA algoritms

Page 8: B ioinform atics Alignment  of biological sequences  - databases and software

FASTA histogrammas

Page 9: B ioinform atics Alignment  of biological sequences  - databases and software

• Altschul, S., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403-410

• Indexes query and DB• Starts with all overlapping words from query• Calculates “neighborhood” of each word using PAM matrix

and probability threshold matrix and probability threshold• Looks up all words and neighbors from query in database

index• Extends High Scoring Pairs (HSPs) left and right to maximal

length• Finds Maximal Segment Pairs (MSPs) between query and

database• Blast 1 does not permit gaps in alignments

[Adapted from M.Gerstein]

Heiristiskas metodes - Blast

Page 10: B ioinform atics Alignment  of biological sequences  - databases and software

BLAST algoritms• Keyword search of all words of length w from

the in the query of length n in database of length m with score above threshold– w = 11 for nucleotide queries, 3 for proteins

• Do local alignment extension for each found keyword– Extend result until longest match above threshold is achieved

• Running time O(nm)

[Adapted from S.Daudenarde]

Page 11: B ioinform atics Alignment  of biological sequences  - databases and software

BLAST algoritms

Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++KSbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263

Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD

keyword

GVK 18GAK 16GIK 16GGK 14GLK 13GNK 12GRK 11GEK 11GDK 11

neighborhoodscore threshold

(T = 13)

Neighborhoodwords

High-scoring Pair (HSP)

extension

[Adapted from S.Daudenarde]

Page 12: B ioinform atics Alignment  of biological sequences  - databases and software

Original BLAST• Dictionary

– All words of length w

• Alignment– Ungapped extensions until score falls below some

threshold

• Output– All local alignments with score > statistical threshold

[Adapted from S.Daudenarde]

Page 13: B ioinform atics Alignment  of biological sequences  - databases and software

Original BLAST: Example

A C G A A G T A A G G T C C A G T

C

T

G

A

T

C C

T

G

G

A

T

T

G C

G

A

w = 4

Exact keyword match of GGTC

Extend diagonals with mismatches until score is under 50%

Output result:

GTAAGGTCCGTTAGGTCC

From lectures by Serafim Batzoglou (Stanford)

Page 14: B ioinform atics Alignment  of biological sequences  - databases and software

Gapped BLAST: Example

A C G A A G T A A G G T C C A G T

C

T

G

A

T

C C

T

G

G

A

T

T

G C

G

A• Original BLAST exact keyword

search, THEN:

• Extend with gaps in a zone around ends of exact match until score < threshold then merge nearby alignments

• Output result:

GTAAGGTCC-AGTGTTAGGTCCTAGT

From lectures by Serafim Batzoglou (Stanford)

Page 15: B ioinform atics Alignment  of biological sequences  - databases and software

BLAST - Programmasblastpcompares an amino acid query sequence against a protein sequence database

blastncompares a nucleotide query sequence against a nucleotide sequence database

blastxcompares a nucleotide query sequence translated in all reading frames against a protein sequence database

tblastncompares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames

tblastxcompares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx is extremely slow and cpu-intensive

Page 16: B ioinform atics Alignment  of biological sequences  - databases and software

PSI-BLAST and transitive sequence comparison

Page 17: B ioinform atics Alignment  of biological sequences  - databases and software

Rezultātu vizualizācija (dotplots)

Page 18: B ioinform atics Alignment  of biological sequences  - databases and software

Proteīnu virkņu formāti - FASTA>sp|P00156|CYB_HUMAN Cytochrome b - Homo sapiens (Human). MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGACLILQITTGLFLAMHYSPDASTAFSSIAHITRDVNYGWIIRYLHANGASMFFICLFLHIGRGLYYGSFLYSETWNIGIILLLATMATAFMGYVLPWGQMSFWGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFTFHFILPFIIAALATLHLLFLHETGSNNPLGITSHSDKITFHPYYTIKDALGLLLFLLSLMTLTLFSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTILRSVPNKLGGVLALLLSILILAMIPILHMSKQQSMMFRPLSQSLYWLLAADLLILTWIGGQPVSYPFTIIGQVASVLYFTTILILMPTISLIENKMLKWA

Page 19: B ioinform atics Alignment  of biological sequences  - databases and software

Proteīnu virkņu formāti - SwissProtID CYB_HUMAN STANDARD; PRT; 380 AA.

AC P00156; Q34786; Q8HBR6; Q8HNQ0; Q8HNQ1; Q8HNQ9; Q8HNR4; Q8HNR7; AC Q8W7V8; Q8WCV9; Q8WCY2; Q8WCY7; Q8WCY8; Q9B1A6; Q9B1B6; Q9B1B8; AC Q9B2X7; Q9B2X9; Q9B2Y3; Q9B2Z0; Q9B2Z4; Q9T6H6; Q9T9Y0; Q9TEH4; DT 21-JUL-1986, integrated into UniProtKB/Swiss-Prot. DT 21-JUL-1986, sequence version 1. DT 19-SEP-2006, entry version 75. DE Cytochrome b. GN Name=MT-CYB; Synonyms=COB, CYTB, MTCYB; OS Homo sapiens (Human). OG Mitochondrion. OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; OC Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP NUCLEOTIDE SEQUENCE. RC TISSUE=Placenta;RX MEDLINE=81173052; PubMed=7219534 [NCBI, ExPASy, EBI, Israel]; RA Anderson S., Bankier A.T., Barrell B.G., de Bruijn M.H.L., RA Coulson A.R., Drouin J., Eperon I.C., Nierlich D.P., Roe B.A., RA Sanger F., Schreier P.H., Smith A.J.H., Staden R., Young I.G.; RT "Sequence and organization of the human mitochondrial genome."; RL Nature 290:457-465(1981).

Page 20: B ioinform atics Alignment  of biological sequences  - databases and software

Proteīnu virkņu formāti - SwissProtFT VARIANT 360 360 T -> A.

FT /FTId=VAR_013667. FT VARIANT 368 368 T -> I. FT /FTId=VAR_013668. SQ SEQUENCE 380 AA; 42730 MW; 90097B07FF2C1FD8 CRC64; MTPMRKINPL MKLINHSFID LPTPSNISAW WNFGSLLGAC LILQITTGLF LAMHYSPDAS TAFSSIAHIT RDVNYGWIIR YLHANGASMF FICLFLHIGR GLYYGSFLYS ETWNIGIILL LATMATAFMG YVLPWGQMSF WGATVITNLL SAIPYIGTDL VQWIWGGYSV DSPTLTRFFT FHFILPFIIA ALATLHLLFL HETGSNNPLG ITSHSDKITF HPYYTIKDAL GLLLFLLSLM TLTLFSPDLL GDPDNYTLAN PLNTPPHIKP EWYFLFAYTI LRSVPNKLGG VLALLLSILI LAMIPILHMS KQQSMMFRPL SQSLYWLLAA DLLILTWIGG QPVSYPFTII GQVASVLYFT TILILMPTIS LIENKMLKWA

Page 21: B ioinform atics Alignment  of biological sequences  - databases and software

Proteīnu virkņu formāti - NiceProt

Page 22: B ioinform atics Alignment  of biological sequences  - databases and software

Proteīnu virkņu datubāzes - UniProt

http://www.expasy.uniprot.org/

Page 23: B ioinform atics Alignment  of biological sequences  - databases and software

Proteīnu virkņu datubāzes - PIR

http://pir.georgetown.edu/

Page 24: B ioinform atics Alignment  of biological sequences  - databases and software

Proteīnu virkņu datubāzes - SwissProt

http://www.expasy.org/sprot/

Page 25: B ioinform atics Alignment  of biological sequences  - databases and software

Proteīnu virkņu datubāzes - NRL-3D :)

Page 26: B ioinform atics Alignment  of biological sequences  - databases and software

Proteīnu struktūru datubāzes - PDB

http://www.pdb.org

Page 27: B ioinform atics Alignment  of biological sequences  - databases and software

Salīdzināšanas programmas - BLAST http://www.ncbi.nlm.nih.gov/BLAST/

http://www.ebi.ac.uk/blastall/

Page 28: B ioinform atics Alignment  of biological sequences  - databases and software

Salīdzināšanas programmas - FASTA http://fasta.bioch.virginia.edu/

http://www.ebi.ac.uk/fasta33/

Page 29: B ioinform atics Alignment  of biological sequences  - databases and software

Salīdzināšanas programmas - Smith-Waterman

http://www.ebi.ac.uk/MPsrch/

Page 30: B ioinform atics Alignment  of biological sequences  - databases and software

Salīdzinājumu novērtēšana

S = Total Score

S(i,j) = similarity matrix score for aligning i and j

Sum is carried out over all aligned i and j

n = number of gaps (assuming no gap ext. penalty)

G = gap penalty

nGjiSSji

,

),(

Simplest score (for identity matrix) is S = # matches

What does a Score of 10 mean? What is the Right Cutoff?

[Adapted from M.Gerstein]

Page 31: B ioinform atics Alignment  of biological sequences  - databases and software

Assessing sequence similarity

• Need to know how strong an alignment can be expected from chance alone

• “Chance” is the comparison of– Real but non-homologous sequences– Real sequences that are shuffled to preserve compositional

properties– Sequences that are generated randomly based upon a DNA or

protein sequence model (favored)

[Adapted from S.Daudenarde]

Page 32: B ioinform atics Alignment  of biological sequences  - databases and software

Model Random Sequence

• Necessary to evaluate the score of a match

• Take into account background– Adjust for G+C content

– Poly-A tails

– “Junk” sequences

– Codon bias

[Adapted from S.Daudenarde]

Page 33: B ioinform atics Alignment  of biological sequences  - databases and software

E-ValuesE-Value

Expectation value cutoff. An E-Value of 1 means that one would expect to find at most 1 alignment with a score as high as reported in a search of the given database. Setting the E-value higher will report more sequence matches

Resp., jo mazāka, jo labāk...

Rēķina apmēram šādi:

E = mn2-S’, kur m,n - virkņu garumi,S' - rēķina, piem., šādi: S’ = S–ln(K)

ln(2)

Page 34: B ioinform atics Alignment  of biological sequences  - databases and software

P-Values

P-Value

Varbūtība, ka šāds (vai labāks) rezultāts tiks iegūts "nejaušam"proteīnam

Arī, jo mazāka, jo labāk...

Page 35: B ioinform atics Alignment  of biological sequences  - databases and software

P-Values

• According to Poison’s distribution, the probability of finding b HSPs with a score S is given by:– (e-EEb)/b!

• For b = 0, that chance is:– e-E

• Thus the probability of finding at least one such HSP is:– P = 1 – e-E

[Adapted from S.Daudenarde]

Page 36: B ioinform atics Alignment  of biological sequences  - databases and software

P-Values• P(s > S) = .01

– P-value of .01 occurs at score threshold S (392 below) where score s from random comparison is greater than this threshold 1% of the time

• Likewise for P=.001 and so on.

[Adapted from M.Gerstein]

Page 37: B ioinform atics Alignment  of biological sequences  - databases and software

Z-score

Page 38: B ioinform atics Alignment  of biological sequences  - databases and software

General Protein Search Principles

• Choose between local or global search algorithms

• Use most sensitive search algorithm available

• Original BLAST for no gaps• Smith-Waterman for most

sensitivity• FASTA with k-tuple 1 is a good

compromise• Gapped BLAST for well delimited

regions• PSI-BLAST for families• Initially BLOSUM62 and default

gap penalties

• If no significant results, use BLOSUM30 and lower gap penalties

• FASTA cutoff of .01

• Blast cutoff of .0001

• Examine results between exp. 0.05 and 10 for biological significance

• Ensure expected score is negative

• Beware of hits on long sequences or hits with unusual aa composition

• Reevaluate results of borderline significance using limited query region

• Segment long queries 300 amino acids

• Segment around known motifs

(some text adapted from D Brutlag)

Page 39: B ioinform atics Alignment  of biological sequences  - databases and software

Uzskatīsim, ka proteīni ir homologi, ja “līdzība” parsniedz kaut kadu slieksni t

true positives (tp) - s(a,b) t un a, b ir homologifalse positives (fp) - s(a,b) t un a, b nav homologi

true negatives (tn) - s(a,b) < t un a, b nav homologifalse negatives (fn) - s(a,b) < t un a, b ir homologi

Sensitivity = tp/n = tp/(tp+fn)Specificity = tn/n = tn(tn+fp)

ROC diagrammas

Page 41: B ioinform atics Alignment  of biological sequences  - databases and software

ROC diagrammas

Error rate (fraction of the “statements” that are false positives)

Coverage (roughly, fraction of sequences that one confidently “says something” about)

100%

100%Different score thresholds

Two “methods” (red is more effective)

Thresh=30

Thresh=20

Thresh=10

[sensitivity=tp/n=tp/(tp+fn)]

[Specificity = tn/n =tn/(tn+fp)]error rate = 1-specificity = fp/n

[Adapted from M.Gerstein]