ch 11 . assessing pairwise sequence similarity: blast and fasta

80
Ch11. Assessing Pairwise Sequen ce Similarity: BLAST and FASTA IDB Lab. Seoul National University Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Third Edition

Upload: kayla

Post on 18-Feb-2016

183 views

Category:

Documents


1 download

DESCRIPTION

Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Third Edition . IDB Lab. Seoul National University. Contents. Introduction Global Versus Local Sequence Alignments Dotplots Scoring Matrices BLAST - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Ch11. Assessing Pairwise Sequence Similarity: BLAST and FASTA

IDB Lab.Seoul National University

Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Third Edition

Page 2: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Contents Introduction Global Versus Local Sequence Alignments Dotplots Scoring Matrices BLAST BLAST2Sequences MegaBLAST PSI-BLAST BLAT FASTA Comparing FASTA and BLAST Summary

Page 3: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Introduction 서열 비교 단백질의 기능 , 위치 , 구조 예측 Similarity and homology

Similarity: 두 서열이 얼마나 유사한가 Homology: 서열 유사성 등으로부터 얻는 잠재적 결론

( 진화론적으로 관련이 있다 / 없다 ) Ortholog: 공통의 유전자로부터 분화된 유전자들을 서로

다른 종들이 가지고 있는 경우(e.g. – geneA , – geneA’) Paralog: 어떤 유전자와 그것의 유전적 복제에 의해 생성된

유전자가 한 생물체 내에 공존하는 경우(e.g. geneA’ –– geneA)

Page 4: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Contents Introduction Global Versus Local Sequence Alignments Dotplots Scoring Matrices BLAST BLAST2Sequences MegaBLAST PSI-BLAST BLAT FASTA Comparing FASTA and BLAST Summary

Page 5: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Global Versus Local Sequence Alignments(1/13)

Global vs. local sequence alignments Global: 서열 전체 비교 – 길이가 거의 같고 비슷한 서열들에 대해 적용 Local: 서열 부분 비교 – 서열들에서 유사한 부분들 찾음

( 길이가 서로 달라도 비교 가능 ) 대부분의 생물학자들이 local alignment 를 사용

Page 6: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Global Versus Local Sequence Alignments(2/13)

Global alignment: Needleman-Wunsch algorithma H E A G A W G H E

0 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -33 -42 -49 -57 -65

A -16 -10 -3 -4 -12 -28 -36 -44 -52

W -24 -18 -11 -6 -7 -15 -21 -29

H -32 -14 -18 -13 -8 -9 -13 -7 -11

E -40 -22 -8 -16 -16 -9 -12 -15 -7

A -48 -30 -16 -3 -11 -11 -12 -12 -15

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

E

-80

-73

-60

-37

-19

-5

2

Page 7: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Global Versus Local Sequence Alignments(3/13)

Global alignment: Needleman-Wunsch algorithma H E A G A W G H E

0 -8 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -33 -42 -49 -57 -65

A -16 -10 -3 -4 -12 -28 -36 -44 -52

W -24 -18 -11 -6 -7 -15 -21 -29

H -32 -14 -18 -13 -8 -9 -13 -7 -11

E -40 -22 -8 -16 -16 -9 -12 -15 -7

A -48 -30 -16 -3 -11 -11 -12 -12 -15

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

E

-80

-73

-60

-37

-19

-5

2

H delete H-

Page 8: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Global Versus Local Sequence Alignments(4/13)

Global alignment: Needleman-Wunsch algorithma H E A G A W G H E

0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -33 -42 -49 -57 -65

A -16 -10 -3 -4 -12 -28 -36 -44 -52

W -24 -18 -11 -6 -7 -15 -21 -29

H -32 -14 -18 -13 -8 -9 -13 -7 -11

E -40 -22 -8 -16 -16 -9 -12 -15 -7

A -48 -30 -16 -3 -11 -11 -12 -12 -15

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

E

-80

-73

-60

-37

-19

-5

2

HE delete E--

Page 9: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Global Versus Local Sequence Alignments(5/13)

Global alignment: Needleman-Wunsch algorithma H E A G A W G H E

0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -17 -33 -42 -49 -57 -65

A -16 -10 -3 -4 -12 -28 -36 -44 -52

W -24 -18 -11 -6 -7 -15 -21 -29

H -32 -14 -18 -13 -8 -9 -13 -7 -11

E -40 -22 -8 -16 -16 -9 -12 -15 -7

A -48 -30 -16 -3 -11 -11 -12 -12 -15

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

E

-80

-73

-60

-37

-19

-5

2

HEA replace A to P--P

Page 10: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Global Versus Local Sequence Alignments(6/13)

Global alignment: Needleman-Wunsch algorithma H E A G

0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65

A -16 -10 -3 -4 -12 -28 -36 -44 -52

W -24 -18 -11 -6 -7 -15 -21 -29

H -32 -14 -18 -13 -8 -9 -13 -7 -11

E -40 -22 -8 -16 -16 -9 -12 -15 -7

A -48 -30 -16 -3 -11 -11 -12 -12 -15

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

-80

-73

-60

-37

-19

-5

2

A W G H E E

HEAG delete G--P-

Page 11: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Global Versus Local Sequence Alignments(7/13)

Global alignment: Needleman-Wunsch algorithma H E A G A W G H E

0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52

W -24 -18 -11 -6 -7 -15 -21 -29

H -32 -14 -18 -13 -8 -9 -13 -7 -11

E -40 -22 -8 -16 -16 -9 -12 -15 -7

A -48 -30 -16 -3 -11 -11 -12 -12 -15

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

E

-80

-73

-60

-37

-19

-5

2

HEAGA replace A to A--P-A

Page 12: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Global Versus Local Sequence Alignments(8/13)

Global alignment: Needleman-Wunsch algorithma H E A G A W G H E

0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52

W -24 -18 -11 -6 -7 -15 -5 -21 -29

H -32 -14 -18 -13 -8 -9 -13 -7 -11

E -40 -22 -8 -16 -16 -9 -12 -15 -7

A -48 -30 -16 -3 -11 -11 -12 -12 -15

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

E

-80

-73

-60

-37

-19

-5

2

HEAGAW replace W to W--P-AW

Page 13: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Global Versus Local Sequence Alignments(9/13)

Global alignment: Needleman-Wunsch algorithma H E A G A W G H E

0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29

H -32 -14 -18 -13 -8 -9 -13 -7 -11

E -40 -22 -8 -16 -16 -9 -12 -15 -7

A -48 -30 -16 -3 -11 -11 -12 -12 -15

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

E

-80

-73

-60

-37

-19

-5

2

HEAGAWG delete G--P-AW-

Page 14: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Global Versus Local Sequence Alignments(10/13)

Global alignment: Needleman-Wunsch algorithma H E A G A W G H E

0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29

H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11

E -40 -22 -8 -16 -16 -9 -12 -15 -7

A -48 -30 -16 -3 -11 -11 -12 -12 -15

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

E

-80

-73

-60

-37

-19

-5

2

HEAGAWGH replace H to H--P-AW-H

Page 15: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Global Versus Local Sequence Alignments(11/13)

Global alignment: Needleman-Wunsch algorithma H E A G A W G H E

0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29

H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11

E -40 -22 -8 -16 -16 -9 -12 -15 -7 3

A -48 -30 -16 -3 -11 -11 -12 -12 -15

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

E

-80

-73

-60

-37

-19

-5

2

HEAGAWGHE replace E to E--P-AW-HE

Page 16: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Global Versus Local Sequence Alignments(12/13)

Global alignment: Needleman-Wunsch algorithma H E A G A W G H E

0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29

H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11

E -40 -22 -8 -16 -16 -9 -12 -15 -7 3

A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

E

-80

-73

-60

-37

-19

-5

2

HEAGAWGHE- insert A--P-AW-HEA

Page 17: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Global Versus Local Sequence Alignments(13/13)

Global alignment: Needleman-Wunsch algorithma H E A G A W G H E

0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29

H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11

E -40 -22 -8 -16 -16 -9 -12 -15 -7 3

A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

E

-80

-73

-60

-37

-19

-5

2

1

HEAGAWGHE-E replace E to E--P-AW-HEAE

Page 18: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Contents Introduction Global Versus Local Sequence Alignments Dotplots Scoring Matrices BLAST BLAST2Sequences MegaBLAST PSI-BLAST BLAT FASTA Comparing FASTA and BLAST Summary

Page 19: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Dotplots(1/4) 두 서열들 간의 관계를

도표로 표현 부분 일치 ( 정방향 /

역방향 ), 삽입 , 삭제 등을 직관적으로 표현

어느 부분이 얼마나 유사한지 정확한 값을 알기 위해서는 다른 방법 필요

Page 20: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Dotplots(2/4) Comparison of HMGB1 with SOX-10 Global alignment 를 사용하면 이러한 관계 포착 불가능

Page 21: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Dotplots(3/4) Comparison of mucin with itself

Page 22: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Dotplots(4/4) Comparison of achaete-scute protein with itself

Page 23: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Contents Introduction Global Versus Local Sequence Alignments Dotplots Scoring Matrices BLAST BLAST2Sequences MegaBLAST PSI-BLAST BLAT FASTA Comparing FASTA and BLAST Summary

Page 24: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Scoring Matrices(1/10) 서열 간의 유사성을 정량적으로 분석 Scoring matrix 를 구성할 때 고려할 사항들

Conservation: conservative substitution 고려 Frequency: 흔하지 않은 잔기에 높은 비중 둠 Evolution: 진화론적 거리 고려

Page 25: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Scoring Matrices(2/10) PAM Matrices

1978 년 Dayhoff 가 유사도 85% 이상인 단백질들을 대상으로 대체 패턴 조사

“ 서열 A, B 의 진화적 거리가 n PAM 이다 .” ≡ “A, B가 평균적으로 n% 차이가 난다 .” (1 PAM = one change per 100 residues)

진화적 거리가 n PAM 인 서열들 간의 substitution matrix 를 PAMn 이라 함 (PAMn = (PAM1)n)

※ PAM scoring matrix: PAM 행렬에 log odds ratio(lod score; ☞ Box 11.1) 를 적용한 행렬

Page 26: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Scoring Matrices(3/10) Example – PAM1

  A R N D C

A 0.9867 0.0002 0.0009 0.0010 0.0003

R 0.0001 0.9913 0.0001 0.0000 0.0001

N 0.0004 0.0001 0.9822 0.0036 0.0000

D 0.0006 0.0000 0.0042 0.9859 0.0000

C 0.0001 0.0001 0.0000 0.0000 0.9973

Page 27: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Scoring Matrices(4/10) PAM 행렬의 단점

다음과 같은 가정 하에 행렬 계산됨 변이들은 서로 독립 ( 실제로는 PAMn ≠ (PAM1)n) 변이 확률은 어느 위치든 동일 ( 실제로는 단백질 구조와

관련 ) 진화 경향은 불변

대체로 1978 년 이전에 발견된 단백질들로부터 행렬 계산 작은 구상 단백질에 편향

Page 28: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Scoring Matrices(5/10) BLOSUM Matrices

1992 년 Henikoff 가 여러 단백질들이 공통으로 가지고 있는 모티프들 조사 , BLOCKS 데이터베이스 구축

단백질에서 변화가 적은 영역들만을 대상으로 대체 패턴 조사

PAM 행렬보다 훨씬 정확 BLOSUMn: 유사도가 n% 이하인 서열들을 대상으로

구축한 행렬

Page 29: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Scoring Matrices(6/10) Example – BLOSUM62

자주 일어나는 substitution 일 수록 큰 값 부여 희귀한 아미노산일 수록 큰 값 부여

  A C Q W Y *

A 4 0 -1 -3 -1 -1

C 0 9 -3 -2 -1 -1

Q -1 -3 5 -2 -1 -1

W -3 -2 -2 11 2 -4

Y -2 -2 -1 2 7 -4

* -4 -4 -4 -4 -4 1

Page 30: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Scoring Matrices(7/10) Selecting an Appropriate Scoring Matrix

PAM250 은 BLOSUM45 와 동등 PAM160 은 BLOSUM62 와 동등 PAM120 은 BLOSUM80 과 동등

Matrix Best use Similarity (%)

PAM40 Short alignments that are highly similar 70-90PAM160 Detecting members of a protein family 50-60PAM250 Longer alignments of more divergent sequences ~30

BLOSUM90 Short alignments that are highly similar 70-90BLOSUM80 Detecting members of a protein family 50-60BLOSUM62 Most effective in finding all potential similarities 30-40BLOSUM30 Longer alignments of more divergent sequences <30

The Similarity column gives the range of similarities that the matrix is able to best detect (c.f. Wheeler, 2003)

Page 31: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Scoring Matrices(8/10) Nucleotide Scoring Matrices(1/2)

A, T, G, C 가 같은 비율로 존재한다고 가정 뉴클레오티드 기반 비교는 단백질 기반 비교에 비해

정확도가 떨어짐

Sequence1 GGTGCACCCGGTATGTGACTGCGATTAGCAGCGGGATCATTTCAGCATGCAGGG * * ***** **** **** ** *** **** ***** *** ** **** ** * (76% 일치)Sequence2 GATACACCCCGTATTTGACAGCAATTTGCAGGGGGATGATTGCACCATGGAGCG

Sequence1 G A P G M W L R L A A G S F E H A G * * * * * (28% 일치)Sequence2 D T P R I W E E F A G G W L H H G A

Page 32: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Scoring Matrices(9/10) Nucleotide Scoring Matrices(2/2)

  A T G C S W R Y K M B V H D NA 5 -4 -4 -4 -4 1 1 -4 -4 1 -4 -1 -1 -1 -2T -4 5 -4 -4 -4 1 -4 1 1 -4 -1 -4 -1 -1 -2G -4 -4 5 -4 1 -4 1 -4 1 -4 -1 -1 -4 -1 -2C -4 -4 -4 5 1 -4 -4 1 -4 1 -1 -1 -1 -4 -2S -4 -4 1 1 -1 -4 -2 -2 -2 -2 -1 -1 -3 -3 -1W 1 1 -4 -4 -4 -1 -2 -2 -2 -2 -3 -3 -1 -1 -1R 1 -4 1 -4 -2 -2 -1 -4 -2 -2 -3 -1 -3 -1 -1Y -4 1 -4 1 -2 -2 -4 -1 -2 -2 -1 -3 -1 -3 -1K -4 1 1 -4 -2 -2 -2 -2 -1 -4 -1 -3 -3 -1 -1M 1 -4 -4 1 -2 -2 -2 -2 -4 -1 -3 -1 -1 -3 -1B -4 -1 -1 -1 -1 -3 -3 -1 -1 -3 -1 -2 -2 -2 -1V -1 -4 -1 -1 -1 -3 -1 -3 -3 -1 -2 -1 -2 -2 -1H -1 -1 -4 -1 -3 -1 -3 -1 -3 -1 -2 -2 -1 -2 -1D -1 -1 -1 -4 -3 -1 -1 -3 -1 -3 -2 -2 -2 -1 -1N -2 -2 -2 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Page 33: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Scoring Matrices(10/10) Gaps and Gap Penalties

아미노산 삽입과 삭제를 고려해 서열 비교 일반적으로 20 잔기 당 기껏해야 1 개의 틈 발생 Affine gap penalty

틈 간격에 따라 유사도에 패널티 부과 패널티 = G + Ln (G: 틈 생성 비용 , L: 틈 확장 비용 , n: 틈

길이 ) 틈 허용함으로써 더 먼 homolog 도 찾을 수 있음

Page 34: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Contents Introduction Global Versus Local Sequence Alignments Dotplots Scoring Matrices BLAST BLAST2Sequences MegaBLAST PSI-BLAST BLAT FASTA Comparing FASTA and BLAST Summary

Page 35: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(1/21) 서열 간 비교를 빠르고 정확하게 수행 Scoring matrix 이용

Program Query DatabaseBLASTN Nucleotide NucleotideBLASTP Protein ProteinBLASTX Nuclotide, six-frame translation Protein

TBLASTN Protein Nuclotide, six-frame translation

TBLASTX Nuclotide, six-frame translation Nuclotide, six-frame translation

( 이 밖에도 여러 종류가 있음 )

Page 36: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(2/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step1 – Seeding(1/4)

Page 37: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(3/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step1 – Seeding(2/4)

Page 38: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(4/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step1 – Seeding(3/4)

Page 39: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(5/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step1 – Seeding(4/4)

Page 40: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(6/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step2 – Extension(1/11)

Page 41: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(7/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step2 – Extension(2/11)

Page 42: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(8/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step2 – Extension(3/11)

Page 43: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(9/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step2 – Extension(4/11)

Page 44: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(10/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step2 – Extension(5/11)

Page 45: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(11/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step2 – Extension(6/11)

Page 46: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(12/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step2 – Extension(7/11)

Page 47: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(13/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step2 – Extension(8/11)

Page 48: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(14/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step2 – Extension(9/11)

Page 49: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(15/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step2 – Extension(10/11)

Page 50: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(16/21)

Subject sequence: TLSREQHKKDHPDYKYQPRRRK Query sequence: ERLRDQHKKDYPESHADAESSS

Step2 – Extension(11/11)

Page 51: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(17/21) Running a BLAST Search

질의 서열질의 서열 일부분비교 대상 DB

Conserved domain 표시여부

종 제한E 계산시 조성비 고려

E represents the number of HSPsthat would be expected purelyby chance(E 가 작을수록 align 결과가false-positive 일 확률 낮고생물학적 중요성 높음 )

Low-complexity region 검색 안 함E<10 인 것만 검색query word 길이Scoring

matrixGap panel

ty

Page 52: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(18/21) Understanding the BLAST Output(1/3)

Page 53: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(19/21) Understanding the BLAST Output(2/3)

접근번호gi – GenInfo IDref – RefSeqsp – SwissProt…

정의 Score

E값 외부링크L – LocusLinkU – UniGene…

Page 54: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(20/21) Understanding the BLAST Output(3/3)

Score E값Exact matches Exact matches& conservative substitutions

Gaps

Gap

Low-complexity

region

Page 55: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST(21/21) BLAST Search Artifacts

Low-complexity regions 연쇄 반복 영역 가설 단백질 ESTs(Expressed Sequence Tags)

Page 56: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Contents Introduction Global Versus Local Sequence Alignments Dotplots Scoring Matrices BLAST BLAST2Sequences MegaBLAST PSI-BLAST BLAT FASTA Comparing FASTA and BLAST Summary

Page 57: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAST2Sequences

Page 58: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Contents Introduction Global Versus Local Sequence Alignments Dotplots Scoring Matrices BLAST BLAST2Sequences MegaBLAST PSI-BLAST BLAT FASTA Comparing FASTA and BLAST Summary

Page 59: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

MegaBLAST

긴 서열들 , 매우 (>95%) 유사한 서열들에 최적화

한 서열이 다른 서열의 일부분인지 쉽게 판별 BLASTN 보다 약 10배 빠름 유사도가 낮은 서열들에 대해서도 alignment

수행하려면 discontinuous MegaBLAST 사용

Page 60: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Contents Introduction Global Versus Local Sequence Alignments Dotplots Scoring Matrices BLAST BLAST2Sequences MegaBLAST PSI-BLAST BLAT FASTA Comparing FASTA and BLAST Summary

Page 61: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

PSI-BLAST(1/2) 약하게 관련된 단백질들도 찾을 수 있음 위치별 점수 행렬 (PSSM) 사용

Page 62: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

PSI-BLAST(2/2)

PSSM 이용해 더 많은 단백질 찾음

E ≤ 0.005

E > 0.005

PSSM 생성에사용

Page 63: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Contents Introduction Global Versus Local Sequence Alignments Dotplots Scoring Matrices BLAST BLAST2Sequences MegaBLAST PSI-BLAST BLAT FASTA Comparing FASTA and BLAST Summary

Page 64: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAT(1/3) 검색 전에 미리 길이 11 짜리 (11-mer) 인덱스 구축 유전체 내에서 특정 서열 검색하거나 종간 분석시 사용

Page 65: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAT(2/3)

종 버전 질의 종류

Page 66: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

BLAT(3/3)

Page 67: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Contents Introduction Global Versus Local Sequence Alignments Dotplots Scoring Matrices BLAST BLAST2Sequences MegaBLAST PSI-BLAST BLAT FASTA Comparing FASTA and BLAST Summary

Page 68: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

FASTA(1/7) 1985 년 Lipman 과 Pearson 에 의해 FASTP 고안 서열 유사도 검색

Program Query DatabaseCorresponding

BLAST program

FASTA NucleotideProtein

NucleotideProtein

BLASTNBLASTP

FASTX/FASTY DNA Protein BLASTXTFASTX/TFASTY Protein Translated DNA TBLASTN

( 이 밖에도 여러 종류가 있음 )

Page 69: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

FASTA(2/7) Step 1 – Word match search

Page 70: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

FASTA(3/7) Step 2 – Top 10 selection

Page 71: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

FASTA(4/7) Step 3 – Optimal pairwise alignment computation

Page 72: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

FASTA(5/7) Step 4 – Assessment of the significance

입력 서열과 길이 , 조성비가 같은 무작위 서열 있다고 가정 무작위 서열에서 alignment 가 일어날 확률 계산

Page 73: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

FASTA(6/7) Running a FASTA Search(1/2)

Query library @ vs %n library searching /slib2/blast/nr.lseg library

1>>>sp|P29617|PROS_DROME Protein prospero – 1403 aa vs NCBI/Blast non-redundant (nr) proteins library opt E()< 20 3193 0:= 22 0 0: one = represents 3301 library sequences 24 2 2:* 26 2 34:* 28 12 364:* 30 52 2210:* 32 381 8545:= * 34 2470 23173:= * 36 13785 47592:===== * 38 46778 78653:=============== * 40 108094 109713:=================================* 42 172066 134111:========================================*============

정규화된 유사도 값 해당 서열 개수 해당 서열 개수의 기대값

Page 74: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

FASTA(7/7) Running a FASTA Search(2/2)

정규화된유사도값

정규화된bit score

기대값

Page 75: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Contents Introduction Global Versus Local Sequence Alignments Dotplots Scoring Matrices BLAST BLAST2Sequences MegaBLAST PSI-BLAST BLAT FASTA Comparing FASTA and BLAST Summary

Page 76: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Comparing FASTA and BLAST FASTA 는 먼저 exact match 를 찾는 반면 , BLAST 는 see

ding 단계에서 conservative substitution 허용 BLAST 는 특정 영역 제외하고 검색할 수 있으나 FASTA

는 불가능 FASTA 는 한 서열 당 하나의 alignment 만을 찾는 반면 ,

BLAST 는 여러 개의 HSP 를 찾을 수 있음 FASTA 는 Smith-Waterman 기법을 사용하므로 약하게

관련된 단백질들을 BLAST 보다 더 잘 찾음 염기 서열과 아미노산 서열을 비교하는 경우 , FASTA 는

frameshift 허용 BLAST 가 FASTA 보다 빠름 서열 유사도가 30% 이상인 경우 FASTA 가 더 정확

Page 77: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Contents Introduction Global Versus Local Sequence Alignments Dotplots Scoring Matrices BLAST BLAST2Sequences MegaBLAST PSI-BLAST BLAT FASTA Comparing FASTA and BLAST Summary

Page 78: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Summary 서열 비교하는 여러 방법들이 있음 각 방법들의 장단점을 파악하고 적절히 사용하는 것이 중요 BLAST 나 FASTA 에 의한 결과는 통계적인 검증 및 문헌이나 실험에 의한 뒷받침을 거치게 됨

Page 79: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Terminology(1/2) Six-frame translation

Page 80: Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

Terminology(2/2) Frameshift mutation

3’-TACTGGGTGCTACCCACT-5’5’-AUGACCCACGAUGGGUGA-3’ MetThrHisAspGly

3’-TACGGGTGCTACCCACT-5’5’-AUGCCCCCCACGAUGGGUGA-3’ MetProProThrMetGly