sequence alignment

26
Sequence alignment Gabor T. Marth Department of Biology, Boston College [email protected] BI420 – Introduction to Bioinformatics

Upload: chastity-preston

Post on 31-Dec-2015

32 views

Category:

Documents


0 download

DESCRIPTION

BI420 – Introduction to Bioinformatics. Sequence alignment. Gabor T. Marth. Department of Biology, Boston College [email protected]. Biologically significant alignment. hba_human. hbb_human. http://artedi.ebc.uu.se/programs/pairwise.html. Biologically plausible alignment. Spurious alignment. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequence alignment

Sequence alignment

Gabor T. Marth

Department of Biology, Boston [email protected]

BI420 – Introduction to Bioinformatics

Page 2: Sequence alignment

Biologically significant alignment

http://artedi.ebc.uu.se/programs/pairwise.html

hba_human

hbb_human

Page 3: Sequence alignment

Biologically plausible alignment

Page 4: Sequence alignment

Spurious alignment

(BRCA1 variant)

Examples from: Biological sequence analysis. Durbin, Eddy, Krogh, Mitchison

Page 5: Sequence alignment

Alignment types

Examples from: BLAST. Korf, Yandell, Bedell

How do we align the words: CRANE and FRAME?

CRANE || |FRAME

3 matches, 2 mismatches

How do we align words that are different in length?

COELACANTH || |||P-ELICAN--

COELACANTH || |||-PELICAN--

5 matches, 2 mismatches, 3 gaps

In this case, if we assign +1 points for matches, and -1 for mismatches or gaps, we get 5 x 1 + 1 x (-1) + 3 x (-1) = 0. This is the alignment score.

Page 6: Sequence alignment

Finding the “best” alignment

COELACANTH || |||P-ELICAN--

COELACANTH | |||PE-LICAN--

COELACANTH || P-EL-ICAN-

COELACANTH PELICAN--

S=-2 S=-6 S=-10

S=0

Page 7: Sequence alignment

Global alignment – Needleman-Wunsch

Example from: Higgs and Attwood

Aligning words: SHAKE and SPEARE

Page 8: Sequence alignment

Local alignment – Smith-Waterman

Example from: Higgs and Attwood

Page 9: Sequence alignment

Visualizing pair-wise alignments

Page 10: Sequence alignment

Sequence similarity and scoring

Match-mismatch-gap penalties: e.g. Match = 1 Mismatch = -5 Gap = -10

Scoring matrices

Page 11: Sequence alignment

Multiple alignments

clustalW

Page 12: Sequence alignment

Anchored multiple alignment

Page 13: Sequence alignment

Similarity searching vs. alignment

Alignment

Similarity search

query

database

Page 14: Sequence alignment

The BLAST algorithms

Program Database Query Typical Uses

BLASTN Nucleotide Nucleotide Mapping oligonucleotides, amplimers, ESTs, and repeats to a genome. Identifying related transcripts.

BLASTP Protein Protein Identifying common regions between proteins. Collecting related proteins for phylogenetic analysis.

BLASTX Protein Nucleotide Finding protein-coding genes in genomic DNA.

TBLASTN Nucleotide Protein Identifying transcripts similar to a known protein (finding proteins not yet in GenBank). Mapping a protein to genomic DNA.

TBLASTX Nucleotide Nucleotide Cross-species gene prediction. Searching for genes missed by traditional methods.

Page 15: Sequence alignment

BLAST report

Page 16: Sequence alignment

BLAST report

http://www.ncbi.nih.gov/BLAST/ gi|7428631

Page 17: Sequence alignment

The BLAST algorithm

Sequence alignment takes place in a 2-dimensional space where diagonal lines represent regions of similarity. Gaps in an alignment appear as broken diagonals. The search space is sometimes considered as 2 sequences and somtimes as query x database.

Sequence 1

alignments gapped alignment

Search space

• Global alignment vs. local alignment

– BLAST is local

• Maximum scoring pair (MSP) vs. High-scoring pair (HSP)

– BLAST finds HSPs (usually the MSP too)

• Gapped vs. ungapped

– BLAST can do both

Page 18: Sequence alignment

The BLAST algorithm

Sequence 1

word hits

RGD 17

KGD 14

QGD 13

RGE 13

EGD 12

HGD 12

NGD 12

RGN 12

AGD 11

MGD 11

RAD 11

RGQ 11

RGS 11

RND 11

RSD 11

SGD 11

TGD 11

BLOSUM62 neighborhood

of RGD

T=12

• Speed gained by minimizing search space

• Alignments require word hits

• Neighborhood words

• W and T modulate speed and sensitivity

Page 19: Sequence alignment

Word length

Page 20: Sequence alignment

2-hit seeding

word clustersisolated words

• Alignments tend to have multiple word hits.

• Isolated word hits are frequently false leads.

• Most alignments have large ungapped regions.

• Requiring 2 word hits on the same diagonal (of 40 aa for example), greatly increases speed at a slight cost in sensitivity.

Page 21: Sequence alignment

Extension of the seed alignments

extension

alignment

• Alignments are extended from seeds in each direction.

• Extension is terminated when the maximum score drops below X.

The quick brown fox jumps over the lazy dog.The quiet brown cat purrs when she sees him.

X = 5

length of extension

trim to max

Text examplematch +1mismatch -1no gaps

Page 22: Sequence alignment

BLAST statistics

>gi|23098447|ref|NP_691913.1| (NC_004193) 3-oxoacyl-(acyl carrier protein) reductase [Oceanobacillus iheyensis] Length = 253

Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1

Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++ISbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49

How significant is this similarity?

Page 23: Sequence alignment

Scoring the alignment

Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++ISbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49

4

4-1

S (score)

Page 24: Sequence alignment

The Karlin-Altschul equation

A minor constant

Expected number of alignments

Length of query

Length of database

Search space

Raw score

Scaling factor

Normalized score

The “Expect” or “E-value”

The “P-value” EeP 1

Page 25: Sequence alignment

The sum-statistics

Sum statistics increases the significance (decreases the E-value) for groups of consistent alignments.

Page 26: Sequence alignment

The sum-statistics

The sum score is not reported by BLAST!