biol335: homology search

21
Homology Search Paul Gardner March 24, 2015 Paul Gardner Homology Search

Upload: paul-gardner

Post on 24-Jun-2015

142 views

Category:

Science


1 download

DESCRIPTION

Course material for: http://www.canterbury.ac.nz/courseinfo/GetCourseDetails.aspx?course=BIOL335

TRANSCRIPT

Page 1: BIOL335: Homology search

Homology Search

Paul Gardner

March 24, 2015

Paul Gardner Homology Search

Page 2: BIOL335: Homology search

News & Views reminder (20% of your course grade, dueMarch 26, Reviewed April 2 (5/20), Revisions April 28(15/20))

I Meredith et al. (2014) Evidence for a single loss ofmineralized teeth in the common avian ancestor. Science

I Nunez et al. (2015) Integrase-mediated spacer acquisitionduring CRISPR-Cas adaptive immunity. Nature

Paul Gardner Homology Search

Page 3: BIOL335: Homology search

Homology search

I In a huge collection of biologicalsequences how can you locatesimilar sequences?

I by using heuristic, super fast,sequence alignment methods

Paul Gardner Homology Search

Page 4: BIOL335: Homology search

BLAST

Paul Gardner Homology Search

Page 5: BIOL335: Homology search

BLAST

I Identify all ’hits’ of at least W long

I Find any hits on the same diagonal of an alignment matrix

I Trigger a full alignment in that region

Basic idea: identify near-identical sub-sequences first → align anyhits in full

Paul Gardner Homology Search

Page 6: BIOL335: Homology search

What does that E-value (Expect) mean?

>gb|CP001191.1| Rhizobium leguminosarum bv. trifolii WSM2304, complete genome

Length=4537948

Features in this part of subject sequence:

cold-shock DNA-binding domain protein

Score = 57.2 bits (62), Expect = 2e-05

Identities = 78/106 (74%), Gaps = 6/106 (6%)

Strand=Plus/Plus

Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC

|| |||||||| ||||||||| |||||| | | | || |||| |||| ||||

Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC

Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG

| | || |||||| ||| ||||||||||| |||||| ||| |||

Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG

Paul Gardner Homology Search

Page 7: BIOL335: Homology search

How can we evaluate the significance of a score?

I Note that a bit-score of 57.2 by itself is not that useful.I It depends on the sequence & database size & composition.I To counter this we can compute an Expect-value (E-value).

I This is the expected number of hits with the observed score forthe given query and database sizes.

I P-values can also be used

0 100 200 300 400 500 600 700

0

2000

4000

6000

8000

10000

Separating true from false hits

score (bits)

Num

. mat

ches

Random sequences/Negative controlsTrue homologs/Positive controls

Threshold

False negatives

True positives

False positives

True negatives

Paul Gardner Homology Search

Page 8: BIOL335: Homology search

How can we evaluate the significance of a score?

0 100 200 300 400 500 600 700

0

2000

4000

6000

8000

10000

Separating true from false hits

score (bits)

Num

. mat

ches

Random sequences/Negative controlsTrue homologs/Positive controls

Threshold

False negatives

True positives

False positives

True negatives

E = κMN2−λx

E : E-valueM&N: query &database sizeκ&λ: fittingparameters

Paul Gardner Homology Search

Page 9: BIOL335: Homology search

BLAST is not the only, or best tool for the job!

Paul Gardner Homology Search

Page 10: BIOL335: Homology search

Profile-based homology search

Krogh, A. et al. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J MolBiol.Image provided by Eric Nawrocki.

Paul Gardner Homology Search

Page 11: BIOL335: Homology search

Profile-based homology search – scoring sequences

Image provided by Eric Nawrocki.

Paul Gardner Homology Search

Page 12: BIOL335: Homology search

Profile HMM are slightly more complicated

I A tree-weighting scheme takes care of unbalancedalignments

I Dirichlet-mixture priors are used to incorporate informationabout amino-acid biochemistry

I Effective sequence number is used to down-weight priorswhen many sequences are available

I Transition probabilities to Insert & Delete states are estimatedfrom the alignment

Paul Gardner Homology Search

Page 13: BIOL335: Homology search

Why not just use BLAST?

I ACCURACY!I Every benchmark of homology search tools has shown that

profile methods are more accurate than single-sequencemethods.

Eddy (2011) Accelerated Profile HMM Searches. PLoSComputational Biology.

Paul Gardner Homology Search

Page 14: BIOL335: Homology search

Why not just use BLAST?I SPEED! To search a single query vs a database of all proteins:

I BLAST: searches 42 million UniProt sequencesI HMMER: searches 15,000 Pfam profiles

I The search space is ∼ 3, 000x smaller for profilesI Save Planet Earth, use HMMER3

Eddy (2011) Accelerated Profile HMM Searches. PLoSComputational Biology.

Paul Gardner Homology Search

Page 15: BIOL335: Homology search

Pfam

What is a Pfam-A Entry?

hmmsearch

hmmbuild

hmmalign

SEED

HMM

OUTOUT

ALIGNDESC

Slide borrowed from Rob Finn.Paul Gardner Homology Search

Page 16: BIOL335: Homology search

But, what about RNA?

5’

3’

0Sequence conservation

1

AG

UK GCUCAUUCAC

CKW

Y UUAUGWYRGYCCC

gCYVU

U H R G C GGAAKA

YGYG

CUWCAUAA RM

YA

YCG

AAUGAYGC M H

AAGM

MWG

GUGCCU R

YCGUCC A MC

UWAa

CYGAUAW Y R

KGU

GMRURC

RCWU

UA

UCAAV

CAYC

GG

RC

GAMACGUY

GA GUK

AGGCACCGCC

UW

5’3’

0Sequence conservation

1

AA

YAAAAUAAUUUACAUUCCA AG

GACCGGUAU

UAUUGU A

GGGGAU

UUGU

GACU

UY C A

AGGCA

AYG

UCCUCU C

UA

CAA

CCGAGUUC R A

GA

AUAARY

AC

MAAYG

GCUC U U

UUU

GUU

AUU

CGAAAG C

UUA

CAAGDUV

YRGYRUMUU

CURUAURCU

CWCYUca

MUY

A CUUUC

MAGUACU

UCAC

AC GGGCCWRACAKMU

5’ 3’

0Sequence conservation

1

UVDWHAUGAUGA

GY

UC

MACUUCWUuGG

UC

CG U G U U U C U G A g a R MCYM

RUGAUMUBWRU

Ga

SA

AaGUUCUGAY

UHM

Paul Gardner Homology Search

Page 17: BIOL335: Homology search

Covariance models

Nawrocki & Eddy (2007) Query-Dependent Banding (QDB) for Faster RNA Similarity Searches. PLOScomputational biology.

Paul Gardner Homology Search

Page 18: BIOL335: Homology search

Benchmark

Freyhult, Bollback & Gardner (2007) Exploring genomic dark matter: A critical assessment of the performance ofhomology search methods on noncoding RNA. Genome Research.

Paul Gardner Homology Search

Page 19: BIOL335: Homology search

Rfam

Paul Gardner Homology Search

Page 20: BIOL335: Homology search

Relevant reading

I Reviews:I Eddy SR (2004) What is a hidden Markov model? Nature

Biotechnology.

I Methods:I Altschul SF et al. (1997) Gapped BLAST and PSI-BLAST: a

new generation of protein database search programs. Nucleicacids research.

I Eddy (2011) Accelerated Profile HMM Searches. PLoSComputational Biology.

Paul Gardner Homology Search

Page 21: BIOL335: Homology search

The End

Paul Gardner Homology Search