advanced blast searching

61
Advanced BLAST Searching Courtesy of Jonathan Pevsner Johns Hopkins U.

Upload: dagan

Post on 16-Jan-2016

137 views

Category:

Documents


0 download

DESCRIPTION

Advanced BLAST Searching. Courtesy of Jonathan Pevsner Johns Hopkins U. Outline of today’s lecture. Organism-specific BLAST sites Specialized BLAST-related algorithms BLAST-like tools for genomic DNA searches PSI-BLAST PHI-BLAST Find-a-gene. Specialized BLAST servers. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Advanced BLAST Searching

Advanced BLAST Searching

Courtesy of Jonathan Pevsner Johns Hopkins U.

Page 2: Advanced BLAST Searching

Outline of today’s lecture

Organism-specific BLAST sites

Specialized BLAST-related algorithms

BLAST-like tools for genomic DNA searches

PSI-BLAST

PHI-BLAST

Find-a-gene

Page 3: Advanced BLAST Searching

Specialized BLAST servers

Organism-specific BLAST sites

Page 4: Advanced BLAST Searching
Page 5: Advanced BLAST Searching

Ensembl BLAST output includes an ideogram

Page 6: Advanced BLAST Searching

TIGR BLAST

Page 7: Advanced BLAST Searching
Page 8: Advanced BLAST Searching
Page 9: Advanced BLAST Searching

BLAST output of ProDom server: graphical view of domains

Page 10: Advanced BLAST Searching

Specialized BLAST servers

Molecule-specific BLAST sites

Species-specific BLAST sites

Specialized algorithms (WU-BLAST 2.0)

Page 11: Advanced BLAST Searching
Page 12: Advanced BLAST Searching
Page 13: Advanced BLAST Searching

FASTA server at the University of Virginia

Page 14: Advanced BLAST Searching

Conserved Domain Database(CDD)

Page 15: Advanced BLAST Searching
Page 16: Advanced BLAST Searching
Page 17: Advanced BLAST Searching
Page 18: Advanced BLAST Searching

BLAST-related tools for genomic DNA

The analysis of genomic DNA presents special challenges:• There are exons (protein-coding sequence) and introns (intervening sequences).• There may be sequencing errors or polymorphisms• The comparison may be between related species (e.g. human and mouse)

Page 19: Advanced BLAST Searching

BLAST-related tools for genomic DNA

Recently developed tools include:

• MegaBLAST at NCBI.

• BLAT (BLAST-like alignment tool). BLAT parses an entire genomic DNA database into words (11mers), then searches them against a query. Thus it is a mirror image of the BLAST strategy. See http://genome.ucsc.edu

• SSAHA at Ensembl uses a similar strategy as BLAT. See http://www.ensembl.org

Page 20: Advanced BLAST Searching

To access BLAT, visit http://genome.ucsc.edu

Page 21: Advanced BLAST Searching

Paste DNA or protein sequencehere in the FASTA format

Page 22: Advanced BLAST Searching

BLAT output includes browser and other formats

Page 23: Advanced BLAST Searching

Position specific iterated BLAST: PSI-BLAST

The purpose of PSI-BLAST is to look deeperinto the database for matches to your queryprotein sequence by employing a scoringmatrix that is customized to your query.

Page 24: Advanced BLAST Searching

PSI-BLAST is performed in five steps

[1] Select a query and search it against a protein database

Page 25: Advanced BLAST Searching

PSI-BLAST is performed in five steps

[1] Select a query and search it against a protein database

[2] PSI-BLAST constructs a multiple sequence alignmentthen creates a “profile” or specialized position-specificscoring matrix (PSSM)

Page 26: Advanced BLAST Searching

R,I,K C D,E,T K,R,T N,L,Y,G

Page 27: Advanced BLAST Searching

A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0

Page 28: Advanced BLAST Searching

A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0

Page 29: Advanced BLAST Searching

PSI-BLAST is performed in five steps

[1] Select a query and search it against a protein database

[2] PSI-BLAST constructs a multiple sequence alignmentthen creates a “profile” or specialized position-specificscoring matrix (PSSM)

[3] The PSSM is used as a query against the database

[4] PSI-BLAST estimates statistical significance (E values)

Page 30: Advanced BLAST Searching
Page 31: Advanced BLAST Searching

PSI-BLAST is performed in five steps

[1] Select a query and search it against a protein database

[2] PSI-BLAST constructs a multiple sequence alignmentthen creates a “profile” or specialized position-specificscoring matrix (PSSM)

[3] The PSSM is used as a query against the database

[4] PSI-BLAST estimates statistical significance (E values)

[5] Repeat steps [3] and [4] iteratively, typically 5 times.At each new search, a new profile is used as the query.

Page 32: Advanced BLAST Searching

Results of a PSI-BLAST search

# hitsIteration # hits > threshold

1 104 492 173 963 236 1784 301 2405 344 2836 342 2987 378 3108 382 320

Page 33: Advanced BLAST Searching

Score = 46.2 bits (108), Expect = 2e-04Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)

Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82

Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135

Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPESbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158

PSI-BLAST alignment of RBP and -lactoglobulin: iteration 1

Page 34: Advanced BLAST Searching

PSI-BLAST alignment of RBP and -lactoglobulin: iteration 2

Score = 140 bits (353), Expect = 1e-32Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%)

Query: 4 VWALLLLAAWAAAERDCRVSSF--------RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55 V L+ LA A + +F V+ENFD ++ G WY + +K P +Sbjct: 2 VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60

Query: 56 NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112 I A +S+ E G + K + D + V ++ +PAK +++++ + Sbjct: 61 CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL--- 112

Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC L ++D + ++ R+P LPPE Sbjct: 113 --MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159

Page 35: Advanced BLAST Searching

PSI-BLAST alignment of RBP and -lactoglobulin: iteration 3

Score = 159 bits (404), Expect = 1e-38Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%)

Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59

Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114 + I A +S+ E G + K V + ++ +PAK +++++ + Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112

Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159

Page 36: Advanced BLAST Searching

Score = 159 bits (404), Expect = 1e-38Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%)

Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59

Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114 + I A +S+ E G + K V + ++ +PAK +++++ + Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112

Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159

Score = 46.2 bits (108), Expect = 2e-04Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)

Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82

Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135

Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPESbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158

1

3

Page 37: Advanced BLAST Searching

The universe of lipocalins (each dot is a protein)

retinol-binding protein

odorant-binding protein

apolipoprotein D

Page 38: Advanced BLAST Searching

Scoring matrices let you focus on the big (or small) picture

retinol-binding protein

your RBP query

Page 39: Advanced BLAST Searching

Scoring matrices let you focus on the big (or small) picture

retinol-binding proteinretinol-binding

protein

PAM250

PAM30

Blosum45

Blosum80

Page 40: Advanced BLAST Searching

PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM

retinol-binding protein

retinol-binding protein

Page 41: Advanced BLAST Searching

PSI-BLAST: performance assessment

Evaluate PSI-BLAST results using a database in which protein structures have been solved and allproteins in a group share < 40% amino acid identity.

Page 42: Advanced BLAST Searching

PSI-BLAST: the problem of corruption

PSI-BLAST is useful to detect weak but biologicallymeaningful relationships between proteins.

The main source of false positives is the spuriousamplification of sequences not related to the query.For instance, a query with a coiled-coil motif maydetect thousands of other proteins with this motifthat are not homologous.

Once even a single spurious protein is includedin a PSI-BLAST search above threshold, it will notgo away.

Page 43: Advanced BLAST Searching

PSI-BLAST: the problem of corruption

Corruption is defined as the presence of at least onefalse positive alignment with an E value < 10-4

after five iterations.

Three approaches to stopping corruption:

[1] Apply filtering of biased composition regions

[2] Adjust E value from 0.001 (default) to a lower value such as E = 0.0001.

[3] Visually inspect the output from each iteration. Remove suspicious hits by unchecking the box.

Page 44: Advanced BLAST Searching
Page 45: Advanced BLAST Searching

See problem 5-1 (page 153): create an artificial proteinconsisting of RBP4 and protein kinase C. How doesPSI-BLAST perform?

Page 46: Advanced BLAST Searching
Page 47: Advanced BLAST Searching

PHI-BLAST: Pattern hit initiated BLAST

Launches from the same page as PSI-BLAST

Combines matching of regular expressions with local alignments surrounding the match.

Page 48: Advanced BLAST Searching

PHI-BLAST: Pattern hit intiated BLAST

Launches from the same page as PSI-BLAST

Combines matching of regular expressions with local alignments surrounding the match.

Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences? PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology.

Page 49: Advanced BLAST Searching

1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD

Align three lipocalins (RBP and two bacterial lipocalins)

Page 50: Advanced BLAST Searching

1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD

GTWYEI K AV M

Pick a small, conserved region and see which amino acidresidues are used

Page 51: Advanced BLAST Searching

1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD

GTWYEI K AV M

GXW[YF][EA][IVLM]

Create a pattern using the appropriate syntax

Page 52: Advanced BLAST Searching
Page 53: Advanced BLAST Searching
Page 54: Advanced BLAST Searching

Syntax rules for PHI-BLAST

The syntax for patterns in PHI-BLAST follows the conventions of PROSITE (protein lecture, Chapter 8).

When using the stand-alone program, it is permissible to have multiple patterns. When using the Web-page only one pattern is allowed per query.

[ ] means any one of the characters enclosed in the bracketse.g., [LFYT] means one occurrence of L or F or Y or T

- means nothing (spacer character)

x(5) means 5 positions in which any residue is allowed

x(2,4) means 2 to 4 positions where any residue is allowed

Page 55: Advanced BLAST Searching

BLAST for gene discovery

You can use BLAST to find a “novel” gene

Page 56: Advanced BLAST Searching

Start with the sequence of a known protein

Search a DNA database(e.g. HTGS, dbEST, or genomic sequencefrom a specific organism)

tblastn

Search your DNA or protein against a protein database (nr) to confirm you haveidentified a novel gene

blastxor

blastpnr

Find matches…[1] to DNA encoding known proteins[2] to DNA encoding related (novel!) proteins[3] to false positives

inspect

Page 57: Advanced BLAST Searching
Page 58: Advanced BLAST Searching
Page 59: Advanced BLAST Searching
Page 60: Advanced BLAST Searching

this is a good candidatefor a novel gene/protein

Page 61: Advanced BLAST Searching

A blastp nr search confirms thatthe Salmonella query is closelyrelated to other lipocalins