ncbi fieldguide a field guide part 2 august 30, 2005 university of colorado health sciences center

98
NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

Upload: lewis-bishop

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eA Field Guide part 2

August 30, 2005 University of Colorado Health Sciences Center

Page 2: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

ePart 2

Entrez: text searching• a GenBank record• preview/index

BLAST: sequence searching• pre-computed searches• algorithms• what’s new?

VAST: structure searching

Example: mapping oligos to a genome

Page 3: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eGenBank Records

Header

Feature Table

Sequence

The Flatfile Format

Page 4: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eA Typical GenBank Record

LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004DEFINITION Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNAACCESSION NM_019570VERSION NM_019570.3 GI:50811869 KEYWORDS .

= Title

Page 5: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eGenBank Record: Feature Table

Page 6: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

GenPept identifier

GenBank Record: Feature Table, con’t.

Page 7: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eGenBank Record: sequence

skip

Page 8: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eIndexing for Nucleotide UID 59958365

Field Indexed Terms

[primary accession] NM_001012399[title] Bos taurus hemochromatosis (hfe), mRNA.[organism] Bos taurus[sequence length] 1168[modification date] 2005/02/19[properties] biomol mrna

gbdiv mamsrcdb refseq

[accn]

[orgn]

[mdat][prop]

Page 9: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eGlobal Entrez Search: HFE

HFE

Page 10: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Entrez Nucleotide: HFE 137 records

Not HFE [Title]

Page 11: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eSmarter Query

hfe[title]

42 records

Curated HFE splice variants(11 total)

AND human[orgn]

Page 12: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

ehfe[title] AND human[orgn] (con’t)

Primary data

Page 13: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

ePreview/Index

Page 14: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

ePreview/Index

Page 15: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

ePreview/Index: Properties, srcdb

srcdbProperties

Page 16: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

ePreview/Index: Properties, srcdb

…AND srcdb refseq[Properties]…AND srcdb refseq[Properties]

Page 17: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

ePreview/Index: Properties, srcdb

…AND srcdb ddbj/embl/genbank[Properties]…AND srcdb ddbj/embl/genbank[Properties]

Page 18: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e#1 hfe 137#2 hfe[title] AND human[orgn] 42

#3 #2 AND srcdb refseq[prop] 11#4 #2 AND srcdb ddbj/embl/genbank[prop] 31

Database Queries

#5 #4 AND gbdiv pri[prop] 29

#4 #4 AND gbdiv est[prop] 2

Primate division gbdiv pri[prop]EST division gbdiv est[prop]

Page 19: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Molecule Queries

#1 hfe 116

#2 hfe[title] AND human[orgn] 42

#3 #2 AND biomol mrna[prop] 29

#4 #2 AND biomol genomic[prop] 13

Genomic DNA biomol genomic[prop]cDNA biomol mrna[prop]

Page 20: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eMore Queries…

Fields are database-specific

Entrez Nucleotide

Reviewed RefSeqs with transcript variants:

srcdb refseq reviewed[prop] AND transcript[title] AND variant[title]

Topoisomerase genes from Archaea:

topoisomerase[gene name] AND archaea[organism]

Entrez Gene

Genes on human chromosome 2 with OMIM links

2[chromosome] AND human[organism] AND “gene omim”[filter]

Membrane proteins linked to cancer:

“integral to plasma membrane”[gene ontology] AND cancer[dis]

Page 21: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Other Entrez Databases

UniSTS: markers on the Genethon map of human chromosome 12

Genethon[Map Name] AND human[organism] AND 12[chromosome]

UniGene: rat clusters that have at least one mRNA

rat[organism] NOT 0[mrna count]

Structure: structures of bacterial kinases with resolutions below 2 Å

bacteria[organism] AND kinase AND 000.00:002.00[resolution]

SNP: uniquely mapped microsatellites on human chr2

microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn]

Page 22: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Basic Local Alignment Search Tool

Page 23: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eBLAST Web Searches, 2005

200,000

Page 24: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Nucleotide or protein: Related

Sequences

BLAST link: BLink

Precomputed BLAST Services

Transcript clusters: UniGene

Protein homologs: HomoloGene

Page 25: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eLink to Related Sequences

Page 26: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eRelated Sequences

Most similar

Least similar

Page 27: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

BLink (BLAST Link)

Page 28: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eBLink Output

Best hitsBest hits 3D structures3D structures CDD-SearchCDD-Search

Page 29: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eGlobal vs Local Alignment

Seq 1

Seq 2

Seq 1

Seq 2

Global alignment

Local alignment

Page 30: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Global vs Local Alignment

Seq1: WHEREISWALTERNOW (16aa)Seq2: HEWASHEREBUTNOWISHERE (21aa)

Global

Seq1: 1 W--HEREISWALTERNOW 16 W HERE

Seq2: 1 HEWASHEREBUTNOWISHERE 21

LocalSeq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERESeq2: 3 WASHERE 9 Seq2: 15 WISHERE 21

Page 31: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eThe Flavors of BLAST

• Standard BLAST– nucleotide, protein and translations (blastn, blastp,

blastx, tblastn, tblastx)– traditional “contiguous” word hit

• Megablast– optimized for large batch searches– can use discontiguous words

• PSI-BLAST– constructs PSSMs automatically; uses as query– very sensitive protein search

• RPS BLAST– searches a database of PSSMs– tool for conserved domain searches

“contiguous”

discontiguous

Page 32: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eFast- heuristic approach based on Smith Waterman

Local alignments

Statistical significance- Expect value

Versatile- blastn, blastp, blastx, tblastn, tblastx, rps-blast,

psi-blast- www, standalone, and network clients

Why Is BLAST So Popular?

Page 33: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

How BLAST Works

• Make lookup table of “words” for query

• Scan database for hits

• Ungapped extensions of hits (initial HSPs)

• Gapped extensions (no traceback)

• Gapped extensions (traceback; alignment

details)

• Make lookup table of “words” for query

• Scan database for hits

• Ungapped extensions of hits (initial HSPs)

• Gapped extensions (no traceback)

• Gapped extensions (traceback; alignment

details)

Page 34: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eNucleotide Words

GTACTGGACATGGACCCTACAGGAAQuery:

GTACTGGACAT

TACTGGACATG

ACTGGACATGG

CTGGACATGGA

TGGACATGGAC

GGACATGGACC

GACATGGACCC

ACATGGACCCT

Make a lookuptable of words

11-mer

. . .

Page 35: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eProtein Words

GTQITVEDLFYNIATRRKALKNQuery:

Neighborhood Words

LTV, MTV, ISV, LSV, etc.

GTQ

TQI

QIT

ITV

TVE

VED

EDL

DLF

...

Make a lookuptable of words

Word size = 3 (default)

Word size can only be 2 or 3

[ -f 11 = blastp default ]

Page 36: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Minimum Requirements for a Hit

• Nucleotide BLAST requires one exact match• Protein BLAST requires two neighboring matches within 40 aa

GTQITVEDLFYNI

SEI YYN

ATCGCCATGCTTAATTGGGCTT

CATGCTTAATT

neighborhood words

one exact match

two matches

[ -A 40 = blastp default ]

Page 37: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

BLASTP Summary

YLS HFLSbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333

Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47

Gapped extension with trace back

Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + +Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337

Final HSP

+E YA YL K F+ L +SP+ +DVNVHP+K V +++ I

High-scoring pair (HSP)

HFL 18HFV 15 HFS 14HWL 13NFL 13DFL 12HWV 10etc …

YLS 15YLT 12 YVS 12YIT 10etc …

Neighborhood words

Neighborhood score threshold

T (-f) =11

Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV…

example query words

Page 38: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Scoring Systems - Nucleotides

A G C T

A +1 –3 –3 -3

G –3 +1 –3 -3

C –3 –3 +1 -3

T –3 –3 –3 +1

Identity matrix

CAGGTAGCAAGCTTGCATGTCA

|| |||||||||||| ||||| raw score = 19-9 = 10

CACGTAGCAAGCTTG-GTGTCA

[ -r 1 -q -3 ]

Page 39: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eScoring Systems - Proteins

Position Independent MatricesPAM Matrices (Percent Accepted Mutation)

• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used

BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly

conserved blocks• Each matrix derived separately from blocks with a

defined percent identity cutoff• BLOSUM62 - default matrix for BLAST

Position Specific Score Matrices (PSSMs)PSI- and RPS-BLAST

Page 40: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62

D

F

Negative for less likely substitutions

D

Y

FPositive for more likely substitutions

Page 41: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Position-Specific Score Matrix

DAF-1

Serine/Threonine protein kinases catalytic loop

1 7 4PSSM scores 5 4

Page 42: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3

Position-Specific Score Matrix

catalytic loop

Page 43: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eLocal Alignment Statistics

High scores of local alignments between two random sequencesfollow the Extreme Value Distribution

Score (S)

Alig

nm

en

ts

(applies to ungapped alignments)

E = Kmne-S or E = mn2-S’

K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2

Expect ValueE = number of database hits you expect to find by chance, ≥ S

your score

expected number of

random hits

More info: www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Page 44: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Gapped Alignments

Gapping provides more biologically realistic alignments

Gapped BLAST parameters are simulated for each scoring matrix

Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)

Page 45: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

An Alignment BLAST Cannot Make

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Reason:

no contiguous exact match of 7 bp.

Page 46: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

BLAST 2 Sequences (blastx) output:

An Alignment BLAST Can Make

Solution: compare protein sequences; BLASTXScore = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3

Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3

Page 47: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Other BLAST Algorithms

• Megablast

• Discontiguous Megablast

• PSI-BLAST

• PHI-BLAST

• Megablast

• Discontiguous Megablast

• PSI-BLAST

• PHI-BLAST

Page 48: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Megablast: NCBI’s Genome Annotator

• Long alignments of similar DNA sequences

• Greedy algorithm

• Concatenation of query sequences

• Faster than blastn; less sensitive

• Long alignments of similar DNA sequences

• Greedy algorithm

• Concatenation of query sequences

• Faster than blastn; less sensitive

Page 49: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

MegaBLAST & Word Size

Trade-off: sensitivity vs speed

Too fast foryou?

Page 50: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

MegaBLAST & Word Size

Trade-off: sensitivity vs speed

23blastp

828megablast

711blastn

minimumdefaultWORD SIZE

Page 51: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Discontiguous Megablast

• Uses discontiguous word matches

• Better for cross-species comparisons

• Uses discontiguous word matches

• Better for cross-species comparisons

Page 52: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Templates for Discontiguous Words

W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111

Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5

W = word size; # matches in template

t = template length

Page 53: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eDiscontiguous (Cross-species)

MegaBLAST

Page 54: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eDiscontiguous Word

Options

Page 55: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eMegaBLAST vs Discontiguous

MegaBLAST

NM_017460 Homo sapiens cytochrome P450, family 3, subfamily A, polypeptide 4 (CYP3A4),

transcript variant 1, mRNA (2768 letters)

vs Drosophila

Page 56: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

MegaBLAST vs Discontiguous MegaBLAST

MegaBLAST = “No significant similarity found.”

Discontiguous megaBLAST =

Page 57: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Another Example . . .

Discontiguous megaBLAST = numerous hits . . .

Query: NM_078651

Drosophila melanogaster CG18582-PA (mbt) mRNA, (3244 bp)

/note= mushroom bodies tiny; synonyms: Pak2, STE20, dPAK2

MegaBLAST = “No significant similarity found.”

Database: nr (nt), Mammalia[orgn]

Page 58: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eEx: Discontiguous MegaBLAST

Page 59: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eEx: BLASTN

Page 60: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

PSI-BLAST

Example: Confirming relationships of purine

nucleotide metabolism proteins

Position-specific Iterated BLAST

Page 61: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINEMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK

PSI-BLAST

0.005 E value cutoff for PSSM

Page 62: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eRESULTS: Initial BLASTP

Same results as protein-protein BLAST; different format

Page 63: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eResults of First PSSM Search

Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

Page 64: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eTenth PSSM Search: Convergence

Just below threshold, another nucleotide metabolism enzyme

Check to add to PSSM

Page 65: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eReverse PSI-BLAST (RPS)-BLAST

Page 66: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eAdenosine/AMP Deaminase Domain

AMP Deaminases

.

.

.

Page 67: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

PHI-BLAST

>gi|231729|sp|P30429|CED4_CAEEL CELL DEATH PROTEIN 4MLCEIECRALSTAHTRLIHDFEPRDALTYLEGKNIFTEDHSELISKMSTRLERIANFLRIYRRQASELIDFFNYNNQSHLADFLEDYIDFAINEPDLLRPVVIAPQFSRQMLDRKLLLGNVPKQMTCYIREYHVIKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDILKSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEIASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEK

[GA]xxxxGK[ST]

Page 68: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eGenome BLAST

Page 69: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eGenome BLAST via Map Viewer

Page 70: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eExample Search Pathways:

Hemochromatosis

Gene

OMIMOMIM GeneGene

“hemochromatosis”HFE

nucleotide sequence

GenomeBLAST Map Viewer

SNP

Protein

Domains

text search

sequence search

Page 71: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eExample: Human Genome BLAST

TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAGTGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGAACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGATGCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGATGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGGGGAAGAGC

Human EST

Page 72: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eHuman Genome BLAST: Results

Page 73: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Human Genome BLAST: MapViewer

Entrez GeneEntrez Gene

Page 74: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

What’s New?

Page 75: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

BLAST DatabasesNucleotide

• refseq_rna = NM_*, XM_*

• refseq_genomic = NC_*, NG_*

• env_nt– environmental sample[filter], e.g., 16S

rRNA

Protein

• refseq = NP_*, XP_*

• env_nr

Page 76: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eNew Formatter

Select lower case Select red

Page 77: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eNew Formatter

• gray line = same database hit

• hsp’s color-coded independently

Page 78: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

BLAST Output: Alignments & Filter

low complexity sequence filtered

Page 79: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eAdvanced Options

Limit to Organism

all[filter] NOT ma

Example Entrez Queriesall[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]

Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]

OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments

Example Entrez Queriesall[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]

Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]

OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments

-e 10000 -v 2000

Page 80: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eSearching by Structure

Why search for similar structures?

• Find homologs with low sequence similarity

• Explore protein evolution: similar protein folds can support different functions

• Identify conserved core elements to model related proteins of unknown structure

Why search for similar structures?

• Find homologs with low sequence similarity

• Explore protein evolution: similar protein folds can support different functions

• Identify conserved core elements to model related proteins of unknown structure

Page 81: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Indexing into MMDB

Structure

id 1 , name "helix 1" , type helix , location subgraph residues interval { { molecule-id 1 , from 49 , to 61 } } } ,

Add secondary structure

inter-residue-bonds { { atom-id-1 { molecule-id 1 , residue-id 1 , atom-id 1 } , atom-id-2 { molecule-id 1 , residue-id 2 , atom-id 9 } } ,

Add chemical bonds

• Import only experimentally determined structures• Convert to ASN.1 • Verify sequences

• Create “backbone” model (Cα, P only)• Create single-conformer model

MMDBMolecular Modeling Data Base

Page 82: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Structure Summary

Conserved Domains3D Domain Neighbors

Structure Neighbors

Page 83: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

3D Domains

1

32

4

Page 84: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Conserved Domains

TyrKc

SH3

SH2

Page 85: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

VAST: Alignment

For each protein chain,

locate SSEs (secondarystructure elements),

represent SSEs asindividual vectors, 1

2

3

4

5 6

Human IL-4

IL-4 &Leptinalign the vectors.

Page 86: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

VAST

Structure neighbors

Taq DNA polymerase

Page 87: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eVAST Results for the Chain

Table view

Page 88: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

VAST

Vector Alignment Search Tool

3D Domain structure neighbors

Page 89: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eVAST Results for Domain 1

Not found with Chain query!

Page 90: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Best way to convert PDB files to MMDB format

for viewing with Cn3D!

Best way to convert PDB files to MMDB format

for viewing with Cn3D!

submit file to PDB

Page 91: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eExample: Mapping Oligos Onto

a Genome

>forwardCCATGGCGACCCTGGAAAAGC

>reverseCAGCAGCGGCTGTGCCTGCGG

??

?

Page 92: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eMap Oligos Onto Genome

>CCATGGCGACCCTGGAAAAGCNNNNNNNNNNCAGCAGCGGCTGTGCCTGCGG

-W 7 –e 1000

forward primer reverse primer

Page 93: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eGenome BLAST Results

Page 94: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Primer Alignments

forward primer

reverse primer

Page 95: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

MapViewer

Page 96: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

MapViewer

Page 97: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

eSequence View (sv)

forward

reverse

Page 98: NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center

NC

BI

Fie

ldG

uid

e

Service Addresses

•BLAST [email protected]

•General Help [email protected]•Wayne Matten [email protected]

•BLAST [email protected]

•General Help [email protected]•Wayne Matten [email protected]