a bioinformatics survey... just a taste, with an emphasis on the gcg suite. steven m. thompson...

A BioInformatics SurveyA BioInformatics Survey . . . . . . just a taste, with an just a taste, with an

emphasis on the GCG suite.emphasis on the GCG suite.

Steven M. ThompsonSteven M. Thompson

Florida State University School of Florida State University School of Computational Science and Computational Science and

Information Technology (Information Technology (CSITCSIT))

Bioinformatics and Computational BiologyBioinformatics and Computational Biology

2014 Molecular Biology Building2014 Molecular Biology Building

Iowa State University, Ames, IA 50011Iowa State University, Ames, IA 50011

515-294-5122; 1-888-569-8509515-294-5122; 1-888-569-8509

Fax 515-294-6790Fax 515-294-6790

www.bcb.iastate.eduwww.bcb.iastate.edu

Nov. 19 – 21, 2002Nov. 19 – 21, 2002

A GCGA GCG¥¥ SeqLab Introduction for: SeqLab Introduction for:

Introductory Overview:Introductory Overview:What is bioinformatics , genomics, sequence What is bioinformatics , genomics, sequence

analysis, computational molecular biology . . .analysis, computational molecular biology . . .

The Reverse Biochemistry Analogy.The Reverse Biochemistry Analogy.

Using sequence analysis tools, one can infer Using sequence analysis tools, one can infer

all sorts of functional, evolutionary, and, all sorts of functional, evolutionary, and,

perhaps, structural insight into a gene, perhaps, structural insight into a gene,

without the need to isolate and purify without the need to isolate and purify

massive amounts of protein!massive amounts of protein!

The computer is an essential part of this The computer is an essential part of this

entire process.entire process.

Definitions:Definitions:Biocomputing and computational biology are fairly synonymous and Biocomputing and computational biology are fairly synonymous and

both describe the use of computers and computational techniques both describe the use of computers and computational techniques

to analyze biological systems.to analyze biological systems.

Bioinformatics describes using computational techniques to access, Bioinformatics describes using computational techniques to access,

analyze, and interpret the biological information in any of the analyze, and interpret the biological information in any of the

available biological databases.available biological databases.

Sequence analysis is the study of molecular sequence data for the Sequence analysis is the study of molecular sequence data for the

purpose of inferring the function, interactions, evolution, and purpose of inferring the function, interactions, evolution, and

perhaps structure of biological molecules.perhaps structure of biological molecules.

Genomics analyzes the context of genes or complete genomes (the Genomics analyzes the context of genes or complete genomes (the

total DNA content of an organism) within and across genomes.total DNA content of an organism) within and across genomes.

Proteomics is the subdivision of genomics concerned with analyzing Proteomics is the subdivision of genomics concerned with analyzing

the complete protein complement, i.e. the proteome, of organisms, the complete protein complement, i.e. the proteome, of organisms,

both within and between different organisms.both within and between different organisms.

The exponential growth of molecular The exponential growth of molecular sequence databasessequence databases & cpu power.& cpu power.

YearYear BasePairs Sequences BasePairs Sequences

19821982 680338 680338 606 60619831983 2274029 2274029 2427 242719841984 3368765 3368765 4175 417519851985 5204420 5204420 5700 570019861986 9615371 9615371 9978 997819871987 15514776 15514776 145841458419881988 23800000 23800000 205792057919891989 34762585 34762585 287912879119901990 49179285 49179285 395333953319911991 71947426 71947426 556275562719921992 101008486 101008486 786087860819931993 157152442 143492 157152442 14349219941994 217102462 215273 217102462 21527319951995 384939485 555694 384939485 55569419961996 651972984 1021211 651972984 102121119971997 1160300687 1765847 1160300687 176584719981998 2008761784 2837897 2008761784 283789719991999 3841163011 4864570 3841163011 486457020002000 11101066288 10106023 11101066288 101060232001 14396883064 136022622001 14396883064 13602262

http://www.http://www.ncbincbi..nlmnlm..nihnih..govgov//GenbankGenbank//genbankstatsgenbankstats.html.html

Database Growth Database Growth (cont.)(cont.)

The Human Genome Project and numerous smaller The Human Genome Project and numerous smaller

genome projects have kept the data coming at genome projects have kept the data coming at

alarming rates. As of August 2002, alarming rates. As of August 2002, 91 complete, 91 complete,

finished genomesfinished genomes are publicly available for analysis, are publicly available for analysis,

not counting all the virus and viroid genomes available.not counting all the virus and viroid genomes available.

The International Human Genome Sequencing The International Human Genome Sequencing

Consortium announced the completion of a "Consortium announced the completion of a "Working Working

DraftDraft" of the " of the human genomehuman genome in June 2000; in June 2000;

independently that same month, the private company independently that same month, the private company

Celera GenomicsCelera Genomics announced that it had completed the announced that it had completed the

first assembly of the human genome. Both articles first assembly of the human genome. Both articles

were published mid-February 2001 in the journals were published mid-February 2001 in the journals

ScienceScience and and NatureNature..

Some neat stuff from the papers:Some neat stuff from the papers:We, We, Homo sapiensHomo sapiens, aren’t nearly as special as , aren’t nearly as special as we had hoped we were. Of the 3.2 billion base we had hoped we were. Of the 3.2 billion base pairs in our DNA —pairs in our DNA —

Traditional, text-book estimates of the number of Traditional, text-book estimates of the number of genes were often in the 100,000 range; turns out genes were often in the 100,000 range; turns out we’ve only got about twice as many as a fruit fly, we’ve only got about twice as many as a fruit fly, between 25,000 and 35,000!between 25,000 and 35,000!

The protein coding region of the genome is only about The protein coding region of the genome is only about 1% or so, much of the remainder “junk” is 1% or so, much of the remainder “junk” is “jumping,” “selfish DNA” of which much may be “jumping,” “selfish DNA” of which much may be involved in regulation and control.involved in regulation and control.

100-200 genes were transferred from an ancestral 100-200 genes were transferred from an ancestral bacterial genome to an ancestral vertebrate bacterial genome to an ancestral vertebrate genome! genome! (Later shown to be not true by more extensive (Later shown to be not true by more extensive analyses, and to be due to gene loss rather than transfer.)analyses, and to be due to gene loss rather than transfer.)

What are these databases like?What are these databases like?What are primary sequences?What are primary sequences?(Central Dogma: DNA —> RNA —> protein)(Central Dogma: DNA —> RNA —> protein)

Primary refers to one dimension — all of the “symbol” Primary refers to one dimension — all of the “symbol”

information written in sequential order necessary to information written in sequential order necessary to

specify a particular biological molecular entity, be it specify a particular biological molecular entity, be it

polypeptide or nucleotide.polypeptide or nucleotide.

The symbols are the one letter alphabetic codes for all of The symbols are the one letter alphabetic codes for all of

the biological nitrogenous bases and amino acid the biological nitrogenous bases and amino acid

residues and their ambiguity codes. Biological residues and their ambiguity codes. Biological

carbohydrates, lipids, and structural information are not carbohydrates, lipids, and structural information are not

included within this sequence, however, much of this included within this sequence, however, much of this

type of information is available in the reference type of information is available in the reference

documentation sections associated with primary documentation sections associated with primary

sequences in the databases.sequences in the databases.

What are sequence databases?What are sequence databases?These databases are an organized way to store the These databases are an organized way to store the

tremendous amount of sequence information that tremendous amount of sequence information that accumulates from laboratories worldwide. Each accumulates from laboratories worldwide. Each database has its own specific format. Three major database has its own specific format. Three major database organizations around the world are database organizations around the world are responsible for maintaining most of this data; they responsible for maintaining most of this data; they largely ‘mirror’ one another.largely ‘mirror’ one another.

North America: National Center for Biotechnology North America: National Center for Biotechnology Information (Information (NCBINCBI): ): GenBankGenBank & GenPept. & GenPept.Also Georgetown University’s NBRF Protein Also Georgetown University’s NBRF Protein

Identification Resource: Identification Resource: PIRPIR & NRL_3D. & NRL_3D.Europe: European Molecular Biology Laboratory (also Europe: European Molecular Biology Laboratory (also

EBI & ExPasy): EMBL & Swiss-Prot.EBI & ExPasy): EMBL & Swiss-Prot.Asia: The DNA Data Bank of Japan (DDBJ).Asia: The DNA Data Bank of Japan (DDBJ).

Content & Organization:Content & Organization:Most sequence databases are examples of complex ASCII/Binary Most sequence databases are examples of complex ASCII/Binary

databases, but usually are not Oracle or SQL or Object Oriented databases, but usually are not Oracle or SQL or Object Oriented

(proprietary ones often are). They contain several very long text (proprietary ones often are). They contain several very long text

files containing different types of information all related to particular files containing different types of information all related to particular

sequences, such as all of the sequences themselves, versus all of sequences, such as all of the sequences themselves, versus all of

the title lines, or all of the reference sections. Binary files often the title lines, or all of the reference sections. Binary files often

help ‘glue together’ all of these other files by providing index help ‘glue together’ all of these other files by providing index

functions.functions.

Software is usually required to successfully interact with these Software is usually required to successfully interact with these

databases and access is most easily handled through various databases and access is most easily handled through various

software packages and interfaces, either on the World Wide Web software packages and interfaces, either on the World Wide Web

or otherwise, although systems level commands can be used if one or otherwise, although systems level commands can be used if one

understands the data's structure. Nucleic acid databases are split understands the data's structure. Nucleic acid databases are split

into subdivisions based on taxonomy (historical). Protein into subdivisions based on taxonomy (historical). Protein

databases are often organized into sections by level of annotation.databases are often organized into sections by level of annotation.

What are other biological databases?What are other biological databases? Three dimensional structure databases:Three dimensional structure databases:

the Protein Data Bank and Rutgers Nucleic Acid Database.the Protein Data Bank and Rutgers Nucleic Acid Database.

Still more; these can be considered ‘non-molecular’:Still more; these can be considered ‘non-molecular’:

Reference Databases: e.g. Reference Databases: e.g.

OMIM — Online Mendelian Inheritance in ManOMIM — Online Mendelian Inheritance in Man

PubMed/MedLine — over 11 million citations from more PubMed/MedLine — over 11 million citations from more than 4 thousand bio/medical scientific journals. than 4 thousand bio/medical scientific journals.

Phylogenetic Tree Databases: e.g. the Tree of Life.Phylogenetic Tree Databases: e.g. the Tree of Life.

Metabolic Pathway Databases: e.g. WIT (What Is There) and Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Genes and Genomes).Genes and Genomes).

Population studies data — which strains, where, etc.Population studies data — which strains, where, etc.

And then databases that most biocomputing people don’t even And then databases that most biocomputing people don’t even usually consider:usually consider:

e.g. GIS/GPS/remote sensing data, medical records, census e.g. GIS/GPS/remote sensing data, medical records, census counts, mortality and birth rates . . . .counts, mortality and birth rates . . . .

So how does one do Bioinformatics?So how does one do Bioinformatics?

NCBI’s BLAST & Entrez, EMBL’s SRS, + GCG’s SeqLab and LookUp, phylogenetics . . .NCBI’s BLAST & Entrez, EMBL’s SRS, + GCG’s SeqLab and LookUp, phylogenetics . . .

Often on the InterNet over the World Wide Web:Often on the InterNet over the World Wide Web:

SiteSite URL (Uniform Resource Locator)URL (Uniform Resource Locator) ContentContent

Nat’l Center Biotech' Info'Nat’l Center Biotech' Info' http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/ databases/analysis/softwaredatabases/analysis/software

PIR/NBRFPIR/NBRF http://www-nbrf.georgetown.edu/http://www-nbrf.georgetown.edu/ protein sequence databaseprotein sequence database

IUBIO Biology ArchiveIUBIO Biology Archive http://iubio.bio.indiana.edu/http://iubio.bio.indiana.edu/ database/software archivedatabase/software archive

Univ. of MontrealUniv. of Montreal http://megasun.bch.umontreal.ca/http://megasun.bch.umontreal.ca/ database/software archivedatabase/software archive

Japan's GenomeNetJapan's GenomeNet http://www.genome.ad.jp/http://www.genome.ad.jp/ databases/analysis/softwaredatabases/analysis/software

European Mol' Bio' Lab'European Mol' Bio' Lab' http://www.embl-heidelberg.de/http://www.embl-heidelberg.de/ databases/analysis/softwaredatabases/analysis/software

European BioinformaticsEuropean Bioinformatics http://www.ebi.ac.uk/http://www.ebi.ac.uk/ databases/analysis/softwaredatabases/analysis/software

The Sanger InstituteThe Sanger Institute http://www.sanger.ac.uk/http://www.sanger.ac.uk/ databases/analysis/softwaredatabases/analysis/software

Univ. of Geneva BioWebUniv. of Geneva BioWeb http://www.expasy.ch/http://www.expasy.ch/ databases/analysis/softwaredatabases/analysis/software

ProteinDataBankProteinDataBank http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/ 3D mol' structure database3D mol' structure database

Molecules R UsMolecules R Us http://molbio.info.nih.gov/cgi-bin/pdb/http://molbio.info.nih.gov/cgi-bin/pdb/ 3D protein/nuc' visualization3D protein/nuc' visualization

The Genome DataBaseThe Genome DataBase http://www.gdb.org/http://www.gdb.org/ The Human Genome ProjectThe Human Genome Project

Stanford GenomicsStanford Genomics http://genome-www.stanford.edu/http://genome-www.stanford.edu/ various genome projectsvarious genome projects

Inst. for Genomic Res’rchInst. for Genomic Res’rch http://www.tigr.org/http://www.tigr.org/ esp. microbial genome projectsesp. microbial genome projects

HIV Sequence DatabaseHIV Sequence Database http://hiv-web.lanl.gov/http://hiv-web.lanl.gov/ HIV epidemeology seq' DBHIV epidemeology seq' DB

The Tree of LifeThe Tree of Life http://tolweb.org/tree/phylogeny.htmlhttp://tolweb.org/tree/phylogeny.html overview of all phylogenyoverview of all phylogeny

Ribosomal Database Proj’Ribosomal Database Proj’ http://rdp.cme.msu.edu/html/http://rdp.cme.msu.edu/html/ databases/analysis/softwaredatabases/analysis/software

WIT MetabolismWIT Metabolism http://wit.mcs.anl.gov/WIT2/http://wit.mcs.anl.gov/WIT2/ metabolic reconstructionmetabolic reconstruction

Harvard Bio' LaboratoriesHarvard Bio' Laboratories http://golgi.harvard.edu/http://golgi.harvard.edu/ nice bioinformatics links listnice bioinformatics links list

So what are the alternatives . . . ?So what are the alternatives . . . ?

Desktop software solutions — public domain Desktop software solutions — public domain

programs are available, but . . . complicated to programs are available, but . . . complicated to

install, configure, and maintain. User must be pretty install, configure, and maintain. User must be pretty

computer savvy. So, computer savvy. So,

commercial software packages are available, e.g. commercial software packages are available, e.g.

Omiga, MacVector, DNAsis, DNAStar, etc.,Omiga, MacVector, DNAsis, DNAStar, etc.,

but . . . license hassles, big expense per machine, and but . . . license hassles, big expense per machine, and

Internet and/or CD database access all complicate Internet and/or CD database access all complicate

matters!matters!

Therefore, UNIX server-based Therefore, UNIX server-based solutions (e.g. the Accelrys GCG solutions (e.g. the Accelrys GCG Wisconsin Package Wisconsin Package [a [a Pharmacopeia Co.]Pharmacopeia Co.]):):One commercial license fee for an entire institution and One commercial license fee for an entire institution and

very fast, convenient database access on local very fast, convenient database access on local

server disks. Connections from any networked server disks. Connections from any networked

terminal or workstation anywhere!terminal or workstation anywhere!

Operating system:Operating system: UNIX command line operation UNIX command line operation

hassles; communications software — telnet, ssh, hassles; communications software — telnet, ssh,

xdmcp, etc. and terminal emulation; X graphics; file xdmcp, etc. and terminal emulation; X graphics; file

transfer — ftp, Mac Fetch, and scp/sftp; and editors transfer — ftp, Mac Fetch, and scp/sftp; and editors

— vi, emacs, pico (or desktop word processing — vi, emacs, pico (or desktop word processing

followed by file transfer [save as "text only!"]).followed by file transfer [save as "text only!"]).

Basic UNIX for NeophytesBasic UNIX for Neophytes..

The Genetics Computer Group — The Genetics Computer Group — The Accelrys Wisconsin Package for Sequence The Accelrys Wisconsin Package for Sequence

Analysis.Analysis. Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept. Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept.

at the University of Wisconsin, Madison, then a private at the University of Wisconsin, Madison, then a private company for over 10 years, then acquired by the Oxford company for over 10 years, then acquired by the Oxford Molecular Group U.K., and now owned by Pharmacopeia Molecular Group U.K., and now owned by Pharmacopeia U.S.A. under the new name Accelrys, Inc.U.S.A. under the new name Accelrys, Inc.

The suite contains almost 150 programs designed to work in The suite contains almost 150 programs designed to work in a "toolbox" fashion. Several simple programs used in a "toolbox" fashion. Several simple programs used in succession can lead to sophisticated results.succession can lead to sophisticated results.

Also 'internal compatibility,' i.e. once you learn to use one Also 'internal compatibility,' i.e. once you learn to use one program, all programs can be run similarly, and, the output program, all programs can be run similarly, and, the output from many programs can be used as input for other from many programs can be used as input for other programs.programs.

Used all over the world by more than 30,000 scientists at Used all over the world by more than 30,000 scientists at over 530 institutions in 35 countries, so learning it here will over 530 institutions in 35 countries, so learning it here will most likely be useful anywhere else you may end up.most likely be useful anywhere else you may end up.

To answer the always perplexing GCG question — “What To answer the always perplexing GCG question — “What sequence(s)? . . . .”sequence(s)? . . . .”

The sequence is in a local GCG format single sequence file in your UNIX The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and all From & To programs)account. (GCG Reformat and all From & To programs)

The sequence is in a local GCG database in which case you ‘point’ to it by The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, “:,” always sets using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession number or a proper the logical name apart from either an accession number or a proper identifier name or a wildcard expression and they are case insensitive.identifier name or a wildcard expression and they are case insensitive.

The sequence is in a GCG format multiple sequence file, either an MSF The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence format) file. To (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple sequence file, supply the specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — {specification, e.g. a wildcard — {**}.}.

Finally, the most powerful method of specifying sequences is in a GCG “list” Finally, the most powerful method of specifying sequences is in a GCG “list” file. It is merely a list of other sequence specifications and can even file. It is merely a list of other sequence specifications and can even contain other list files within it. The convention to use a GCG list file in a contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, “@.” Furthermore, one can program is to precede it with an at sign, “@.” Furthermore, one can supply attribute information within list files to specify something special supply attribute information within list files to specify something special about the sequence.about the sequence.

Specifying sequences, GCG style;Specifying sequences, GCG style;in order of increasing power and complexity:in order of increasing power and complexity:

This is a small example of GCG single sequence format.This is a small example of GCG single sequence format.

Always put some documentation on top, so in the futureAlways put some documentation on top, so in the future

you can figure out what it is you're dealing with! Theyou can figure out what it is you're dealing with! The

line with the two periods is converted to the checksum line.line with the two periods is converted to the checksum line.

example.seq Length: 77 July 21, 1999 09:30 Type: N Check: example.seq Length: 77 July 21, 1999 09:30 Type: N Check:

4099 ..4099 ..

1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA

51 GATTTAATAG CATGCGATCC CATGGGA51 GATTTAATAG CATGCGATCC CATGGGA

‘‘Clean’ GCG format single sequence file after Clean’ GCG format single sequence file after

‘reformat’ (or any of the From… programs) —‘reformat’ (or any of the From… programs) —

SeqLab’s Editor mode can also SeqLab’s Editor mode can also

“Import” native GenBank format and “Import” native GenBank format and

ABI or LI-COR trace files!ABI or LI-COR trace files!

Logical terms for the Wisconsin Package —Logical terms for the Wisconsin Package —

Sequence databases, nucleic acids:Sequence databases, nucleic acids: Sequence databases, amino acids:Sequence databases, amino acids:

GENMBLPLUSGENMBLPLUS all of GenEMBL plus EST and GSS subdivisionsall of GenEMBL plus EST and GSS subdivisions GENPEPTGENPEPT GenBank CDS translationsGenBank CDS translations

GEPGEP all of GenEMBL plus EST and GSS subdivisionsall of GenEMBL plus EST and GSS subdivisions GPGP GenBank CDS translationsGenBank CDS translations

GENEMBLGENEMBL all of GenEMBL except EST and GSS subdivisionsall of GenEMBL except EST and GSS subdivisions SWISSPROTPLUSSWISSPROTPLUS all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL

GEGE all of GenEMBL except EST and GSS subdivisionsall of GenEMBL except EST and GSS subdivisions SWPSWP all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL

BABA GenEMBL bacterial subdivisionGenEMBL bacterial subdivision SWISSPROTSWISSPROT all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)

BACTERIALBACTERIAL GenEMBL bacterial subdivisionGenEMBL bacterial subdivision SWSW all of Swiss-Prot (fully annotated) all of Swiss-Prot (fully annotated)

ESTEST GenEMBL EST (Expressed Sequence Tags) subdivisionGenEMBL EST (Expressed Sequence Tags) subdivision SPTREMBLSPTREMBL Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations

GSSGSS GenEMBL GSS (Genome Survey Sequences) subdivisionGenEMBL GSS (Genome Survey Sequences) subdivision SPTSPT Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations

HTCHTC GenEMBL High Throughput cDNAGenEMBL High Throughput cDNA PP all of PIR Proteinall of PIR Protein

HTGHTG GenEMBL High Throughput GenomicGenEMBL High Throughput Genomic PIRPIR all of PIR Proteinall of PIR Protein

ININ GenEMBL invertebrate subdivisionGenEMBL invertebrate subdivision PROTEINPROTEIN PIR fully annotated subdivisionPIR fully annotated subdivision

INVERTEBRATEINVERTEBRATE GenEMBL invertebrate subdivisionGenEMBL invertebrate subdivision PIR1PIR1 PIR fully annotated subdivisionPIR fully annotated subdivision

OMOM GenEMBL other mammalian subdivisionGenEMBL other mammalian subdivision PIR2PIR2 PIR preliminary subdivisionPIR preliminary subdivision

OTHERMAMMOTHERMAMM GenEMBL other mammalian subdivisionGenEMBL other mammalian subdivision PIR3PIR3 PIR unverified subdivisionPIR unverified subdivision

OVOV GenEMBL other vertebrate subdivision GenEMBL other vertebrate subdivision PIR4PIR4 PIR unencoded subdivisionPIR unencoded subdivision

OTHERVERTOTHERVERT GenEMBL other vertebrate subdivision GenEMBL other vertebrate subdivision NRL_3DNRL_3D PDB 3D protein sequencesPDB 3D protein sequences

PATPAT GenEMBL patent subdivision GenEMBL patent subdivision NRLNRL PDB 3D protein sequencesPDB 3D protein sequences

PATENTPATENT GenEMBL patent subdivision GenEMBL patent subdivision

PHPH GenEMBL phage subdivision GenEMBL phage subdivision

PHAGEPHAGE GenEMBL phage subdivisionGenEMBL phage subdivision General data files: General data files:

PLPL GenEMBL plant subdivision GenEMBL plant subdivision GCGCORE GCGCORE path to main GCG filespath to main GCG files

PLANTPLANT GenEMBL plant subdivision GenEMBL plant subdivision GENMOREDATAGENMOREDATA path to GCG optional data filespath to GCG optional data files

PRPR GenEMBL primate subdivision GenEMBL primate subdivision GENRUNDATAGENRUNDATA path to GCG default data filespath to GCG default data files

PRIMATEPRIMATE GenEMBL primate subdivisionGenEMBL primate subdivision

RORO GenEMBL rodent subdivisionGenEMBL rodent subdivision

RODENTRODENT GenEMBL rodent subdivisionGenEMBL rodent subdivision

STSSTS GenEMBL (sequence tagged sites) subdivisionGenEMBL (sequence tagged sites) subdivision

SYSY GenEMBL synthetic subdivisionGenEMBL synthetic subdivision

SYNTHETICSYNTHETIC GenEMBL synthetic subdivisionGenEMBL synthetic subdivision

TAGSTAGS GenEMBL EST and GSS subdivisionsGenEMBL EST and GSS subdivisions

UNUN GenEMBL unannotated subdivisionGenEMBL unannotated subdivision

UNANNOTATEDUNANNOTATED GenEMBL unannotated subdivisionGenEMBL unannotated subdivision

VIVI GenEMBL viral subdivisionGenEMBL viral subdivision

VIRALVIRAL GenEMBL viral subdivisionGenEMBL viral subdivision

These are easy — These are easy — they make sense and they make sense and you’ll have a vested you’ll have a vested interest.interest.

GCG MSF & RSF format —GCG MSF & RSF format —

The trick is to not forget the Braces and ‘wild The trick is to not forget the Braces and ‘wild

card,’ e.g. filename{card,’ e.g. filename{**}, when specifying!}, when specifying!

!!RICH_SEQUENCE 1.0!!RICH_SEQUENCE 1.0....{{name ef1a_gialaname ef1a_gialadescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listdescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listtype PROTEINtype PROTEINlongname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}sequence-ID Q08046sequence-ID Q08046checksum 7342checksum 7342offset 23offset 23creation-date 07/11/2001 16:51:19creation-date 07/11/2001 16:51:19strand 1strand 1comments ////////////////////////////////////////////////////////////comments ////////////////////////////////////////////////////////////

!!AA_MULTIPLE_ALIGNMENT 1.0!!AA_MULTIPLE_ALIGNMENT 1.0

small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..

Name: a49171 Len: 425 Check: 537 Weight: 1.00Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00Name: a46241 Len: 274 Check: 3514 Weight: 1.00

// //////////////////////////////////////////////////// //////////////////////////////////////////////////

This is SeqLab’s native formatThis is SeqLab’s native format

The List File Format —The List File Format —

An example GCG list file of many elongation An example GCG list file of many elongation

1a and Tu factors follows. As with all GCG 1a and Tu factors follows. As with all GCG

data files, two periods separate data files, two periods separate

documentation from data. ..documentation from data. ..

my-special.pepmy-special.pep begin:24begin:24 end:134end:134

SwissProt:EfTu_EcoliSwissProt:EfTu_Ecoli

Ef1a-Tu.msf{*}Ef1a-Tu.msf{*}

/usr/accounts/test/another.rsf{ef1a_*}/usr/accounts/test/another.rsf{ef1a_*}

@[email protected]

The ‘way’ SeqLab works!The ‘way’ SeqLab works!

remember the @ sign!remember the @ sign!

SeqLab — GCG’s X-based GUI!SeqLab — GCG’s X-based GUI!Seqlab is the merger of Steve Smith’s Seqlab is the merger of Steve Smith’s Genetic Data Environment and GCG’s Genetic Data Environment and GCG’s Wisconsin Package Interface:Wisconsin Package Interface:

GDE + WPI = SeqLabGDE + WPI = SeqLab

Requires an X-Windowing environment Requires an X-Windowing environment — either native on UNIX computers — either native on UNIX computers (including LINUX, but not included by (including LINUX, but not included by Apple in Mac OS X [v.10+] but see Apple in Mac OS X [v.10+] but see XDarwin), or emulated with X-Server XDarwin), or emulated with X-Server Software on personal computers.Software on personal computers.

The SeqLab Tutorial —The SeqLab Tutorial —Elongation Factor 1Elongation Factor 1 from a from a sorted collection of ‘lower’ sorted collection of ‘lower’ Eukaryotes.Eukaryotes.

How to search databases,How to search databases,

analyze and interpret pair-wise analyze and interpret pair-wise comparisons for significance,comparisons for significance,

and prepare and analyze and prepare and analyze multiple sequence alignments.multiple sequence alignments.

Supplement —Supplement —How the algorithms work; accompanying How the algorithms work; accompanying

illustrations for the Tutorial Introduction:illustrations for the Tutorial Introduction:

Dynamic Programming,Dynamic Programming,

Score Matrices,Score Matrices,

Significance,Significance,

Database Similarity Searching,Database Similarity Searching,

Dot Matrix Techniques,Dot Matrix Techniques,

Multiple Sequence Analysis.Multiple Sequence Analysis.

What about Homology?What about Homology?

Inference through homology is a Inference through homology is a fundamental principle of fundamental principle of biology!biology!

What is homologyWhat is homology — in this context it is — in this context it is

similarity great enough such that common similarity great enough such that common ancestry is implied. Walter Fitch, a famous ancestry is implied. Walter Fitch, a famous molecular evolutionist, likes to relate the molecular evolutionist, likes to relate the analogy — homology is like pregnancy, you analogy — homology is like pregnancy, you either are or you’re not; there’s no such either are or you’re not; there’s no such thing as 65% pregnant!thing as 65% pregnant!

Pairwise Comparisons: Dynamic Programming.Pairwise Comparisons: Dynamic Programming.

A ‘brute force’ approach just won’t work. The computation required to compare all possible A ‘brute force’ approach just won’t work. The computation required to compare all possible alignments between two sequences requires time proportional to the product of the lengths of alignments between two sequences requires time proportional to the product of the lengths of the two sequences, without considering gaps at all. If the two sequences are approximately the the two sequences, without considering gaps at all. If the two sequences are approximately the same length (N), this is a Nsame length (N), this is a N22 problem. To include gaps, the calculation needs to be repeated 2N problem. To include gaps, the calculation needs to be repeated 2N times to examine the possibility of gaps at each possible position within the sequences, now a times to examine the possibility of gaps at each possible position within the sequences, now a NN4N4N problem. problem.

Therefore, Therefore, An optimal alignment is defined as an arrangement of two sequences, 1 of length An optimal alignment is defined as an arrangement of two sequences, 1 of length ii and 2 of length and 2 of length jj, such that:, such that:

1) you maximize the number of matching symbols between 1 and 2;1) you maximize the number of matching symbols between 1 and 2;

2) you minimize the number of indels within 1 and 2; and2) you minimize the number of indels within 1 and 2; and3) you minimize the number of mismatched symbols between 1 and 2.3) you minimize the number of mismatched symbols between 1 and 2. Therefore, the actual solution can be represented by:Therefore, the actual solution can be represented by:

SSii-1 -1 jj-1-1 or or max Smax Si-xi-x j-j-11 + w + wx-x-11 or orSSijij = s = sijij + max 2 < + max 2 < xx < < ii max Smax Sii-1 -1 j-yj-y + w + wy-y-11 2 < 2 < yy < < II

Where SWhere Sij ij is the score for the alignment ending at is the score for the alignment ending at ii in sequence 1 and in sequence 1 and jj in sequence 2, in sequence 2,

ssijij is the score for aligning is the score for aligning ii with with jj,,

wwxx is the score for making a is the score for making a xx long gap in sequence 1, long gap in sequence 1,

wwyy is the score for making a is the score for making a yy long gap in sequence 2, long gap in sequence 2,

allowing gaps to be any length in either sequence.allowing gaps to be any length in either sequence.

An oversimplified example —An oversimplified example —

total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])

Optimum Alignments —Optimum Alignments —There will probably be more than one best path through the matrix and There will probably be more than one best path through the matrix and none of them may be the biologically CORRECT alignment. Starting at none of them may be the biologically CORRECT alignment. Starting at the top and working down as we did, then tracing back, I found two the top and working down as we did, then tracing back, I found two optimum alignments:optimum alignments:

cTATAtAaggcTATAtAagg cTATAtAaggcTATAtAagg| ||||| | ||||| | ||||| ||||cg.TAtAaT.cg.TAtAaT. cgT.AtAaT.cgT.AtAaT.

Each of these solutions yields a traceback total score of 22. This is the Each of these solutions yields a traceback total score of 22. This is the number optimized by the algorithm, not any type of a similarity or number optimized by the algorithm, not any type of a similarity or identity score! Even though one of these alignments has 6 exact identity score! Even though one of these alignments has 6 exact matches and the other has 5, they are both optimal according to the matches and the other has 5, they are both optimal according to the relatively strange criteria by which we solved the algorithm. Software relatively strange criteria by which we solved the algorithm. Software will report only one of these solutions. Do you have any ideas about will report only one of these solutions. Do you have any ideas about how others could be discovered? Answer — Often if you reverse the how others could be discovered? Answer — Often if you reverse the solution of the entire dynamic programming process, other solutions solution of the entire dynamic programming process, other solutions can be found!can be found!

Global versus local solution: negative numbers in match matrix and Global versus local solution: negative numbers in match matrix and pick best diagonal within overall graph.pick best diagonal within overall graph.

What about proteins —What about proteins — conservative replacements and similarity as conservative replacements and similarity as opposed to identity, and similarity versus homology!opposed to identity, and similarity versus homology! Similarity is not Similarity is not automatically homology. Homology always means related by descent from a automatically homology. Homology always means related by descent from a common ancestor.common ancestor.

Values whose magnitude is 4 are drawn in outline characters to make them easier to recognize. Notice that positive values for identity range from 4 to 11 and negative values for those substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity.

BLOSUM62 amino acid substitution matrix Henikoff, S. and Henikoff, J. G. (1992). GAP_CREATE 12 GAP_EXTEND 4

A B C D E F G H I K L M N P Q R S T V W X Y Z

A 44 -2 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1

B -2 66 -3 66 2 -3 -1 -1 -3 -1 -4-4 -3 1 -1 0 -2 0 -1 -3 -4-4 -1 -3 2

C 0 -3 99 -3 -4-4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4-4

D -2 66 -3 66 2 -3 -1 -1 -3 -1 -4-4 -3 1 -1 0 -2 0 -1 -3 -4-4 -1 -3 2

E -1 2 -4-4 2 55 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 55

F -2 -3 -2 -3 -3 66 -3 -1 0 -3 0 0 -3 -4-4 -3 -3 -2 -2 -1 1 -1 3 -3-3

G 0 -1 -3 -1 -2 -3 66 -2 -4-4 -2 -4-4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2

H -2 -1 -3 -1 0 -1 -2 88 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0

I -1 -3 -1 -3 -3 0 -4-4 -3 44 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3

K -1 -1 -3 -1 1 -3 -2 -1 -3 55 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1

L -1 -4-4 -1 -4-4 -3 0 -4-4 -3 2 -2 44 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3

M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 55 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2

N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 66 -2 0 0 1 0 -3 -4-4 -1 -2 0

P -1 -1 -3 -1 -1 -4-4 -2 -2 -3 -1 -3 -2 -2 77 -1 -2 -1 -1 -2 -4-4 -1 -3 -1

Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 55 1 0 -1 -2 -2 -1 -1 2

R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 55 -1 -1 -3 -3 -1 -2 0

S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 44 1 -2 -3 -1 -2 0

T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 55 0 -2 -1 -2 -1

V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 44 -3 -1 -1 -2

W -3 -4-4 -2 -4-4 -3 1 -2 -2 -3 -3 -2 -1 -4-4 -4-4 -2 -3 -3 -2 -3 1 111 -1 2 -3

X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 77 -2

Z -1 2 -4-4 2 55 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 xx

Significance — When is an Alignment Worth Significance — When is an Alignment Worth Anything Biologically?Anything Biologically?

Monte Carlo simulations:Monte Carlo simulations:

Z score = [ ( actual score ) - ( mean of randomized scores ) ] ( standard deviation of randomized score distribution )

Many Z scores measure the distance from a mean using a simplistic Monte Many Z scores measure the distance from a mean using a simplistic Monte

Carlo model assuming a normal distribution, in spite of the fact that Carlo model assuming a normal distribution, in spite of the fact that

‘sequence-space’ actually follows what is know as an ‘extreme value ‘sequence-space’ actually follows what is know as an ‘extreme value

distribution;’ however, the Monte Carlo method does approximate distribution;’ however, the Monte Carlo method does approximate

significance estimates pretty well.significance estimates pretty well.

Histogram Key:Histogram Key: Each histogram symbol represents 604 search set sequencesEach histogram symbol represents 604 search set sequences Each inset symbol represents 21 search set sequencesEach inset symbol represents 21 search set sequences z-scores computed from opt scoresz-scores computed from opt scores

z-score obs expz-score obs exp (=) (*)(=) (*)

< 20 650 0:==< 20 650 0:== 22 0 0:22 0 0: 24 3 0:=24 3 0:= 26 22 8:*26 22 8:* 28 98 87:*28 98 87:* 30 289 528:*30 289 528:* 32 1714 2042:===*32 1714 2042:===* 34 5585 5539:=========*34 5585 5539:=========* 36 12495 11375:==================*==36 12495 11375:==================*== 38 21957 18799:===============================*=====38 21957 18799:===============================*===== 40 28875 40 28875 26223:===========================================*====26223:===========================================*==== 42 34153 42 34153 32054:=====================================================*32054:=====================================================*====== 44 35427 44 35427 35359:=====================================================35359:==========================================================*=====* 46 36219 46 36219 36014:=====================================================36014:===========================================================*======* 48 33699 48 33699 34479:=====================================================34479:======================================================== *=== * 50 30727 50 30727 31462:=================================================== *31462:=================================================== * 52 27288 27661:=============================================*52 27288 27661:=============================================* 54 22538 23627:====================================== *54 22538 23627:====================================== * 56 18055 19736:============================== *56 18055 19736:============================== * 58 14617 16203:========================= *58 14617 16203:========================= * 60 12595 13125:=====================*60 12595 13125:=====================* 62 10563 10522:=================*62 10563 10522:=================* 64 8626 8368:=============*=64 8626 8368:=============*= 66 6426 6614:==========*66 6426 6614:==========* 68 4770 5203:========*68 4770 5203:========* 70 4017 4077:======*70 4017 4077:======* 72 2920 3186:=====*72 2920 3186:=====* 74 2448 2484:====*74 2448 2484:====* 76 1696 1933:===*76 1696 1933:===* 78 1178 1503:==*78 1178 1503:==* 80 935 1167:=*80 935 1167:=* 82 722 893:=*82 722 893:=* 84 454 707:=*84 454 707:=* 86 438 547:*86 438 547:* 88 322 423:*88 322 423:* 90 257 328:*90 257 328:* 92 175 253:* :========= *92 175 253:* :========= * 94 210 196:* :=========*94 210 196:* :=========* 96 102 152:* :===== *96 102 152:* :===== * 98 63 117:* :=== *98 63 117:* :=== * 100 58 91:* :=== *100 58 91:* :=== * 102 40 70:* :== *102 40 70:* :== * 104 30 54:* :==*104 30 54:* :==* 106 17 42:* :=*106 17 42:* :=* 108 14 33:* :=*108 14 33:* :=* 110 14 25:* :=*110 14 25:* :=* 112 12 20:* :*112 12 20:* :* 114 9 15:* :*114 9 15:* :* 116 6 12:* :*116 6 12:* :* 118 8 9:* :*118 8 9:* :*>120 1030 7:*= :*=======================================>120 1030 7:*= :*=======================================

These are the best hits, those most These are the best hits, those most similar sequences with a Pearson similar sequences with a Pearson zz--score greater than 120 in this search.score greater than 120 in this search.

‘‘Sequence-space’ actually Sequence-space’ actually follows the ‘extreme value follows the ‘extreme value distribution.’distribution.’ Based on this Based on this known statistical distribution, and known statistical distribution, and robust statistical methodology, a robust statistical methodology, a realistic Expectation function, the realistic Expectation function, the E value, can be calculated. The E value, can be calculated. The particulars of how BLAST and particulars of how BLAST and FastA do this differ, but the ‘take-FastA do this differ, but the ‘take-home’ message is the same:home’ message is the same:

The higher the E value is, the The higher the E value is, the more probable that the observed more probable that the observed match is due to chance in a match is due to chance in a search of the same size database search of the same size database and the lower its Z score will be, and the lower its Z score will be, i.e. is NOT significant. Therefore, i.e. is NOT significant. Therefore, the smaller the E value, i.e. the the smaller the E value, i.e. the closer it is to zero, the more closer it is to zero, the more significant it is and the higher its Z significant it is and the higher its Z score will be! The E value is the score will be! The E value is the number that really matters.number that really matters.

Pairwise Comparisons — Database SearchingPairwise Comparisons — Database Searching

BLAST — Basic Local Alignment Search BLAST — Basic Local Alignment Search Tool, developed at NCBI.Tool, developed at NCBI.

1)1) Normally NOT a good idea to use Normally NOT a good idea to use for DNA against DNA searches for DNA against DNA searches (not optimized);(not optimized);

2)2) Prefilters repeat and “low Prefilters repeat and “low complexity” sequence regions;complexity” sequence regions;

4)4) Can find more than one region of Can find more than one region of gapped similarity;gapped similarity;

5)5) Very fast heuristic and parallel Very fast heuristic and parallel implementation;implementation;

6)6) Restricted to precompiled, specially Restricted to precompiled, specially formatted databases;formatted databases;

FastA — and its family of relatives, FastA — and its family of relatives, developed by Bill Pearson at the developed by Bill Pearson at the University of Virginia.University of Virginia.

1)1) Works well for DNA against DNA Works well for DNA against DNA searches (within limits of possible searches (within limits of possible sensitivity);sensitivity);

2)2) Can find only one gapped region Can find only one gapped region of similarity;of similarity;

3)3) Relatively slow, should usually be Relatively slow, should usually be run in the background;run in the background;

4)4) Does not require specially Does not require specially prepared, preformatted prepared, preformatted databases.databases.

Add the previous concepts to ‘hashing’ to come up with heuristic style database searching. Hashing Add the previous concepts to ‘hashing’ to come up with heuristic style database searching. Hashing breaks sequences into small ‘words’ or ‘ktuples’ of a set size to create a ‘look-up’ table with words keyed breaks sequences into small ‘words’ or ‘ktuples’ of a set size to create a ‘look-up’ table with words keyed to numbers. When a word matches part of a database entry, that match is saved. ‘Worthwhile’ results at to numbers. When a word matches part of a database entry, that match is saved. ‘Worthwhile’ results at the end are compiled and the longest alignment within the program’s restrictions is created. Hashing the end are compiled and the longest alignment within the program’s restrictions is created. Hashing reduces the complexity of the search problem from Nreduces the complexity of the search problem from N22 for dynamic programming to N, the length of all the for dynamic programming to N, the length of all the sequences in the database. Approximation techniques are collectively known as ‘heuristics.’ In database sequences in the database. Approximation techniques are collectively known as ‘heuristics.’ In database

searching the heuristic restricts search space by searching the heuristic restricts search space by calculatingcalculating a statistic that allows the program to decide a statistic that allows the program to decide whether further scrutiny of a particular match should be pursued.whether further scrutiny of a particular match should be pursued.

Versions available of each for DNA-DNA, DNA-protein, protein-DNA, and Versions available of each for DNA-DNA, DNA-protein, protein-DNA, and

protein-protein searches. Translations done ‘on the fly’ for mixed searches.protein-protein searches. Translations done ‘on the fly’ for mixed searches.

The algorithms —The algorithms —

BLAST:BLAST:

FastA:FastA:

Two word hits on the Two word hits on the same diagonal above same diagonal above some similarity threshold some similarity threshold triggers ungapped triggers ungapped extension until the score extension until the score isn’t improved enough isn’t improved enough above another threshold:above another threshold:

the HSP.the HSP.

Find all ungapped exact Find all ungapped exact word hits; maximize the word hits; maximize the ten best continuous ten best continuous regions’ scores: regions’ scores: init1init1..

Combine Combine nonoverlapping init nonoverlapping init regions on different regions on different diagonals:diagonals:initninitn..

Use dynamic Use dynamic programming ‘in a programming ‘in a band’ for all regions band’ for all regions with with initninitn scores scores better than some better than some threshold: threshold: optopt score.score.

Initiate gapped extensions Initiate gapped extensions using dynamic programming for using dynamic programming for those HSP’s above a third those HSP’s above a third threshold up to the point where threshold up to the point where the score starts to drop below a the score starts to drop below a fourth threshold: yields fourth threshold: yields alignment.alignment.

Pairwise ComparisonsPairwise Comparisons::

The Dot Matrix Method.The Dot Matrix Method.Provides a ‘Gestalt’ of all possible Provides a ‘Gestalt’ of all possible

alignments between two sequences.alignments between two sequences.

To begin — very simple 0, 1 (match, To begin — very simple 0, 1 (match,

nomatch) identity scoring function.nomatch) identity scoring function.

Put a dot wherever symbols match.Put a dot wherever symbols match.

A way to see similarities —A way to see similarities —

Identities and insertion/deletion events (indels) identified Identities and insertion/deletion events (indels) identified

(zero:one match score matrix, no window).(zero:one match score matrix, no window).

Noise due to random composition effects contributes to confusion. To ‘clean up’ Noise due to random composition effects contributes to confusion. To ‘clean up’

the plot consider a filtered windowing approach. A dot is placed at the middle of the plot consider a filtered windowing approach. A dot is placed at the middle of

a window if some ‘stringency’ is met within that defined window size. Then the a window if some ‘stringency’ is met within that defined window size. Then the

window is shifted one position and the entire process is repeated window is shifted one position and the entire process is repeated (zero:one (zero:one

match score, match score, window of size three and a stringency level of two out of threewindow of size three and a stringency level of two out of three).).

The phenylalanine transfer RNA molecule from yeast plotted against itself The phenylalanine transfer RNA molecule from yeast plotted against itself

using a window size to 7 and the stringency value to 5. As a general guide using a window size to 7 and the stringency value to 5. As a general guide

pick a window size about the same size as the feature that you are trying to pick a window size about the same size as the feature that you are trying to

recognize and a stringency such that unwanted background noise is just recognize and a stringency such that unwanted background noise is just

filtered away enough to enable you to see that desired feature.filtered away enough to enable you to see that desired feature.

RNA comparisons of the reverse, complement of a sequence to itself can often RNA comparisons of the reverse, complement of a sequence to itself can often

be very informative. The yeast tRNA sequence is compared to its reverse, be very informative. The yeast tRNA sequence is compared to its reverse,

complement using the same 5 out of 7 stringency setting as previously. The complement using the same 5 out of 7 stringency setting as previously. The

stem-loop, inverted repeats of the tRNA clover-leaf molecular shape become stem-loop, inverted repeats of the tRNA clover-leaf molecular shape become

obvious. They appear as clearly delineated diagonals running perpendicular to obvious. They appear as clearly delineated diagonals running perpendicular to

an imaginary main diagonal running oppositely than before.an imaginary main diagonal running oppositely than before.

22 GAGCGCCAGACT G 12, 2222 GAGCGCCAGACT G 12, 22 || | ||||| | A || | ||||| | A48 CTGGAGGTCTAG A 348 CTGGAGGTCTAG A 3

Base position 22 through position 33 base pairs with (think —Base position 22 through position 33 base pairs with (think — is quite similar to the reverse-is quite similar to the reverse-

complement of) itself from base position 37 through position 48. MFold, Zuker’s RNA folding complement of) itself from base position 37 through position 48. MFold, Zuker’s RNA folding

algorithm uses base pairing energies to find the family of optimal and suboptimal structures; the algorithm uses base pairing energies to find the family of optimal and suboptimal structures; the

most stable structure found is shown to possess a stem at positions 27 to 31 with 39 to 43. most stable structure found is shown to possess a stem at positions 27 to 31 with 39 to 43.

However the region around position 38 is represented as a loop. The actual modeled structure However the region around position 38 is represented as a loop. The actual modeled structure

as seen in PDB’s 1TRA shows ‘reality’ lies somewhere in between.as seen in PDB’s 1TRA shows ‘reality’ lies somewhere in between.

Multiple Sequence Alignment & Multiple Sequence Alignment & Analysis — Analysis — Dynamic programming’s complexity increases Dynamic programming’s complexity increases exponentially with the number of sequences being compared. N-exponentially with the number of sequences being compared. N-dimensional matrix . . . .dimensional matrix . . . .

Therefore — Therefore — pairwise, pairwise, progressive dynamic progressive dynamic programming restricts programming restricts the solution to the the solution to the neighbor-hood of only neighbor-hood of only two sequences at a time.two sequences at a time.

All sequences are compared, All sequences are compared, pairwise, and then each is pairwise, and then each is aligned to its most similar aligned to its most similar partner or group of partners. partner or group of partners. Each group of partners is then Each group of partners is then aligned to finish the complete aligned to finish the complete multiple sequence alignment.multiple sequence alignment.

Conserved regions can be Conserved regions can be visualized with a sliding visualized with a sliding window approach and appear window approach and appear as peaks. Concentrate on the as peaks. Concentrate on the first peak seen here.first peak seen here.

Motifs —Motifs —

GHVDHGKS

A consensus isn’t A consensus isn’t necessarily the necessarily the biologically “correct” biologically “correct” combination. combination. Therefore, build one-Therefore, build one-dimensional ‘pattern dimensional ‘pattern descriptors.’descriptors.’

PROSITE Database of PROSITE Database of protein families and protein families and domains - over 1,000 domains - over 1,000 motifs.motifs.

This motif, the P-loop, is This motif, the P-loop, is defined: defined: (A,G)x4GK(S,T), i.e. (A,G)x4GK(S,T), i.e. either an Alanine or a either an Alanine or a Glycine, followed by Glycine, followed by four of anything, four of anything, followed by an followed by an invariant Glycine-invariant Glycine-Lysine pair, followed Lysine pair, followed by either a Serine or a by either a Serine or a Threonine.Threonine.

But motifs can not But motifs can not convey any degree of convey any degree of the ‘importance’ of the the ‘importance’ of the residues.residues.

Enter Enter the the ProfileProfile

Given a multiple sequence alignment, how can we use all of the information contained Given a multiple sequence alignment, how can we use all of the information contained in it to find ever more remotely similar sequences, that is those “Twilight Zone” in it to find ever more remotely similar sequences, that is those “Twilight Zone” similarities below ~20% identity, those Z scores below ~5, those BLAST/Fast similarities below ~20% identity, those Z scores below ~5, those BLAST/Fast EE values values above ~10above ~10-5-5 or so? or so?Use a position specific, two-dimensional matrix where conserved areas of the Use a position specific, two-dimensional matrix where conserved areas of the alignment receive the most importance and variable regions hardly matter!alignment receive the most importance and variable regions hardly matter!

The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22 substituted with a The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22 substituted with a tryptophan would never happen, -87. Tryptophan is the most conserved residue on all matrix series and aspartate 22 is conserved throughout tryptophan would never happen, -87. Tryptophan is the most conserved residue on all matrix series and aspartate 22 is conserved throughout the alignment — the negative matrix score of any substitution to tryptophan times the high conservation at that position for aspartate equals the the alignment — the negative matrix score of any substitution to tryptophan times the high conservation at that position for aspartate equals the most negative score in the profile. Position 16 has a valine assigned because it has the highest score, 37, but glycine also occurs several most negative score in the profile. Position 16 has a valine assigned because it has the highest score, 37, but glycine also occurs several times, a score of 20. However, other residues are ranked in the substitution matrices as being quite similar to valine; therefore isoleucine and times, a score of 20. However, other residues are ranked in the substitution matrices as being quite similar to valine; therefore isoleucine and leucine also get similar scores, 24 and 14, and alanine occurs some of the time in the alignment so it gets a comparable score, 15.leucine also get similar scores, 24 and 14, and alanine occurs some of the time in the alignment so it gets a comparable score, 15.

Advanced methodologies —Advanced methodologies —Many wondrous things can be accomplished based on combinations of all Many wondrous things can be accomplished based on combinations of all

the previous techniques.the previous techniques.

PSI-BLAST uses profile methods to iterate database searches.PSI-BLAST uses profile methods to iterate database searches.

Motif profiles can be discovered in unaligned sequences using Motif profiles can be discovered in unaligned sequences using Expectation Maximization or optimized in aligned sequences with Expectation Maximization or optimized in aligned sequences with Hidden Markov Modeling statistics.Hidden Markov Modeling statistics.

Secondary structure can be predicted in many cases. See Secondary structure can be predicted in many cases. See http://www.http://www.emblembl--heidelbergheidelberg.de/.de/predictproteinpredictprotein//predictproteinpredictprotein.html.html, which uses , which uses multiple sequence alignment profile techniques along with neural net multiple sequence alignment profile techniques along with neural net technology. Even three-dimensional “homology modeling” will often technology. Even three-dimensional “homology modeling” will often lead to remarkably accurate representations if the similarity is great lead to remarkably accurate representations if the similarity is great enough between your protein and one in which the structure has been enough between your protein and one in which the structure has been solved through experimental means. See SwissModel at solved through experimental means. See SwissModel at http://www.http://www.expasyexpasy..chch//swissmodswissmod/SWISS-MODEL.html/SWISS-MODEL.html..

Evolutionary relationships can be ascertained using a multiple sequence Evolutionary relationships can be ascertained using a multiple sequence alignment and the methods of molecular phylogenetics. See the alignment and the methods of molecular phylogenetics. See the PAUP*PAUP* and and PHYLIPPHYLIP software packages. And if you’re really software packages. And if you’re really interested in this topic check out the interested in this topic check out the Workshop on Molecular EvolutionWorkshop on Molecular Evolution offered every August at the Woods offered every August at the Woods Hole Marine Biological Laboratory and/or similar courses worldwide.Hole Marine Biological Laboratory and/or similar courses worldwide.

See the listed references and WWW sites.See the listed references and WWW sites.Many fine texts are also starting to become available in the field.Many fine texts are also starting to become available in the field.

FOR EVEN MORE INFO...FOR EVEN MORE INFO...

http://bio.http://bio.fsufsu..eduedu/~/~stevetstevet/workshop.html/workshop.html

Contact me (Contact me (stevetstevet@[email protected]) for specific bioinformatics ) for specific bioinformatics assistance and/or collaboration.assistance and/or collaboration.

To learn more -To learn more -

Gunnar von Heijne in his quite readable treatise, Gunnar von Heijne in his quite readable treatise, Sequence Analysis in Molecular Biology; Treasure Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Trove or Trivial Pursuit (1987), provides a very appropriate conclusion:(1987), provides a very appropriate conclusion:

““Think about what you’re doing; use your knowledge of the molecular system involved to guide Think about what you’re doing; use your knowledge of the molecular system involved to guide

both your interpretation of results and your direction of inquiry; use as much information as both your interpretation of results and your direction of inquiry; use as much information as

possible; and do not blindly accept everything the computer offers you.”possible; and do not blindly accept everything the computer offers you.”

He continues:He continues:

““. . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution . . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution

one must first and foremost be a biologist, and only second a theoretician . . . . We have to one must first and foremost be a biologist, and only second a theoretician . . . . We have to

develop better algorithms, we have to find ways to cope with the massive amounts of data, develop better algorithms, we have to find ways to cope with the massive amounts of data,

and above all we have to become better biologists. But that’s all it takes.”and above all we have to become better biologists. But that’s all it takes.”

Conclusions —Conclusions —

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular BiologyJournal of Molecular Biology 215215, 403-410., 403-410.

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. Protein Database Search Programs. Nucleic Acids ResearchNucleic Acids Research 2525, 3389-3402., 3389-3402.

Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids ResearchNucleic Acids Research 2020, 2013-2018., 2013-2018.

Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A.Washington, U.S.A.

Genetics Computer Group Genetics Computer Group ¥¥ (GCG), Inc. (Copyright 1982-2001) (GCG), Inc. (Copyright 1982-2001) Program Manual for the Wisconsin PackageProgram Manual for the Wisconsin Package, Version 10.2, Madison, Wisconsin, USA , Version 10.2, Madison, Wisconsin, USA 53711.53711.

Gribskov, M. and Devereux, J., editors (1992) Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis PrimerSequence Analysis Primer. W.H. Freeman and Company, New York, N.Y., U.S.A.. W.H. Freeman and Company, New York, N.Y., U.S.A.

Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A.Proc. Natl. Acad. Sci. U.S.A. 8484, 4355-4358., 4355-4358.

Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A. 8989, , 10915-10919.10915-10919.

Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal Journal of Molecular Biologyof Molecular Biology 4848, 443-453., 443-453.

Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio 1994. 1994. Nucleic Acids ResearchNucleic Acids Research 2222, 3470-3473., 3470-3473.

Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A. 8585, , 2444-2448.2444-2448.

Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular BiologyJournal of Molecular Biology 232232, 584-599., 584-599.

Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Sequence Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Sequence Analysis. Analysis. CABIOSCABIOS, , 1010, 671-675., 671-675.

Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and StructureAtlas of Protein Sequences and Structure, (M.O. Dayhoff editor) , (M.O. Dayhoff editor) 55, , Suppl. Suppl. 33, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A., 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A.

Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied MathematicsAdvances in Applied Mathematics 22, 482-489., 482-489.

Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Omega Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Omega Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Nucleic Acids ResearchNucleic Acids Research 1010, 2471-2484., 2471-2484.

Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997) Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997) Smithsonian Institution, Washington D.C., U.S.A.Smithsonian Institution, Washington D.C., U.S.A.

Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids ResearchNucleic Acids Research, , 2222, 4673-4680., 4673-4680.

von Heijne, G. (1987) von Heijne, G. (1987) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit.Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, California, U.S.A. Academic Press, Inc., San Diego, California, U.S.A.

Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences Proceedings of the National Academy of Sciences U.S.A.U.S.A. 8080, 726-730., 726-730.

Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. ScienceScience 244244, 48-52., 48-52.

References —References —

a bioinformatics survey... just a taste, with an emphasis on the gcg suite. steven m. thompson...

Documents