bioinformatics2

7/31/2019 Bioinformatics2

1/16

Tools to analyze protein characteristics

Proteinsequence

-Family member-Multiple alignments

Identification of

conserved regions

Evolutionary

relationship (Phylogeny)

3-D fold model

Protein sorting and

sub-cellular localization

Anchoring into

the membrane

Signal sequence

(tags)

Some nascent proteins contain a specific signal, or targeting sequence

that directs them to the correct organelle. (ER, mitochondrial, chloroplast,

lysosome, vacuoles, Golgi, or cytosol)


2/16

Can we train the computers:To detect signal sequences and predict protein destination?

To identify conserved domains (or a pattern)in proteins?To predict the membrane-anchoring type of a protein?(Transmembrane domain, GPI anchor)

To predict the 3D structure of a protein?

Learning algorithms are good for solving problems in pattern

recognition because they can be trained on a sample data set.

Classes of learning algorithms:

-Artificial neural networks (ANNs)

-Hidden Markov Models (HMM)

Questions


3/16

Artificial neural networks (ANN)

Machine learning algorithms that mimic the brain. Real brains,

however, are orders of magnitude more complex than any

ANN so far considered.

ANNs, like people, learn by example.ANNs cannot be programmedto perform a specific task.

ANN is composed of a large number ofhighly interconnected

processing elements (neurons) working simultaneously to solve

specific problems.

The first artificial neuron was developed in 1943 by theneurophysiologist Warren McCulloch and the logician Walter Pits.


4/16

Hidden Markov Models (HMM)

HMM is a probabilisticprocess over a set ofstates, in which the

states are hidden. It is only the outcome that visible to the

observer. Hence, the name Hidden Markov Model.

HMM has many uses in genomics:

Gene prediction (GENSCAN)

SignalPFinding periodic patterns

Used to answer questions like:

What is the probability of obtaining a particularoutcome?

What is the best model from many combinations?


5/16

Expasy server(http://au.expasy.org)

is dedicated to the analysis of

protein sequences and structures.

The ExPASy (Expert Protein Analysis System)

Sequence analysis tools include:

DNA -> Protein [Translate]Patternand profile searches

Post-translational modification and

topology prediction

Primary structure analysis

Structure prediction (2D and 3D)

Alignment


6/16

PredictProtein:A service for sequence analysis, and structure prediction

http://www.predictprotein.org/newwebsite/submit.html

TMpred: http://www.ch.embnet.org/software/TMPRED_form.html

TMHMM: Predicts transmembrane helices in proteins (CBS; Denmark)http://www.cbs.dtu.dk/services/TMHMM-2.0/

big-PI : Predicts GPI-anchor site:http://mendel.imp.univie.ac.at/sat/gpi/gpi_server.html

DGPI: Predicts GPI-anchor site: http://129.194.185.165/dgpi/index_en.html

SignalP: Predicts signal peptide: http://www.cbs.dtu.dk/services/SignalP/

PSORT: Predicts sub-cellular localization: http://www.psort.org/

TargetP: Predicts sub-cellular localization:http://www.cbs.dtu.dk/services/TargetP/

NetNGlyc: Predicts N-glycosylation sites:http://www.cbs.dtu.dk/services/NetNGlyc/

PTS1: Predicts peroxisomal targeting sequences

http://mendel.imp.univie.ac.at/mendeljsp/sat/pts1/PTS1predictor.jsp

MITOPROT: Predicts of mitochondrial targeting sequenceshttp://ihg.gsf.de/ihg/mitoprot.html

Hydrophobicity: http://www.vivo.colostate.edu/molkit/hydropathy/index.html
http://www.predictprotein.org/newwebsite/submit.htmlhttp://www.ch.embnet.org/software/TMPRED_form.htmlhttp://www.cbs.dtu.dk/services/TMHMM-2.0/http://mendel.imp.univie.ac.at/sat/gpi/gpi_server.htmlhttp://129.194.185.165/dgpi/index_en.htmlhttp://www.cbs.dtu.dk/services/SignalP/http://www.psort.org/http://www.cbs.dtu.dk/services/TargetP/http://www.cbs.dtu.dk/services/NetNGlyc/http://mendel.imp.univie.ac.at/mendeljsp/sat/pts1/PTS1predictor.jsphttp://ihg.gsf.de/ihg/mitoprot.htmlhttp://www.vivo.colostate.edu/molkit/hydropathy/index.htmlhttp://www.vivo.colostate.edu/molkit/hydropathy/index.htmlhttp://ihg.gsf.de/ihg/mitoprot.htmlhttp://mendel.imp.univie.ac.at/mendeljsp/sat/pts1/PTS1predictor.jsphttp://www.cbs.dtu.dk/services/NetNGlyc/http://www.cbs.dtu.dk/services/TargetP/http://www.psort.org/http://www.cbs.dtu.dk/services/SignalP/http://129.194.185.165/dgpi/index_en.htmlhttp://mendel.imp.univie.ac.at/sat/gpi/gpi_server.htmlhttp://www.cbs.dtu.dk/services/TMHMM-2.0/http://www.cbs.dtu.dk/services/TMHMM-2.0/http://www.cbs.dtu.dk/services/TMHMM-2.0/http://www.ch.embnet.org/software/TMPRED_form.htmlhttp://www.predictprotein.org/newwebsite/submit.html


7/16

Multiple alignment

Used to do phylogenetic analysis:

Same protein from different species

Evolutionary relationship: history

Used to find conserved regions

Local multiple alignment reveals conserved regions

Conserved regions usually are key functional regionsThese regions are prime targets fordrug developments

Protein domains are often conserved across many species

Algorithm for search ofconserved regions:

Block maker: http://blocks.fhcrc.org/blocks/make_blocks.html
http://blocks.fhcrc.org/blocks/make_blocks.htmlhttp://blocks.fhcrc.org/blocks/make_blocks.html


8/16

Multiple alignment tools

Free programs:

Phylip and PAUP: http://evolution.genetics.washington.edu/phylip.html

Phyml: http://atgc.lirmm.fr/phyml/

The most used websites :

http://align.genome.jp/

http://prodes.toulouse.inra.fr/multalin/multalin.htmlhttp://www.ch.embnet.org/index.html (T-COFFEE and ClustalW)

ClustalW:

Standard popular software

Italigns 2 and keep on adding a new sequence to the alignment

Problem: It is simply a heuristics.

Motif discovery: use yourown motif to search databases:

PatternFind: http://myhits.isb-sib.ch/cgi-bin/pattern_search
http://evolution.genetics.washington.edu/phylip.htmlhttp://atgc.lirmm.fr/phyml/http://align.genome.jp/http://prodes.toulouse.inra.fr/multalin/multalin.htmlhttp://www.ch.embnet.org/index.htmlhttp://myhits.isb-sib.ch/cgi-bin/pattern_searchhttp://myhits.isb-sib.ch/cgi-bin/pattern_searchhttp://myhits.isb-sib.ch/cgi-bin/pattern_searchhttp://myhits.isb-sib.ch/cgi-bin/pattern_searchhttp://myhits.isb-sib.ch/cgi-bin/pattern_searchhttp://myhits.isb-sib.ch/cgi-bin/pattern_searchhttp://www.ch.embnet.org/index.htmlhttp://prodes.toulouse.inra.fr/multalin/multalin.htmlhttp://align.genome.jp/http://atgc.lirmm.fr/phyml/http://evolution.genetics.washington.edu/phylip.html


9/16

Phylogenetic analysis

Phylogenetic trees

Describe evolutionary relationships between sequences

Major modes that drive the evolution:Point mutations modify existing sequences

Duplications (re-use existing sequence)Rearrangement

Two most common methods

Maximum parsimonyMaximum likelihood


10/16

Parsimony vsMaximum likelihood

Parsimony is the most popular method in which the simplest

answer is always the preferred one.

It involvesstatistical evaluationof the number of mutations needto explain the observed data.

The best tree is the one that requires thefewestnumber of

evolutionary changes.

Likelihood generally performs better than parsimony

In contrast,maximum likelihood does not necessarily satisfy

any optimality criterion. It attempts to answer the question:

Whatparameters of evolutionary events was likely to produce thecurrent data set?

This is computationally difficult to do. This is the slowest of allmethods.


11/16

Definitions

Homologous:Have a common ancestor. Homology cannot be measured.

Orthologous:The same gene in different species . It is the result ofspeciation (common ancestral)

Paralogous: Related genes (already diverged) in the same species. It isthe result of genomic rearrangements or duplication


12/16

Determining protein structure

Direct measurement of structure

X-ray crystallography

NMR spectroscopy

Site-directed mutagenesis

Computer modeling

Prediction of structure

Comparative protein-structure modeling


13/16

Comparative protein-structure modeling

Goal:Construct 3-D model of a protein of unknown

structure (target), based on similarity of sequence toproteins of known structure (templates)

Blue: predicted model by PROSPECT

Red: NMR structure

Procedure:

Template selectionTemplatetarget alignment

Model building

Model evaluation


14/16

The Protein 3-D Database

The Protein DataBase (PDB) contains 3-D structural data

for proteins

Founded in 1971 with a dozen structures

As of June 2004, there were 25,760 structures in the database.

All structures are reviewed for accuracy and data uniformity.

Structural data from the PDB can be freely accessed at

http://www.rcsb.org/pdb/

80% come from X-ray crystallography

16% come from NMR

2% come from theoretical modeling
http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/


15/16

High-throughput methods


16/16

Most used websites for 3-D structure prediction

Protein Homology/analogY Recognition Engine (Phyre) at

http://www.sbg.bio.ic.ac.uk/phyre/html/index.html

PredictProtein at

http://www.predictprotein.org/newwebsite/submit.html

UCLA Fold Recognition at

http://www.doe-mbi.ucla.edu/Services/FOLD/
http://www.sbg.bio.ic.ac.uk/phyre/html/index.htmlhttp://www.predictprotein.org/newwebsite/submit.htmlhttp://www.doe-mbi.ucla.edu/Services/FOLD/http://www.doe-mbi.ucla.edu/Services/FOLD/http://www.doe-mbi.ucla.edu/Services/FOLD/http://www.doe-mbi.ucla.edu/Services/FOLD/http://www.predictprotein.org/newwebsite/submit.htmlhttp://www.sbg.bio.ic.ac.uk/phyre/html/index.html

bioinformatics2

Documents