bioinformatics2
TRANSCRIPT
-
7/31/2019 Bioinformatics2
1/16
Tools to analyze protein characteristics
Proteinsequence
-Family member-Multiple alignments
Identification of
conserved regions
Evolutionary
relationship (Phylogeny)
3-D fold model
Protein sorting and
sub-cellular localization
Anchoring into
the membrane
Signal sequence
(tags)
Some nascent proteins contain a specific signal, or targeting sequence
that directs them to the correct organelle. (ER, mitochondrial, chloroplast,
lysosome, vacuoles, Golgi, or cytosol)
-
7/31/2019 Bioinformatics2
2/16
Can we train the computers:To detect signal sequences and predict protein destination?
To identify conserved domains (or a pattern)in proteins?To predict the membrane-anchoring type of a protein?(Transmembrane domain, GPI anchor)
To predict the 3D structure of a protein?
Learning algorithms are good for solving problems in pattern
recognition because they can be trained on a sample data set.
Classes of learning algorithms:
-Artificial neural networks (ANNs)
-Hidden Markov Models (HMM)
Questions
-
7/31/2019 Bioinformatics2
3/16
Artificial neural networks (ANN)
Machine learning algorithms that mimic the brain. Real brains,
however, are orders of magnitude more complex than any
ANN so far considered.
ANNs, like people, learn by example.ANNs cannot be programmedto perform a specific task.
ANN is composed of a large number ofhighly interconnected
processing elements (neurons) working simultaneously to solve
specific problems.
The first artificial neuron was developed in 1943 by theneurophysiologist Warren McCulloch and the logician Walter Pits.
-
7/31/2019 Bioinformatics2
4/16
Hidden Markov Models (HMM)
HMM is a probabilisticprocess over a set ofstates, in which the
states are hidden. It is only the outcome that visible to the
observer. Hence, the name Hidden Markov Model.
HMM has many uses in genomics:
Gene prediction (GENSCAN)
SignalPFinding periodic patterns
Used to answer questions like:
What is the probability of obtaining a particularoutcome?
What is the best model from many combinations?
-
7/31/2019 Bioinformatics2
5/16
Expasy server(http://au.expasy.org)
is dedicated to the analysis of
protein sequences and structures.
The ExPASy (Expert Protein Analysis System)
Sequence analysis tools include:
DNA -> Protein [Translate]Patternand profile searches
Post-translational modification and
topology prediction
Primary structure analysis
Structure prediction (2D and 3D)
Alignment
-
7/31/2019 Bioinformatics2
6/16
PredictProtein:A service for sequence analysis, and structure prediction
http://www.predictprotein.org/newwebsite/submit.html
TMpred: http://www.ch.embnet.org/software/TMPRED_form.html
TMHMM: Predicts transmembrane helices in proteins (CBS; Denmark)http://www.cbs.dtu.dk/services/TMHMM-2.0/
big-PI : Predicts GPI-anchor site:http://mendel.imp.univie.ac.at/sat/gpi/gpi_server.html
DGPI: Predicts GPI-anchor site: http://129.194.185.165/dgpi/index_en.html
SignalP: Predicts signal peptide: http://www.cbs.dtu.dk/services/SignalP/
PSORT: Predicts sub-cellular localization: http://www.psort.org/
TargetP: Predicts sub-cellular localization:http://www.cbs.dtu.dk/services/TargetP/
NetNGlyc: Predicts N-glycosylation sites:http://www.cbs.dtu.dk/services/NetNGlyc/
PTS1: Predicts peroxisomal targeting sequences
http://mendel.imp.univie.ac.at/mendeljsp/sat/pts1/PTS1predictor.jsp
MITOPROT: Predicts of mitochondrial targeting sequenceshttp://ihg.gsf.de/ihg/mitoprot.html
Hydrophobicity: http://www.vivo.colostate.edu/molkit/hydropathy/index.html
http://www.predictprotein.org/newwebsite/submit.htmlhttp://www.ch.embnet.org/software/TMPRED_form.htmlhttp://www.cbs.dtu.dk/services/TMHMM-2.0/http://mendel.imp.univie.ac.at/sat/gpi/gpi_server.htmlhttp://129.194.185.165/dgpi/index_en.htmlhttp://www.cbs.dtu.dk/services/SignalP/http://www.psort.org/http://www.cbs.dtu.dk/services/TargetP/http://www.cbs.dtu.dk/services/NetNGlyc/http://mendel.imp.univie.ac.at/mendeljsp/sat/pts1/PTS1predictor.jsphttp://ihg.gsf.de/ihg/mitoprot.htmlhttp://www.vivo.colostate.edu/molkit/hydropathy/index.htmlhttp://www.vivo.colostate.edu/molkit/hydropathy/index.htmlhttp://ihg.gsf.de/ihg/mitoprot.htmlhttp://mendel.imp.univie.ac.at/mendeljsp/sat/pts1/PTS1predictor.jsphttp://www.cbs.dtu.dk/services/NetNGlyc/http://www.cbs.dtu.dk/services/TargetP/http://www.psort.org/http://www.cbs.dtu.dk/services/SignalP/http://129.194.185.165/dgpi/index_en.htmlhttp://mendel.imp.univie.ac.at/sat/gpi/gpi_server.htmlhttp://www.cbs.dtu.dk/services/TMHMM-2.0/http://www.cbs.dtu.dk/services/TMHMM-2.0/http://www.cbs.dtu.dk/services/TMHMM-2.0/http://www.ch.embnet.org/software/TMPRED_form.htmlhttp://www.predictprotein.org/newwebsite/submit.html -
7/31/2019 Bioinformatics2
7/16
Multiple alignment
Used to do phylogenetic analysis:
Same protein from different species
Evolutionary relationship: history
Used to find conserved regions
Local multiple alignment reveals conserved regions
Conserved regions usually are key functional regionsThese regions are prime targets fordrug developments
Protein domains are often conserved across many species
Algorithm for search ofconserved regions:
Block maker: http://blocks.fhcrc.org/blocks/make_blocks.html
http://blocks.fhcrc.org/blocks/make_blocks.htmlhttp://blocks.fhcrc.org/blocks/make_blocks.html -
7/31/2019 Bioinformatics2
8/16
Multiple alignment tools
Free programs:
Phylip and PAUP: http://evolution.genetics.washington.edu/phylip.html
Phyml: http://atgc.lirmm.fr/phyml/
The most used websites :
http://align.genome.jp/
http://prodes.toulouse.inra.fr/multalin/multalin.htmlhttp://www.ch.embnet.org/index.html (T-COFFEE and ClustalW)
ClustalW:
Standard popular software
Italigns 2 and keep on adding a new sequence to the alignment
Problem: It is simply a heuristics.
Motif discovery: use yourown motif to search databases:
PatternFind: http://myhits.isb-sib.ch/cgi-bin/pattern_search
http://evolution.genetics.washington.edu/phylip.htmlhttp://atgc.lirmm.fr/phyml/http://align.genome.jp/http://prodes.toulouse.inra.fr/multalin/multalin.htmlhttp://www.ch.embnet.org/index.htmlhttp://myhits.isb-sib.ch/cgi-bin/pattern_searchhttp://myhits.isb-sib.ch/cgi-bin/pattern_searchhttp://myhits.isb-sib.ch/cgi-bin/pattern_searchhttp://myhits.isb-sib.ch/cgi-bin/pattern_searchhttp://myhits.isb-sib.ch/cgi-bin/pattern_searchhttp://myhits.isb-sib.ch/cgi-bin/pattern_searchhttp://www.ch.embnet.org/index.htmlhttp://prodes.toulouse.inra.fr/multalin/multalin.htmlhttp://align.genome.jp/http://atgc.lirmm.fr/phyml/http://evolution.genetics.washington.edu/phylip.html -
7/31/2019 Bioinformatics2
9/16
Phylogenetic analysis
Phylogenetic trees
Describe evolutionary relationships between sequences
Major modes that drive the evolution:Point mutations modify existing sequences
Duplications (re-use existing sequence)Rearrangement
Two most common methods
Maximum parsimonyMaximum likelihood
-
7/31/2019 Bioinformatics2
10/16
Parsimony vsMaximum likelihood
Parsimony is the most popular method in which the simplest
answer is always the preferred one.
It involvesstatistical evaluationof the number of mutations needto explain the observed data.
The best tree is the one that requires thefewestnumber of
evolutionary changes.
Likelihood generally performs better than parsimony
In contrast,maximum likelihood does not necessarily satisfy
any optimality criterion. It attempts to answer the question:
Whatparameters of evolutionary events was likely to produce thecurrent data set?
This is computationally difficult to do. This is the slowest of allmethods.
-
7/31/2019 Bioinformatics2
11/16
Definitions
Homologous:Have a common ancestor. Homology cannot be measured.
Orthologous:The same gene in different species . It is the result ofspeciation (common ancestral)
Paralogous: Related genes (already diverged) in the same species. It isthe result of genomic rearrangements or duplication
-
7/31/2019 Bioinformatics2
12/16
Determining protein structure
Direct measurement of structure
X-ray crystallography
NMR spectroscopy
Site-directed mutagenesis
Computer modeling
Prediction of structure
Comparative protein-structure modeling
-
7/31/2019 Bioinformatics2
13/16
Comparative protein-structure modeling
Goal:Construct 3-D model of a protein of unknown
structure (target), based on similarity of sequence toproteins of known structure (templates)
Blue: predicted model by PROSPECT
Red: NMR structure
Procedure:
Template selectionTemplatetarget alignment
Model building
Model evaluation
-
7/31/2019 Bioinformatics2
14/16
The Protein 3-D Database
The Protein DataBase (PDB) contains 3-D structural data
for proteins
Founded in 1971 with a dozen structures
As of June 2004, there were 25,760 structures in the database.
All structures are reviewed for accuracy and data uniformity.
Structural data from the PDB can be freely accessed at
http://www.rcsb.org/pdb/
80% come from X-ray crystallography
16% come from NMR
2% come from theoretical modeling
http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/ -
7/31/2019 Bioinformatics2
15/16
High-throughput methods
-
7/31/2019 Bioinformatics2
16/16
Most used websites for 3-D structure prediction
Protein Homology/analogY Recognition Engine (Phyre) at
http://www.sbg.bio.ic.ac.uk/phyre/html/index.html
PredictProtein at
http://www.predictprotein.org/newwebsite/submit.html
UCLA Fold Recognition at
http://www.doe-mbi.ucla.edu/Services/FOLD/
http://www.sbg.bio.ic.ac.uk/phyre/html/index.htmlhttp://www.predictprotein.org/newwebsite/submit.htmlhttp://www.doe-mbi.ucla.edu/Services/FOLD/http://www.doe-mbi.ucla.edu/Services/FOLD/http://www.doe-mbi.ucla.edu/Services/FOLD/http://www.doe-mbi.ucla.edu/Services/FOLD/http://www.predictprotein.org/newwebsite/submit.htmlhttp://www.sbg.bio.ic.ac.uk/phyre/html/index.html