bioinformatics a biased overview
DESCRIPTION
Lecture given to graduate microbiology students on examples of work in bioinformatics. Date:TRANSCRIPT
A {Biased} Overview of Bioinformatics
with Examples Drawn from Our Own Work
Philip E. Bourne Professor of Pharmacology UCSD
1Bioinformatics - Overview
There Are Multiple Types of Informatics in the Life Sciences
Bioinformatics - Overview 2
PharmacyInformatics
BiomedicalInformatics
Bioinformatics
Drug dosingPharmacokineticsPharmacy InformationSystems
EHRDecision support systemsHospital Information Systems
AlgorithmsGenomicsProteomicsBiological networksSystems Biology
Note: These are only representative examples
There Are Multiple Types of Informatics in the Life Sciences
Bioinformatics - Overview 3
PharmacyInformatics
BiomedicalInformatics
Bioinformatics
Controlled vocabulariesOntologiesLiterature searchingData managementPharmacogenomicsPersonalized medicine
Note: These are only representative examples
Biological Experiment Data Information Knowledge Discovery
Collect Characterize Compare Model Infer
Sequence
Structure
Assembly
Sub-cellular
Cellular
Organ
Higher-life
90 05
Computing Power
Sequencing
Data1 10 100 1000 105
95 00
Human Genome Project
E.ColiGenome
C.ElegansGenome
1 Small Genome/Mo.
ESTs
YeastGenome
Gene Chips
Virus Structure
Ribosome
Model Metaboloic Pathway of E.coli
Complexity Technology
Brain Mapping
Genetic Circuits
Neuronal Modeling
Cardiac Modeling
Human Genome
# People/Web Site
106 102 1
VirtualCommunities
Bioinformatics In One Slide
106
BlogsFacebook
1000’sGWAS
The Omics Revolution4Bioinformatics - Overview
Bioinformatics – One Definition
• The integration of biological data in digital form from different sources and possibly different scales (complexity), usually collected by others, and subsequent analyzed to offer new biological insights
Bioinformatics - Overview 5
Biological Scales (Complexity)
Bioinformatics - Overview 6
Genomics
Proteomics
Protein-protein interactions
Biological Networks
Systems Biology
We will look at an example of how bioinformatics is used at each scale
Some Thoughts on Genomic Data
• Its scary• Its time to consider
cost vs benefit• Reductionism is
not a dirty word• We need to do
more with the long tail
On the Future of Genomic DataScience 11 February 2011: vol. 331 no. 6018 728-729
8
Bioinformatics & Metagenomics• New type of genomics
• New data (and lots of it) and new types of data– 17M new (predicted
proteins!) 4-5 x growth in just few months and much more coming
– New challenges and exacerbation of old challenges
Bioinformatics at Different Scales - GenomicsBioinformatics - Overview
9
Metagenomics: Early Results
• More then 99.5% of DNA in every environment studied represent unknown organisms
• Most genes represent distant homologs of known genes, but there are thousands of new families
• Environments being studied:– Water (ocean, lakes)– Air– Soil– Human body (gut, oral
cavity, human microbiome)
Bioinformatics at Different Scales - GenomicsBioinformatics - Overview
10
Metagenomics New DiscoveriesEnvironmental (red) vs. Currently Known PTPases (blue)
Higher eukaryotes
1
23
4Bioinformatics at Different Scales - Genomics
Bioinformatics - Overview
Proteomics
Bioinformatics - Overview 11
Num
ber
of r
elea
sed
entr
ies
Year
Its Not Just About Numbers its About Complexity
Courtesy of the RCSB Protein Data BankBioinformatics at Different Scales - Proteomics12Bioinformatics - Overview
13
Determining 3D Structures – The Impact of Bioinformatics
Basic Steps
Target Selection
Crystallomics• Isolation,• Expression,• Purification,• Crystallization
DataCollection
StructureSolution
StructureRefinement
Functional Annotation Publish
Structural biology moves from being functionally driven to genomically driven
Fill inprotein fold
space
Robotics-ve data
Software engineering Functional prediction
Notnecessarily
Bioinformatics at Different Scales - ProteomicsBioinformatics - Overview
Bioinformatics at Different Scales - Proteomics14Bioinformatics - Overview
Nature’s ReductionismThere are ~ 20300 possible proteins>>>> all the atoms in the Universe
~20M protein sequences from UniProt/TrEMBL
~75,000 protein structures Yield ~1500 folds, ~2000 superfamilies,
~4000 families (SCOP 1.75)Using Protein Structure to Study Evolution
16
Structure Provides an Evolutionary Fingerprint
Distribution among the three kingdoms as taken from SUPERFAMILY
• Superfamily distributions would seem to be related to the complexity of life
Eukaryota (650)
Archaea (416) Bacteria (564)
2 42
10
135
118
387
17
SCOP fold (765 total)
1
153/14
9/1
21/2 310/0645/49
29/0 68/0
Any genome / All genomes
Using Protein Structure to Study Evolution
17
Method – Distance Determination
(FSF)SCOP
SUPERFAMILY
organisms
C. intestinalis C. briggsae F. rubripes
a.1.1 1 1 1
a.1.2 1 1 1
a.10.1 0 0 1
a.100.1 1 1 1
a.101.1 0 0 0
a.102.1 0 1 1
a.102.2 1 1 1
C. intestinalis C. briggsae F. rubripes
C. intestinalis 0 101 109
C. briggsae 0 144
F. rubripes 0
Presence/Absence Data Matrix
Distance Matrix
Using Protein Structure to Study Evolution
18
If Structure is so Conservedis it a Useful Tool in the Study of Evolution?
The Answer Would Appear to be Yes
• It is possible to generate a reasonable tree of life from merely the presence or absence of superfamilies (FSFs) within a given proteome
Using Protein Structure to Study Evolution
Yang, Doolittle & Bourne (2005) PNAS 102(2) 373-8
19
The Influence of Environment on Life
Chris Dupont Scripps Institute of Oceanography
UCSD
DuPont, Yang, Palenik, Bourne. 2006 PNAS 103(47) 17822-17827
Using Protein Structure to Study Evolution
20
Consider the Distribution of Disulfide
Bonds among Folds • Disulphides are only stable under
oxidizing conditions• Oxygen content gradually
accumulated during the earth’s evolution
• The divergence of the three kingdoms occurred 1.8-2.2 billion years ago
• Oxygen began to accumulate ~ 2.0 billion years ago
• Logical deduction – disulfides more prevalent in folds (organisms) that evolved later
• This would seem to hold true
• Can we take this further?
Eukaryota
Archaea Bacteria
0% (0/2)
16.7% (7/42)
0% (0/10)
31.9% (43/135)
14.4% (17/118) 4.7%
(18/387)
5.9% (1/17)
SCOP fold (708 total)
1
Using Protein Structure to Study Evolution
21
Evolution of the Earth• 4.5 billion years of change• 300+50K• 1-5 atmospheres• Constant photoenergy• Chemical and geological
changes• Life has evolved in this
time
• The ocean was the “cradle” for 90% of evolution
Using Protein Structure to Study Evolution
22
• Whether the deep ocean became oxic or euxinic following the rise in atmospheric oxygen (~2.3 Gya) is debated, therefore both are shown (oxic ocean-solid lines, euxinic ocean-dashed lines).
• The phylogenetic tree symbols at the top of the figure show one idea as to the theoretical periods of diversification for each Superkingdom.
0
0.5
1
1.00E-20
1.00E-16
1.00E-12
1.00E-08
1.00E-15
1.00E-12
1.00E-09
1.00E-06
1.00E-11
1.00E-09
1.00E-07
00.511.522.533.544.5
Billions of years before present
Concentration
(O2
in arbitrary units, Zn and Fe in m
oles L-1
BacteriaArchaea
Eukarya
Oxygen
Zinc
Iron
CobaltManganese
Theoretical Levels of Trace Metals and Oxygen in the Deep Ocean Through Earth’s History
Replotted from Saito et al, 2003Inorganica Chimica Acta 356: 308-318
Using Protein Structure to Study Evolution
23
The Gaia Hypothesis
Gaia - a complex entity involving the Earth's biosphere, atmosphere, oceans, and soil; the totality constituting a feedback system which seeks an optimal physical and chemical environment for life on this planet.
James Lovelock
Gaia (pronounced /'geɪ.ə/ or /'gaɪ.ə/) "land" or "earth", from the Greek Γαῖα; is a Greek goddess personifying the Earth
Using Protein Structure to Study Evolution
24
The Question
• Have the emergent properties of an organism as judged by its protein content been influenced by the environment?
• Will do this by consideration of the metallomes of a broad range of species
• The metallomes can only be deduced by consideration of the protein structures to which the metal is covalently bound
• Will hypothesize that these emergent properties in turn influenced the environment
Using Protein Structure to Study Evolution
27
Bacteria Fe superfamilies
a.1.1 a.1.2
a.104.1 a.110.1
a.119.1 a.138.1
a.2.11 a.24.3
a.24.4 a.25.1
a.3.1 a.39.3
a.56.1 a.93.1
b.1.13 b.2.6
b.3.6 b.33.1
b.70.2 b.82.2
c.56.6 c.83.1
c.96.1 d.134.1
d.15.4 d.174.1
d.178.1 d.35.1
d.44.1 d.58.1
e.18.1 e.19.1
e.26.1 e.5.1
f.21.1 f.21.2
f.24.1 f.26.1
g.35.1 g.36.1
g.41.5
Eukaryotic Fe superfamilies
a.1.1 a.1.2
a.104.1 a.110.1
a.119.1 a.138.1
a.2.11 a.24.3
a.24.4 a.25.1
a.3.1 a.39.3
a.56.1 a.93.1
b.1.13 b.2.6
b.3.6 b.33.1
b.70.2 b.82.2
c.56.6 c.83.1
c.96.1 d.134.1
d.15.4 d.174.1
d.178.1 d.35.1
d.44.1 d.58.1
e.18.1 e.19.1
e.26.1 e.5.1
f.21.1 f.21.2
f.24.1 f.26.1
g.35.1 g.36.1
g.41.5
Superfamily Distribution As Well As Overall Content Has Changed
Using Protein Structure to Study Evolution
28
Metal Binding Proteins are Not Consistent Across Superkingdoms
0
1
2
Zn Fe Mn Co
Archaea Bacteria Eukarya
Total domains in a proteome
Tot
al Z
n-bi
ndin
g do
mai
ns in
a p
rote
ome
10
104
102.5 105
Slo
pe o
f fi
tted
pow
er la
w
A B
Since these data are derived from current species they are independent ofevolutionary events such as duplication, gene loss, horizontal transfer andendosymbiosis
Using Protein Structure to Study Evolution
Power Laws: Fundamental Constants in the Evolution of Proteomes
A slope of 1 indicates that a group of structural domains is in equilibrium with genome
growth, while a slope > 1 indicates that the group of domains is being preferentially
duplicated (or retained in the case of genome reductions).
van Nimwegen E (2006) in: Koonin EV, Wolf YI, Karev GP, (Ed.). Power laws, scale-free networks, and genome biology
Using Protein Structure to Study Evolution
30
Why are the Power Laws Different for Each Superkingdom?
• Power laws are likely influenced by selective pressure. Qualitatively, the differences in the power law slopes describing Eukarya and Prokarya are correlated to the shifts in trace metal geochemistry that occur with the rise in oceanic oxygen
• We hypothesize that proteomes contain an imprint of the environment at the time of the last common ancestor in each Superkingdom
• This suggests that Eukarya evolved in an oxic environment, whereas the Prokarya evolved in anoxic environments
Using Protein Structure to Study Evolution
31
Do the Metallomes Contain Further Support for this Hypothesis?
Overall percent of Fe bound bySuperkingdom Fold Family % Fe-binding O2 Fe-S heme amino
Cytochrome P450 0.44 + 0.48 heme yesCytochrome c3-like 0.13 + 0.3 heme noCytochrome b5 0.12 + 0.09 heme no
Eukarya Purple acid phosphatase 0.11 + 0.08 amino no 21 + 9 47 + 19 32 + 12Penicillin synthase-like 0.07 + 0.1 amino yesHypoxia-inducible factor 0.07 + 0.04 amino yesDi-heme elbow motif 0.06 + 0.01 heme no
4Fe-4S ferredoxins 1.80 + 0.7 Fe-S noMoCo biosynthesis proteins 1.60 + 0.3 Fe-S noHeme-binding PAS domain 1.10 + 1.0 heme no
Archaea HemN 0.80 + 0.20 Fe-S 1 68 + 12 13 + 14 19 + 6a helical ferrodoxin 0.60 + 0.16 Fe-S nobiotin synthase 0.55 + 0.1 Fe-S noROO N-terminal domain-like 0.5 + 0.1 amino 2
High potential iron protein 0.38 + 0.25 Fe-S noHeme-binding PAS domain 0.3 + 0.4 heme 1MoCo biosynthesis proteins 0.21 + 0.15 Fe-S no
Bacteria HemN 0.2 + 0.15 Fe-S no 47 + 11 22 + 12 31 + 164Fe-4S ferredoxins 0.2 + 0.2 Fe-S nocytochrome c 0.14 + 0.2 heme noa helical ferrodoxin 0.12 + 0.09 Fe-S no
1. Some, but not all, PAS domains actually sense oxygen2. The Rubredoxin oxygen:oxidoreductase (ROO) protein does not contact oxygen, but catalyzes an oxygen reduction pathway
Using Protein Structure to Study Evolution
32
e- Transfer ProteinsSame Broad Function, Same Metal, Different Chemistry
Induced by the Environment?
Fe-S clustersFe bound by S
Cluster held in place by Cys
Generally negative reduction potentials
Very susceptible to oxidation
CytochromesFe bound by heme (and
amino-acids)
Generally positive reduction potentials
Less susceptible to oxidation
Using Protein Structure to Study Evolution
33
Hypothesis
• Emergence of cyanobacteria changed oxygen concentrations
• Impacted relative metal ion concentrations in the ocean
• Organisms evolved to use these metals in new ways to evolve new biological processes eg complex signaling
• This in turn further impacted the environment
• Only protein structures could reveal such dependencies
Using Protein Structure to Study Evolution
Bioinformatics in the Context of Drug Discovery
Bioinformatics - Overview 34
Our Motivation• Tykerb – Breast cancer
• Gleevac – Leukemia, GI cancers
• Nexavar – Kidney and liver cancer
• Staurosporine – natural product – alkaloid – uses many e.g., antifungal antihypertensive
Collins and Workman 2006 Nature Chemical Biology 2 689-700Motivators
A Reverse Engineering Approach to Drug Discovery Across Gene FamiliesCharacterize ligand binding site of primary target (Geometric Potential)
Identify off-targets by ligand binding site similarity(Sequence order independent profile-profile alignment)
Extract known drugs or inhibitors of the primary and/or off-targets
Search for similar small molecules
Dock molecules to both primary and off-targets
Statistics analysis of docking score correlations
…
Computational MethodologyXie and Bourne 2009 Bioinformatics 25(12) 305-312
The Problem with Tuberculosis
• One third of global population infected• 1.7 million deaths per year• 95% of deaths in developing countries• Anti-TB drugs hardly changed in 40 years• MDR-TB and XDR-TB pose a threat to
human health worldwide• Development of novel, effective and
inexpensive drugs is an urgent priority
Repositioning - The TB Story
The TB-Drugome
1. Determine the TB structural proteome
2. Determine all known drug binding sites from the PDB
3. Determine which of the sites found in 2 exist in 1
4. Call the result the TB-drugome
A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976
1. Determine the TB Structural Proteome
284
1, 446
3, 996 2, 266
TB proteome
homology models
solved structu
res
• High quality homology models from ModBase (http://modbase.compbio.ucsf.edu) increase structural coverage from 7.1% to 43.3%
A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976
2. Determine all Known Drug Binding Sites in the PDB
• Searched the PDB for protein crystal structures bound with FDA-approved drugs
• 268 drugs bound in a total of 931 binding sites
No. of drug binding sites
MethotrexateChenodiol
AlitretinoinConjugated estrogens
DarunavirAcarbose
A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976
Map 2 onto 1 – The TB-Drugomehttp://funsite.sdsc.edu/drugome/TB/
Similarities between the binding sites of M.tb proteins (blue), and binding sites containing approved drugs (red).
From a Drug Repositioning Perspective
• Similarities between drug binding sites and TB proteins are found for 61/268 drugs
• 41 of these drugs could potentially inhibit more than one TB protein
No. of potential TB targets
raloxifenealitretinoin
conjugated estrogens &methotrexate
ritonavir
testosteronelevothyroxine
chenodiol
A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976
Top 5 Most Highly Connected Drugs
Drug Intended targets Indications No. of connections TB proteins
levothyroxine transthyretin, thyroid hormone receptor α & β-1, thyroxine-binding globulin, mu-crystallin homolog, serum albumin
hypothyroidism, goiter, chronic lymphocytic thyroiditis, myxedema coma, stupor
14
adenylyl cyclase, argR, bioD, CRP/FNR trans. reg., ethR, glbN, glbO, kasB, lrpA, nusA, prrA, secA1, thyX, trans. reg. protein
alitretinoin retinoic acid receptor RXR-α, β & γ, retinoic acid receptor α, β & γ-1&2, cellular retinoic acid-binding protein 1&2
cutaneous lesions in patients with Kaposi's sarcoma 13
adenylyl cyclase, aroG, bioD, bpoC, CRP/FNR trans. reg., cyp125, embR, glbN, inhA, lppX, nusA, pknE, purN
conjugated estrogens estrogen receptor
menopausal vasomotor symptoms, osteoporosis, hypoestrogenism, primary ovarian failure
10
acetylglutamate kinase, adenylyl cyclase, bphD, CRP/FNR trans. reg., cyp121, cysM, inhA, mscL, pknB, sigC
methotrexatedihydrofolate reductase, serum albumin
gestational choriocarcinoma, chorioadenoma destruens, hydatidiform mole, severe psoriasis, rheumatoid arthritis
10
acetylglutamate kinase, aroF, cmaA2, CRP/FNR trans. reg., cyp121, cyp51, lpd, mmaA4, panC, usp
raloxifeneestrogen receptor, estrogen receptor β
osteoporosis in post-menopausal women 9
adenylyl cyclase, CRP/FNR trans. reg., deoD, inhA, pknB, pknE, Rv1347c, secA1, sigC
Chang et al. 2010 Plos Comp. Biol. 6(9): e1000938
Systems Biology & Drug Discovery
44Bioinformatics - Overview
Bioinformatics & Patient Care
Bioinformatics - Overview 45
7. Social ChangeJosh Sommer and Chordoma Disease
http://fora.tv/2010/04/23/Sage_Commons_Josh_Sommer_Chordoma_Foundation#fullprogram
5. Personalized Medicine
http://pharmacogenomics.ucsd.edu/
Additional Reading
• http://en.wikipedia.org/wiki/Bioinformatics
Bioinformatics - Overview 48
9 Translational Medicine