introduction to molecular biology for computer scientists dr. suzanne gollery – sierra nevada...
TRANSCRIPT
Introduction to Molecular Biology for Computer Scientists
Dr. Suzanne Gollery – Sierra Nevada College
Martin Gollery – Active Motif
Who are we?
Suzanne – Assistant Professor, Sierra Nevada College Formerly at Baylor College of Medicine, UC
Berkeley Marty- Senior Scientist, ActiveMotif, Inc.
Formerly at University of Nevada, Reno TimeLogic etc
Why Do Bioinformaticists need it?
Avoiding mistakes Understanding the purpose Appreciating the Difficulties We will look at some of the applicable
programs to these concepts
Introduction to Molecular Biology for Computer Scientists
I. Protein structure and functionA. Protein structure
B. Protein function
II. Nucleic acid structure and functionA. The central dogma
B. Nucleic acid structure
C. Genetic code
D. Control of gene expression
E. Mutation
Introduction to Molecular Biology for Computer Scientists
III. The centrality of evolution by natural selection in biology
A. Universal genetic codeB. Types of mutations and their effectsC. GenomesD. Natural selection acts on random variation to
produce evolution
Introduction to Molecular Biology for Computer Scientists
I. Protein structure and function
A. Protein structure
B. Protein function
Protein structure – Amino acids
An amino acid has four functional groups attached to a central carbon atom
An amino group A carboxyl group A hydrogen atom A variable side chain
(R group) Proteins use L isomers
of amino acids
Protein structure – 20 amino acids are used in proteins
Polar amino acid side chains have partial negative or positive electrostatic charges
O and N atoms “hog” electrons and have partial negative charges
Atoms attached to N and O have partial positive charges
Protein structure – 20 amino acids are used in proteins Non-polar amino acid side chains have no
electrostatic charge
Protein structure – 20 amino acids are used in proteins
Charged amino acids are acidic or basic, and donate or accept a H+ in cells
Some charged amino acids have a positive electrostatic charge
Other charged amino acids have a negative electrostatic charge
Protein structure – 20 amino acids are used in proteins Aromatic amino acids have a bulky carbon ring
structure in their side chains
Protein structure – Analysis
X-ray Crystallography- interference patterns Nuclear Magnetic Resonance (NMR) ~31,000 structures in Protein Databank (PDB) PDB is organized by other databases Tends to emphasize certain types of proteins
Protein structure - Polypeptides
Amino acids are joined to produce polypeptides
A water molecule is removed as the amino group of an amino acid reacts with the carboxyl group at the end of a polypeptide
The peptide bond (yellow) is a planar structure
Protein structure – Protein folding
Chemical interactions among amino acids determine the final 3D shape (conformation) of a protein
Protein structure – Protein folding
Protein structure
Primary structure: the linear order of amino acids in a polypeptide
Secondary structure: regions of -helix or -sheet Tertiary structure: globular folded polypeptide Quaternary structure: multiple folded polypeptides
form a complete protein – hemoglobin in this figure
Protein Structure- Primary
Sequence yields structure/function Homology searching paradigm- similar primary structures
yield similar functions Needleman-Wunsch, Smith-Waterman, FASTA, BLAST Similarity scoring- amino acids with similar properties can
replace each other without breaking structure
Secondary structure
Prediction of Secondary structure from primary sequence is tractable
Programs- Coils, PHD, predator, JPRED Secondary Structure may be used to improve
alignments
Protein Folding- Prediction
Very Computationally Intensive A Short Protein (100 bases) would take ~20 days straight on a Petaflop
computer Force Field approximation programs CHARMM, AMBER Accelerated versions based on Field Programmable Gate Array (FPGA)
technology
Protein structural Motifs
Structural Motifs
Repeated or combined motifs form functional domains Domains predict protein function NAD(P)-binding domain of proteins that bind to NAD, an electron carrier -sandwich domain of cell surface recognition proteins (Ig, MHC, CD4)
Functional motifs
eMATRIX/eMOTIF MEME/MetaMEME FingerPrintScan PHI-BLAST
Hidden Markov Models
Represent Domains, Motifs or proteins Major programs include
HMMer (hmmer.wustl.edu) SAM (www.cse.ucsc.edu/research/compbio/sam) Wise tools (www.ebi.ac.uk/Wise2/) Meta-MEME (metameme.sdsc.edu/) PSI-BLAST (www.ncbi.nlm.nih.gov/blast) DeCypherHMM (www.timelogic.com)
What are HMMs, anyway?
Statistical description of a protein family's consensus sequence
Conserved regions receive highest scores Can be seen as a Finite State Machine
Hidden Markov Models
yciH KDGII ZyciH KDGVI VCA0570 KDGDI HI1225 KNGII sll0546 KEDCV
C D E G I K N V
1 1.0
2 0.6 0.2 0.2
3 0.2 0.8
4 0.2 0.2 0.4 0.2
5 0.8 0.2
Contrast with RE type motif, K[DEN][DG][CDIV][IV]
HMM databases
Pfam TIGRfam Superfamily SMART COG KinFam PirSF Panther KOG …etc
Protein structure – Modifications after protein synthesis
Prosthetic groups associate (Heme of cytochrome c)
Polypeptides are trimmed or cut
Sugars are attached Other chemical groups are
attached Chaperone proteins assist
in polypeptide folding
Protein function – Binding to ligand
Ligand (antigen peptide) fits into a cleft on a protein (MHC) like two puzzle pieces fit together
Protein function – Binding to ligand
SitesBase –information on known ligand binding sites from the PDB
LigBase adds related sequences and structures
Protein function – Chymotrypsin binding to ligand:Complementarity of electrostatic charge and 3D shape
Protein function – proteins bind to other proteins
Myosin head binds to actin during muscle contraction
Protein shapes are complementary like puzzle pieces
Binding is reversible Goodness of fit (shape,
charge, hydrophobicity) determines affinity of protein/ligand binding
Protein function – changes in shape are essential to protein function
Proteins are dynamic machines
Two or a few protein conformations may be of similar stability
Ligand binding can act as a switch to change protein conformation
Induced fit: binding of glucose (red) to hexokinase changes enzyme conformation
Protein function – changes in shape are essential to protein function Hemoglobin switches between a T (taut)
deoxygenated and an R (relaxed) oxygenated state
Protein function – changes in shape are essential to protein function Binding of lactose to the lactose transport protein
changes the shape of the protein. Lactose (red) binds to the protein on one side of a
cell membrane and is released on the other side Movement across the membrane is reversible
Protein function – changes in shape are essential to protein function The Na+ -K+ pump
switches between two conformations when a phosphate groups is added or removed
Attaching phosphate to the Na+ -K+ pump switches the protein’s shape, moving Na+ outside the cell
Removing phosphate switches the protein’s shape again, moving K+ into the cell
Protein function – changes in shape are essential to protein function Myosin heads cycles between multiple conformations during
muscle contraction Binding and release of ATP, ADP, and phosphate (Pi) trigger
changes in myosin head conformation Myosin heads pull on actin to make muscle fibers shorter
Protein function – changes in shape are essential to protein function Intrinsically Disordered Proteins have roles in signalling, etc. Some take shape only when interacting Tend to form hubs in interaction networks Predict with PONDR, Spritz, Wiggle, FoldUnfold Disprot database of ID proteins
Protein function - Phosphorylation
Phosphorylation changes a protein’s shape
Phosphorylation may turn a protein on or off
Kinases: enzymes that attach phosphates to proteins
Phosphatases: enzymes that remove phosphates
Other charged chemical groups (cAMP, Ca++…) are also attached to proteins to switch them on or off
Protein structure – Modifications after protein synthesis Post-Translational Modifications (PTM) Phosphorylation takes place on S, T or Y, but
only in certain situations NetPhosK uses Artificial Neural Networks KinasePhos uses HMMs to predict
Phosphorylation sites
Protein Interaction Networks
Introduction to Molecular Biology for Computer Scientists
II. Nucleic acid structure and function
A. The central dogma
B. Nucleic acid structure
C. Genetic code
D. Control of gene expression
E. Mutation
The central dogma of molecular biology
DNA contains instructions for making RNA and protein
DNA is transcribed (copied) to make messenger RNA (mRNA)
mRNA is translated (instructions read) to make proteins
Central dogma - Transcription
Transcription occurs in the nucleus
mRNAs are modified before transport to the cytoplasm
Other RNAs are also transcribed:
o tRNAs: “read” genetic code in mRNA
o rRNAs: backbone of ribosomes
o small RNAs: enzymes and regulators of gene function
Central dogma - Translation
Translation occurs in the cytoplasm on ribosomes
tRNAs match the correct amino acids to codons on the mRNA
Ribosomal enzymes join amino acids to the growing polypeptide
The emerging polypeptide folds
Nucleic acid structure - Nucleotides
Nucleotides are built from a sugar, base, and phosphate
Carbon atoms in the sugar are numbered
DNA has 2’ deoxyribose; RNA has ribose
DNA has G, A, T, and C RNA has G, A, U, and C
Nucleic acid structure - Polynucleotides
The 5’ phosphate of one nucleotide is joined to the 3’ hydroxyl group of the preceding nucleotide
The beginning of a nucleic acid has a free 5’ phosphate group
The end of a nucleic acid has a free 3’ hydroxyl group
Nucleic acid structure - DNA
Double stranded DNA forms a helix
DNA strands are joined by hydrogen bonds between complementary bases
A always pairs with T; two hydrogen bonds
G always pairs with C; three hydrogen bonds
Nucleic acid structure - DNA
Each DNA strand serves as a template for replication or repair of the other strand, and for RNA synthesis
Nucleic acid structure - RNA
RNA is single stranded RNAs fold to form
regions of internal double helix with complementary base pairs
Many functional RNAs have globular shapes (green: sugar-phosphate backbone; gray: paired bases)
Genetic code – Linear code
One DNA strand serves as a template for mRNA synthesis
The linear order of nucleotides in the mRNA corresponds to the amino acid sequence of the polypeptide
Triplet codons specify insertion of amino acids
The DNA coding strand is complementary to the template strand, so its sequence is comparable to the mRNA (with T instead of U)
Genetic code – triplet codons
tRNA “reads” genetic code: anticodon is complementary to the mRNA codon
Redundant genetic code: some amino acids are specified by multiple codons
First codon is AUG Met Three stop codons specify
termination of polypeptide synthesis
Control of Gene Expression – Gene Structure
RNA polymerase binds to DNA at the promoter to initiate transcription Other proteins bind to sequences near the exons to regulate transcription The information in genes (exons) is interrupted by non-coding sequence
(introns) that are removed from RNA by splicing 5’ cap and poly (A) tail are added for mRNA stability
Control of gene expression – DNA binding proteins can activate or inhibit transcription
Multiple proteins must bind to DNA to initiate transcription
Some proteins bind near the promoter, while others bind farther away at enhancers
DNA bends so that enhancer-binding proteins help RNA polymerase assemble on the promoter
Some proteins bind to DNA or to activator proteins to block transcription initiation
Control of gene expression – how proteins bind to specific DNA sequences DNA binding proteins
insert an -helix into the major groove of DNA
Helix-turn-helix domain
Control of gene expression – how proteins bind to specific DNA sequences The protein stalls on the DNA where amino acids form
the maximum number of hydrogen bonds with nucleotide bases in the major groove
Control of gene expression – DNA binding protein domains
Homeodomain Zinc finger Leucine zipper
Control of gene expression – other points of control
Whether or not a gene is “expressed” can be controlled at any step that affects proteins concentration and function
Alternative splicing (post-transcriptional processing) produces multiple proteins from one gene
Control of translation (siRNAs block translation)
Control of protein activity (phosphorylation switches proteins on or off)
mRNA or protein longevity (how quickly it is degraded)
Control of gene expression – Alternative splicing
Analysis of gene expression
Microarrays, GeneChips Three classes of software-
Reading the images Clustering the data, building associations
GeneSpring, GeneSifter, Bioconductor Warehousing the data
GEO, SMD, YMD, others
Meta-analysis is difficult due to variability
Mutations
Although DNA is replicated and repaired accurately, rare mistakes are made, which alters nucleotide sequence
Exposure to some chemicals and radiation damages DNA, increasing the likelihood of mutation
Although rare, a few nucleotide sequence changes occur with each generation
Mutations introduce genetic variability in a population of individuals
Introduction to Molecular Biology for Computer Scientists
III. The centrality of evolution by natural selection in biology
A. Universal genetic code
B. Types of mutations and their effects
C. Genomes
D. Natural selection acts on random variation to produce evolution
Universal genetic code
All living organisms use the same genetic code: CCC encodes proline in all cells
All organisms are descended from a common ancestor – all life on earth evolved from a common point of origin
The impressive variation in living organisms arose through random changes in nucleotide sequence (mutation) acted upon by natural selection over billions of years
Mutation – nucleotide substitutions
Mutations are random changes in nucleotide sequence
Mutations in non-coding sequences and codon third position are often silent
Some nucleotide substitutions change the amino acid
Nonsense mutations introduce stop codons and truncate a polypeptide
Mutations - Frameshifts
Insertion or deletion of a nucleotide shifts the reading frame and drastically alters amino acid sequence
Mutations - Frameshifts
Frameshift tolerant matching programs add additional states corresponding to transitions to other reading frames
FrameSearch, Wise2, BLAST with OOF option
Mutations – Chromosomal mutations
Large duplications generate multiple copies of genes
Large deletions remove genes from the genome
Duplications, deletions, inversions, and translocations are preserved in future generations, so can be used to trace evolutionary history
Mutations – effects on organisms
Random mutation produces genetic variability that is acted upon by natural selection
Most mutations are deleterious: mutations occur randomly, and are more likely to disrupt protein function than alter it in a positive way
Deleterious mutations are eliminated by natural selection
Rare mutations that introduce altered, even beneficial functions are positively selected
Genomes – Eukaryotic cells
Animals, Plants, Fungi, and some other organisms (Protists) have eukaryotic cells
Eukaryotic organisms have existed on earth for millions of years
Eukaryotic genomes are in linear pieces of DNA called chromosomes
Eukaryotic genes usually have introns Eukaryotic genes are separated by lots of spacer
DNA that does not encode proteins Many repetitive sequences are present:
o Tandem repeats of short sequences, like GACo Transposons (for example, LINES and SINES)
Genomes – Prokaryotic cells
Bacteria have prokaryotic cells Prokaryotic organisms have existed on earth for
billions of years Most prokaryotic genomes consist of one circular
DNA molecule Most prokaryotic genes lack introns Prokaryotic genomes have little spacer DNA; most
DNA encodes known RNAs or proteins There is much less repetitive DNA than for eukaryotic
genomes Plasmids, tiny circular DNA molecules separate from
the main genome, are used in recombinant DNA technology to introduce genes into bacteria
Genomes – evolution through chromosomal mutation Gene duplications result
in multigene families After duplication, one
gene can provide the original function while the other may evolve (through mutation and selection) a different function
The globin gene family evolved through duplication, mutation, and selection
Genomes – evolution through chromosomal mutation
- and -globin genes vary in amino acid sequence, yet share the same conformation
Sequence differences are fairly conservative
Multiple Sequence Alignment
Multiple Sequence Alignment
Genomes – evolution through chromosomal mutation
As species diverge, chromosomal mutations shuffle genome content
The more recently two species diverged, the more similar their genome organization
Both chromosomal and single nucleotide mutations can be used to trace the evolutionary history of species
Genetic information in human chromosomes (blue) compared to dog chromosomes
Credits
Many figures were borrowed from three sources:o Nelson and Cox, Lehninger Principles of Biochemistry, 4e,
WH Freeman and Co., 2005 ISBN: 0-7167-4339-6o Raven, Johnson, Losos, Mason, and Singer, Biology, 8e,
McGraw-Hill, 2008 ISBN: 978-0-07-296581-0o Klug, Cummings, and Spencer, Essentials of Genetics, 6e,
Pearson/Prentice Hall, 2007 ISBN: 0-13-224127-7
These texts and the accompanying on-line materials are excellent resources for learning more molecular biology
Thank You!
Marty is at [email protected] Suzanne is at [email protected]