introduction to molecular biology for computer scientists dr. suzanne gollery – sierra nevada...

Introduction to Molecular Biology for Computer Scientists

Dr. Suzanne Gollery – Sierra Nevada College

Martin Gollery – Active Motif

Who are we?

Suzanne – Assistant Professor, Sierra Nevada College Formerly at Baylor College of Medicine, UC

Berkeley Marty- Senior Scientist, ActiveMotif, Inc.

Formerly at University of Nevada, Reno TimeLogic etc

Why Do Bioinformaticists need it?

Avoiding mistakes Understanding the purpose Appreciating the Difficulties We will look at some of the applicable

programs to these concepts


I. Protein structure and functionA. Protein structure

B. Protein function

II. Nucleic acid structure and functionA. The central dogma

B. Nucleic acid structure

C. Genetic code

D. Control of gene expression

E. Mutation


III. The centrality of evolution by natural selection in biology

A. Universal genetic codeB. Types of mutations and their effectsC. GenomesD. Natural selection acts on random variation to

produce evolution


I. Protein structure and function

A. Protein structure

B. Protein function

Protein structure – Amino acids

An amino acid has four functional groups attached to a central carbon atom

An amino group A carboxyl group A hydrogen atom A variable side chain

(R group) Proteins use L isomers

of amino acids

Protein structure – 20 amino acids are used in proteins

Polar amino acid side chains have partial negative or positive electrostatic charges

O and N atoms “hog” electrons and have partial negative charges

Atoms attached to N and O have partial positive charges

Protein structure – 20 amino acids are used in proteins Non-polar amino acid side chains have no

electrostatic charge

Protein structure – 20 amino acids are used in proteins

Charged amino acids are acidic or basic, and donate or accept a H+ in cells

Some charged amino acids have a positive electrostatic charge

Other charged amino acids have a negative electrostatic charge

Protein structure – 20 amino acids are used in proteins Aromatic amino acids have a bulky carbon ring

structure in their side chains

Protein structure – Analysis

X-ray Crystallography- interference patterns Nuclear Magnetic Resonance (NMR) ~31,000 structures in Protein Databank (PDB) PDB is organized by other databases Tends to emphasize certain types of proteins

Protein structure - Polypeptides

Amino acids are joined to produce polypeptides

A water molecule is removed as the amino group of an amino acid reacts with the carboxyl group at the end of a polypeptide

The peptide bond (yellow) is a planar structure

Protein structure – Protein folding

Chemical interactions among amino acids determine the final 3D shape (conformation) of a protein

Protein structure – Protein folding

Protein structure

Primary structure: the linear order of amino acids in a polypeptide

Secondary structure: regions of -helix or -sheet Tertiary structure: globular folded polypeptide Quaternary structure: multiple folded polypeptides

form a complete protein – hemoglobin in this figure

Protein Structure- Primary

Sequence yields structure/function Homology searching paradigm- similar primary structures

yield similar functions Needleman-Wunsch, Smith-Waterman, FASTA, BLAST Similarity scoring- amino acids with similar properties can

replace each other without breaking structure

Secondary structure

Prediction of Secondary structure from primary sequence is tractable

Programs- Coils, PHD, predator, JPRED Secondary Structure may be used to improve

alignments

Protein Folding- Prediction

Very Computationally Intensive A Short Protein (100 bases) would take ~20 days straight on a Petaflop

computer Force Field approximation programs CHARMM, AMBER Accelerated versions based on Field Programmable Gate Array (FPGA)

technology

Protein structural Motifs

Structural Motifs

Repeated or combined motifs form functional domains Domains predict protein function NAD(P)-binding domain of proteins that bind to NAD, an electron carrier -sandwich domain of cell surface recognition proteins (Ig, MHC, CD4)

Functional motifs

eMATRIX/eMOTIF MEME/MetaMEME FingerPrintScan PHI-BLAST

Hidden Markov Models

Represent Domains, Motifs or proteins Major programs include

HMMer (hmmer.wustl.edu) SAM (www.cse.ucsc.edu/research/compbio/sam) Wise tools (www.ebi.ac.uk/Wise2/) Meta-MEME (metameme.sdsc.edu/) PSI-BLAST (www.ncbi.nlm.nih.gov/blast) DeCypherHMM (www.timelogic.com)

What are HMMs, anyway?

Statistical description of a protein family's consensus sequence

Conserved regions receive highest scores Can be seen as a Finite State Machine

Hidden Markov Models

yciH KDGII ZyciH KDGVI VCA0570 KDGDI HI1225 KNGII sll0546 KEDCV

C D E G I K N V

1 1.0

2 0.6 0.2 0.2

3 0.2 0.8

4 0.2 0.2 0.4 0.2

5 0.8 0.2

Contrast with RE type motif, K[DEN][DG][CDIV][IV]

HMM databases

Pfam TIGRfam Superfamily SMART COG KinFam PirSF Panther KOG …etc

Protein structure – Modifications after protein synthesis

Prosthetic groups associate (Heme of cytochrome c)

Polypeptides are trimmed or cut

Sugars are attached Other chemical groups are

attached Chaperone proteins assist

in polypeptide folding

Protein function – Binding to ligand

Ligand (antigen peptide) fits into a cleft on a protein (MHC) like two puzzle pieces fit together

Protein function – Binding to ligand

SitesBase –information on known ligand binding sites from the PDB

LigBase adds related sequences and structures

Protein function – Chymotrypsin binding to ligand:Complementarity of electrostatic charge and 3D shape

Protein function – proteins bind to other proteins

Myosin head binds to actin during muscle contraction

Protein shapes are complementary like puzzle pieces

Binding is reversible Goodness of fit (shape,

charge, hydrophobicity) determines affinity of protein/ligand binding

Protein function – changes in shape are essential to protein function

Proteins are dynamic machines

Two or a few protein conformations may be of similar stability

Ligand binding can act as a switch to change protein conformation

Induced fit: binding of glucose (red) to hexokinase changes enzyme conformation

Protein function – changes in shape are essential to protein function Hemoglobin switches between a T (taut)

deoxygenated and an R (relaxed) oxygenated state

Protein function – changes in shape are essential to protein function Binding of lactose to the lactose transport protein

changes the shape of the protein. Lactose (red) binds to the protein on one side of a

cell membrane and is released on the other side Movement across the membrane is reversible

Protein function – changes in shape are essential to protein function The Na+ -K+ pump

switches between two conformations when a phosphate groups is added or removed

Attaching phosphate to the Na+ -K+ pump switches the protein’s shape, moving Na+ outside the cell

Removing phosphate switches the protein’s shape again, moving K+ into the cell

Protein function – changes in shape are essential to protein function Myosin heads cycles between multiple conformations during

muscle contraction Binding and release of ATP, ADP, and phosphate (Pi) trigger

changes in myosin head conformation Myosin heads pull on actin to make muscle fibers shorter

Protein function – changes in shape are essential to protein function Intrinsically Disordered Proteins have roles in signalling, etc. Some take shape only when interacting Tend to form hubs in interaction networks Predict with PONDR, Spritz, Wiggle, FoldUnfold Disprot database of ID proteins

Protein function - Phosphorylation

Phosphorylation changes a protein’s shape

Phosphorylation may turn a protein on or off

Kinases: enzymes that attach phosphates to proteins

Phosphatases: enzymes that remove phosphates

Other charged chemical groups (cAMP, Ca++…) are also attached to proteins to switch them on or off

Protein structure – Modifications after protein synthesis Post-Translational Modifications (PTM) Phosphorylation takes place on S, T or Y, but

only in certain situations NetPhosK uses Artificial Neural Networks KinasePhos uses HMMs to predict

Phosphorylation sites

Protein Interaction Networks


II. Nucleic acid structure and function

A. The central dogma

B. Nucleic acid structure

C. Genetic code

D. Control of gene expression

E. Mutation

The central dogma of molecular biology

DNA contains instructions for making RNA and protein

DNA is transcribed (copied) to make messenger RNA (mRNA)

mRNA is translated (instructions read) to make proteins

Central dogma - Transcription

Transcription occurs in the nucleus

mRNAs are modified before transport to the cytoplasm

Other RNAs are also transcribed:

o tRNAs: “read” genetic code in mRNA

o rRNAs: backbone of ribosomes

o small RNAs: enzymes and regulators of gene function

Central dogma - Translation

Translation occurs in the cytoplasm on ribosomes

tRNAs match the correct amino acids to codons on the mRNA

Ribosomal enzymes join amino acids to the growing polypeptide

The emerging polypeptide folds

Nucleic acid structure - Nucleotides

Nucleotides are built from a sugar, base, and phosphate

Carbon atoms in the sugar are numbered

DNA has 2’ deoxyribose; RNA has ribose

DNA has G, A, T, and C RNA has G, A, U, and C

Nucleic acid structure - Polynucleotides

The 5’ phosphate of one nucleotide is joined to the 3’ hydroxyl group of the preceding nucleotide

The beginning of a nucleic acid has a free 5’ phosphate group

The end of a nucleic acid has a free 3’ hydroxyl group

Nucleic acid structure - DNA

Double stranded DNA forms a helix

DNA strands are joined by hydrogen bonds between complementary bases

A always pairs with T; two hydrogen bonds

G always pairs with C; three hydrogen bonds

Nucleic acid structure - DNA

Each DNA strand serves as a template for replication or repair of the other strand, and for RNA synthesis

Nucleic acid structure - RNA

RNA is single stranded RNAs fold to form

regions of internal double helix with complementary base pairs

Many functional RNAs have globular shapes (green: sugar-phosphate backbone; gray: paired bases)

Genetic code – Linear code

One DNA strand serves as a template for mRNA synthesis

The linear order of nucleotides in the mRNA corresponds to the amino acid sequence of the polypeptide

Triplet codons specify insertion of amino acids

The DNA coding strand is complementary to the template strand, so its sequence is comparable to the mRNA (with T instead of U)

Genetic code – triplet codons

tRNA “reads” genetic code: anticodon is complementary to the mRNA codon

Redundant genetic code: some amino acids are specified by multiple codons

First codon is AUG Met Three stop codons specify

termination of polypeptide synthesis

Control of Gene Expression – Gene Structure

RNA polymerase binds to DNA at the promoter to initiate transcription Other proteins bind to sequences near the exons to regulate transcription The information in genes (exons) is interrupted by non-coding sequence

(introns) that are removed from RNA by splicing 5’ cap and poly (A) tail are added for mRNA stability

Control of gene expression – DNA binding proteins can activate or inhibit transcription

Multiple proteins must bind to DNA to initiate transcription

Some proteins bind near the promoter, while others bind farther away at enhancers

DNA bends so that enhancer-binding proteins help RNA polymerase assemble on the promoter

Some proteins bind to DNA or to activator proteins to block transcription initiation

Control of gene expression – how proteins bind to specific DNA sequences DNA binding proteins

insert an -helix into the major groove of DNA

Helix-turn-helix domain

Control of gene expression – how proteins bind to specific DNA sequences The protein stalls on the DNA where amino acids form

the maximum number of hydrogen bonds with nucleotide bases in the major groove

Control of gene expression – DNA binding protein domains

Homeodomain Zinc finger Leucine zipper

Control of gene expression – other points of control

Whether or not a gene is “expressed” can be controlled at any step that affects proteins concentration and function

Alternative splicing (post-transcriptional processing) produces multiple proteins from one gene

Control of translation (siRNAs block translation)

Control of protein activity (phosphorylation switches proteins on or off)

mRNA or protein longevity (how quickly it is degraded)

Control of gene expression – Alternative splicing

Analysis of gene expression

Microarrays, GeneChips Three classes of software-

Reading the images Clustering the data, building associations

GeneSpring, GeneSifter, Bioconductor Warehousing the data

GEO, SMD, YMD, others

Meta-analysis is difficult due to variability

Mutations

Although DNA is replicated and repaired accurately, rare mistakes are made, which alters nucleotide sequence

Exposure to some chemicals and radiation damages DNA, increasing the likelihood of mutation

Although rare, a few nucleotide sequence changes occur with each generation

Mutations introduce genetic variability in a population of individuals


III. The centrality of evolution by natural selection in biology

A. Universal genetic code

B. Types of mutations and their effects

C. Genomes

D. Natural selection acts on random variation to produce evolution

Universal genetic code

All living organisms use the same genetic code: CCC encodes proline in all cells

All organisms are descended from a common ancestor – all life on earth evolved from a common point of origin

The impressive variation in living organisms arose through random changes in nucleotide sequence (mutation) acted upon by natural selection over billions of years

Mutation – nucleotide substitutions

Mutations are random changes in nucleotide sequence

Mutations in non-coding sequences and codon third position are often silent

Some nucleotide substitutions change the amino acid

Nonsense mutations introduce stop codons and truncate a polypeptide

Mutations - Frameshifts

Insertion or deletion of a nucleotide shifts the reading frame and drastically alters amino acid sequence

Mutations - Frameshifts

Frameshift tolerant matching programs add additional states corresponding to transitions to other reading frames

FrameSearch, Wise2, BLAST with OOF option

Mutations – Chromosomal mutations

Large duplications generate multiple copies of genes

Large deletions remove genes from the genome

Duplications, deletions, inversions, and translocations are preserved in future generations, so can be used to trace evolutionary history

Mutations – effects on organisms

Random mutation produces genetic variability that is acted upon by natural selection

Most mutations are deleterious: mutations occur randomly, and are more likely to disrupt protein function than alter it in a positive way

Deleterious mutations are eliminated by natural selection

Rare mutations that introduce altered, even beneficial functions are positively selected

Genomes – Eukaryotic cells

Animals, Plants, Fungi, and some other organisms (Protists) have eukaryotic cells

Eukaryotic organisms have existed on earth for millions of years

Eukaryotic genomes are in linear pieces of DNA called chromosomes

Eukaryotic genes usually have introns Eukaryotic genes are separated by lots of spacer

DNA that does not encode proteins Many repetitive sequences are present:

o Tandem repeats of short sequences, like GACo Transposons (for example, LINES and SINES)

Genomes – Prokaryotic cells

Bacteria have prokaryotic cells Prokaryotic organisms have existed on earth for

billions of years Most prokaryotic genomes consist of one circular

DNA molecule Most prokaryotic genes lack introns Prokaryotic genomes have little spacer DNA; most

DNA encodes known RNAs or proteins There is much less repetitive DNA than for eukaryotic

genomes Plasmids, tiny circular DNA molecules separate from

the main genome, are used in recombinant DNA technology to introduce genes into bacteria

Genomes – evolution through chromosomal mutation Gene duplications result

in multigene families After duplication, one

gene can provide the original function while the other may evolve (through mutation and selection) a different function

The globin gene family evolved through duplication, mutation, and selection

Genomes – evolution through chromosomal mutation

- and -globin genes vary in amino acid sequence, yet share the same conformation

Sequence differences are fairly conservative

Multiple Sequence Alignment

Genomes – evolution through chromosomal mutation

As species diverge, chromosomal mutations shuffle genome content

The more recently two species diverged, the more similar their genome organization

Both chromosomal and single nucleotide mutations can be used to trace the evolutionary history of species

Genetic information in human chromosomes (blue) compared to dog chromosomes

Credits

Many figures were borrowed from three sources:o Nelson and Cox, Lehninger Principles of Biochemistry, 4e,

WH Freeman and Co., 2005 ISBN: 0-7167-4339-6o Raven, Johnson, Losos, Mason, and Singer, Biology, 8e,

McGraw-Hill, 2008 ISBN: 978-0-07-296581-0o Klug, Cummings, and Spencer, Essentials of Genetics, 6e,

Pearson/Prentice Hall, 2007 ISBN: 0-13-224127-7

These texts and the accompanying on-line materials are excellent resources for learning more molecular biology

Thank You!

Marty is at [email protected] Suzanne is at [email protected]

introduction to molecular biology for computer scientists dr. suzanne gollery – sierra nevada...

Documents

protein structure protein

protein slide

protein structure amino

protein function slide

planar structure slide

breaking structure slide

chains slide

short protein