c e n t r e introduction to bioinformatics · 2008. 2. 12. · • bioinformatics and molecular...

Introduction to bioinformaticsLecture 2

Genes and Genomes

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

Organisational• Course website:

http://ibi.vu.nl/teaching/mnw_2year/mnw2_2008.phpor click on http://ibi.vu.nl(>teaching >Introduction to Bioinformatics)

• Course book:• Bioinformatics and Molecular Evolution by Paul G.

Higgs and Teresa K. Attwood (Blackwell Publishing), 2005, ISBN (Pbk) 1-4051-0683-2

• Essential Bioinformatics by Jin Xiong, Cambridge University Press, 2006, ISBN 0521840988

• Lots of information about Bioinformatics can be found on the web.

.....acctc ctgtgcaaga acatgaaaca nctgtggttctcccagatgg gtcctgtccc aggtgcacct gcaggagtcgggcccaggac tggggaagcc tccagagctc aaaaccccacttggtgacac aactcacaca tgcccacggt gcccagagcccaaatcttgt gacacacctc ccccgtgccc acggtgcccagagcccaaat cttgtgacac acctccccca tgcccacggtgcccagagcc caaatcttgt gacacacctc ccccgtgcccccggtgccca gcacctgaac tcttgggagg accgtcagtcttcctcttcc ccccaaaacc caaggatacc cttatgatttcccggacccc tgaggtcacg tgcgtggtgg tggacgtgagccacgaagac ccnnnngtcc agttcaagtg gtacgtggacggcgtggagg tgcataatgc caagacaaag ctgcgggaggagcagtacaa cagcacgttc cgtgtggtca gcgtcctcaccgtcctgcac caggactggc tgaacggcaa ggagtacaagtgcaaggtct ccaacaaagc aaccaagtca gcctgacctgcctggtcaaa ggcttctacc ccagcgacat cgccgtggagtgggagagca atgggcagcc ggagaacaac tacaacaccacgcctcccat gctggactcc gacggctcct tcttcctctacagcaagctc accgtggaca agagcaggtg gcagcaggggaacatcttct catgctccgt gatgcatgag gctctgcacaaccgctacac gcagaagagc ctctc.....

DNA sequenceDNA sequence

Genome sizeGenome sizeOrganism Number of base pairs

φX-174 virus 5,386

Epstein Bar Virus 172,282

Mycoplasma genitalium 580,000

Hemophilus Influenza 1.8 × 106

Yeast (S. Cerevisiae) 12.1 × 106

Human Human 3.2 3.2 ×××××××× 101099

Wheat 16 × 109

Lilium longiflorum 90 × 109

Salamander 100 × 109

Amoeba dubia 670 × 109

Four DNA nucleotide building blocks

G-C is more strongly hydrogen-bonded than A-T

A gene codes for a protein

��

� �

� �

��

��

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

��

��

��

��

��

��

��

��

�� !"��

��#��$�#��%

q ��

q��

q�� !��

q�� !��

q"� ��!��

q#��$��$��$��

!�&��'�(��#��%

q �� %#&��!��

q � ��

q �� !��

q ��

q �� $��

��

DNA makes RNA makes Protein

… yet another picture to appreciate the above statement

Transcription Factors

mRNA transcription

TF binding site (open)

TF binding site (closed)

TATA

mRNA transcription

TF binding site (TFBS)

TATA

TF

Pol II

Transcription Factor (TF): protein that binds to DNA and to a polymerase (PolII)

Polymerase: complex protein that transcribes DNA into mRNA

Transcription factor –polymerase interaction sets off gene transcription…

… many TFBSs are possible upstream of a gene

Nucleosomes (chromatin structures composed of histones) are structures round which DNA coils. This blocks access of TFs

)�� '��

q %��'(�(((�) '*�(((��+�,-��

q ��+�.�(((��

q ��*!/��

q ��+�'((��

q ��+�'(((��

q .-��

q "��0��,��

�*��'��&�+��'��

q %��1��2�$��3� �� $�� 4��1�� 5

q %�� 6��7��+�'�8�� 9:��

q ;!��1��

q %��!� �� 2��1��!� �� 42��

q ��!� ��$��!��

4��1��!� �� $�3��5! ��

��

��

��#��

q ��

q <��,��

��

��

��#��,�

q ��*�� q �� !��=��!>��1��

=��>��1$�0:*,�

��#��'��$��

q ��$��

��$�

��

�'6!�� $�

��

�'6!�� !

��$��

q ��

�!�!��!�*

ü ��?�� @��?�� AB!��

��'6 ��

ü �� #��*6!,6C�,6!*6��*6!,6�

ü ��?��$�D$�>$�� @��?��$�D$�>$�-

)�%

��

)�� $��

q >!D��!%��!E��

q ,��

q "��D!>��D>��

�� $��!��

TAA, TAG, TGA Stop Stop codons

CGT, CGC, CGA, CGG, AGA, AGGRArginine

AAA, AAGKLysine

GAT, GACDAspartic acid

GAA, GAGEGlutamic acid

CAT, CACHHistidine

AAT, AACNAsparagine

CAA, CAGQGlutamine

TGGWTryptophan

TAT, TACYTyrosine

TCT, TCC, TCA, TCG, AGT, AGCSSerine

ACT, ACC, ACA, ACGTThreonine

CCT, CCC, CCA, CCGPProline

GGT, GGC, GGA, GGG GGlycine

GCT, GCC, GCA, GCG AAlanine

TGT, TGCcCysteine

ATGM,

StartMethionine

TTT, TTCFPhenylalanine

GTT, GTC, GTA, GTGVValine

CTT, CTC, CTA, CTG, TTA, TTGLLeucine

ATT, ATC, ATA IIsoleucine

DNA codons

Single Letter Code

Amino Acid

��

q 4��DF>��F%��

q %��D>!��

q B�� DF>�� !�� 9'-

G�DF>��!�� +'(-�

q A��

*(-��

,/-"�� !��

,.- �� #��

.��#��*��

q H�� 3>��5�

q ��$��$�� >��9�

q " ��

q <��

q ��

q ��

q ��

/01000

/1000,01000

�� ,�

q D��>#%��

q >#%��4>��% !��% ��

q %��$��$��$��$��

q >#%��>�!��$��$��$��$��$��1��

�� /�

q >#��>#%��,!��*(.�� 08.(�� >�! ��

q ��<��<��

%��#*(.��

�� %�� I��

��*(9��

��%>��%%��

��>#%��"��0:::��

%��2��4�0�� 4��0�4�� <��&&��

2��3��#��

��%

q E��1��$��J��

q %��1�� $�� $�� 4$��

q ��$��$��

q %�� %��5��#B ��

q I��$��

�'��#�� #��'��

��*��$��1��$��1�4$��

� ��*��$��1��$��1�4$��

)��$ ��#�� 1��#�4$ ��

��6��#��(��

7��'��8��#�

#��'��"

q %��K��

ü B ��!��

ü B ��1��

��

��

��

��

��

��

�� #��'��"

q % ��

��K��

��

��'�$'��#�#�'��"

)'��#��#

q % ��4

��K��

��'�$'��#�#�'��"

2��#��'��

q % ��L

��

2� �$'��#�#�'��"

2��#��'��

)��#�� #�

q ��

q ��

q �� 1��

q �� 4!��$��!��

q ��

��$��)��#��

q >��

q %��

q "��

$��%�� !��

"��M��0:::��$�� &��%� 9$�.*8�! .*:

��)��#��

q ��1�� q >��

"��>��=�� K��D��$��#��1��$�N��#��0::,��%B<��=A�G��>��"��B��G��

0/"�� "�� "��4��'��

/�� *

�� $��

q "��

q %��

/�� *

��$��

q "�� 0/"�

q %��

Ban et al., Science 289 (905-920), 2000

/�� *��

q "�� !��

q %�� !��

)��%

q 4��!��7��

q E�!��7�

q ,��1��

Three main principles

• DNA makes RNA makes Protein

• Structure more conserved than sequence

• Sequence Structure Function

How to go from DNA to protein sequence

A piece of double stranded DNA:

5’ attcgttggcaaatcgcccctatccggc 3’3’ taagcaaccgtttagcggggataggccg 5’

DNA direction is from 5’ to 3’

How to go from DNA to protein sequence

6-frame conceptual translation using the codon table:

5’ attcgttggcaaatcgcccctatccggc 3’

3’ taagcaaccgtttagcggggataggccg 5’

So, there are six possibilities to make a protein from an unknown piece of DNA, only one of which might be a natural protein

Remark

• Identifying (annotating) human genes, i.e. finding what they are and what they do, is a difficult problem– First, the gene should be delineated on the genome

• Gene finding methods should be able to tell a gene region from anon-gene region

• Start, stop codons, further compositional differences

– Then, a putative function should be found for the gene located

Dean, A. M. and G. B. Golding: Pacific Symposium on Bioinformatics 2000

Evolution and three-dimensional protein structure information

Isocitratedehydrogenase:

The distance fromthe active site(in yellow) determinesthe rate of evolution(red = fast evolution, blue = slow evolution)

Genomic Data Sources

• DNA/protein sequence

• Expression (microarray)

• Proteome (xray, NMR,

mass spectrometry)

• Metabolome

• Physiome (spatial,

temporal)

Integrativebioinformatics

Dinner discussion: Integrative Bioinformatics & Genomics VUDinner discussion: Integrative Bioinformatics & Genomics VU

metabolomemetabolome

proteomeproteome

genomegenome

transcriptometranscriptome

physiomephysiome

Genomic Data SourcesVertical Genomics

DNA makes RNA makes Protein(reminder)

DNA makes RNA makes Protein:Expression data

• More copies of mRNA for a gene leads to more protein

• mRNA can now be measured for all the genes in a cell at ones through microarraytechnology

• Can have 60,000 spots (genes) on a single gene chip

• Colour change gives intensity of gene expression (over- or under-expression)

Proteomics

• Elucidating all 3D structures of proteins in the cell

• This is also called Structural Genomics

• Finding out what these proteins do

• This is also called Functional Genomics

Protein-protein interaction networks

Metabolic networks

Glycolysisand

Gluconeogenesis

Kegg database (Japan)

High-throughput Biological Data

• Enormous amounts of biological data are being generated by high-throughput capabilities; even more are coming– genomic sequences

– arrayCGH (Comparative Genomic Hybridization) data, gene expression data

– mass spectrometry data

– protein-protein interaction data

– protein structures

– ......

Protein structural data explosion

Protein Data Bank (PDB): 14500 Structures (6 March 2001)10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others...

Bioinformatics databases grow exponentially…

Dickerson’s formula: equivalent to Moore’s law

Dickerson predicted that the Protein Data Bank (PDB) of protein three-dimensional structures would grow, starting with the first protein in 1960, as indicated by the above exponential growth function. On 27 March 2001 there were 12,123 3D protein structures in the PDB: Dickerson’s formula predicts 12,066 (within 0.5% -- not a bad prediction)!

n = e0.19(y-1960)

with y the year.

Sequence versus structural data• Structural genomics initiatives are now in

full swing and growth is still exponential.

• However, growth of sequence data is even more rapidly. There are now more than 600 completely sequenced genomes publicly available.

Increasing gap between structural and sequence data (“Mind the gap”)

Bioinformatics• Offers an ever more essential input to

– Molecular Biology– Pharmacology (drug design)– Agriculture– Biotechnology– Clinical medicine– Anthropology– Forensic science– Chemical industries (detergent industries, etc.)

This list is from a molecular biology textbook – so not a self-absorbed bioinformatician is saying this…

c e n t r e introduction to bioinformatics · 2008. 2. 12. · • bioinformatics and molecular...

Documents