91.350/580 topics in bioinformatics what is bioinformatics ? dna sequences protein...
TRANSCRIPT
91.350/580 Topics in Bioinformatics
What is bioinformatics ? DNA sequences Protein sequences/structures Modeling/inference
Intersection biology statistics computer science
algorithms machine learning
Topics today
Textbook, pgs. 16-19 http://sphweb.bumc.bu.edu/otlt/MPH-Modules/
PH/PH709_DNA-Genetics/
DNA & RNA Genes to Proteins
transcription Translation
Genome
DNA & RNA
DNA and RNA
• DNA (deoxyribonucleic acid) and RNA (ribonucleic acid) are composed of linear chains of monomeric units of nucleotides
• A nucleotide has three parts: a sugar, a phophate and a base• Four bases
Base Types
• Nucleic acid bases are of two types• Pyrimidine [pairímədì:n]– C, T, U (two nitrogens in 6-
member ring at positions 1 and 3)• Purine – A, G (pyrimidine ring fused to an imidazole ring
(C3H4N2))
A TG C
A TG C
A TG C
A TG C
A TG C
A TG C
A TG C
A TG C
R YW
s
M K B
V D H N
• Nucleotides are joined by phosphodiester bonds and form sugar-phosphate backbone• Sugar is deoxyribose in DNA (left)and ribose in RNA
(right)• Nitrogen-containing nucleobases are bonded to sugar
Primary Structure of DNA and RNA
Online course on Biology
• Educational Portal
• DNA chemical structure• http://education-portal.com/academy/lesson/dna-and-
the-chemical-structure-of-nucleic-acids.html
• Double helix – 1953 Watson and Crick using X-ray diffraction• Sugar-phosphate backbone is
the outer part of the helix• Two strands run in antiparallel
directions• Dimensions
• Inside diameter of backbone: 11 A (1.1 nm)
• Outer diameter: 20 A (1A=10-10
m =0.1 nm)• Length of one complete turn: 34
A, 10 base-pairs• Major and minor grooves –
drugs or polypeptides bind to DNA
Secondary Structure
• Two strands are complementary• Base pairing: A-T; G-C• Pyrimidine and Purine form complementary H
bonding
Secondary Structure of DNA
• In double strands• # of A = # of T; # of G = # of C• Erwin Chargaff’s 1st Parity Rule, 1951
• In a single strand ?• # of A = # of T; # of G = # of C• Erwin Chargaff’s 2nd Parity Rule
Monomer counts in DNA
• Many consider hydrogen bond essential to the evolution of life
• Individual hydrogen bond is weak, many H bonds collectively exert very strong force
• Orderly repetitive arrangement of H bonds in polymers determines their shape
Importance of Hydrogen Bonding
Online course on Biology
• Educational Portal
• Four bases• http://education-portal.com/academy/lesson/dna-
adenine-guanine-cytosine-thymine-complementary-base-pairing.html
• 3.4A per base• 3 Billion bases
• 1.8 meters of DNA• 0.09 nm of chromatin after
being wound on histones• Five families of histones
• H1/H5, H2A, H2B, H3, and H4
Chromosome Length
RNA
• Sugar in RNA nucleotide is ribose rather than 2’-deoxyribose
• Thymine is replaced by uracil (U)• RNA polymers are usually a few thousand
nucleotides or shorter• RNA in cells is usually single-stranded• RNA is considered to be the original gene coding
material, and it still code genes in a few viruses
RNA Types
• Four RNA’s are involved in protein synthesis
RNA Type Size Function
Transfer RNA Small Transports AA to protein synthesis sites
Ribosomal RNA Variable combines with proteins to form ribosome, where protein polypeptide chain grows
Messenger RNA Variable Transcribes AA sequence from genes
Small nuclear RNA Processing of initial mRNA to its mature form in eukaryotes
Online course on Biology
• Educational Portal
• RNA• http://education-portal.com/academy/lesson/differences-
between-rna-and-dna-types-of-rna-mrna-trna-rrna.html
Gene to Protein:Transcription & Translation
Genome
• one or more chromosomes that contain the code (gene) that directs the synthesis of proteins that are essential for its structure and function
• Human: 22 pairs of homologous chromosomes & XY
• http://www.ncbi.nlm.nih.gov/genome/?term=txid9606[orgn]
Genes
• Allele• Alternative forms of the
same gene
• Dominant, recessive
Gene to Protein:Transcription & Translation
Gene to Protein:Transcription & Translation
Gene to Protein:Transcription & Translation
Transcription
Transcription
Gene to Protein
intron
exon
Protein 1Protein 2
Protein Coding Region
5’UTR
3’UTR
intergenicUTR
Non-Protein Coding Region
Non-Protein Coding Region
Alternative Splicing
Example
Translation
• Genetic Code• A triplet (called codon)
• Ribosome moves along mRNA 3 bases at a time
• Degenerate coding• 4x4x4=64 possible triplets into 20 Amino
Acids• 8 AA have 3rd base irrelevant – immune to
mutation
• Anti-codon – reverse complement of a codon
Genetic Code
Genetic Code
Translation
Genetic Code
Amino Acids
• General structure of amino acids• an amino group• a carboxyl group• α-carbon bonded to a
hydrogen and a side-chain group, R
• R determines the identity of particular amino acid • R: large white and
gray• C: black• Nitrogen: blue• Oxygen: red• Hydrogen: white
Genome
• Genome– The entire DNAs of a cell is the genome– Individual units for coding proteins or RNA are genes
– A gene starts with ATG, ends with one or two stop codons
– Called ORF (Open Reading Frame)
– Biological Info– Contained in genome– Encoded in nucleotide sequences of DNA or RNA– Partitioned into discrete units, genes
Genome
Cell
– Different levels of cells– Prokaryote (karyan, “kernel” in Greek)(/proekaeriəts) (pro for
“before”)
– Eukaryote (“true”)– Main difference is the presence of organelle, especially the
nucleus, in eukaryotes
Organelle Prokaryotes Eukaryotes
Nucleus No definite nucleus Present
Cell membrane Present Present
Mitochondria None. Present
Endoplasmic reticulum None Present
Ribosomes Present Present
Chloroplasts None Present in green plants
Prokaryotic cell
animal cell
plant cell
• Classification purely based on biochemistry (RNA)– C. Woese, 1981
• Eubacteria (true bacteria)• Archaea (archaebacteria, early bacteria)• Eukarya (eukaryotes)
Three Domain
Genome Sequencing Projects Major genome sequencing centers
U.S. Dept. of Energy Joint Genome Institute (435 projects) J. Craig Venter Institue (302) The Institute for Genomic Research (TIGR) (206) Washington Univ. (184) Institut Pasteur, Univ. of Tokyo www.ncbi.nlm.nih.gov/genomes/static/lcenters.html
Completely sequenced genomes include Several hundred bacteria, over 20 archea, and over 30 eukarya Human (homo sapies), chimpanzee (Pan troglodytes), mouse (Mus
musculus), brown rat (Rattus norvegicus), dog (Canis familiaris), Thale cress (Arabidopsis thaliana), rice (Oryza sativa), Fruit fly (Drosophila melanogaster), yeast (Saccharomyces cerevisiae)
http://www.ebi.ac.uk/2can/genomes/genomes.html has descriptions of species and their clinical and scientific significances
http://www.genomesonline.org has current status of genome projects
Genome Databases
Completed genomes ftp site -- ftp://ftp.ncbi.nlm.nih.gov/genomes/ http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/allorg.html http://www.ebi.ac.uk/genomes/mot/index.html http:/pir.goergetown.edu/pirwww/search/genome.html
Organism-specific databases http://www.unledu/stc-95/ResTools/biotools/biotools10.html http://www.fp.mcs.anl.gov/~gaasterland/genomes.html http://www.hgmp.mrc.ac.uk/GenomeWeb/genome-db.html http://www.bioinformatik.de/cgi-bin/browse/Catalog/
Databases/Genome_Proejcts
Genomes of Prokaryotes
Circular double-stranded DNA Protein-coding regions do not contain introns Protein-coding regions are partially organized into operons –
tandom genes transcribed into a single mRNA molecule
The density of coding region is high ~89% in E.Coli
trpE trpD
The trp operon in E.Coli begins with control region, followed by genes performing
successive steps in systhesis of tryptophan AA
Genome of E.Coli
Many E.Coli proteins were known before the sequencing (1853 proteins)
Genome of Escherichia coli, strain MG1655 published in 1997 By F. Blattner at Univ. Wisconsin 4.64 Mbp
4284 protein-coding genes, 122 structural RNA genes, Non-coding repeat sequences, Regulatory elements, etc.
Average size of ORF is 317 AA Average inter-genic gap is 118 bp ¾ transcribe single genes, and the rest are operons (gene
clusters) 60% protein functions are known
http://wishart.biology.ualberta.ca/BacMap/index.html contains an atlas of bacterial genome diagram (2005)
Genome of Archea
Microorganism Methanococcus jannaschii thrives in hydrothermal vents at temp from 48 to 94 CB
genes from 45 strains Capable of self-reproduction from inorganic components Metabolism is to synthesize methane from H2 and CO2 Sequenced in 1996 by TIGR
1.665 Mbp in chromosome containing a circular DNA modecule, two extra-chromosomal elements
1,784 protein-coding regions Proteins in archea for transcription and translation are
closer to those in eukaryote Proteins involved in metabolism are closer to those of
bacteria
Genomes of Eukarya Majority of DNA is in the nucleus
Organized into chromosomes containing single-DNA molecule each
Smaller amount of DNA in organelles such as mitochondria and chloroplasts Organelles originated as intra-cellular parasites Organelle genomes usually have circular forms, but
sometimes in linear or multi-circular shape Genetic code is different that the one for nuclear genes
Diverse among species Humans have 23 chromosomes, chimpanzees have 24 Human chromosome #2 is equivalent to a fusion of
chimpanzee chromosomes 12 and 13 List of genome sequences
http://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_genomes
Genome of Saccharomyces cerevisiae (Yeast)
Simplest eukaryotic organism Sequencing from 100 labs completed in 1992
12.06 Mbp 16 chromosomes 6,172 protein-coding genes
Dense: only 231 genes contain introns
Genome of Caenorhabditis elegans (C. elegans)
Completed in 1998 First full DNA sequence of
a multi-cellular organism 97 Mbp Paired chromosomes
XX for a self-fertilizing hermaphrodite (simultaneously male and female)
XO for male Avg. 5 introns per gene
Proteins 42% have homologues to
other species 34% specific to
nematodes (round worms) 24% no known
homologues
Chromosome
Size (Mbp)
Protein genes
Kbp/gene
I 7.9 2803 5.06
II 8.5 3559 3.05
III 7.6 2508 5.40
IV 9.2 3094 5.17
V 9.8 4082 4.15
X 10.1 2631 6.54
Genome of Drosophila melanogaster (Fruit fly)
Completed in 1999 by Celera Genomics and Berkeley 180 Mbp Five chromosomes: 3 large autosomes, Y, and tiny fifth 13,601 genes, 1 gene/8Kbp Has 289 homologues to human genes
Such as cancer, cardiovascular, neurological, etc. There is a fly model for Parkinson and malaria
Genome of Arabidopsis thaliana
Relatively small genome, 146 Mbp, completed in 2000
Five chromosomes 25,498 predicted genes; 1 gene/4.6 kbp
Proteins Most A. thaliana proteins have homologues in animals
60% of genes have human homologues, e.g., BRCA2 Gene distribution
Nucleus: genome size (125 Mbp), genes (25,500) Chloroplast: genome (154 Kbp), genes (79) Mitochondrion: genome (367 Kbp), genes (58)
20 of 54 genes in a 340-Kbp stretch of rice genome (top) are conserved and retain the same order in five A. thalia strands
Human Genome• Human Genome Project
– Conceived in 1984, begun in 1990, completed in 2001 ahead of 2003 schedule
• What did the sequence reveal ?– 3 Bbp (base pair)– 24 chromosomes,
– 22 autosomes plus two sex chromasomes (X,Y)– Longest 250 Mbp, shorted 55 Mbp
– Mitochondrial genome– Circular DNA molecule of 16.569 Mbp
– ~10**(13) cells– How many is 3 Bbp ?
– Typical 11-pt font can print 60 nucleotide is 3 in (~10 cm).– In this format, 3 Bbp writes out in 5,000 mi
Genome of Homo sapiens 22 chromosomes plus X (163 Mbp) and Y (51
Mbp) Web resources
Interactive access to DNA and protein sequences http://www.ensembl.org
Images of chromosomes, maps, loci http://www.ncbi.nlm.nih.gov/projects/genome/guide/
Gene map 99 http://www.ncbi.nlm.nih.gov/genemap99
overview of human genome structure http://www.ims.u-tokyo.ac.jp/imsut/en
SNP (Single nucleotide polymorphisms) http://snp.cshl.org
Human genetic diseases http://www.ncbi.nlm.nih.gov/Omim (Online Mendelian
Inheritance in Man )
http://www.geneclinics.org/profiles/all-html
Human Genome Insights (ENCODE)
Majority of genome is transcribed ~50% transposons ~25% protein coding genes/1.3% exons ~23,700 protein coding genes ~160,000 transcripts Average Gene ~ 36,000 bp
7 exons @ ~ 300 bp6 introns @ ~5,700 bp7 alternatively spliced products (95% of genes)
RefSeq: ~34,600 “reference sequence” genes (includes pseudogenes, known RNA genes)
Genome of Homo sapiens (cont’d) Repeat sequences >50 % of the genome
Short interspersed nuclear elements (SINEs): 13 %, LINEs: 21 %
Simple stutters (repeats of short oligomers including mini- and micro-satellites) Triplet repeats such as CAG are implicated in numerous
diseases (e.g., glutamine repeats in glutamine protein) SNP (pronounced snip)
A->T mutation in beta-globin changes Glu -> Val, creating a sticky surface on haemoglobin molecules => sicklecell anaemia
Progeria Avg 1 SNP/Kbp (100 SNPs per 100 Kbp) Many 100-Kbp regions tend to remain intact, with fewer than five
SNPs discrete combinations of SNPs define individual’s haplotype
(haploid genotype) Individual genomes are characterized by a distribtuion of
genetic makers including SNPs Int’l HapMap Consortium
Genome of Homo sapiens (cont’d) SNP consortium
Collects human SNPs, nearly 5 million SNPs Show
Most of variations appear in all populations However, a few SNPs are unique to particular populations Genomes of individuals from Japan and China are very similar Chromosome X varies more than other chromosomes (X is
more subject to selective pressure) Mitochondrial DNA
Double-stranded closed circular molecule of 16,569 bp Inherited almost exclusively through maternal lines Not subject to recombination, and changes only by mutation About 1 mutation every 25,000 years
mtDNA and Y mtDNA Inherited through maternal lines
Both sons and daughters get it from their mother All existing sequence variants are traced back to a single
woman (Mitochondrial Eve) in Africa roughly 200,000 years ago Supports “from Africa” hypothesis Avg difference in mtDNA between pairs of individuals is 61.1,
between Africans is 76.7, between non-Africans is 38.5 More divergent populations in Africa for much longer than in
the rest of the world Y chromosome
Most recent common male ancestor (Y-chromosome Adam) is around 59,000 years ago
Most divergent sequences are found from Africans
Other Species
Organism Genome size # of genes
Epstein – Barr virus 0.17 Mbp 80
E.Coli 4.6 Mbp 4,406
Yeast (S. cerevisiae) 12.5 Mbp 6,172
Nematode worm (C.elegans) 100.3 Mbp 19,099
Thale cress (A. thaliana) 115.4 Mbp 25,498
Fruit fly (D. melanogaster) 128.3 Mbp 13,601
Human (H. sapiens) 3223.0 Mbp 20,500
Fugu (Takifugu rubripes) 390.0 Mbp 30,000
Wheat 16000.0 Mbp 30,000