91.350/580 topics in bioinformatics what is bioinformatics ? dna sequences protein...

60
91.350/580 Topics in Bioinformatics What is bioinformatics ? DNA sequences Protein sequences/structures Modeling/inference Intersection biology statistics computer science algorithms machine learning

Upload: domenic-jennings

Post on 12-Jan-2016

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

91.350/580 Topics in Bioinformatics

What is bioinformatics ? DNA sequences Protein sequences/structures Modeling/inference

Intersection biology statistics computer science

algorithms machine learning

Page 2: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Topics today

Textbook, pgs. 16-19 http://sphweb.bumc.bu.edu/otlt/MPH-Modules/

PH/PH709_DNA-Genetics/

DNA & RNA Genes to Proteins

transcription Translation

Genome

Page 3: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

DNA & RNA

Page 4: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

DNA and RNA

• DNA (deoxyribonucleic acid) and RNA (ribonucleic acid) are composed of linear chains of monomeric units of nucleotides

• A nucleotide has three parts: a sugar, a phophate and a base• Four bases

Page 5: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Base Types

• Nucleic acid bases are of two types• Pyrimidine [pairímədì:n]– C, T, U (two nitrogens in 6-

member ring at positions 1 and 3)• Purine – A, G (pyrimidine ring fused to an imidazole ring

(C3H4N2))

Page 6: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

A TG C

A TG C

A TG C

A TG C

A TG C

A TG C

A TG C

A TG C

R YW

s

M K B

V D H N

Page 7: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

• Nucleotides are joined by phosphodiester bonds and form sugar-phosphate backbone• Sugar is deoxyribose in DNA (left)and ribose in RNA

(right)• Nitrogen-containing nucleobases are bonded to sugar

Primary Structure of DNA and RNA

Page 8: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Online course on Biology

• Educational Portal

• DNA chemical structure• http://education-portal.com/academy/lesson/dna-and-

the-chemical-structure-of-nucleic-acids.html

Page 9: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

• Double helix – 1953 Watson and Crick using X-ray diffraction• Sugar-phosphate backbone is

the outer part of the helix• Two strands run in antiparallel

directions• Dimensions

• Inside diameter of backbone: 11 A (1.1 nm)

• Outer diameter: 20 A (1A=10-10

m =0.1 nm)• Length of one complete turn: 34

A, 10 base-pairs• Major and minor grooves –

drugs or polypeptides bind to DNA

Secondary Structure

Page 10: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

• Two strands are complementary• Base pairing: A-T; G-C• Pyrimidine and Purine form complementary H

bonding

Secondary Structure of DNA

Page 11: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

• In double strands• # of A = # of T; # of G = # of C• Erwin Chargaff’s 1st Parity Rule, 1951

• In a single strand ?• # of A = # of T; # of G = # of C• Erwin Chargaff’s 2nd Parity Rule

Monomer counts in DNA

Page 12: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

• Many consider hydrogen bond essential to the evolution of life

• Individual hydrogen bond is weak, many H bonds collectively exert very strong force

• Orderly repetitive arrangement of H bonds in polymers determines their shape

Importance of Hydrogen Bonding

Page 13: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology
Page 14: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Online course on Biology

• Educational Portal

• Four bases• http://education-portal.com/academy/lesson/dna-

adenine-guanine-cytosine-thymine-complementary-base-pairing.html

Page 15: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

• 3.4A per base• 3 Billion bases

• 1.8 meters of DNA• 0.09 nm of chromatin after

being wound on histones• Five families of histones

• H1/H5, H2A, H2B, H3, and H4

Chromosome Length

Page 16: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

RNA

• Sugar in RNA nucleotide is ribose rather than 2’-deoxyribose

• Thymine is replaced by uracil (U)• RNA polymers are usually a few thousand

nucleotides or shorter• RNA in cells is usually single-stranded• RNA is considered to be the original gene coding

material, and it still code genes in a few viruses

Page 17: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

RNA Types

• Four RNA’s are involved in protein synthesis

RNA Type Size Function

Transfer RNA Small Transports AA to protein synthesis sites

Ribosomal RNA Variable combines with proteins to form ribosome, where protein polypeptide chain grows

Messenger RNA Variable Transcribes AA sequence from genes

Small nuclear RNA Processing of initial mRNA to its mature form in eukaryotes

Page 18: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Online course on Biology

• Educational Portal

• RNA• http://education-portal.com/academy/lesson/differences-

between-rna-and-dna-types-of-rna-mrna-trna-rrna.html

Page 19: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Gene to Protein:Transcription & Translation

Page 20: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genome

• one or more chromosomes that contain the code (gene) that directs the synthesis of proteins that are essential for its structure and function

• Human: 22 pairs of homologous chromosomes & XY

• http://www.ncbi.nlm.nih.gov/genome/?term=txid9606[orgn]

Page 21: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genes

• Allele• Alternative forms of the

same gene

• Dominant, recessive

Page 22: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology
Page 23: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Gene to Protein:Transcription & Translation

Page 24: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Gene to Protein:Transcription & Translation

Page 25: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Gene to Protein:Transcription & Translation

Page 26: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Transcription

Page 27: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Transcription

Page 28: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Gene to Protein

intron

exon

Protein 1Protein 2

Protein Coding Region

5’UTR

3’UTR

intergenicUTR

Non-Protein Coding Region

Non-Protein Coding Region

Page 29: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Alternative Splicing

Page 30: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Example

Page 31: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Translation

• Genetic Code• A triplet (called codon)

• Ribosome moves along mRNA 3 bases at a time

• Degenerate coding• 4x4x4=64 possible triplets into 20 Amino

Acids• 8 AA have 3rd base irrelevant – immune to

mutation

• Anti-codon – reverse complement of a codon

Page 32: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genetic Code

Page 33: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genetic Code

Page 34: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Translation

Page 35: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genetic Code

Page 36: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Amino Acids

• General structure of amino acids• an amino group• a carboxyl group• α-carbon bonded to a

hydrogen and a side-chain group, R

• R determines the identity of particular amino acid • R: large white and

gray• C: black• Nitrogen: blue• Oxygen: red• Hydrogen: white

Page 37: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genome

Page 38: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

• Genome– The entire DNAs of a cell is the genome– Individual units for coding proteins or RNA are genes

– A gene starts with ATG, ends with one or two stop codons

– Called ORF (Open Reading Frame)

– Biological Info– Contained in genome– Encoded in nucleotide sequences of DNA or RNA– Partitioned into discrete units, genes

Genome

Page 39: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Cell

– Different levels of cells– Prokaryote (karyan, “kernel” in Greek)(/proekaeriəts) (pro for

“before”)

– Eukaryote (“true”)– Main difference is the presence of organelle, especially the

nucleus, in eukaryotes

Organelle Prokaryotes Eukaryotes

Nucleus No definite nucleus Present

Cell membrane Present Present

Mitochondria None. Present

Endoplasmic reticulum None Present

Ribosomes Present Present

Chloroplasts None Present in green plants

Page 40: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Prokaryotic cell

animal cell

plant cell

Page 41: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

• Classification purely based on biochemistry (RNA)– C. Woese, 1981

• Eubacteria (true bacteria)• Archaea (archaebacteria, early bacteria)• Eukarya (eukaryotes)

Three Domain

Page 42: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genome Sequencing Projects Major genome sequencing centers

U.S. Dept. of Energy Joint Genome Institute (435 projects) J. Craig Venter Institue (302) The Institute for Genomic Research (TIGR) (206) Washington Univ. (184) Institut Pasteur, Univ. of Tokyo www.ncbi.nlm.nih.gov/genomes/static/lcenters.html

Completely sequenced genomes include Several hundred bacteria, over 20 archea, and over 30 eukarya Human (homo sapies), chimpanzee (Pan troglodytes), mouse (Mus

musculus), brown rat (Rattus norvegicus), dog (Canis familiaris), Thale cress (Arabidopsis thaliana), rice (Oryza sativa), Fruit fly (Drosophila melanogaster), yeast (Saccharomyces cerevisiae)

http://www.ebi.ac.uk/2can/genomes/genomes.html has descriptions of species and their clinical and scientific significances

http://www.genomesonline.org has current status of genome projects

Page 43: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genome Databases

Completed genomes ftp site -- ftp://ftp.ncbi.nlm.nih.gov/genomes/ http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/allorg.html http://www.ebi.ac.uk/genomes/mot/index.html http:/pir.goergetown.edu/pirwww/search/genome.html

Organism-specific databases http://www.unledu/stc-95/ResTools/biotools/biotools10.html http://www.fp.mcs.anl.gov/~gaasterland/genomes.html http://www.hgmp.mrc.ac.uk/GenomeWeb/genome-db.html http://www.bioinformatik.de/cgi-bin/browse/Catalog/

Databases/Genome_Proejcts

Page 44: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genomes of Prokaryotes

Circular double-stranded DNA Protein-coding regions do not contain introns Protein-coding regions are partially organized into operons –

tandom genes transcribed into a single mRNA molecule

The density of coding region is high ~89% in E.Coli

trpE trpD

The trp operon in E.Coli begins with control region, followed by genes performing

successive steps in systhesis of tryptophan AA

Page 45: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genome of E.Coli

Many E.Coli proteins were known before the sequencing (1853 proteins)

Genome of Escherichia coli, strain MG1655 published in 1997 By F. Blattner at Univ. Wisconsin 4.64 Mbp

4284 protein-coding genes, 122 structural RNA genes, Non-coding repeat sequences, Regulatory elements, etc.

Average size of ORF is 317 AA Average inter-genic gap is 118 bp ¾ transcribe single genes, and the rest are operons (gene

clusters) 60% protein functions are known

http://wishart.biology.ualberta.ca/BacMap/index.html contains an atlas of bacterial genome diagram (2005)

Page 46: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology
Page 47: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genome of Archea

Microorganism Methanococcus jannaschii thrives in hydrothermal vents at temp from 48 to 94 CB

genes from 45 strains Capable of self-reproduction from inorganic components Metabolism is to synthesize methane from H2 and CO2 Sequenced in 1996 by TIGR

1.665 Mbp in chromosome containing a circular DNA modecule, two extra-chromosomal elements

1,784 protein-coding regions Proteins in archea for transcription and translation are

closer to those in eukaryote Proteins involved in metabolism are closer to those of

bacteria

Page 48: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genomes of Eukarya Majority of DNA is in the nucleus

Organized into chromosomes containing single-DNA molecule each

Smaller amount of DNA in organelles such as mitochondria and chloroplasts Organelles originated as intra-cellular parasites Organelle genomes usually have circular forms, but

sometimes in linear or multi-circular shape Genetic code is different that the one for nuclear genes

Diverse among species Humans have 23 chromosomes, chimpanzees have 24 Human chromosome #2 is equivalent to a fusion of

chimpanzee chromosomes 12 and 13 List of genome sequences

http://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_genomes

Page 49: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genome of Saccharomyces cerevisiae (Yeast)

Simplest eukaryotic organism Sequencing from 100 labs completed in 1992

12.06 Mbp 16 chromosomes 6,172 protein-coding genes

Dense: only 231 genes contain introns

Page 50: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genome of Caenorhabditis elegans (C. elegans)

Completed in 1998 First full DNA sequence of

a multi-cellular organism 97 Mbp Paired chromosomes

XX for a self-fertilizing hermaphrodite (simultaneously male and female)

XO for male Avg. 5 introns per gene

Proteins 42% have homologues to

other species 34% specific to

nematodes (round worms) 24% no known

homologues

Chromosome

Size (Mbp)

Protein genes

Kbp/gene

I 7.9 2803 5.06

II 8.5 3559 3.05

III 7.6 2508 5.40

IV 9.2 3094 5.17

V 9.8 4082 4.15

X 10.1 2631 6.54

Page 51: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genome of Drosophila melanogaster (Fruit fly)

Completed in 1999 by Celera Genomics and Berkeley 180 Mbp Five chromosomes: 3 large autosomes, Y, and tiny fifth 13,601 genes, 1 gene/8Kbp Has 289 homologues to human genes

Such as cancer, cardiovascular, neurological, etc. There is a fly model for Parkinson and malaria

Page 52: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genome of Arabidopsis thaliana

Relatively small genome, 146 Mbp, completed in 2000

Five chromosomes 25,498 predicted genes; 1 gene/4.6 kbp

Proteins Most A. thaliana proteins have homologues in animals

60% of genes have human homologues, e.g., BRCA2 Gene distribution

Nucleus: genome size (125 Mbp), genes (25,500) Chloroplast: genome (154 Kbp), genes (79) Mitochondrion: genome (367 Kbp), genes (58)

Page 53: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

20 of 54 genes in a 340-Kbp stretch of rice genome (top) are conserved and retain the same order in five A. thalia strands

Page 54: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Human Genome• Human Genome Project

– Conceived in 1984, begun in 1990, completed in 2001 ahead of 2003 schedule

• What did the sequence reveal ?– 3 Bbp (base pair)– 24 chromosomes,

– 22 autosomes plus two sex chromasomes (X,Y)– Longest 250 Mbp, shorted 55 Mbp

– Mitochondrial genome– Circular DNA molecule of 16.569 Mbp

– ~10**(13) cells– How many is 3 Bbp ?

– Typical 11-pt font can print 60 nucleotide is 3 in (~10 cm).– In this format, 3 Bbp writes out in 5,000 mi

Page 55: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genome of Homo sapiens 22 chromosomes plus X (163 Mbp) and Y (51

Mbp) Web resources

Interactive access to DNA and protein sequences http://www.ensembl.org

Images of chromosomes, maps, loci http://www.ncbi.nlm.nih.gov/projects/genome/guide/

Gene map 99 http://www.ncbi.nlm.nih.gov/genemap99

overview of human genome structure http://www.ims.u-tokyo.ac.jp/imsut/en

SNP (Single nucleotide polymorphisms) http://snp.cshl.org

Human genetic diseases http://www.ncbi.nlm.nih.gov/Omim (Online Mendelian

Inheritance in Man )

http://www.geneclinics.org/profiles/all-html

Page 56: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Human Genome Insights (ENCODE)

Majority of genome is transcribed ~50% transposons ~25% protein coding genes/1.3% exons ~23,700 protein coding genes ~160,000 transcripts Average Gene ~ 36,000 bp

7 exons @ ~ 300 bp6 introns @ ~5,700 bp7 alternatively spliced products (95% of genes)

RefSeq: ~34,600 “reference sequence” genes (includes pseudogenes, known RNA genes)

Page 57: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genome of Homo sapiens (cont’d) Repeat sequences >50 % of the genome

Short interspersed nuclear elements (SINEs): 13 %, LINEs: 21 %

Simple stutters (repeats of short oligomers including mini- and micro-satellites) Triplet repeats such as CAG are implicated in numerous

diseases (e.g., glutamine repeats in glutamine protein) SNP (pronounced snip)

A->T mutation in beta-globin changes Glu -> Val, creating a sticky surface on haemoglobin molecules => sicklecell anaemia

Progeria Avg 1 SNP/Kbp (100 SNPs per 100 Kbp) Many 100-Kbp regions tend to remain intact, with fewer than five

SNPs discrete combinations of SNPs define individual’s haplotype

(haploid genotype) Individual genomes are characterized by a distribtuion of

genetic makers including SNPs Int’l HapMap Consortium

Page 58: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Genome of Homo sapiens (cont’d) SNP consortium

Collects human SNPs, nearly 5 million SNPs Show

Most of variations appear in all populations However, a few SNPs are unique to particular populations Genomes of individuals from Japan and China are very similar Chromosome X varies more than other chromosomes (X is

more subject to selective pressure) Mitochondrial DNA

Double-stranded closed circular molecule of 16,569 bp Inherited almost exclusively through maternal lines Not subject to recombination, and changes only by mutation About 1 mutation every 25,000 years

Page 59: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

mtDNA and Y mtDNA Inherited through maternal lines

Both sons and daughters get it from their mother All existing sequence variants are traced back to a single

woman (Mitochondrial Eve) in Africa roughly 200,000 years ago Supports “from Africa” hypothesis Avg difference in mtDNA between pairs of individuals is 61.1,

between Africans is 76.7, between non-Africans is 38.5 More divergent populations in Africa for much longer than in

the rest of the world Y chromosome

Most recent common male ancestor (Y-chromosome Adam) is around 59,000 years ago

Most divergent sequences are found from Africans

Page 60: 91.350/580 Topics in Bioinformatics What is bioinformatics ?  DNA sequences  Protein sequences/structures  Modeling/inference Intersection  biology

Other Species

Organism Genome size # of genes

Epstein – Barr virus 0.17 Mbp 80

E.Coli 4.6 Mbp 4,406

Yeast (S. cerevisiae) 12.5 Mbp 6,172

Nematode worm (C.elegans) 100.3 Mbp 19,099

Thale cress (A. thaliana) 115.4 Mbp 25,498

Fruit fly (D. melanogaster) 128.3 Mbp 13,601

Human (H. sapiens) 3223.0 Mbp 20,500

Fugu (Takifugu rubripes) 390.0 Mbp 30,000

Wheat 16000.0 Mbp 30,000