genomics, bioinformatics and the revolution in biology
Post on 19-Jan-2016
96 Views
Preview:
DESCRIPTION
TRANSCRIPT
Genomics, Bioinformaticsand the Revolution in Biology
Jonathan Pevsner, Ph.D.Kennedy Krieger Institute/
Johns Hopkins School of Medicine
Outline
Three views of bioinformatics and genomicsInformaticsFrom small to largeFrom genotype to phenotype
The chromosomes
SNPs, HapMap, and the 1000 Genomes project
• Bioinformatics is the interface of biology and computers.It is the analysis of proteins, genes and genomes using computer algorithms and databases.
• Genomics is the analysis of genomes, including the nature of genetic elements on chromosomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.
• Genetics is the study of the origin and expression of individual uniqueness.
Definitions of bioinformatics and genomics
Three views of bioinformatics and genomics
1. The field of informatics
2. From small to large
3. From genotype to phenotype
Tool-users
Tool-makers
bioinformatics
public healthinformatics
medicalinformatics
infrastructure
databases algorithms
genomics
Three views of bioinformatics and genomics
1. The field of informatics
2. From small to large
3. From genotype to phenotype
DNA RNA phenotypeprotein
020406080
100120140160180200
1982 1992 2002 2008
Total number of DNA base pairs in GenBank/WGS
Sequences (millions)
Base pairs (billions)
Rapid growth of DNA sequences
Year
Time ofdevelopment
Body region, physiology, pharmacology, pathology
The Origin of Species (1859)
It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, with birds singing on the bushes, with various insects flitting about, and with worms crawling through the damp earth, and to reflect that these elaborately constructed forms, so different from each other, and dependent upon each other in so complex a manner, have all been produced by laws acting around us.
Source: Origin of Species, Chapter 15
Eukaryotes(Baldauf et al. 2000)
animals
fungi
plants
slimemold
GiardiaTrichomonas
Paramecium
Trypanosoma
Plasmodium
Wolfe et al. (1999)
Wolfe et al. (1999)
8 chromosomes(5,000 genes)
16 chromosomes(10,000 genes)
16 chromosomes(6,000 genes)
Paramecium tetraurelia: a ciliate with two nuclei, 40,000 genes, and three whole-genome duplications
Phylogenetic footprinting
Population shadowing
Phylogenetic shadowing
Three views of bioinformatics and genomics
1. The field of informatics
2. From small to large
3. From genotype to phenotype
DNA
RNA
protein
cell
pathway
organism
population
DNA
RNA
protein
cell
pathway
organism
population
Phenotype
We see 500 inpatients and 13,000 outpatients per year at the Kennedy Krieger Institute. Why do children engage in self-injurious behavior? In many cases, there are chromosomal insults.
DNA
RNA
protein
cell
pathway
organism
population
From genotype…
…to phenotype
DNA
RNA
protein
cell
pathway
organism
population
DNA
cellular phenotype
proteinRNA
clinical phenotype
DNA
RNA
protein
cell
pathway
organism
population
DNA
Central dogma of molecular biology:DNA is transcribed into RNA,and translated into protein.
proteinRNA
Central dogma of bioinformatics/genomics:the genome is transcribed into the transcriptome, and translated into the proteome.
DNA
RNA
protein
cell
pathway
organism
populationOver 200 billion base pairs of DNA have now been sequenced, from >165,000 organisms.
0
20
40
60
80
100
120
140
160
180
200
1982 1992 2002 2008
DNA
RNA
protein
cell
pathway
organism
population
Scope of bioinformatics
Sequence analysisPairwise alignmentMultiple sequence alignmentPhylogenyDatabase searching (e.g. BLAST)
Functional genomicsRNA studies; gene expression profilingProteomics; protein structureGene function
Pairwise alignments in the 1950s
-corticotropin (sheep)Corticotropin A (pig)
ala gly glu asp asp gluasp gly ala glu asp glu
OxytocinVasopressin
CYIQNCPLGCYFQNCPRG
Early example of sequence alignment: globins (1961)
H.C. Watson and J.C. Kendrew, “Comparison Between the Amino-Acid Sequences of Sperm Whale Myoglobin and of Human Hæmoglobin.” Nature 190:670-672, 1961.
myoglobin globins:
2e Fig. 5.21
LAGAN
Multiple sequence alignment of five globins:ClustalW
Praline
MUSCLE
Probcons
TCoffee
DNA
RNA
protein
cell
pathway
organism
population
Scope of bioinformatics
Sequence analysisPairwise alignmentMultiple sequence alignmentPhylogenyDatabase searching (e.g. BLAST)
Functional genomicsRNA studies; gene expression profilingProteomics; protein structureGene function
DNA
RNA
protein
cell
pathway
organism
population
Four bases: A, G, C, T arranged in base pairs along a double helix (1953).
Human genome project: sequencing all ~3 billion base pairs (2003).
DNA
RNA
protein
cell
pathway
organism
population
1995: first genome sequence (a bacterium)2000: fruit fly genome, plant2003: human genome2008: --two individual human genomes finished
--1,000 human genomes (launched)--SNPs used to study chromosomes
DNA
RNA
protein
cell
pathway
organism
population
DNA
RNA
protein
cell
pathway
organism
population
DNA
RNA
protein
cell
pathway
organism
population
Time ofdevelopment
Body region, physiology, pharmacology, pathology
DNA
RNA
protein
cell
pathway
organism
population
DNA
RNA
protein
cell
pathway
organism
population
Phenotype
Genotype
Outline
Three views of bioinformatics and genomicsInformaticsFrom small to largeFrom genotype to phenotype
The chromosomes
SNPs, HapMap, and the 1000 Genomes project
Eukaryotic genomes are organized into chromosomes
Genomic DNA is organized in chromosomes. The diploid number of chromosomes is constant in each species(e.g. 46 in human). Chromosomes are distinguished by a centromere and telomeres.
The chromosomes are routinely visualized by karyotyping(imaging the chromosomes during metaphase, when each chromosome is a pair of sister chromatids).
Fig. 16.19Page 565
human chromosome 21at NCBI
nucleolar organizing center
centromere
nucleolar organizing center
centromere
human chromosome 21at www.ensembl.org
human chromosome 21at UCSC Genome Browser
centromere
human chromosome 21at UCSC Genome Browser
centromere
First P.G. mitosis in polar view. Tradescantia virginiana, Commelinaceae, n = 9 (from aberrrant plant with 22 chromosomes). 2 BE - CV smears. x 1200. Printed on multigrade paper. Darlington.
First P.G. mitosis in Paris quadrifolia, Liliaceae, showing all stages from prophase to telophase. n = 10 (cf. Darlington 1937, 1941) 2 BE – CV smear, 8mm. objective. x 800Darlington.
Root tip squashes showing anaphase separation. Fritillaria pudica, 3x = 39, spiral structure of chromatids revealed by pressure after cold treatment. 2 BD – Feulgen; x 3000Darlington.
Cleavage mitosis in the morula of the teleostean fish, Coregonus clupeoides, in the middle of anaphase. Spindle structure revealed by slow fixation. Section cut at 10 u. x 4000. Strong Flemming, haematoxylin. Prep. and photo by P.C. Koller.Darlington.
The eukaryotic chromosome: Robertsonian fusioncreates one metacentric by fusion of two acrocentrics
Ohno (1970) Plate II
ordinary male house mouse (Mus musculus, 2n = 40)
male tobacco mouse (Mus poschiavinus, 2n = 26)
The spectrum of variation
Category of variation Size typeSingle base pair changes 1 bp SNPs,
point mutationsSmall insertions/deletions 1 – 50 bpShort tandem repeats 1 – 500 bp microsatellitesFine-scale structural var. 50 bp – 5 kb del, dup, inv
tandem repeatsRetroelement insertions 0.3 – 10 kb SINEs, LINEs
LTRs, ERVsIntermediate-scale struct. 5 kb – 50 kb del, dup, inv,
tandem repeatsLarge-scale structural var. 50 kb – 5 Mb del, dup, inv, large
tandem repeatsChromosomal variation >>5Mb aneuploidy
Adapted from Sharp AJ et al. (2006) Annu Rev Genomics Hum Genet 7:407-42
Across the genome, there are four possible SNP calls:[1] homozygous (AA)[2] homozygous (BB)[3] heterozygous (AB)[4] no call
In a deleted region, there are three possible SNP calls:[1] A (interpreted as AA)[2] B (interpreted as BB)[3] no call
Across the genome, there are four possible SNP calls:[1] homozygous (AA)[2] homozygous (BB)[3] heterozygous (AB)[4] no call
Single nucleotide polymorphisms (SNPs) to investigate chromosomes: A case of 7p deletion
AA AB BB
AA AB BBA B
A case of 7p deletion
A B
•Deletions (and duplications) such as these are called copy number variants (CNVs).• CNVs commonly occur in normal individuals. • When found in individuals with disease, we can tell if they are inherited (likely to be benign) or occur de novo (more likely to be disease-associated) by comparison to the parents’ genotypes.• Recent papers report many CNVs in disease.
A case of 7p deletion
A case of trisomy 21 (Down syndrome)
AAA AAB ABB BBB
Three cases of 10q deletion
Deafness gene?
The International HapMap Project
► A catalog of common genetic variants that occur in humans ► The project’s goal is to compare the genetic sequences of different individuals to identify chromosomal regions where genetic variants are shared ► An initial focus has been on four groups (n=270):
CEU European ancestry (30 trios)Utah residents
YRI African ancestry (30 trios)Yoruba in Ibadan, Nigeria
JPT/CHB Asian ancestry (90 individuals)Japanese in Tokyo, JapanHan Chinese in Beijing, China
► Phase I (2005): > 1 million SNPs Phase II (2007): added 2.1 million SNPs
The International HapMap Project
► In addition to CEU, YRI, and JPT/CHB additional populations have been genotyped including:
Maasai in Kinyawa, KenyaLuhya in Webuye, KenyaGujarati Indians in Houston, TXToscani in ItalyMexican ancestry in Los AngelesAfrican ancestry in southwestern US
The ENCODE project
►The ENCyclopedia Of DNA Elements (ENCODE) project was launched in 2003 ► Pilot phase: devise and test high-throughput approaches to identify functional elements. Efforts center on 44 DNA targets. These cover about 1 percent of the human genome, or about 30 million base pairs. ► Second phase: technology development. ► Third phase: production. Expand the ENCODE project to analyze the remaining 99 percent of the human genome.
The ENCODE project
Goal of ENCODE: build a list of all sequence-based functional elements in human DNA. This includes: ► protein-coding genes► non-protein-coding genes► regulatory elements involved in the control of gene transcription ► DNA sequences that mediate chromosomal structure and dynamics.
ENCODE data at the UCSC Genome Browser: beta globin
HBB, HBD, HBG1,HBG2, HBE1
ENCODE data at the UCSC Genome Browser: beta globin(50,000 base pairs including HBB, HBD, HBG1, HBG2, HBE1)
ENCODE tracks available at the UCSC Genome Browser
EGASP: the human ENCODE Genome Annotation Assessment Project
EGASP goals:
[1] Assess of the accuracy of computational methods to predict protein coding genes. 18 groups competed to make gene predictions, blind; these were evaluated relative to reference annotations generated by the GENCODE project.
[2] Assess of the completeness of the current human genome annotations as represented in the ENCODE regions.
UCSC: tracks for Gencode and for various gene prediction algorithms(focus on 50 kb encompassing five globin genes)
JIGSAW
Gencode
On bioinformatics
“Science is about building causal relations between natural phenomena (for instance, between a mutation in a gene and a disease). The development of instruments to increase our capacity to observe natural phenomena has, therefore, played a crucial role in the development of science - the microscope being the paradigmatic example in biology. With the human genome, the natural world takes an unprecedented turn: it is better described as a sequence of symbols. Besides high-throughput machines such as sequencers and DNA chip readers, the computer and the associated software becomes the instrument to observe it, and the discipline of bioinformatics flourishes.”
On bioinformatics
“However, as the separation between us (the observers) and the phenomena observed increases (from organism to cell to genome, for instance), instruments may capture phenomena only indirectly, through the footprints they leave. Instruments therefore need to be calibrated: the distance between the reality and the observation (through the instrument) needs to be accounted for. This issue of Genome Biology is about calibrating instruments to observe gene sequences; more specifically, computer programs to identify human genes in the sequence of the human genome.”
Martin Reese and Roderic Guigó, Genome Biology 2006 7(Suppl I):S1,introducing EGASP, the Encyclopedia of DNA Elements (ENCODE) Genome Annotation Assessment Project
The 1000 Genomes Project
Goal: To create a deep catalog of human genetic variation in multiple populations.
[1] Discover variants (SNPs, copy number variants, insertions/deletions). Include ~all variants with allele frequencies >1% across the genome (and >0.1-0.5% in gene regions)
[2] Estimate the frequencies of variant alleles
The 1000 Genomes Project
Secondary goals: • Characterize SNPs• Improve the human reference sequence• Study regions under selection• Study variation across populations• Study mutation and recombination
The 1000 Genomes Project
Current approaches include sequencing two HapMap trios (one from YRI, one CEU; father/mother/child) at 20X depth using next generation sequencing technology. For one individual, 20X depth = 60 gigabasesFor one trio, 20X depth = 180 gigabases
In another approach, sequence many individuals (n=1000) from the extended HapMap collection at lighter coverage.
Conclusions
We briefly surveyed the fields of bioinformatics and genomics. Bioinformatics serves biology, and genomics depends on the tools of bioinformatics.
There are rapid advances in available technologies, such as next generation sequencing, that allow us to address fundamental biological questions at unprecedented resolution. These questions include the nature of variation within and between genomes of individuals, groups (gender, ethnicity, disease status), and across species. Other questions, posed decades ago, concern biological processes such as development, metabolism, adaptation, and function.
top related