genomics, bioinformatics and the revolution in biology

Genomics, Bioinformaticsand the Revolution in Biology

Jonathan Pevsner, Ph.D.Kennedy Krieger Institute/

Johns Hopkins School of Medicine

Outline

Three views of bioinformatics and genomicsInformaticsFrom small to largeFrom genotype to phenotype

The chromosomes

SNPs, HapMap, and the 1000 Genomes project

• Bioinformatics is the interface of biology and computers.It is the analysis of proteins, genes and genomes using computer algorithms and databases.

• Genomics is the analysis of genomes, including the nature of genetic elements on chromosomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.

• Genetics is the study of the origin and expression of individual uniqueness.

Definitions of bioinformatics and genomics

Three views of bioinformatics and genomics

1. The field of informatics

2. From small to large

3. From genotype to phenotype

Tool-users

Tool-makers

bioinformatics

public healthinformatics

medicalinformatics

infrastructure

databases algorithms

genomics

DNA RNA phenotypeprotein

020406080

100120140160180200

1982 1992 2002 2008

Total number of DNA base pairs in GenBank/WGS

Sequences (millions)

Base pairs (billions)

Rapid growth of DNA sequences

Time ofdevelopment

Body region, physiology, pharmacology, pathology

The Origin of Species (1859)

It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, with birds singing on the bushes, with various insects flitting about, and with worms crawling through the damp earth, and to reflect that these elaborately constructed forms, so different from each other, and dependent upon each other in so complex a manner, have all been produced by laws acting around us.

Source: Origin of Species, Chapter 15

Eukaryotes(Baldauf et al. 2000)

animals

plants

slimemold

GiardiaTrichomonas

Paramecium

Trypanosoma

Plasmodium

Wolfe et al. (1999)

8 chromosomes(5,000 genes)

Paramecium tetraurelia: a ciliate with two nuclei, 40,000 genes, and three whole-genome duplications

Phylogenetic footprinting

Population shadowing

Phylogenetic shadowing

protein

pathway

organism

population

protein

pathway

organism

population

Phenotype

We see 500 inpatients and 13,000 outpatients per year at the Kennedy Krieger Institute. Why do children engage in self-injurious behavior? In many cases, there are chromosomal insults.

protein

pathway

organism

population

From genotype…

…to phenotype

protein

pathway

organism

population

cellular phenotype

proteinRNA

clinical phenotype

protein

pathway

organism

population

Central dogma of molecular biology:DNA is transcribed into RNA,and translated into protein.

proteinRNA

Central dogma of bioinformatics/genomics:the genome is transcribed into the transcriptome, and translated into the proteome.

protein

pathway

organism

populationOver 200 billion base pairs of DNA have now been sequenced, from >165,000 organisms.

1982 1992 2002 2008

protein

pathway

organism

population

Scope of bioinformatics

Sequence analysisPairwise alignmentMultiple sequence alignmentPhylogenyDatabase searching (e.g. BLAST)

Functional genomicsRNA studies; gene expression profilingProteomics; protein structureGene function

Pairwise alignments in the 1950s

-corticotropin (sheep)Corticotropin A (pig)

ala gly glu asp asp gluasp gly ala glu asp glu

OxytocinVasopressin

CYIQNCPLGCYFQNCPRG

Early example of sequence alignment: globins (1961)

H.C. Watson and J.C. Kendrew, “Comparison Between the Amino-Acid Sequences of Sperm Whale Myoglobin and of Human Hæmoglobin.” Nature 190:670-672, 1961.

myoglobin globins:

2e Fig. 5.21

Multiple sequence alignment of five globins:ClustalW

Praline

MUSCLE

Probcons

TCoffee

protein

pathway

organism

population

Scope of bioinformatics

Sequence analysisPairwise alignmentMultiple sequence alignmentPhylogenyDatabase searching (e.g. BLAST)

Functional genomicsRNA studies; gene expression profilingProteomics; protein structureGene function

protein

pathway

organism

population

Four bases: A, G, C, T arranged in base pairs along a double helix (1953).

Human genome project: sequencing all ~3 billion base pairs (2003).

protein

pathway

organism

population

1995: first genome sequence (a bacterium)2000: fruit fly genome, plant2003: human genome2008: --two individual human genomes finished

--1,000 human genomes (launched)--SNPs used to study chromosomes

protein

pathway

organism

population

protein

pathway

organism

population

protein

pathway

organism

population

Time ofdevelopment

Body region, physiology, pharmacology, pathology

protein

pathway

organism

population

protein

pathway

organism

population

Phenotype

Genotype

Outline

Three views of bioinformatics and genomicsInformaticsFrom small to largeFrom genotype to phenotype

The chromosomes

SNPs, HapMap, and the 1000 Genomes project

Eukaryotic genomes are organized into chromosomes

Genomic DNA is organized in chromosomes. The diploid number of chromosomes is constant in each species(e.g. 46 in human). Chromosomes are distinguished by a centromere and telomeres.

The chromosomes are routinely visualized by karyotyping(imaging the chromosomes during metaphase, when each chromosome is a pair of sister chromatids).

Fig. 16.19Page 565

human chromosome 21at NCBI

nucleolar organizing center

centromere

nucleolar organizing center

centromere

human chromosome 21at www.ensembl.org

human chromosome 21at UCSC Genome Browser

centromere

human chromosome 21at UCSC Genome Browser

centromere

First P.G. mitosis in polar view. Tradescantia virginiana, Commelinaceae, n = 9 (from aberrrant plant with 22 chromosomes). 2 BE - CV smears. x 1200. Printed on multigrade paper. Darlington.

First P.G. mitosis in Paris quadrifolia, Liliaceae, showing all stages from prophase to telophase. n = 10 (cf. Darlington 1937, 1941) 2 BE – CV smear, 8mm. objective. x 800Darlington.

Root tip squashes showing anaphase separation. Fritillaria pudica, 3x = 39, spiral structure of chromatids revealed by pressure after cold treatment. 2 BD – Feulgen; x 3000Darlington.

Cleavage mitosis in the morula of the teleostean fish, Coregonus clupeoides, in the middle of anaphase. Spindle structure revealed by slow fixation. Section cut at 10 u. x 4000. Strong Flemming, haematoxylin. Prep. and photo by P.C. Koller.Darlington.

The eukaryotic chromosome: Robertsonian fusioncreates one metacentric by fusion of two acrocentrics

Ohno (1970) Plate II

ordinary male house mouse (Mus musculus, 2n = 40)

male tobacco mouse (Mus poschiavinus, 2n = 26)

The spectrum of variation

Category of variation Size typeSingle base pair changes 1 bp SNPs,

point mutationsSmall insertions/deletions 1 – 50 bpShort tandem repeats 1 – 500 bp microsatellitesFine-scale structural var. 50 bp – 5 kb del, dup, inv

tandem repeatsRetroelement insertions 0.3 – 10 kb SINEs, LINEs

LTRs, ERVsIntermediate-scale struct. 5 kb – 50 kb del, dup, inv,

tandem repeatsLarge-scale structural var. 50 kb – 5 Mb del, dup, inv, large

tandem repeatsChromosomal variation >>5Mb aneuploidy

Adapted from Sharp AJ et al. (2006) Annu Rev Genomics Hum Genet 7:407-42

Across the genome, there are four possible SNP calls:[1] homozygous (AA)[2] homozygous (BB)[3] heterozygous (AB)[4] no call

In a deleted region, there are three possible SNP calls:[1] A (interpreted as AA)[2] B (interpreted as BB)[3] no call

Across the genome, there are four possible SNP calls:[1] homozygous (AA)[2] homozygous (BB)[3] heterozygous (AB)[4] no call

Single nucleotide polymorphisms (SNPs) to investigate chromosomes: A case of 7p deletion

AA AB BB

AA AB BBA B

A case of 7p deletion

•Deletions (and duplications) such as these are called copy number variants (CNVs).• CNVs commonly occur in normal individuals. • When found in individuals with disease, we can tell if they are inherited (likely to be benign) or occur de novo (more likely to be disease-associated) by comparison to the parents’ genotypes.• Recent papers report many CNVs in disease.

A case of 7p deletion

A case of trisomy 21 (Down syndrome)

AAA AAB ABB BBB

Three cases of 10q deletion

Deafness gene?

The International HapMap Project

► A catalog of common genetic variants that occur in humans ► The project’s goal is to compare the genetic sequences of different individuals to identify chromosomal regions where genetic variants are shared ► An initial focus has been on four groups (n=270):

CEU European ancestry (30 trios)Utah residents

YRI African ancestry (30 trios)Yoruba in Ibadan, Nigeria

JPT/CHB Asian ancestry (90 individuals)Japanese in Tokyo, JapanHan Chinese in Beijing, China

► Phase I (2005): > 1 million SNPs Phase II (2007): added 2.1 million SNPs

The International HapMap Project

► In addition to CEU, YRI, and JPT/CHB additional populations have been genotyped including:

Maasai in Kinyawa, KenyaLuhya in Webuye, KenyaGujarati Indians in Houston, TXToscani in ItalyMexican ancestry in Los AngelesAfrican ancestry in southwestern US

The ENCODE project

►The ENCyclopedia Of DNA Elements (ENCODE) project was launched in 2003 ► Pilot phase: devise and test high-throughput approaches to identify functional elements. Efforts center on 44 DNA targets. These cover about 1 percent of the human genome, or about 30 million base pairs. ► Second phase: technology development. ► Third phase: production. Expand the ENCODE project to analyze the remaining 99 percent of the human genome.

The ENCODE project

Goal of ENCODE: build a list of all sequence-based functional elements in human DNA. This includes: ► protein-coding genes► non-protein-coding genes► regulatory elements involved in the control of gene transcription ► DNA sequences that mediate chromosomal structure and dynamics.

ENCODE data at the UCSC Genome Browser: beta globin

HBB, HBD, HBG1,HBG2, HBE1

ENCODE data at the UCSC Genome Browser: beta globin(50,000 base pairs including HBB, HBD, HBG1, HBG2, HBE1)

ENCODE tracks available at the UCSC Genome Browser

EGASP: the human ENCODE Genome Annotation Assessment Project

EGASP goals:

[1] Assess of the accuracy of computational methods to predict protein coding genes. 18 groups competed to make gene predictions, blind; these were evaluated relative to reference annotations generated by the GENCODE project.

[2] Assess of the completeness of the current human genome annotations as represented in the ENCODE regions.

UCSC: tracks for Gencode and for various gene prediction algorithms(focus on 50 kb encompassing five globin genes)

JIGSAW

Gencode

On bioinformatics

“Science is about building causal relations between natural phenomena (for instance, between a mutation in a gene and a disease). The development of instruments to increase our capacity to observe natural phenomena has, therefore, played a crucial role in the development of science - the microscope being the paradigmatic example in biology. With the human genome, the natural world takes an unprecedented turn: it is better described as a sequence of symbols. Besides high-throughput machines such as sequencers and DNA chip readers, the computer and the associated software becomes the instrument to observe it, and the discipline of bioinformatics flourishes.”

On bioinformatics

“However, as the separation between us (the observers) and the phenomena observed increases (from organism to cell to genome, for instance), instruments may capture phenomena only indirectly, through the footprints they leave. Instruments therefore need to be calibrated: the distance between the reality and the observation (through the instrument) needs to be accounted for. This issue of Genome Biology is about calibrating instruments to observe gene sequences; more specifically, computer programs to identify human genes in the sequence of the human genome.”

Martin Reese and Roderic Guigó, Genome Biology 2006 7(Suppl I):S1,introducing EGASP, the Encyclopedia of DNA Elements (ENCODE) Genome Annotation Assessment Project

The 1000 Genomes Project

Goal: To create a deep catalog of human genetic variation in multiple populations.

[1] Discover variants (SNPs, copy number variants, insertions/deletions). Include ~all variants with allele frequencies >1% across the genome (and >0.1-0.5% in gene regions)

[2] Estimate the frequencies of variant alleles

Secondary goals: • Characterize SNPs• Improve the human reference sequence• Study regions under selection• Study variation across populations• Study mutation and recombination

Current approaches include sequencing two HapMap trios (one from YRI, one CEU; father/mother/child) at 20X depth using next generation sequencing technology. For one individual, 20X depth = 60 gigabasesFor one trio, 20X depth = 180 gigabases

In another approach, sequence many individuals (n=1000) from the extended HapMap collection at lighter coverage.

Conclusions

We briefly surveyed the fields of bioinformatics and genomics. Bioinformatics serves biology, and genomics depends on the tools of bioinformatics.

There are rapid advances in available technologies, such as next generation sequencing, that allow us to address fundamental biological questions at unprecedented resolution. These questions include the nature of variation within and between genomes of individuals, groups (gender, ethnicity, disease status), and across species. Other questions, posed decades ago, concern biological processes such as development, metabolism, adaptation, and function.

genomics, bioinformatics and the revolution in biology

definitions of bioinformatics

tools of bioinformatics

genomes project bioinformatics

billions of base pairs

field of informatics2

largefrom genotype

analysis of genomes

genomicsinformaticsfrom

Documents

bioinformatics, genomics,...

bioinformatics on genomics

bioinformatics - genomics and post-genomics - f. dardel, f....

bioinformatics and evolutionary genomics

genomics revolution

2015 nrf-managed bioinformatics and functional genomics

astrobiology and bioinformatics: past, present, and...

genomics and bioinformatics

how genomics and bioinformatics is transforming clinical

bd single-cell genomics bioinformatics handbook

functional genomics overview - bioinformatics-core-shared...

tentative definition of bioinformatics bioinformatics, often...

bioinformatics, genomics, and proteomics

bioinformatics and functional genomics - buch.de€¦ ·...

chapter 24 topics: genomics, proteomics, bioinformatics

program genomics and bioinformatics

center for genomics and bioinformatics

bioinformatics and evolutionary genomics : pathway evolution

bioinformatics and functional genomics

meeting the bioinformatics challenges of functional genomics