integrating genomes d. r. zerbino, b. paten, d. haussler science 336, 179 (2012)

29
Integrating Genomes D. R. Zerbino, B. Paten, D. Haussler Science 336, 179 (2012) Teacher: Professor Chao, Kun-Mao Speaker: Ho, Bin-Shenq June 4, 2012

Upload: herman-carter

Post on 02-Jan-2016

32 views

Category:

Documents


4 download

DESCRIPTION

Integrating Genomes D. R. Zerbino, B. Paten, D. Haussler Science 336, 179 (2012). Teacher: Professor Chao, Kun-Mao Speaker: Ho, Bin-Shenq June 4, 2012. Outline. Overview Obtaining Genomic Sequences Modeling Evolution of Genotype From Genotype to Phenotype Looking Ahead to Applications - PowerPoint PPT Presentation

TRANSCRIPT

Integrating Genomes D. R. Zerbino, B. Paten, D. Haussler

Science 336, 179 (2012)

Teacher: Professor Chao, Kun-Mao

Speaker: Ho, Bin-Shenq

June 4, 2012

OutlineOverviewObtaining Genomic SequencesModeling Evolution of GenotypeFrom Genotype to PhenotypeLooking Ahead to ApplicationsConclusion

Overview Specialization in computational genomics Integration of genetic, molecular, and phenotypic

information

Impact on diverse fields of science New window into the story of life

population genetics, phylogeneticshuman disease genetics

+graph theory, signal processing

statistics, computer science

Milestones First genome sequences_1970s

Bacteriophage MS2 RNA: 3,569 nucleotides long_1976

Computational genomics_1980

Smith and Waterman

Stormo et al. 16-fold improvement in computational power

under Moore’s law A 10,000-fold sequencing performance

improvement in the past 8 years

Computational Genomics

Genomic dataEvolution

Molecular phenotype

Organismal phenotype

DNA sequence evolving in time ( history )

chromatin piece interactingwith other molecules ( mechanism )

gene product acting in cellular

pathways affecting organisms

( function )

Obtaining Genomic SequencesGenome assembly

given sufficient read redundancy

Large redundant regions (repeats)→ complex networks of read-to-read overlaps not all reflecting actual overlaps→ to determine which overlaps being legitimate and which being spurious→ NP-hard problem→ undetermined, prone-to-errors, costly-to-finish regions Newer sequencing technologies with longer reads

Obtaining Genomic Sequences

Reference-based assembly

Tendency of bias toward reference genome

Newer sequencing technologies with longer

reads

Modeling Evolution of Genotype

Diversity of Genomes

Alignment

Phylogenetic analysis

Diversity of Genomesevery genome being the result of a 3.8-billion-year evolutionary journey

from the origin of life

Mostly shared and partly unique

Single-base change_substitution, SNP Indel_insertion, deletion Tandem duplication Recombination Transposition Rearrangement_inversion, segmental deletion,

segmental duplication, fusion, fission, translocation Whole genome duplication

Diversity of Genomes

Germline selections

Evolution

Somatic selections

Cancer / Immunity

Assembly and Alignment

Fig. 1. Assembly and alignment.

Alignment Alignment with assumption of derivation

from a suitably recent common ancestor What being conserved or changed during

the evolution from common ancestor Substitution, indel, segment order, copy

number Local alignment for conserved functional

regions of more distantly related genomes Global / Genome alignment for genomes

from closely related species

Phylogenetic Analysis Single tree providing an explicit order of gene

descent through shared ancestry Finding optimal phylogeny under probabilistic

or parsimony models of substitutions and indels being NP-hard

Being complicated by homologous recombination

Intending to construct a tractable unified theory of genome evolution with stochastic processes jointly describing diversification events of genome

From Genotype to Phenotype

Fig. 2. The dynamic processes that affect and are affected by the genome.

Genomes_Mechanisms_Functions

Active molecules of the cell, including proteins, messenger RNAs, other functional RNAs

Epigenetic mechanisms regulating RNA and protein production and function

Gene regulatory networks Protein signaling cascades Metabolic pathways Regulatory network motifs

From Genotype to Phenotype Exploring unfolding history and diversity of life Deriving experimental data from an expansion

of cell culture resources for diverse species / tissues and newer single-cell assay methodologies

Correlating specific segregating variants with phenotypic traits or diseases

Identifying causal variants by complete genome analysis in related as well as unrelated cases and controls and in combination with better prediction of possible effects of genome variants

From Genotype to Phenotype

Constructing models of molecular phenotypes involving epigenetic state, RNA expression, and (inferred) protein levels through hidden Markov models, factor graphs, Bayesian networks, and Markov random fields

Incorporating biological knowledge into classification and regression methods (e.g., general linear models, neural networks, and support vector machines)

Looking Ahead to Applications

Genome data growth collectively from petabytes (1015 bytes) today to exabytes (1018 bytes) tomorrow

Cancer diagnosis and treatment Immunology Stem cell therapy Agriculture Human prehistory study

Conclusion Facing challenges of obtaining maximum

information from every sequencing experiment

To borrow and tie together advances from a spectrum of different research fields into foundational mathematical models

Between model comprehensiveness and computational efficiency

To be shaped by increasing knowledge of biology

Thank YouFor

Your Attention