assembly, comparison, and annotation of mammalian genomes

44
David Haussler David Haussler Howard Hughes Medical Institute Howard Hughes Medical Institute University of California, Santa Cruz University of California, Santa Cruz Assembly, Comparison, and Assembly, Comparison, and Annotation of Mammalian Annotation of Mammalian Genomes Genomes

Upload: jessica-jarvis

Post on 31-Dec-2015

69 views

Category:

Documents


0 download

DESCRIPTION

Assembly, Comparison, and Annotation of Mammalian Genomes. David Haussler Howard Hughes Medical Institute University of California, Santa Cruz. Bioinformatics of mammalian genomes. Sequence Assembly. Genome Browsers: new computational microscopes. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Assembly, Comparison, and Annotation of Mammalian Genomes

David HausslerDavid HausslerHoward Hughes Medical Institute Howard Hughes Medical Institute

University of California, Santa CruzUniversity of California, Santa Cruz

Assembly, Comparison, and Annotation Assembly, Comparison, and Annotation of Mammalian Genomesof Mammalian Genomes

Page 2: Assembly, Comparison, and Annotation of Mammalian Genomes

Bioinformatics of mammalian genomesBioinformatics of mammalian genomes

•Sequence Assembly

•Genome Browsers: new computational microscopes

•Computing Evolution’s Path: key to understanding function

Page 3: Assembly, Comparison, and Annotation of Mammalian Genomes

Assembling the human genomeAssembling the human genome• GigAssembler (Kent)GigAssembler (Kent)

– Built first draft of the human genome from lower-level contigs Built first draft of the human genome from lower-level contigs produced by Phrap( P. Green)produced by Phrap( P. Green)

• Celera Assembler (Myers/Sutton)Celera Assembler (Myers/Sutton)

– First mammalian whole genome shotgun assemblerFirst mammalian whole genome shotgun assembler

Outgoing UCSC internet traffic (green) for year 2000. Main peak is

activity on July 7, 2000 when human sequence was first posted on the web

Page 4: Assembly, Comparison, and Annotation of Mammalian Genomes

Assembling other Assembling other mammalian genomesmammalian genomes

• Arachne (Jaffe/Batzolou, Lander group at MIT)Arachne (Jaffe/Batzolou, Lander group at MIT)

– Built first draft of mouse genome, February 2002Built first draft of mouse genome, February 2002– Mouse also assembled by Phusion assembler (Mullikin, Mouse also assembled by Phusion assembler (Mullikin,

Sanger Centre)Sanger Centre)

• Atlas (Havlak/Chen/Durbin, Gibbs group at Baylor)Atlas (Havlak/Chen/Durbin, Gibbs group at Baylor)

– Built first draft of rat genome, November 2002Built first draft of rat genome, November 2002

Page 5: Assembly, Comparison, and Annotation of Mammalian Genomes

Browsers as web-based Browsers as web-based genome microscopesgenome microscopes

• Ensembl Browser (Birney Ensembl Browser (Birney et alet al.).)

• MapViewer (NCBI Mapviewer team)MapViewer (NCBI Mapviewer team)

• UCSC Genome Browser (Kent UCSC Genome Browser (Kent et alet al.) .) http://genome.ucsc.eduhttp://genome.ucsc.edu, currently getting , currently getting more than 140,000 page requests per daymore than 140,000 page requests per day

Page 6: Assembly, Comparison, and Annotation of Mammalian Genomes

Browsers take you from early Browsers take you from early maps of the genome . . .maps of the genome . . .

Page 7: Assembly, Comparison, and Annotation of Mammalian Genomes

. . . to a multi-resolution view . . .. . . to a multi-resolution view . . .

Page 8: Assembly, Comparison, and Annotation of Mammalian Genomes

. . . at the gene cluster level . . . . . . at the gene cluster level . . .

Page 9: Assembly, Comparison, and Annotation of Mammalian Genomes

. . . the single gene level . . . . . . the single gene level . . .

Page 10: Assembly, Comparison, and Annotation of Mammalian Genomes

. . . the single exon level . . .. . . the single exon level . . .

Page 11: Assembly, Comparison, and Annotation of Mammalian Genomes

. . . and at the single base level. . . and at the single base level

caggcggactcagtggatctggccagctgtgacttgacaag caggcggactcagtggatctagccagctgtgacttgacaag

Page 12: Assembly, Comparison, and Annotation of Mammalian Genomes

linking to functional informationlinking to functional information

In situ image from I. Dragatsis et al. 1998

Page 13: Assembly, Comparison, and Annotation of Mammalian Genomes

Goal: the browser as a continuously-tuned Goal: the browser as a continuously-tuned engine for discoveryengine for discovery

• Multiple streams of high-throughput Multiple streams of high-throughput genomics data generated asynchronously genomics data generated asynchronously

• Data fed into nightly updates of browser Data fed into nightly updates of browser database, analysis and displaydatabase, analysis and display

• Browser becomes a new kind of Browser becomes a new kind of microscope scanning the genome at ever microscope scanning the genome at ever greater detail, dimension, and depthgreater detail, dimension, and depth

Page 14: Assembly, Comparison, and Annotation of Mammalian Genomes

Using evolution to find genes Using evolution to find genes and other functional elementsand other functional elements

Mouse conservation pattern in the Mouse conservation pattern in the IGFALS gene on human chr. 16 and a IGFALS gene on human chr. 16 and a known transcription factor binding siteknown transcription factor binding site

R. Weber, L. Elnitski et. al.

Page 15: Assembly, Comparison, and Annotation of Mammalian Genomes

At least half of the human genome consistsAt least half of the human genome consistsof relics of retrotransposonsof relics of retrotransposons

DNA of genome Retrotransposon New copy of retrotransposon

RNA

Reverse transcriptase

1. Transcription

2. Translation

4. Synthesis of secondDNA strand

3. Reverse transcriptionof RNA to DNA

5. Insertionof retro-transposonDNA

Page 16: Assembly, Comparison, and Annotation of Mammalian Genomes

Ancestral retrotransposonsAncestral retrotransposons

• Retrotransposon relics from our common ancestor Retrotransposon relics from our common ancestor with mouse and other placental mammalswith mouse and other placental mammals

• They cover 22% of the human genomeThey cover 22% of the human genome

• ““AR” sites can be used to study neutral evolution: AR” sites can be used to study neutral evolution: mutation without selectionmutation without selection

• ““AR” sites are similar to “4D” sites in genes AR” sites are similar to “4D” sites in genes (four-fold degenerate sites in codons)(four-fold degenerate sites in codons)

Page 17: Assembly, Comparison, and Annotation of Mammalian Genomes

Estimated rate of neutral substitution from AR Estimated rate of neutral substitution from AR

and 4D sites co-varies along the chromosomesand 4D sites co-varies along the chromosomes

R. Hardison, K. Roskin, S, Yang, A. Smit, et al.

Page 18: Assembly, Comparison, and Annotation of Mammalian Genomes

By comparison to local neutral substitution By comparison to local neutral substitution rates, it appears that about 5% of the human rates, it appears that about 5% of the human genome may be under purifying selection.genome may be under purifying selection.

K. Roskin, R. Weber, F. Chiaromonte

Page 19: Assembly, Comparison, and Annotation of Mammalian Genomes

More species increases power to More species increases power to detect conserved elementsdetect conserved elements

BROWSER SNAPSHOT

Human Chimp Baboon Cat Dog Pig Cow Rat Mouse Chicken Zebrafish Fugu Tetraodon

Data from Eric Green at NGHRI, alignments by Webb Miller

About 4% of CFTR region is under purifying selection

Page 20: Assembly, Comparison, and Annotation of Mammalian Genomes

Models of molecular evolutionModels of molecular evolution

Branch length equalsaverage number ofsubstitutions per site

Page 21: Assembly, Comparison, and Annotation of Mammalian Genomes

Models of molecular evolutionModels of molecular evolution

Branch length equalsaverage number ofsubstitutions per site

A

Page 22: Assembly, Comparison, and Annotation of Mammalian Genomes

Models of molecular evolutionModels of molecular evolution

Branch length equalsaverage number ofsubstitutions per site

A

A

G

Page 23: Assembly, Comparison, and Annotation of Mammalian Genomes

Models of molecular evolutionModels of molecular evolution

Branch length equalsaverage number ofsubstitutions per site

A

A

G

A

T

G

G

Page 24: Assembly, Comparison, and Annotation of Mammalian Genomes

Models of molecular evolutionModels of molecular evolution

Branch length equalsaverage number ofsubstitutions per site

A

A

G

A

T

G

G

T

T

Page 25: Assembly, Comparison, and Annotation of Mammalian Genomes

Models of molecular evolutionModels of molecular evolution

Branch length equalsaverage number ofsubstitutions per site

A

A

G

A

T

G

G

T

T

Page 26: Assembly, Comparison, and Annotation of Mammalian Genomes

Continuous-time Markov models ofContinuous-time Markov models ofmolecular evolution can be used tomolecular evolution can be used tocalculate p-values for conservationcalculate p-values for conservation

Conditional probability Conditional probability distribution on each distribution on each branch has the form branch has the form P = eP = eQtQt where t is the where t is the time and Q is a 4 by 4 time and Q is a 4 by 4 rate matrix.rate matrix.

Parameterizations of Q: JC, …, HKY, REV, UNRParameterizations of Q: JC, …, HKY, REV, UNR

Page 27: Assembly, Comparison, and Annotation of Mammalian Genomes

Calculation of p-valuesCalculation of p-values

• p-value is probability of getting a given parsimony score or better, using a cont. time Markov model of evolution• p-values are calculated recursively for the two subtrees, for all possible values of parsimony score and ancestral bases for each subtree• data for subtrees is combines to produce p-value at root

Method developed by Mathieu Blanchette and Martin Tompa

Page 28: Assembly, Comparison, and Annotation of Mammalian Genomes

Calculation of p-valuesCalculation of p-values

• p-value is probability of getting a given parsimony score or better, using a cont. time Markov model of evolution• p-values are calculated recursively for the two subtrees, for all possible values of parsimony score and ancestral bases for each subtree• data for subtrees is combines to produce p-value at root

Method developed by Mathieu Blanchette and Martin Tompa

Page 29: Assembly, Comparison, and Annotation of Mammalian Genomes

Calculation of p-valuesCalculation of p-values

• p-value is probability of getting a given parsimony score or better, using a cont. time Markov model of evolution• p-values are calculated recursively for the two subtrees, for all possible values of parsimony score and ancestral bases for each subtree• data for subtrees is combines to produce p-value at root

Method developed by Mathieu Blanchette and Martin Tompa

Page 30: Assembly, Comparison, and Annotation of Mammalian Genomes

Examples of conserved regionsExamples of conserved regions

Analysis of CFTR region by Mathieu Blanchette

Page 31: Assembly, Comparison, and Annotation of Mammalian Genomes

Regulatory modulesRegulatory modules

Mathieu Blanchette

Page 32: Assembly, Comparison, and Annotation of Mammalian Genomes

Conserved RNA structure in a 3’ UTRConserved RNA structure in a 3’ UTR

Mathieu Blanchette

Page 33: Assembly, Comparison, and Annotation of Mammalian Genomes

Intronic RNA structural elementIntronic RNA structural element

73kb to ST7 173kb to ST7 1stst exon 73kb to ST7 2 exon 73kb to ST7 2ndnd exon exon

~90 bp conserved stem~90 bp conserved stem

Mathieu Blanchette

Page 34: Assembly, Comparison, and Annotation of Mammalian Genomes

Modeling different modes of Modeling different modes of substitutionsubstitution

We want to pay attention to how elements are conserved, not just

that they are conserved

Page 35: Assembly, Comparison, and Annotation of Mammalian Genomes

Context mattersContext matters

substitution rate matrix for non-coding dinucleotidessubstitution rate matrix for non-coding dinucleotides

Adam Siepel

Page 36: Assembly, Comparison, and Annotation of Mammalian Genomes

Dinucleotide and trinucleotide Dinucleotide and trinucleotide models fit substitution data from models fit substitution data from

neutral regions much betterneutral regions much better

Improvement in log likelihood on AR sites for higher order models of base substitution

Adam Siepel

Page 37: Assembly, Comparison, and Annotation of Mammalian Genomes

Method also produces improved Method also produces improved models of codon evolutionmodels of codon evolution

Adam Siepel

Page 38: Assembly, Comparison, and Annotation of Mammalian Genomes

Phylogenetic HMMsPhylogenetic HMMs

TAATGGTA…CCAGTTA…GCAGAGT…

CCATGGTT…CCCGTAG…CCAGAGT…

TAATGGTA…CCGGTTA…ACAGAGT…

TTATGGTA…CCTGTTA…ACAGAGT…

CGATGGTG…CCGGTCG…ACAGAGC…CTATGGTC…CCTGTTA…TCAGAGC…GTATGGTC…CCTGTCG…TCAGAGC…CCATGGTT…CCCGTAG…CCAGAGT…

human

baboon

mouse

dog

cat

cow

pig

chicken

Adam Siepel

Page 39: Assembly, Comparison, and Annotation of Mammalian Genomes

Human splice variants of ZNF278 conserved in mouseHuman splice variants of ZNF278 conserved in mouse

Chuck Sugnet

Comparative cDNA analysis finds Comparative cDNA analysis finds alternatively spliced genesalternatively spliced genes

Page 40: Assembly, Comparison, and Annotation of Mammalian Genomes

Molecular evolution is moreMolecular evolution is morethan base substitutionsthan base substitutions

• InsertionsInsertions

• DeletionsDeletions

• DuplicationsDuplications

• InversionsInversions

• RearrangementsRearrangements

Page 41: Assembly, Comparison, and Annotation of Mammalian Genomes

Genome-wide human-mouse Genome-wide human-mouse alignments reveal a host of multibase alignments reveal a host of multibase

evolutionary eventsevolutionary events

A 15,000 base inversion on human A 15,000 base inversion on human chromosome 7 containing two geneschromosome 7 containing two genes

J. Kent, W. Miller, R. Baertsch

Page 42: Assembly, Comparison, and Annotation of Mammalian Genomes

Hot spots for rearrangements?Hot spots for rearrangements?

At finer resolution, many thousands of syntenic blocks between human and mouse are found, and short blocks are clustered in clumps

J. Kent, W. Miller, R. Baertsch

Page 43: Assembly, Comparison, and Annotation of Mammalian Genomes

Grand challenge of humanGrand challenge of humanmolecular evolutionmolecular evolution

Reconstruct the evolutionary historyof each base in the human genome

Page 44: Assembly, Comparison, and Annotation of Mammalian Genomes

CreditsCredits Thanks to Jim Kent, Terry Furey, Mathieu Blanchette, Adam Thanks to Jim Kent, Terry Furey, Mathieu Blanchette, Adam

Siepel, Chuck Sugnet, Ryan Weber, Krishna Roskin, Mark Siepel, Chuck Sugnet, Ryan Weber, Krishna Roskin, Mark Diekhans, Robert Baertsch, Matt Schwartz, Angie Hinrichs, Diekhans, Robert Baertsch, Matt Schwartz, Angie Hinrichs, Donna Karolchik, Heather Trumbower, Yontao Lu, Fan Hsu, Donna Karolchik, Heather Trumbower, Yontao Lu, Fan Hsu, Daryl Thomas, Jorge Garcia, Patrick Gavin and Paul Tatarsky Daryl Thomas, Jorge Garcia, Patrick Gavin and Paul Tatarsky at UCSCat UCSC

Francis Collins, Bob Waterston, Eric Lander, Richard Gibbs, Francis Collins, Bob Waterston, Eric Lander, Richard Gibbs, Eric Green, Elliot Margulies, David Kulp, Alan Williams, Eric Green, Elliot Margulies, David Kulp, Alan Williams, Ray Wheeler, Webb Miller, Ross Hardison, Scott Schwartz, Ray Wheeler, Webb Miller, Ross Hardison, Scott Schwartz, Francesca Chiaromonte, Thomas Pringle, Greg Schuler, Francesca Chiaromonte, Thomas Pringle, Greg Schuler, Deanna Church, Steve Sherry, Ewan Birney, Michelle Clamp, Deanna Church, Steve Sherry, Ewan Birney, Michelle Clamp, David Jaffe, Asif Chinwalla, Jim Mullikin,Tim Hubbard, David Jaffe, Asif Chinwalla, Jim Mullikin,Tim Hubbard, Arian Smit, Nick Goldman, Barbara Trask, Ian Dunham, Sean Arian Smit, Nick Goldman, Barbara Trask, Ian Dunham, Sean Eddy, Evan Eichler, David Cox, Carol Bult, and many other Eddy, Evan Eichler, David Cox, Carol Bult, and many other outside collaboratorsoutside collaborators