multiple mouse reference genomes and strain specific gene annotations

30
Multiple mouse reference genomes and strain specific gene annotations Thomas Keane, Wellcome Trust Sanger Institute @drtkeane @mousegenomes [email protected]

Upload: thomas-keane

Post on 16-Apr-2017

693 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Multiple mouse reference genomes and strain specific gene annotations

Multiple mouse reference genomes and strain specific

gene annotations

Thomas Keane,Wellcome Trust Sanger Institute @drtkeane @[email protected]

Page 2: Multiple mouse reference genomes and strain specific gene annotations

Sequence variation

**

*

*

*

**

*

*

**

*

*

*

*

*

➢ 36 inbred strains with whole-genome illumina sequencing

➢ SNPs, indels, and structural variants

➢ Are there more inbred strains with deep whole genome illumina sequencing?

➢ LG/J, SM/J, and JF1/MsJ pending

Anthony Doran, WTSI

Page 3: Multiple mouse reference genomes and strain specific gene annotations

Genome assemblies

➢ REL-1412: Illumina mate pair based de novo scaffolds

➢ REL-1504: Pseudo-chromosomes○ Alignment synteny with GRCm38

○ Evaluation with PacBio WGS/cDNA showed

excessive reference bias

➢ REL-1509: Pseudo-chromosomes based on breakpoint graphs

○ Dovetail genomics scaffolds for CAST/EiJ,

PWK/PhJ, and SPRET/EiJ.

nnnn

nnnn

1. Contigs

2. Scaffolds

Chr1

3. Pseudo-chromosomes

Paired-endIllumina

Large fragment ends (3,6,10kb, Dovetail, BAC ends)

Whole-genome alignments

Page 4: Multiple mouse reference genomes and strain specific gene annotations

PacBio alignments

➢ Use PacBio long reads alignment contiguity to validate the chromosome sequence

➢ Compare the number of inconsistently mapped reads

X

Page 5: Multiple mouse reference genomes and strain specific gene annotations

PacBio WGS and cDNA alignments

Page 6: Multiple mouse reference genomes and strain specific gene annotations

PWK/PhJ

Dovetail genomics: CAST/EiJ, PWK/PhJ, SPRET/EiJ

A) High molecular weight (50+ kbp) input DNA

B) Reconstitute chromatin from the input DNA

C) Addition of a fixative agent (e.g., formaldehyde) produces crosslinks

D) Crosslinked chromatin digested with a restriction endonuclease to generate sticky-ended fragments

E+F) DNA ligase added to perform blunt-end ligation of the many ends within a given chromatin aggregate

G) Chromatin is removed and DNA is purified and processed to remove biotin

Enriched for biotin-containing fragments and prepare sequencing library

http://dovetailgenomics.com/

Page 7: Multiple mouse reference genomes and strain specific gene annotations

Dovetail Scaffolds

Length (Gbp)

Scaffolds N50 (Mbp) Largest (Mbp)

% Ns

CAST/EiJ 2.69 382,843 0.644 4.75 11.4

PWK/PhJ 2.53 271,282 0.390 4.0 6.3

SPRET/EiJ 2.66 297,604 0.361 2.82 9.4

Length (Gbp)

Scaffolds N50 (Mbp) Largest (Mbp)

% Ns

CAST/EiJ 2.69 367,627 22.216 90.4 11.5

PWK/PhJ 2.58 251,844 24.066 100.6 7.44

SPRET/EiJ 2.66 272,127 23.475 88.6 9.5

REL-1412

REL-1412+Dovetail

Page 8: Multiple mouse reference genomes and strain specific gene annotations

Dovetail Scaffolds

Length (Gbp)

Scaffolds N50 (Mbp) Largest (Mbp)

% Ns

CAST/EiJ 2.69 382,843 0.644 4.75 11.4

PWK/PhJ 2.53 271,282 0.390 4.0 6.3

SPRET/EiJ 2.66 297,604 0.361 2.82 9.4

Length (Gbp)

Scaffolds N50 (Mbp) Largest (Mbp)

% Ns

CAST/EiJ 2.69 367,627 22.216 90.4 11.5

PWK/PhJ 2.58 251,844 24.066 100.6 7.44

SPRET/EiJ 2.66 272,127 23.475 88.6 9.5

REL-1412

REL-1412+Dovetail

Page 9: Multiple mouse reference genomes and strain specific gene annotations

PacBio WGS alignments

➢ Proportion of WGS reads where all hits are one orientation vs. mixed orientations (lower is better)

Page 10: Multiple mouse reference genomes and strain specific gene annotations

Complex regions - Nlrp1 paralogs

Post-dovetailPseudo-chromosomes (pre-dovetail)

➢ A dozen highly polymorphic complex loci○ Major urinary proteins (MUPs), H2/MHC, IRG, Nlrp etc.

Jingtao Lilue, WTSI

Page 11: Multiple mouse reference genomes and strain specific gene annotations

Pseudo-chromosomes (REL-1509)Strain Length (Gbp) Sequences (>2kb) N50 (Mbp) Largest (Mbp) %N129S1_SvImJ 2.73 7,153 134.54 202.56 0.15A_J 2.63 4,687 129.07 194.20 0.11AKR_J 2.71 5,954 132.98 199.99 0.13BALB_cJ 2.63 3,824 129.64 194.91 0.11C3H_HeJ 2.70 4,069 133.07 200.88 0.14C57BL_6NJ 2.81 3,893 139.12 208.92 0.18CAST_EiJ 2.65 2,976 133.75 200.42 0.14CBA_J 2.92 5,465 144.78 216.63 0.21DBA_2J 2.61 4,104 128.21 192.93 0.11FVB_NJ 2.59 5,013 127.06 191.00 0.11LP_J 2.73 3,498 135.16 203.66 0.16NOD_ShiLtJ 2.98 5,551 147.35 223.33 0.23NZO_HlLtJ 2.70 7,022 132.96 199.80 0.14PWK_PhJ 2.60 5,085 127.27 191.61 0.11SPRET_EiJ 2.63 5,405 131.95 198.85 0.11WSB_EiJ 2.69 2,238 133.18 200.11 0.16

Page 12: Multiple mouse reference genomes and strain specific gene annotations

Pseudo-chromosomes (REL-1509)Strain Length (Gbp) Sequences (>2kb) N50 (Mbp) Largest (Mbp) %N129S1_SvImJ 2.73 7,153 134.54 202.56 0.15A_J 2.63 4,687 129.07 194.20 0.11AKR_J 2.71 5,954 132.98 199.99 0.13BALB_cJ 2.63 3,824 129.64 194.91 0.11C3H_HeJ 2.70 4,069 133.07 200.88 0.14C57BL_6NJ 2.81 3,893 139.12 208.92 0.18CAST_EiJ 2.65 2,976 133.75 200.42 0.14CBA_J 2.92 5,465 144.78 216.63 0.21DBA_2J 2.61 4,104 128.21 192.93 0.11FVB_NJ 2.59 5,013 127.06 191.00 0.11LP_J 2.73 3,498 135.16 203.66 0.16NOD_ShiLtJ 2.98 5,551 147.35 223.33 0.23NZO_HlLtJ 2.70 7,022 132.96 199.80 0.14PWK_PhJ 2.60 5,085 127.27 191.61 0.11SPRET_EiJ 2.63 5,405 131.95 198.85 0.11WSB_EiJ 2.69 2,238 133.18 200.11 0.16

➢ Propose to make REL-1509 the first annotated reference genomes for the laboratory strains

Page 13: Multiple mouse reference genomes and strain specific gene annotations

Gene prediction approach

RNA-SeqGencode M7

C57BL/6J Strain specific

Ian Fiddes, UCSC

Stefanie König,U. Greifswald

Mario Stanke,U. Greifswald

Evidence

Page 14: Multiple mouse reference genomes and strain specific gene annotations

Gene prediction approach

➢ TransMap - utilise as much of the Gencode C57BL/6J genome annotation as possible

○ Local augustus - refine the lift over to allow small adjustments based on strain specific RNA-Seq

TransMap

RNA-SeqGencode M7

C57BL/6J

Ian Fiddes, UCSC

Stefanie König,U. Greifswald

Mario Stanke,U. Greifswald

TransMap+local Augustus

Strain specific

Evidence

Page 15: Multiple mouse reference genomes and strain specific gene annotations

How many genes have at least one fully correct transcript?

Ian Fiddes, UCSC

Page 16: Multiple mouse reference genomes and strain specific gene annotations

Gene prediction approach

➢ TransMap - liftover as much of the Gencode C57BL/6J genome annotation as possible

○ Local augustus - refine the lift over to allow small adjustments based on strain specific RNA-Seq

➢ Comparative gene prediction: Augustus CGP○ Generate gene predictions based primarily on RNA-Seq evidence

○ Allows for predictions of new transcripts+exons absent in C57BL/6J

TransMap TransMap+local Augustus

Augustus CGP

RNA-SeqGencode M7

Ian Fiddes, UCSC

Stefanie König,U. Greifswald

Mario Stanke,U. Greifswald

Strain specificC57BL/6J

Evidence

Page 17: Multiple mouse reference genomes and strain specific gene annotations

Gene prediction approach

➢ TransMap - utilise as much of the Gencode C57BL/6J genome annotation as possible

○ Local augustus - refine the lift over to allow small adjustments based on strain specific RNA-Seq

➢ Comparative gene prediction: Augustus CGP○ Generate gene predictions based primarily on RNA-Seq evidence

○ Allows for predictions of new transcripts+exons absent in C57BL/6J

TransMap TransMap+local Augustus

Augustus CGP

RNA-SeqGencode M7

Consensus gene set

Ian Fiddes, UCSC

Stefanie König,U. Greifswald

Mario Stanke,U. Greifswald

Strain specificC57BL/6J

Evidence

Page 18: Multiple mouse reference genomes and strain specific gene annotations

Efcab13-Efcab3 hybrid

Stefanie König,U. Greifswald

Charlie Steward,WTSI

Page 19: Multiple mouse reference genomes and strain specific gene annotations

What about human?

Page 20: Multiple mouse reference genomes and strain specific gene annotations

Efcab13-Efcab3 hybrid

NOT VALIDATED (YET)!

Stefanie König,U. Greifswald

Charlie Steward,WTSI

Page 21: Multiple mouse reference genomes and strain specific gene annotations

Dnah14: dynein, axonemal, heavy chain 14

Stefanie König,U. Greifswald

Charlie Steward,WTSI

Page 22: Multiple mouse reference genomes and strain specific gene annotations

Charlie Steward,WTSI

Gene extensions - Dnah14

NOT VALIDATED (YET)!

Stefanie König,U. Greifswald

Page 23: Multiple mouse reference genomes and strain specific gene annotations

Complex regions - Nlrp1 paralogs

Jingtao Lilue, WTSI

C57BL/6J

PWK/PhJ

C57BL/6J

PWK/PhJ

PWK/PhJassembly

Page 24: Multiple mouse reference genomes and strain specific gene annotations

How can I look at the genomes?

http://hgwdev-mus-strain.sdsc.eduMark Diekhans, UCSC

Ian Fiddes, UCSC

Page 25: Multiple mouse reference genomes and strain specific gene annotations

How can I look at the genomes?

http://hgwdev-mus-strain.sdsc.eduMark Diekhans, UCSC

Ian Fiddes, UCSC

Page 26: Multiple mouse reference genomes and strain specific gene annotations

Change co-ordinate system to strain of interest

http://hgwdev-mus-strain.sdsc.edu

Mark Diekhans, UCSC

Ian Fiddes, UCSC

Page 27: Multiple mouse reference genomes and strain specific gene annotations

How can I look at the genomes?

Developed and maintained by the Genome Reference Informatics Teamhttp://mice-geval.sanger.ac.uk

Kerstin Howe,WTSI

Page 28: Multiple mouse reference genomes and strain specific gene annotations

Acknowledgements➢ Wellcome Trust Sanger Institute

○ Anthony Doran, Kim Wong, Dirk-Dominik Dolle, Jingtao Lilue, Monica Abrudan○ David Adams, Richard Durbin, Kerstin Howe, Jennifer Harrow, Charles Steward, Mark Thomas, Ruth Bennet,, Jo Wood,

James Torrance, Will Chow, Mike Quail, Matt Dunn, Marcela Sjoberg, James Gilbert, Ed Griffiths, Anne Ferguson-Smith

➢ UCSC○ Benedict Paten, Joel Armstrong, Mark Diekhans, Dent Earl, Ian Fiddes

➢ EBI○ David Thybert, Duncan Odom, Paul Flicek

➢ University of Greifswald○ Mario Stanke, Stefanie König

➢ Salk Institute○ Son Pham, Mikhail Kolmogorov

➢ Yale○ Fabio Navarro, Cristina Sisu, Mark Gerstein

➢ Wellcome Trust Centre for Human Genetics○ Jonathan Flint, Richard Mott, Leo Goodstadt

➢ Jackson Laboratory○ Laura Reinholdt, Anne Czechanski

➢ URLs○ http://www.sanger.ac.uk/science/data/mouse-genomes-project○ http://hgwdev-mus-strain.sdsc.edu○ http://mice-geval.sanger.ac.uk/index.html

2014-2017 2015-2018

Sequence Variation Infrastructure Group, WTSI

Page 29: Multiple mouse reference genomes and strain specific gene annotations

BioNano genomics optical mapping

Page 30: Multiple mouse reference genomes and strain specific gene annotations

10kb mate-pair consistency