ensembl: a genomic toolset for pigs, poultry, …...ensembl: a genomic toolset for pigs, poultry,...
TRANSCRIPT
Ensembl: A Genomic Toolset for pigs, poultry, plants, pests, pathogens and pollinators
Ensembl: A Genomic Toolset for pigs, poultry, plants, pests, pathogens and pollinators
Paul Kersey
[email protected] 18.03.20152
A brief history of genome sequencing
• 1995 Haemophilus influenzae 1.8 Mb
• 1996 Saccharomyces cerevisiae 12 Mb
• 1999 Drosophila melanogaster 140 Mb
• 2001 Homo sapiens 3.1 Gb
• Sequencing technology is continuously improving, but (massively parallel) “next generation” techniques really were game-changers
Cost of Sequencing a Human Genome 2001-2013
[email protected] 18.03.20154
A brief history of genome sequencing
• 2008-2015 1000 genomes project (2500 human genomes)
• 2008-2015 1001 genomes project (1,0001 Arabidopsis genomes)
• 2015-2019 Genomics England (100,000 human genomes)
[email protected] 18.03.20155
What can we do with thousands of genome sequences?• Statistical association of traits with markers
• Increased marker resolution to find causative variants
• Understand population structure and evolutionary processes
• Track epidemics
• Assay for known variation
• Environmental distribution
• Tool for managing crosses
• More genomes…
• More statistical power, find rarer causative alleles
[email protected] 18.03.20156
Thousands of genomes – a tool for breeding
• Characterize germplasm of land races and wild relatives
• Understand what’s actually present in an existing line
• Find alleles associated with traits
• Combine genotyping with various (laboratory, greenhouse, field) phenotyping mechanisms, themselves increasingly automated and high-throughput
• Manage crosses
[email protected] 18.03.20157
Everyone can do their own experiments, but…
• EMBL-EBI would like to maintain a cataologue of reference genomes and variants for all majorly studied species
• Selected lines can be re-phenotyped and analysed against the same reference data
• One major challenge: organising the pan-genome
• No single genome is enough to serve as a reference for many species
• Variants, functional elements present in some strains but not in the reference
• Reference is still a useful concept: but needs to be extended –“choose your own reference” according to need
[email protected] 18.03.20158
Phenotyping data
• Immensely varied
• Dependent on an environment (GxPxE)
• Anything you can measure – from molecular assays to in-field imaging
• Increasing use of structured controlled vocabularies for human readable, inter-operable data summaries
• Meta data is critical
• What has been assayed?
• Where was it assayed?
• How has it been assayed?
[email protected] 18.03.20159
The EBI mission• EMBL-EBI provides freely available data from life
science experiments, performs basic research in computational biology and offers an extensive user training programme, supporting researchers in academia and industry.
• We also coordinated the ELIXIR pilot phase and are hosting the ELIXIR hub
[email protected] 18.03.201510
EBI provides…
• Structured archives (and associated submission services) for most major types of molecular biological data
• e.g. European Nucleotide Archive (part of the ENA-GenBank-DDBJ International Nucleotide Sequence Database Consortium)
• European Variation Archive – now accepting submissions in VCF format
• ArrayExpress, PRIDE, Metabolights
• Integrative, interpreted services providing access to that data in a biologically meaningful context
• e.g. Ensembl
[email protected] ELIXIR Innovation and SME Forum Wageningen 18th-19th March 2015
18.03.201511
Ensembl
• A modular suite of software for genome analysis and visualisation developed jointly by the Wellcome Trust Sanger Institute and the European Bioinformatics Institute
• Now used for genomes from across the taxonomic space
• Offers a standard set of interfaces to a wide range of genome-scale data, including:
• Web-based GUI
• Public mySQL server
• Perl and REST-ful APIs
• FTP
• Data mining tool (constructed using BioMart) framework with its own set of interfaces: web GUI, web services, command line and local client
ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201526th July 2013 [email protected]
vertebrates
metazoaplants
protistsfungibacteria
• Farm animals
• Crop plants
• Pests
• Vectors
• Pollinators
• Pathogens
• Symbionts and commensuals
Agriculturally relevant species in Ensembl
ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201514
ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201515
ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201516
18.03.201517
Gene tree pipeline
O r t h o l o g s & P a r a l o g s
Take canonical protein for each gene belonging to one Ensembl Genomes clade
Cluster: WU-BLASTP + Smith-Waterman all-versus-all, hcluster_sg
Align: multiple aligners consensified by M-Coffee
Build trees: PhyML-WAG + PhyML-HKY + NJ-p + NJ-dN + NJ-dS + species tree → TreeBeST-merge
Infer orthologues and paralogues
Paralogues:
Any gene pairwise relationship where the ancestor node is a duplication event
Orthologues:
Any gene pairwise relationship where the ancestor node is a speciation event
Orthologues and paralogues
• ortholog_one2one
• ortholog_one2many
• ortholog_many2many
• apparent_ortholog_one2one
• possible_ortholog (weakly supported duplication node)
• within_species_paralog
• other_paralog (too distant to be in the same tree)
• contiguous_gene_split (artefact)
• putative_gene_split (artefact)
Orthology / paralogy types
ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201502.02.2015 [email protected]
• Only for certain combinations of species
• Generated using (B)LASTz-net
Synteny
• Organisms of relatively recent divergence show similar blocks of genes in the same relative positions in the genome
• Shows how the genome is “cut and pasted” in the course of evolution
• Calculated using pairwise whole genome alignments
• Only for certain combinations of species
Pairwise whole genome alignments & synteny
Ensembl
• Ensembl supports many livestock species
• Ensembl provides automatic gene annotation for these species
• Ensembl works with Havana to support manual annotation in Pigs
• Ensembl provides Variation databases and functional annotation where the data exists
• Ensembl is playing an active role in FAANG and will integrate the functional data generated as it becomes available
FAANG
• Functional Annotation of Animal Genomes
• High quality transcriptomic and regulatory annotation of Animal Genomes
• Open Data released pre-publication
• Common data and analysis standards
• EBI leading establishment of infrastructure for data sharing and standard
• http://www.faang.org
[email protected] 18.03.201527
The bread wheat genome
• Large – haploid genome size is > 5 Gb
• But in fact, the genome is an alloxhexaploid (triploid genome size ~ 16 Gb)
• Each diploid genome is quite homozygous
Evolution of hexaploid bread wheat
[email protected] 18.03.201529
The bread wheat genome
• Genome has been sequenced by Illumina after chromosome sorting
• Assembly is fragmented, but gene models are broadly comparable to other grasses
• Chromosome 3B has been sequenced BAC-by-BAC
[email protected] 18.03.201530
Wheat in Ensembl Plants
• We represent the IWGSC chromosome survey sequence with the addition of the “finished” 3B sequence.
• We also use PopSeq data (from IPK, Gatersleben) to group scaffolds into bins based on genetic locations
ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201518.03.201531
1:1 orthology calls over 19 cereals including the three sub-genomes of bread wheat
ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201518.03.201532
ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201518.03.201533
[email protected] 18.03.201534
Polymorphism data for bread wheat
• ~900,000 SNPs provided by CerealsDB, as follows:
• The Axiom 820K SNP Array contains 820,000 SNPs of which ~684,000 have been mapped.
• The iSelect 80K Array contains over 80,000 SNP loci of which ~58,000 have been mapped.
• The KASP probeset contains ~3,900 SNP loci of which ~3,100 have been loaded in Ensembl Plants
ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201518.03.201535
ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201518.03.201536
[email protected] 18.03.201538
Inter-homoeologous variants
Genome combination
Mismatchlength (in reference genome), bp
Alignment length, bp
% mismatch
B on A 2,881,969 41,739,915 6.90
D on A 2,665,562 43,228,044 6.17
A on B 2,892,005 41,749,951 6.93
D on B 2,739,967 44,238,039 6.34
A on D 2,689,840 43,252,322 6.22
B on D 2,745,993 43,244,065 6.35
Mismatch defined as length on reference not matched in non reference
ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201518.03.201539
Bread wheat whole genome alignment
• DNA-DNA pairwise alignments with lastZ
• Brachypodium distachyon: 617,996,145 Mb (14% of bread wheat) in 1,310,922 blocks
• Hordeum vulgare: 423,284,874 Mb (9% of bread wheat) in 2,902,234 blocks
• Oryza sativa Japonica: 312,857,683 Mb out of 4,460,951,632 (7% of bread wheat) in 718,036 blocks
[email protected] 18.03.201540
[email protected] 18.03.201541
Additional alignment data for bread wheat
• Repbase repeats
• Triticeae repeats from TREP
• Wheat RNA-Seq, ESTs, and UniGene datasets have been aligned to the Triticum aestivum genome:
• 454 RNA-seq data for the following INSDC studies: SRP02455 (Akhuvnova et al.), ERP001415 (Brenchley et al.), SRP004502
• Sequences from TriFLDB
• Transcriptome assembly from diploid einkorn wheat Triticum monococcum (Fox et al.)
Diploid progenitors of bread wheat
• Aegilops tauschii (DD) and Triticum urartu (AA) are also included in Ensembl Plants
• In addition, we have RNA-seq data from Triticum monococcum (AA)
• These genomes have been aligned to rice, and barley
• Relevant RNA-seq reads have been also aligned
ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201519.02.2013 [email protected]
Bread wheat whole genome alignment
• DNA-DNA pairwise alignments with lastZ
• Brachypodium distachyon: 617,996,145 Mb (14% of bread wheat) in 1,310,922 blocks
• Hordeum vulgare: 423,284,874 Mb (9% of bread wheat) in 2,902,234 blocks
• Oryza sativa Japonica: 312,857,683 Mb out of 4,460,951,632 (7% of bread wheat) in 718,036 blocks
[email protected] 18.03.201543
Accessing Ensembl Data ProgramaticallyAccessing Ensembl Data Programatically
5 easy methods
• ftp://ftp.ensemblgenomes.org/pub/
• http://plants.ensembl.org/info/data/ftp/index.html
• Genomic, cDNA and protein sequence (FASTA)
• Annotated sequence (EMBL / GenBank)
• Gene sets (GTF)
• Resequencing alignments individuals / strains (EMF)
• Whole-genome multiple alignments (EMF)
• Gene-based multiple alignments (EMF)
• Constrained elements (BED)
• Database dumps (MySQL)
Access method 1:FTP downloads
[email protected] Gramene Workshop, Plant and Animal Genomes XIII18.03.201546
Access method 2: mySQL
• MySQL: an open-source relational database management system (RDBMS)
• Used as the back end to support most Ensembl pipelines and applications
• You get the database from http:///mysql.com and install locally
• On the Ensembl Genomes FTP site, you can download the Ensembl schema as a .sql file.
• You can also download the data files
/data/mysql/bin/mysql -u mysqldba
create database zea_mays_core_24_77_6;
exit;
/data/mysql/bin/mysql -u mysqldba zea_mays_core_24_77_6 < zea_mays_core_24_77_6
/data/mysql/bin/mysqlimport -u mysqldba --fields_escaped_by=\\zea_mays_core_24_77_6 -L *.txt
Access method 3: Ensembl Perl API
• Mature, fully featured Perl API (Applications Programming Interface) for Ensembl resources
• Perl: a commonly used programming language in bioinformatics, designed to make “easy thing easy and hard things possible”
• Provides access to:
• Genomic sequence
• Genome features e.g. genes, translations
• Annotation e.g. cross-references
• http://http://www.ensembl.org/info/docs/api/index.html
• REpresentational State Transfer
• is an abstraction of the architecture of the World Wide Web; more precisely, REST is an architectural style consisting of a coordinated set of architectural constraints applied to components, connectors, and data elements, within a distributed hypermedia system. REST ignores the details of component implementation and protocol syntax in order to focus on the roles of components, the constraints upon their interaction with other components, and their interpretation of significant data elements (Wikipedia)
• A style for structuring URLs (i.e. web addresses) according to the content they contain
• RESTful web service or RESTful web API• Allows users to access data simply by invoking the URL
• Often returns a data structure defined in a simple grammar (e.g. JSON) which can be imported into an object in any programming language
Access method 4: REST API
• A generic tool to facilitate the design and query of data warehouses
• Data warehouses are databases designed to optimise the performance of certain commonly performed queries
• May be less flexible than normalised schema
• Less suitable for maintaining primary data (harder to automatically define constraints due to form of data model)
• Nonetheless, can still be implemented within RDBMS
• BioMart uses mySQL
• We have gene-centric and variant centric BioMarts for all Ensembl divisions
• BioMart comes with its own web interface
Access method 5: BioMart
BioMart Web UI
[email protected] Gramene Workshop, Plant and Animal Genomes XIII18.03.201551
Access Method 6: Virtual Machines
• Download an environment containing all of Ensembl to run on your machine
• In effect, you are downloading/running a model of a computer
• As long as your computer can support running the VM, there should be no problem with library incompatibilities etc. - all the resources Ensembl needs are within the VM
• Increasingly, a model of choice for running web-based services (e.g. in cloud environments) – you don’t deploy into a platform, you deploy a whole platform
• We use OpenBox, an open source virtualisation platform
• http://ensemblgenomes.org/info/access/virtual_machine
[email protected] 18.03.201552
Funding• Ensembl Genomes Funded by
• EMBL
• EU (INFRAVEC, Microme, transPLANT, AllBio)
• BBSRC (PhytoPath, wheat/barley/midge sequencing, UK-US collaboration, RNAcentral)
• Wellcome Trust (PomBase)
• NIH/NIAID (VectorBase)
• NSF (Gramene collaboration)
• Bill and Melinda Gates Foundation (wheat rust)
[email protected] 18.03.201553
People• James Allen, Irina Armean, Dan Bolser, Bruce Bolt, Mikkel
Christensen, Paul Davis, Thomas Down, Christoph Grabmueller, Kevin Howe, Arnaud Kerhornou, Julia Khobdova, Eugene Kulesha, Nick Langridge, Dan Lawson, Mark McDowall, Uma Maheswari, Gareth Maslen, Michael Nuhn, Chuang Kee Ong, Michael Paulini, Helder Pedro, Anton Petrov, Dan Staines, Brandon Walts, Gary Williams
• The vertebrate genomics team @ EBI (Paul Flicek)