plant and animal genome sequencing and bioinformatics ......disease, evolution, and crop breeding....

2
[email protected] 888.440.0954 www.medgenome.com Overview Over the last decade, rapid improvements in DNA sequencing technology have transformed the field of genomics and is now an essential tool in genetic and clinical research. The ability to generate chromosome-level genomes or sequence specific genomic regions of interest has tremendous implications in a variety of applications including development, disease, evolution, and crop breeding. At MedGenome, we have leveraged our experience in genomics technologies and developed a comprehensive, customizable service for high-quality genome assembly and annotation. Figure 3. A) Starting from a variety of tissues or cells, high-molecular weight DNA is extracted for long- and short-read library preparation and sequenced at the recommended depth. Tissues or cells will also be processed for Hi-C library preparation and sequenced to about 100x depth of a given genome. Long and short reads will be used for generating an initial draft genome assembly that will be assessed for accuracy and contiguity. This draft genome will then be scaffolded using Hi-C data. B) Gene models will be predicted in the final genome using various types of evidence such as transcriptome and protein sequence data along with ab initio gene prediction tools. The predicted gene models will be assessed for quality and refined over multiple iterations. C) The final gene model set will then be annotated using publicly available databases for homology-based function assignment. Plant and Animal Genome Sequencing and Bioinformatics Solutions at MedGenome B. Structural Annotation Platform Long-reads Short-reads Hi-C 50-100x 50x 100x Coverage A. Assembly Final contigs Repeat Library Augustus HMM model SNAP HMM model RNA-Seq Assembled Transcripts Protein sequences from closely related species Protein sequences from distanly related species Annotation quality STAR Feature Count DeSeq2 MAKER (BlastP Interproscan) MAKER (SNAP, Augustus, Exonerate) RepeatMasker RepeatModeler MAKER (SNAP training 3 iterations) Trinity BUSCO -long BUSCO Gene function assignment Gene structure annotation Final Scaffolds C. Functional Annotation TransDecoder ORF prediction Mapping DB hits NCBI NR DB hits Reverse hits Reciprocal best hits Interpro DB hits Input: GFF Annotation, Isoseq transcripts Diamond BlastP Input seqs vs. NCBI NR Diamond BlastP NCBI NR hits vs. Input sequences Map NCBI hit accessions to • KEGG • Gene Ontology • UniprotKB • NCBI Gene • Ensembl INTERPROSCAN • CATH-GENE3D • COD • MobiDB • HAMAP • PANTHER • Pfam • PIRSF • PRINTS • ProDom • PROSITE • SFLD • SMART SUPERFAMILV • TIGRFAM Get reciprocal best hits: >= 95% of best bitscore both ways Amino acid sequences Workflow Long read data (PacBio/ONT) ~60-100X Illumina short reads (~100X) Dovetail Hi-C Contigs Corrected Super-Scaffolded contigs

Upload: others

Post on 14-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Plant and Animal Genome Sequencing and Bioinformatics ......disease, evolution, and crop breeding. At MedGenome, we have leveraged our experience in genomics technologies and developed

[email protected] 888.440.0954 www.medgenome.com

Overview

Over the last decade, rapid improvements in DNA sequencing technology have transformed the field of genomics and is now an essential tool in genetic and clinical research. The ability to generate chromosome-level genomes or sequence specific genomic regions of interest has tremendous implications in a variety of applications including development, disease, evolution, and crop breeding. At MedGenome, we have leveraged our experience in genomics technologies and developed a comprehensive, customizable service for high-quality genome assembly and annotation.

Figure 3. A) Starting from a variety of tissues or cells, high-molecular weight DNA is extracted for long- and short-read library preparation and sequenced at the recommended depth. Tissues or cells will also be processed for Hi-C library preparation and sequenced to about 100x depth of a given genome. Long and short reads will be used for generating an initial draft genome assembly that will be assessed for accuracy and contiguity. This draft genome will then be scaffolded using Hi-C data. B) Gene models will be predicted in the final genome using various types of evidence such as transcriptome and protein sequence data along with ab initio gene prediction tools. The predicted gene models will be assessed for quality and refined over multiple iterations. C) The final gene model set will then be annotated using publicly available databases for homology-based function assignment.

Plant and Animal Genome Sequencing and Bioinformatics Solutions at MedGenome

B. Structural Annotation

Platform

Long-reads

Short-reads

Hi-C

50-100x

50x

100x

Coverage

A. Assembly

Final contigs

RepeatLibrary

AugustusHMM model

SNAPHMM model

RNA-Seq

AssembledTranscripts

Proteinsequences fromclosely related

species

Proteinsequences fromdistanly related

species

Annotationquality

STAR Feature Count DeSeq2

MAKER (BlastP Interproscan)

MAKER(SNAP,

Augustus,Exonerate)

RepeatMaskerRepeatModeler

MAKER(SNAP training

3 iterations)

Trinity

BUSCO-long

BUSCOGene functionassignment

Gene structureannotation

FinalScaffolds

C. Functional Annotation

TransDecoderORF prediction

Mapping DB hits

NCBI NRDB hits

Reverse hits

Reciprocalbest hits

InterproDB hits

Input:GFF Annotation,Isoseq transcripts

Diamond BlastPInput seqs

vs. NCBI NR

Diamond BlastPNCBI NR hits vs.Input sequences

Map NCBI hitaccessions to• KEGG• Gene Ontology• UniprotKB• NCBI Gene• Ensembl

INTERPROSCAN• CATH-GENE3D• COD• MobiDB• HAMAP• PANTHER• Pfam• PIRSF• PRINTS• ProDom• PROSITE• SFLD• SMART• SUPERFAMILV• TIGRFAM

Get reciprocal best hits:

>= 95% of bestbitscore both ways

Amino acidsequences

Workflow

Long read data (PacBio/ONT) ~60-100X

Illumina short reads(~100X)

Dovetail Hi-CContigs

Corrected

Super-Scaffolded contigs

Page 2: Plant and Animal Genome Sequencing and Bioinformatics ......disease, evolution, and crop breeding. At MedGenome, we have leveraged our experience in genomics technologies and developed

[email protected] 888.440.0954 www.medgenome.com

Suryamohan and colleagues report a de novo near-chromosomal genome assembly of Naja naja, the Indian cobra, a highly venomous, medically important snake. With a scaffold N50 of 223.35 Mb, this is the most contiguous reptilian genome published till date. A comprehensive annotation pipeline identified 23,248 predicted protein-coding genes, of which 12,346 were venom-gland-expressed genes and constitute the ‘venom-ome’. Venom gene-specific gene models were generated to identify 139 genes from 33 toxin families. This will aid in synthetic venom production through recombinant toxin expression and will aid in the rapid development of safe and effective synthetic antivenom. Additionally, this genome will serve as a reference for snake genomes, support evolutionary studies and enable venom-driven drug discovery.

Link to Nature Article: https://www.nature.com/articles/s41588-019-0559-8

Highlights of our Services and Bioinformatics Capabilities• Support with experimental design and selection of appropriate genome assembly workflow • High-molecular weight DNA extraction (blood, flash frozen tissue, plants)• End-to-end solution in library preparation, sequencing, genome assembly and annotation

Analysis deliverables

• Raw data (FASTQ for genome sequencing, BAM files from Hi-C)

• Chromosomal-level scaffolded genome assembly and assembly statistics

• Repetitive element annotation and statistics

• Genome annotation and quality assessment statistics

Advanced Deliverables

• Custom gene family annotation

• Differential gene expression analysis

• Whole genome comparative analysis

Our Partner

Case Study: The Indian cobra reference genome and transcriptome enables comprehensive identification of venom toxins

Figure 2. A) Circos plot of the chromosomal Indian cobra genome assembly depicting the chromosomes, repetitive content, GC content and other characteristics. B) Transcriptome data overlaid on the high-quality genome allowed identification of genes primarily expressed in the Indian cobra venom gland

A B

8-10 weeks 8-10 weeks 8-10 weeks 2 weeks

Genome andtranscriptome

sequencing

Genome andtranscriptome

assemblyAnnotation

Reviewdata

Delivere datato customer