the ncbi eukaryotic genome annotation pipeline and alternate genomic sequences
DESCRIPTION
GRC Workshop at Churchill College on Sep 21, 2014. This is Paul Kitt's talk describing the NCBI approach to annotation the full human reference assembly.TRANSCRIPT
GRC Assembly Analysis Workshop At Genome InformaticsSeptember 21, 2014
The NCBI Eukaryotic Genome Annotation Pipeline And Alternate Genomic Sequences
Paul KittsNCBI
National Center for Biotechnology Information
Genomes Annotated By NCBI
Human GRCh382014-02-03
Zebrafish GRCz10in progress
Mouse GRCm38.p22013-12-27
Outline
• Overview of the NCBI Eukaryotic Genome Annotation Pipeline• What to do with alternate loci & patch scaffolds?• How we use the alt/patch/PAR alignments to inform our annotation• Examples:
– Annotation only on alternate loci– Different alleles annotated on primary assembly and alternate loci– Annotation improved by patches– Pseudoautosomal Regions annotated consistently on X & Y
• Recent enhancements:– Using RNA-Seq evidence for gene prediction– Gap-filling gene models using transcript sequences– Annotation reports
Eukaryotic Genome Annotation Pipeline Overview
Ranking Alignments
• Rank alignments for each query sequence– using a quality score that combines identity & coverage– Rank-1 > Rank-2 > Rank-3…
• Conflicting alignments cannot have same rank– alignments of the same query sequence to an assembly
conflict if they have significant overlap (>= 30%)– Insignificant
– Significant
• A subset of rank-1 alignments is used for annotation
Span in alignment B
Span in alignment A
Span in alignment B
Span in alignment A
mRNA-F1
Annotation Of A Simple Assembly Using Ranked Alignments
mRNA-F1
mRNA-F2
Input mRNAsGenes in the assembly
mRNA-F2
Unplaced scaffold1
mRNA-F1
Filter out alignments that are not rank-1
GeneF1 GeneF2Chr1
GeneF1Chr1
Resulting annotation
GeneF2 Unplaced scaffold1
mRNA-F2 mRNA-F1* * **
* * *mRNA-F1mRNA-F2* *
Rank alignments
Unplaced scaffold1
GeneF2Chr1 GeneF1
Rank-1
Rank-2
Rank-3Rank-1
Rank-2
Align mRNAs
Unplaced scaffold1GeneF1 GeneF2Chr1
What to do with alternate loci & patch scaffolds?
1. Omit the alternate loci & patch scaffolds2. Include the alternate loci & patch scaffolds;
no special treatment3. Include the alternate loci & patch scaffolds;
use known relationships to primary assembly
Gene1/A G2-Allele-APrimary Chr1
Resulting annotation
Gene2
mRNA-3A* * *
Annotation Omitting Alt-scaffolds
mRNA-1A
mRNA-1B
mRNA-2A
Input mRNAs
Gene3
Primary Chr1
Alt-scaffold1
Genes/Alleles represented in the assembly
Gene1/A Gene2
Gene1/B
Alt-scaffold2 mRNA-3A
Gene3
Scenario 1: no annotation for Gene3 no annotation for Gene1/Allele-B
✔
mRNA-1A
Rank-1 mRNA alignments
Gene1/A Gene2Primary Chr1
mRNA-2A
✗✔
Scenario 2: Gene3 annotated at the wrong location no annotation for Gene1/Allele-B
Gene1/A G2-Allele-A
Gene3
Gene4
Primary Chr1
Alt-scaffold2
Alt-scaffold1
Resulting annotation
Gene2
Annotation Using Alt-scaffolds Without Alt-to-primary Alignments
mRNA-1A
mRNA-1B
mRNA-2A
Input mRNAs
Gene3
Primary Chr1
Alt-scaffold1
Genes/Alleles represented in the assembly
Gene1/A Gene2
Gene1/B
Alt-scaffold2
✔✗
✔
mRNA-3A
mRNA-1A
mRNA-1B
mRNA-3A
Rank-1 mRNA alignments
Gene1/A Gene2
Gene3
Gene1/B
Primary Chr1
Alt-scaffold2
Alt-scaffold1
mRNA-2A
✔
Gene1/A G2-Allele-A
Gene3
Gene1/B
Primary Chr1
Alt-scaffold2
Alt-scaffold1
Resulting annotation
Gene2
Annotation Using Alt-scaffolds & Alt-to-primary Alignments
mRNA-1A
mRNA-1B
mRNA-2A
Input mRNAs
Gene3
Primary Chr1
Alt-scaffold1
Genes/Alleles represented in the assembly
Gene1/A Gene2
Gene1/B
Alt-scaffold2
alt-to-primary alignment
✔✔
✔
mRNA-3A
mRNA-1A
mRNA-1B
mRNA-3A
Rank-1 mRNA alignments
Gene1/A Gene2
Gene3
Gene1/B
Primary Chr1
Alt-scaffold2
Alt-scaffold1
mRNA-2A
✔
Pros & cons of different choices for dealing with alternate loci & patch scaffolds
1. Omit the alternate loci & patch scaffoldsPros: Easy to implementCons: No representation for genes or alleles only on alts. Incorrect models for genes that have been patched.
2. Include the alternate loci & patch scaffolds; no special treatmentPros: Easy to implementCons: Incorrectly annotate genes that have alternate alleles or patches as if they were paralogs. Wrongly penalize sequences for having multiple or ambiguous placements.
3. Include the alternate loci & patch scaffolds;use known relationships to primary assemblyPros: Genes only on alts are annotated. Correctly annotate genes with alternate alleles. Correctly annotate patched genes Cons: Requires software and pipelines changes
Eukaryotic Genome Annotation Pipeline: Steps using alt-to-primary alignments
Alt-to-primaryalignments
Curated genelocalization
Ranking Alignments Across Assembly Units
• Create graph of related alignments– Alignments that are collocated or mappable– Transcript/protein to genomic– Alt or patch scaffold to primary assembly
• Partition graph into clusters– Each alignment in the cluster is related to at least one other
alignment in the same cluster– No alignment is related to any alignment in another cluster– Split conflicting alignments within a cluster into separate groups– Merge non-conflicting clusters into groups
• Evaluate groups, sort and assign ranks– All alignments in a group get the same rank
Ranked Alignment Groups Across Assembly Units
Assembly unitAssembly alignmentmRNA1 alignmentmRNA2 alignmentClusterRank group
Assembly1-Primary
Assembly1-Alt1
Assembly1-Alt2
Rank-1
Rank-2
Ranked Alignment Groups Across Assemblies
Assembly unitAssembly alignmentmRNA1 alignmentmRNA2 alignmentClusterRank group
Assembly1-Primary
Assembly1-Alt1
Assembly1-Alt2Rank-1
Rank-2
Assembly2-Primary
Assembly3-Primary
Ranked Alignment Groups Across Pseudoautosomal Regions (PARs)
ChromosomePAR alignmentmRNA1 alignmentmRNA2 alignmentClusterRank group
Chromosome Y
Chromosome X
Rank-1
PAR#1 PAR#2
NCBI> Gene> “Homo sapiens”[orgn] AND "only annotated on alternate loci in reference assembly"[Text Word] AND gene_nucleotide_pos[filter]
Genes Only Annotated On GRCh38 Alternate Loci
NCBI> Gene> “Homo sapiens”[orgn] AND "only annotated on alternate loci in reference assembly"[Text Word] AND gene_nucleotide_pos[filter] AND “genetype protein coding”[prop] AND srcdb_refseq_known[prop]
Num. Gene Type20 Protein Coding40 Protein Coding (model)21 Pseudogene32 Pseudogene (model)32 ncRNA (model)
5 Other 3 Other (model)
Different Alleles Annotated On GRCh38 Primary Assembly And Alternate Loci
ALT_REF_LOCI_2
ALT_REF_LOCI_7
NM_001243042.1 comment: This variant represents the C*07:01:01:01 allele of the HLA-C gene.
NM_002117.5 comment: This variant represents the C*07:02:01 allele of the HLA-C gene.
Annotation Of GRCh37 Improved By Patch Scaffold
EPPK1 gene on primary assembly chromosome 8 has an internal deletion.EPPK1 gene on patch scaffold is complete.
Primary Assembly chromosome 8
Patch scaffold HG104_HG975_PATCH
Pseudoautosomal Regions Annotated Consistently on GRCh38 chromosomes X & Y
Recent Enhancements To The Genome Annotation Pipeline:#1 Using RNA-Seq Evidence For Gene Prediction
0
10000
20000
30000
40000
50000
60000
70000
80000
Number of coding transcripts predicted +/- RNA-Seq
Chicken
CowHorse
Human
Mouse Pig Rat
Soybean
Zebrafish0
10000
20000
30000
40000
50000
60000
Number of genes predicted +/- RNA-Seq
Without RNA-Seq
With RNA-Seq
75 organisms annotated with RNA-Seq data
Example Of Tracks Made Using RNA-Seq Data
NCBI > GENE > Xenopus (Silurana) tropicalis nbr1 [neighbor of BRCA1 gene 1]
Recent Enhancements To The Genome Annotation Pipeline:#2 Gap-filling Gene Models Using Transcript Sequences
Genomic sequence
Transcript alignment 1 32 4
RefSeq model
Gap
How gap-filling works
Reporting of gap-filled regions
Recent Enhancements To The Genome Annotation Pipeline:#3 Annotation Reports
RNA-Seq
Summary
Including the alternate loci & patch scaffolds and using their known relationships to the primary assembly significantly improves the annotation of GRC assemblies.
It is worth the extra effort!
CREDITSGenome pipeline infrastructureAlex AstashynNathan BoukRob CohenMike DicuccioEric EngelsonOlga ErmoloevaWratko HlavinaLucian IonAvi KimchiBoris KiryutinDavid ManagadzeEyal MozesTerence MurphyDaniel RauschRobert SmithSasha SouvorovCraig WallinAlex Zasypkin
Eukaryotic annotation setup & execution
Françoise Thibaud-NissenJinna ChoiPatrick MastersonKim Pruitt and the “genome champions”
from the RefSeq group
Genomic Collections DBAvi KimchiVictor SapojnikovCharlie XiangAndrey Zherikov
Genome assemblies with alt/patch to primary alignmentsGenome Reference Consortium
The Wellcome Trust Sanger InstituteThe Genome Institute at Washington UniversityThe European Bioinformatics InstituteThe National Center for Biotechnology Information
Eukaryotic Genome Annotation at NCBI: www.ncbi.nlm.nih.gov/genome/annotation_euk/