araport data integration - 2015 umd minisymposium

1
Vivek Krishnakumar 1 , Chia-Yi Cheng 1 , Maria Kim 1 , Erik Ferlanti 1 , Irina Belyaeva 1 , Seth Schobel 1 , Sergio Contrino 3 , Matthew R. Hanlon 2 , Walter Moreira 2 , Steve Mock 2 , Joe Stubbs 2 , Agnes P. Chan 1 , Jason R. Miller 1 , Matthew W. Vaughn 2 , Gos Micklem 3 , Christopher D. Town 1 1 J. Craig Venter Institute, Rockville, MD, USA; 2 Texas Advanced Computing Center, Austin, TX, USA; 3 Cambridge University, Cambridge, UK Araport, the Arabidopsis Information Portal, (https://www.araport.org ), is an open-access, online resource for the Arabidopsis research community funded by the NSF and BBSRC. Since its inception in late 2013, the goal of Araport has been to provide users with a “one-stop-shop” through data federation. Araport exposes a searchable index of TAIR10 genomic data as well as additional datasets from UniProt (protein), BAR (expression), EPIC-CoGe (epigenomics), IntAct (interaction networks), ATTED-II (co-expression), PubMed (literature), and other diverse and geographically dispersed resources using a combination of warehousing and state-of-the-art web technologies. Araport incorporates and integrates software from GMOD including InterMine, JBrowse, GBrowse, WebApollo, Tripal, and Chado. Araport has inherited from TAIR the responsibility of providing continued access to up-to-date structural and functional annotation for the Col-0 genome. Later this year, the Araport11 annotation update will be released including over 1,000 novel protein coding gene loci and ~50k splice variants derived from ~28k gene loci using 11 tissue-specific bins of RNA-seq datasets spanning over 100 SRA accessions, as well as various classes of non-coding RNA. Araport: Data Integration for the Arabidopsis Research Community Araport (https://www.araport.org) “One-stop-shop” for Arabidopsis data ThaleMine report pages present a comprehensive set of data integrated from a variety sources. Report below shows up-to-date information about EMBRYO DEFECTIVE 2770, such as: GO annotation(s), publications, array based expression, protein–protein interactions, metabolic pathways and homologs in other plant species. 113 SRA accessions Binned by 11 Tissue/Organ TopHat Alignment to TAIR10 Genome-Guided Trinity Assembly Binned by 11 Tissue/Organ De novo Trinity Assembly Concatenating De Novo Assembly and Genome- Guided Assembly for each Tissue/Organ 11 Transcriptomes Assembled by PASA Annotation Update by PASA Consolidating 11 Transcriptomes Re-indexing updated gene models Araport11 Protein-Coding Gene NCBI and MAKER-P Assembly Uniprot Protein Novel Transcribed Regions Filtering Novel Loci Appending Novel Transcripts to TAIR10 Augmented TAIR10 Unique Models Filtering Protein Alignment Literature Araport11 Annotation Pipeline JBrowse genome viewer presents users with data organized into hierarchical and faceted track list(s). Genomic region shown below represents the features within the vicinity of EMBRYO DEFECTIVE 2770, highlighting the Col-0 methylation data retrieved on-the-fly from EPIC-CoGe, Paired-end analysis of TSS (PEAT) peaks, TDNA-seq based insertion sites and 1001 genomes variants alongside the updated Araport11 annotation set. Category TAIR10 Araport11 Description Long intergenic noncoding RNA (linc RNA) 2,708 The 2,708 intergenic transcripts were detected by tiling array and confirmed by RNA-seq (Liu et al., 2012) Natural antisense transcript (NAT) 2,980 Li et al (2013) identified 1490 NAT pairs in whole root samples using strand-specific RNA-seq followed by computational analysis (NASTIseq) microRNA (miRNA) 177 427 miRBase 21 Small nucleolar RNA (snoRNA) 71 287 Sherstnev et al (2012) incorporated data from TAIR, PlantDB, Chen and Wu (2009) and Kim et al (2010) and annotated 287 snoRNA. tRNA 689 689 Small nuclear RNA (snRNA) 13 13 Small RNA 24,575 We used ShortStack (Axtell, 2013), a software designed for annotation of small RNA genes, to analyze public data sets (Law et al., 2013). ShortStack was able to recapitulate >99% of the siRNAs clusters reported by Law et al (2013), which was based on TAIR8 genome. We ran ShortStack using 'de novo discovery mode', supplemented with TAIR10 and miRBase 21 as the reference, and identified 24,575 smRNA non-miRNA non-hairpin small RNA loci. rRNA 15 15 Other RNA 394 Total 1,359 31,681 Araport11 protein-coding gene annotation: TAIR10 annotation was supplemented with novel transcripts from NCBI and MAKER-P assemblies and used as the reference annotation set. RNA-seq reads from SRA grouped into 11 tissue/ organ types, assembled by Trinity; tissue specific transcriptomes reconstructed from a hybrid assembly of de novo and genome-guided assemblies. PASA based annotation update was performed independently for each tissue group to avoid constituting chimeric transcripts and the 11 transcriptomes were consolidated using a custom Python script to collapse isoforms diering in terminal UTR length. Around 300 Uniprot protein records inconsistent with TAIR10 were evaluated, filtered, and appended to the PASA updated set. Additional novel transcripts extracted from PASA and literature were used to further quantify novel loci. Updated gene models and novel loci part of Araport11, will be re-indexed with appropriate locus and isoform identifiers and released for community review. Statistics: Araport11 updated 80.3% (28,429/35,385) of TAIR10 protein-coding gene models of which 3.3% (933) and 88.2% (25,079/28,429) altered CDS and UTR respectively. A total of 1,162 new loci and 14,880 new gene models were added. 38.3% (18% in TAIR10) of protein-coding genes now have additional splice variants. Overall, the Araport11 pre-release contains 28,565 protein-coding gene loci encompassing 50,265 gene models. Araport11 non-coding RNA annotation Publications 1. Araport: the Arabidopsis Information Portal. Nucleic Acids Research (2014) doi: 10.1093/nar/gku1200 2. The Arabidopsis Information Portal: An Application Platform for Data Discovery. Proceedings of the 9th Gateway Computing Environments Workshop (2014) doi: 10.1109/GCE.2014.10 We thank NCBI RefSeq team and Mark Yandell lab for sharing the TAIR10 re-annotation data, authors of the RNA-seq data sets used in our coding and non-coding RNA annotation, Michael Axtell (PSU) and Ho-Ming Chen (Academia Sinica) for helpful discussions. Acknowledgements

Upload: vivek-krishnakumar

Post on 23-Jan-2018

144 views

Category:

Science


4 download

TRANSCRIPT

Page 1: Araport Data Integration - 2015 UMD Minisymposium

Vivek Krishnakumar1, Chia-Yi Cheng1, Maria Kim1, Erik Ferlanti1, Irina Belyaeva1, Seth Schobel1, Sergio Contrino3, Matthew R. Hanlon2, Walter Moreira2, Steve Mock2, Joe Stubbs2, Agnes P. Chan1, Jason R. Miller1, Matthew W. Vaughn2, Gos Micklem3, Christopher D. Town1 1J. Craig Venter Institute, Rockville, MD, USA; 2Texas Advanced Computing Center, Austin, TX, USA; 3Cambridge University, Cambridge, UK

Araport, the Arabidopsis Information Portal, (https://www.araport.org), is an open-access, online resource for the Arabidopsis research community funded by the NSF and BBSRC. Since its inception in late 2013, the goal of Araport has been to provide users with a “one-stop-shop” through data federation. Araport exposes a searchable index of TAIR10 genomic data as well as additional datasets from UniProt (protein), BAR (expression), EPIC-CoGe (epigenomics), IntAct (interaction networks), ATTED-II (co-expression), PubMed (literature), and other diverse and geographically dispersed resources using a combination of warehousing and state-of-the-art web technologies. Araport incorporates and integrates software from GMOD including InterMine, JBrowse, GBrowse, WebApollo, Tripal, and Chado. Araport has inherited from TAIR the responsibility of providing continued access to up-to-date structural and functional annotation for the Col-0 genome. Later this year, the Araport11 annotation update will be released including over 1,000 novel protein coding gene loci and ~50k splice variants derived from ~28k gene loci using 11 tissue-specific bins of RNA-seq datasets spanning over 100 SRA accessions, as well as various classes of non-coding RNA.

Araport: Data Integration for the Arabidopsis Research Community

Araport (https://www.araport.org)

“One-stop-shop” for Arabidopsis data ThaleMine report pages present a comprehensive set of data integrated from a variety sources. Report below shows up-to-date information about EMBRYO DEFECTIVE 2770, such as: GO annotation(s), publications, array based expression, protein–protein interactions, metabolic pathways and homologs in other plant species.

113 SRA accessions

Binned by 11 Tissue/Organ

TopHat Alignment to TAIR10

Genome-Guided Trinity Assembly

Binned by 11 Tissue/Organ

De novo Trinity Assembly

Concatenating De Novo Assembly and Genome-Guided Assembly for each Tissue/Organ

11 Transcriptomes Assembled by PASA

Annotation Update by PASA

Consolidating 11 Transcriptomes

Re-indexing updated gene models

Araport11 Protein-Coding Gene

NCBI and MAKER-P Assembly Uniprot Protein

Novel Transcribed Regions

Filtering

Novel Loci

Appending Novel Transcripts to TAIR10

Augmented TAIR10

Unique Models

Filtering

Protein Alignment

Literature

Araport11 Annotation Pipeline

JBrowse genome viewer presents users with data organized into hierarchical and faceted track list(s). Genomic region shown below represents the features within the vicinity of EMBRYO DEFECTIVE 2770, highlighting the Col-0 methylation data retrieved on-the-fly from EPIC-CoGe, Paired-end analysis of TSS (PEAT) peaks, TDNA-seq based insertion sites and 1001 genomes variants alongside the updated Araport11 annotation set.

Category TAIR10 Araport11 Description Long intergenic noncoding RNA (linc RNA) 2,708 The 2,708 intergenic transcripts were detected by tiling array and

confirmed by RNA-seq (Liu et al., 2012)

Natural antisense transcript (NAT) 2,980

Li et al (2013) identified 1490 NAT pairs in whole root samples using strand-specific RNA-seq followed by computational analysis (NASTIseq)

microRNA (miRNA) 177 427 miRBase 21

Small nucleolar RNA (snoRNA) 71 287 Sherstnev et al (2012) incorporated data from TAIR, PlantDB, Chen

and Wu (2009) and Kim et al (2010) and annotated 287 snoRNA.

tRNA 689 689 Small nuclear RNA (snRNA) 13 13

Small RNA 24,575

We used ShortStack (Axtell, 2013), a software designed for annotation of small RNA genes, to analyze public data sets (Law et al., 2013). ShortStack was able to recapitulate >99% of the siRNAs clusters reported by Law et al (2013), which was based on TAIR8 genome. We ran ShortStack using 'de novo discovery mode', supplemented with TAIR10 and miRBase 21 as the reference, and identified 24,575 smRNA non-miRNA non-hairpin small RNA loci.

rRNA 15 15 Other RNA 394 Total 1,359 31,681

Araport11 protein-coding gene annotation: TAIR10 annotation was supplemented with novel transcripts from NCBI and MAKER-P assemblies and used as the reference annotation set. RNA-seq reads from SRA grouped into 11 tissue/organ types, assembled by Trinity; tissue specific transcriptomes reconstructed from a hybrid assembly of de novo and genome-guided assemblies. PASA based annotation update was performed independently for each tissue group to avoid constituting chimeric transcripts and the 11 transcriptomes were consolidated using a custom Python script to collapse isoforms differing in terminal UTR length. Around 300 Uniprot protein records inconsistent with TAIR10 were evaluated, filtered, and appended to the PASA updated set. Additional novel transcripts extracted from PASA and literature were used to further quantify novel loci. Updated gene models and novel loci part of Araport11, will be re-indexed with appropriate locus and isoform identifiers and released for community review. Statistics: Araport11 updated 80.3% (28,429/35,385) of TAIR10 protein-coding gene models of which 3.3% (933) and 88.2% (25,079/28,429) altered CDS and UTR respectively. A total of 1,162 new loci and 14,880 new gene models were added. 38.3% (18% in TAIR10) of protein-coding genes now have additional splice variants. Overall, the Araport11 pre-release contains 28,565 protein-coding gene loci encompassing 50,265 gene models.

Araport11 non-coding RNA annotation

Publications 1.  Araport: the Arabidopsis Information Portal. Nucleic Acids Research (2014) doi: 10.1093/nar/gku1200 2.  The Arabidopsis Information Portal: An Application Platform for Data Discovery. Proceedings of the 9th Gateway

Computing Environments Workshop (2014) doi: 10.1109/GCE.2014.10

We thank NCBI RefSeq team and Mark Yandell lab for sharing the TAIR10 re-annotation data, authors of the RNA-seq data sets used in our coding and non-coding RNA annotation, Michael Axtell (PSU) and Ho-Ming Chen (Academia Sinica) for helpful discussions.

Acknowledgements