BD2K All Hands Meeting
Interoperability, Sustainability & Impact A UniProt Case Study
BD2K All Hands MeetingNovember 30, 2016
Bethesda, MD
Alex Bateman – EMBL-EBI (European Bioinformatics Institute, UK)
Cathy Wu – PIR (Protein Information Resource, USA)
Ioannis Xenarios – SIB (Swiss institute of Bioinformatics, Switzerland)
UniProt Consortiumhttp://www.uniprot.org
BD2K All Hands Meeting
UniProt BD2K Supplement: A Sustainable Infrastructure Mapping Protein Function onto DNA Variation
The Need for Interoperability and Engaging the Clinical Genomics CommunityMany medical geneticists in the community do not know how to make use of
protein functional information in UniProt for variant curation.
=> Add UniProt Sequence Annotation to Human Reference Genome
3
• Aim 1. To support the worldwide genomics community in the full interpretation and exploitation of the human genome & proteome by providing an up-to-date and high quality mapping between the genomic coordinates and the protein sequence and its features
• Aim 2. To develop programmatic interoperability between the UniProtKB and ClinGen/ClinVar and CPTAC and collaborate with the wider genomic and proteomic communities to develop use cases and knowledge exchange to best use the integrated data for curation of the wealth of variant information being generated
BD2K All Hands Meeting
Mapping of reference proteome and annotations to the reference genome• Human protein sequences mapped to the human reference genome (GRCh38) • BED/BigBED track hubs are available for 26 Feature Types plus proteins & isoforms• View protein sequence annotations via public track hubs on UCSC & Ensembl browsers
4
UniProt Genome Annotation Tracks
UniProt Feature Viewer
Knowing the genome position of functional protein features is a valuable tool for gene, variant and protein curation as they can often associate with biological cause
BD2K All Hands Meeting
UniProt Genome Tracks on UCSC Browser
5
Active Sites Disulfide Bond Glycosylation Site
• GLA gene (associated with Fabry disease) on browser plus ClinVar, dbSNP & OMIM tracks• Active site - variant disrupts enzyme active site; SNPs not observed in other resources• Disulfide bond - variant associated with FD that removes a Cys in a structural fold • Glycosylation - variant disrupts site for lysosome targeting; pathogenic variants in ClinVar
BD2K All Hands Meeting
Aligning ClinVar SNPs to UniProt Features• Mapping of ClinVar SNPs to UniProt features by genomic position shows the raw
number and % of ClinVar pathogenic SNPs that aligned with different features (Table1).• Comparison between ClinVar ‘pathogenic’ SNPs and UniProt ‘disease-associated’
Variants shows that 36% of ClinVar pathogenic SNPs are in UniProt and and 48% of UniProt disease-associated variants are currently in ClinVar (Table 2).
• Collaboration with NCBI ClinVar group‒ reciprocal links at variant level‒ supporting data provider (like OMIM)‒ UniProt variant submission
Table1
6
UniProt Feature Type (total)CV SNPs that
map to feature
CV Pathogenic SNPs that map to
feature
Pathogenic in ClinVar (%)
Intramembrane (265) 233 181 77.68% DNA Binding dom. (712) 727 474 65.20%Region (8,969) 15,962 5,225 32.73%Domain (65,655) 106,572 33,616 31.54%Metal Binding (2,735) 2,082 1,158 55.62%Topological dom. (18,494) 18,865 6,407 33.96%Natural variant (72,808) 32,373 20,092 62.06%Coiled Coil (10,858) 11,126 2,665 23.95%Peptide (384) 199 94 47.24%Transit peptide (458) 344 97 28.20%Transmembrane (39,729) 9,822 5,192 52.86%Repeat (14,913) 6,253 1,900 30.39%Nucleotide binding (3,379) 561 351 62.57%Ca Binding Site (458) 83 42 50.60%Signal Peptide (9,274) 1,791 603 33.67%Motif (2,991) 297 142 47.81%Zn Finger (8,972) 760 333 43.82%Site (1,945) 124 66 53.23%Binding Sites (5,589) 183 134 73.22%Active Site (3,608) 60 41 68.33%Modified Residue (51,094) 561 184 32.80%Cross Link (2,798) 24 7 29.17%Carbohydrate Site (16,166) 154 39 25.32%Lipid (1,018) 10 2 20.00%
Table2 All CinVar SNPs (111,022)
Pathogenic CinVar SNPs
(33,521)
Non-Pathogenic ClinVar SNPs
(77,501)
All UniProt Variants (73,968) 18,641 12,337 6,304
UniProt Disease Variants (28,025) 13,505 11,253 2,252
UniProt non-Disease Variants
(45,943)5,136 1,084 4,052
BD2K All Hands Meeting
Integrating UniProt Features into ClinGenKB & Pathogenicity Calculator
7
• 20,000+ evidence documents for UniProt AA disease associated variants provided for use in the ClinGen Pathogenicity Calculator
• UniProt AA variations and proteins to be loaded in public ClinGenAllele Registry as protein alleles (early 2017).
ClinGen protein feature document model
Linked data
BD2K All Hands Meeting
8
SCIENTIFIC DATA | 3:160018 | DOI: 10.1038/sdata.2016.18
UniProt is FAIR
FindableAccessible Interoperable Reusable
Examples of FAIRness and the resulting value-added
BD2K All Hands Meeting
Growth of UniProt Databases
0
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
140,000,000N
umbe
r of E
ntrie
s UniParc
UniProtKB
UniRef100
UniRef90
UniRef50
UniProtKB/Swiss-Prot
9
BD2K All Hands Meeting
Growth of Biomedical Literature
12
CuratedEvaluated?
Curatable?
• Is expert curation sustainable?• How many articles do we evaluate in total every year?• What proportion of PubMed is curatable for UniProtKB/Swiss-Prot?
BD2K All Hands Meeting
Literature Triage: Curation workflow
• 4 curators from different annotation programs• Run tests over 8 months• Use PubTator to select publications (in collaboration with Z Lu at NCBI)
12
• Tag curatable papers• Tag non-curatable papers and describe why
BD2K All Hands Meeting
Sustainability of Literature curation
• Curators evaluate ~50-60,000 papers per year• ~10K are curated and added to UniProt• ~10K are redundant to existing information• ~10K low priority• ~10K are not well supported• ~20K out of scope• Sampling shows 90% PubMed out of scope• We estimate that we curate 35-45% of the curatable part of PubMed• The major challenge is the literature triage step• The number of publications curated is important, but it is as important to
select papers that provide the maximum of high quality information to make best use of our resources
13Expert curation is sustainable
BD2K All Hands Meeting
14
UniProt @Innovations in Curation Workshop
Text-mining assisted manual curation
• PubTator for literature triage
Additional Bibliography: computationally mapped references
• Bibliography from other curated sources
• In progress: Text mining-assisted UniProt tagging (including ePMC); Computational assignment of concept categories
Integration of text mining into publishing
• UniProt ID assignment at point of paper submission in collaboration with journals (BioCreative VI Bio-ID Track)
BD2K All Hands Meeting
Impact: Resource UtilizationUniProt Google Analytics StatisticsPeriod / monthly average Visits Unique visitors Pageviews
June 2010 - May 2011 612,905 320,892 3,177,758
June 2011 - May 2012 724,286 369,485 3,703,560
June 2012 - May 2013 820,623 408,244 4,022,786
June 2013 - May 2014 808,135 409,848 4,255,675
March 2014 - Feb. 2015* 821,368 433,136 4,097,871
March 2015 - Feb. 2016 952,837 509,278 4,758,278* Different period due to new NIH grant period; these numbers were reported to the NIH
15
BD2K All Hands Meeting
17
Impact: Citation, Linking, Reuse
WoS Citation network of UniProt paper Increased use of resource URLs for citation
Impact of Linking from UniProt
MobiDB
0
2
4
6
8
10
12
14
16
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
50000000
2012 2013 2014
Links to external resources in UniProt
#new resources
linked
#Uni
Prot
ent
ries
wit
h lin
ks
Year UniProt release 04
Increased # UniProt links to external resources
UniProt data is reused in hundreds of tools and resources (e.g., NCBI)- How to assess the impact of resource reuse?
BD2K All Hands Meeting
PIs: Alex Bateman, Cathy Wu, Ioannis Xenarios The Team
Content/Curation: Lucila Aimo, Ghislaine Argoud-Puy, Andrea Auchincloss, Kristian Axelsen, Sara Benmohhamed, Brigitte Boeckmann, Emmanuel Boutet, Lionel Breuza, Ramona Britto, Hema Bye-A-Jee, Cristina Casals Casas, Elisabeth Coudert, Melanie Courtot, Anne Estreicher, Livia Famiglietti, Marc Feuermann, John S. Garavelli, Penelope Garmiri, Daniel Gonzalez, Arnaud Gos, Nadine Gruaz, Emma Hatton-Ellis, Ursula Hinz, Chantal Hulo, Nevila Hyka-Nouspikel, Florence Jungo, Guillaume Keller, Kati Laiho, Philippe Lemercier, Damien Lieberherr, Alistair MacDougall, Patrick Masson, Anne Morgat, Barbara Palka, Ivo Pedruzzi, Klemens Pichler, Sandrine Pilbout, Catherine Rivoire, Bernd Roechert, Karen Ross, Michel Schneider, Aleksandra Shypitsyna, Christian Sigrist, Elena Speretta, Andre Stutz, Shyamala Sundaram, Michael Tognolli, Nidhi Tyagi, C. R. Vinayaka, Qinghua Wang, Kate Warner, Lai-Su Yeh, Rosanna Zaru
Development: Emanuele Alpi, Ricardo Antunes, Leslie Arminski, Parit Bansal, Delphine Baratin, Teresa Batista Neto, Benoit Bely, Mark Bingley, Jerven Bolleman, Borisas Bursteinas, Chuming Chen, Yongxing Chen, Beatrice Cuche, Alan Da Silva, Edouard De Castro, Maurizio De Giorgi, Tunca Dogan, Leyla Garcia Castro, Elisabeth Gasteiger, SebastienGehant, Leonardo Gonzales, Arnaud Kerhornou, Vicente Lara, Wudong Liu, Thierry Lombardot, Jie Luo, Xavier Martin, Andrew Nightingale, Joseph Onwubiko, Monica Pozzato, Sangya Pundir, Guoying Qi, Alexandre Renaux, Steven Rosanoff, Rabie Saidi, Tony Sawford, Edward Turner, Vladimir Volynkin, Yuqi Wang, Tony Wardell, Xavier Watkins, Hermann Zellner, Jian Zhang
European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UKProtein Information Resource (PIR), Washington DC and Delaware, USASIB Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland
Key staff: Cecilia Arighi (Curation), Lydie Bougueleret (Co-Direction), Alan Bridge (Content), Hongzhan Huang (Development), Michele Magrane (Curation), Maria Martin (Development), Peter McGarvey (Content), Darren Natale (Content), Claire O’Donovan (Content), Sylvain Poux (Curation), Manuela Pruess (Coordination), Nicole Redaschi (Development)
18