interoperability, sustainability & impact a uniprot … · cv snps that map to feature ......

18
BD2K All Hands Meeting Interoperability, Sustainability & Impact A UniProt Case Study BD2K All Hands Meeting November 30, 2016 Bethesda, MD Alex Bateman – EMBL-EBI (European Bioinformatics Institute, UK) Cathy Wu – PIR (Protein Information Resource, USA) Ioannis Xenarios – SIB (Swiss institute of Bioinformatics, Switzerland) UniProt Consortium http://www.uniprot.org

Upload: dangnguyet

Post on 25-Aug-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

BD2K All Hands Meeting

Interoperability, Sustainability & Impact A UniProt Case Study

BD2K All Hands MeetingNovember 30, 2016

Bethesda, MD

Alex Bateman – EMBL-EBI (European Bioinformatics Institute, UK)

Cathy Wu – PIR (Protein Information Resource, USA)

Ioannis Xenarios – SIB (Swiss institute of Bioinformatics, Switzerland)

UniProt Consortiumhttp://www.uniprot.org

BD2K All Hands Meeting

UniProt: Hub for Protein Sequence & Function Information

2

BD2K All Hands Meeting

UniProt BD2K Supplement: A Sustainable Infrastructure Mapping Protein Function onto DNA Variation

The Need for Interoperability and Engaging the Clinical Genomics CommunityMany medical geneticists in the community do not know how to make use of

protein functional information in UniProt for variant curation.

=> Add UniProt Sequence Annotation to Human Reference Genome

3

• Aim 1. To support the worldwide genomics community in the full interpretation and exploitation of the human genome & proteome by providing an up-to-date and high quality mapping between the genomic coordinates and the protein sequence and its features

• Aim 2. To develop programmatic interoperability between the UniProtKB and ClinGen/ClinVar and CPTAC and collaborate with the wider genomic and proteomic communities to develop use cases and knowledge exchange to best use the integrated data for curation of the wealth of variant information being generated

BD2K All Hands Meeting

Mapping of reference proteome and annotations to the reference genome• Human protein sequences mapped to the human reference genome (GRCh38) • BED/BigBED track hubs are available for 26 Feature Types plus proteins & isoforms• View protein sequence annotations via public track hubs on UCSC & Ensembl browsers

4

UniProt Genome Annotation Tracks

UniProt Feature Viewer

Knowing the genome position of functional protein features is a valuable tool for gene, variant and protein curation as they can often associate with biological cause

BD2K All Hands Meeting

UniProt Genome Tracks on UCSC Browser

5

Active Sites Disulfide Bond Glycosylation Site

• GLA gene (associated with Fabry disease) on browser plus ClinVar, dbSNP & OMIM tracks• Active site - variant disrupts enzyme active site; SNPs not observed in other resources• Disulfide bond - variant associated with FD that removes a Cys in a structural fold • Glycosylation - variant disrupts site for lysosome targeting; pathogenic variants in ClinVar

BD2K All Hands Meeting

Aligning ClinVar SNPs to UniProt Features• Mapping of ClinVar SNPs to UniProt features by genomic position shows the raw

number and % of ClinVar pathogenic SNPs that aligned with different features (Table1).• Comparison between ClinVar ‘pathogenic’ SNPs and UniProt ‘disease-associated’

Variants shows that 36% of ClinVar pathogenic SNPs are in UniProt and and 48% of UniProt disease-associated variants are currently in ClinVar (Table 2).

• Collaboration with NCBI ClinVar group‒ reciprocal links at variant level‒ supporting data provider (like OMIM)‒ UniProt variant submission

Table1

6

UniProt Feature Type (total)CV SNPs that

map to feature

CV Pathogenic SNPs that map to

feature

Pathogenic in ClinVar (%)

Intramembrane (265) 233 181 77.68% DNA Binding dom. (712) 727 474 65.20%Region (8,969) 15,962 5,225 32.73%Domain (65,655) 106,572 33,616 31.54%Metal Binding (2,735) 2,082 1,158 55.62%Topological dom. (18,494) 18,865 6,407 33.96%Natural variant (72,808) 32,373 20,092 62.06%Coiled Coil (10,858) 11,126 2,665 23.95%Peptide (384) 199 94 47.24%Transit peptide (458) 344 97 28.20%Transmembrane (39,729) 9,822 5,192 52.86%Repeat (14,913) 6,253 1,900 30.39%Nucleotide binding (3,379) 561 351 62.57%Ca Binding Site (458) 83 42 50.60%Signal Peptide (9,274) 1,791 603 33.67%Motif (2,991) 297 142 47.81%Zn Finger (8,972) 760 333 43.82%Site (1,945) 124 66 53.23%Binding Sites (5,589) 183 134 73.22%Active Site (3,608) 60 41 68.33%Modified Residue (51,094) 561 184 32.80%Cross Link (2,798) 24 7 29.17%Carbohydrate Site (16,166) 154 39 25.32%Lipid (1,018) 10 2 20.00%

Table2 All CinVar SNPs (111,022)

Pathogenic CinVar SNPs

(33,521)

Non-Pathogenic ClinVar SNPs

(77,501)

All UniProt Variants (73,968) 18,641 12,337 6,304

UniProt Disease Variants (28,025) 13,505 11,253 2,252

UniProt non-Disease Variants

(45,943)5,136 1,084 4,052

BD2K All Hands Meeting

Integrating UniProt Features into ClinGenKB & Pathogenicity Calculator

7

• 20,000+ evidence documents for UniProt AA disease associated variants provided for use in the ClinGen Pathogenicity Calculator

• UniProt AA variations and proteins to be loaded in public ClinGenAllele Registry as protein alleles (early 2017).

ClinGen protein feature document model

Linked data

BD2K All Hands Meeting

8

SCIENTIFIC DATA | 3:160018 | DOI: 10.1038/sdata.2016.18

UniProt is FAIR

FindableAccessible Interoperable Reusable

Examples of FAIRness and the resulting value-added

BD2K All Hands Meeting

Growth of UniProt Databases

0

20,000,000

40,000,000

60,000,000

80,000,000

100,000,000

120,000,000

140,000,000N

umbe

r of E

ntrie

s UniParc

UniProtKB

UniRef100

UniRef90

UniRef50

UniProtKB/Swiss-Prot

9

BD2K All Hands Meeting

Dealing with Scale

10

BD2K All Hands Meeting

Growth of Biomedical Literature

12

CuratedEvaluated?

Curatable?

• Is expert curation sustainable?• How many articles do we evaluate in total every year?• What proportion of PubMed is curatable for UniProtKB/Swiss-Prot?

BD2K All Hands Meeting

Literature Triage: Curation workflow

• 4 curators from different annotation programs• Run tests over 8 months• Use PubTator to select publications (in collaboration with Z Lu at NCBI)

12

• Tag curatable papers• Tag non-curatable papers and describe why

BD2K All Hands Meeting

Sustainability of Literature curation

• Curators evaluate ~50-60,000 papers per year• ~10K are curated and added to UniProt• ~10K are redundant to existing information• ~10K low priority• ~10K are not well supported• ~20K out of scope• Sampling shows 90% PubMed out of scope• We estimate that we curate 35-45% of the curatable part of PubMed• The major challenge is the literature triage step• The number of publications curated is important, but it is as important to

select papers that provide the maximum of high quality information to make best use of our resources

13Expert curation is sustainable

BD2K All Hands Meeting

14

UniProt @Innovations in Curation Workshop

Text-mining assisted manual curation

• PubTator for literature triage

Additional Bibliography: computationally mapped references

• Bibliography from other curated sources

• In progress: Text mining-assisted UniProt tagging (including ePMC); Computational assignment of concept categories

Integration of text mining into publishing

• UniProt ID assignment at point of paper submission in collaboration with journals (BioCreative VI Bio-ID Track)

BD2K All Hands Meeting

Impact: Resource UtilizationUniProt Google Analytics StatisticsPeriod / monthly average Visits Unique visitors Pageviews

June 2010 - May 2011 612,905 320,892 3,177,758

June 2011 - May 2012 724,286 369,485 3,703,560

June 2012 - May 2013 820,623 408,244 4,022,786

June 2013 - May 2014 808,135 409,848 4,255,675

March 2014 - Feb. 2015* 821,368 433,136 4,097,871

March 2015 - Feb. 2016 952,837 509,278 4,758,278* Different period due to new NIH grant period; these numbers were reported to the NIH

15

BD2K All Hands Meeting

16

Impact: Communities Served

BD2K All Hands Meeting

17

Impact: Citation, Linking, Reuse

WoS Citation network of UniProt paper Increased use of resource URLs for citation

Impact of Linking from UniProt

MobiDB

0

2

4

6

8

10

12

14

16

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

45000000

50000000

2012 2013 2014

Links to external resources in UniProt

#new resources

linked

#Uni

Prot

ent

ries

wit

h lin

ks

Year UniProt release 04

Increased # UniProt links to external resources

UniProt data is reused in hundreds of tools and resources (e.g., NCBI)- How to assess the impact of resource reuse?

BD2K All Hands Meeting

PIs: Alex Bateman, Cathy Wu, Ioannis Xenarios The Team

Content/Curation: Lucila Aimo, Ghislaine Argoud-Puy, Andrea Auchincloss, Kristian Axelsen, Sara Benmohhamed, Brigitte Boeckmann, Emmanuel Boutet, Lionel Breuza, Ramona Britto, Hema Bye-A-Jee, Cristina Casals Casas, Elisabeth Coudert, Melanie Courtot, Anne Estreicher, Livia Famiglietti, Marc Feuermann, John S. Garavelli, Penelope Garmiri, Daniel Gonzalez, Arnaud Gos, Nadine Gruaz, Emma Hatton-Ellis, Ursula Hinz, Chantal Hulo, Nevila Hyka-Nouspikel, Florence Jungo, Guillaume Keller, Kati Laiho, Philippe Lemercier, Damien Lieberherr, Alistair MacDougall, Patrick Masson, Anne Morgat, Barbara Palka, Ivo Pedruzzi, Klemens Pichler, Sandrine Pilbout, Catherine Rivoire, Bernd Roechert, Karen Ross, Michel Schneider, Aleksandra Shypitsyna, Christian Sigrist, Elena Speretta, Andre Stutz, Shyamala Sundaram, Michael Tognolli, Nidhi Tyagi, C. R. Vinayaka, Qinghua Wang, Kate Warner, Lai-Su Yeh, Rosanna Zaru

Development: Emanuele Alpi, Ricardo Antunes, Leslie Arminski, Parit Bansal, Delphine Baratin, Teresa Batista Neto, Benoit Bely, Mark Bingley, Jerven Bolleman, Borisas Bursteinas, Chuming Chen, Yongxing Chen, Beatrice Cuche, Alan Da Silva, Edouard De Castro, Maurizio De Giorgi, Tunca Dogan, Leyla Garcia Castro, Elisabeth Gasteiger, SebastienGehant, Leonardo Gonzales, Arnaud Kerhornou, Vicente Lara, Wudong Liu, Thierry Lombardot, Jie Luo, Xavier Martin, Andrew Nightingale, Joseph Onwubiko, Monica Pozzato, Sangya Pundir, Guoying Qi, Alexandre Renaux, Steven Rosanoff, Rabie Saidi, Tony Sawford, Edward Turner, Vladimir Volynkin, Yuqi Wang, Tony Wardell, Xavier Watkins, Hermann Zellner, Jian Zhang

European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UKProtein Information Resource (PIR), Washington DC and Delaware, USASIB Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland

Key staff: Cecilia Arighi (Curation), Lydie Bougueleret (Co-Direction), Alan Bridge (Content), Hongzhan Huang (Development), Michele Magrane (Curation), Maria Martin (Development), Peter McGarvey (Content), Darren Natale (Content), Claire O’Donovan (Content), Sylvain Poux (Curation), Manuela Pruess (Coordination), Nicole Redaschi (Development)

18