aprank (antigenic peptide/protein ranker): a bioinformatic

1
# N # Ag Score used AUC # Ag in top 5% scores Top 5% enrichment Proteins 12,225 886 Protein score 0.690 116 2.77 times Peptides 4,871,188 1112 > 14,430 Peptide score 0.743 ± 6.6e - 3 3,069 4.25 times Combined scores 0.719 ± 6.7e - 3 2,432 3.37 times APRANK (Antigenic Peptide/Protein Ranker): A bioinformatic tool for genome - wide prioritization of candidate antigens of human pathogens Alejandro Ricci 1 , Mauricio Brunner 2 , Diego Ramoa 2 , Santiago J. Carmona 1 , and Fernán AgĂĽero 1 . 1 Instituto de Investigaciones BiotecnolĂłgicas - Instituto TecnolĂłgico de ChascomĂşs , Universidad Nacional de San MartĂ­n - CONICET, San MartĂ­n , Buenos Aires, Argentina. 2 Licenciatura en Bioinformática , Facultad de IngenierĂ­a , Universidad Nacional de Entre RĂ­os, Oro Verde, Entre RĂ­os, Argentina. Contacts: [email protected]; [email protected] . One of the preferred methods to test for the presence of a pathogen is via immunodiagnostics, relying on the detection of antibodies that bind to a B - cell epitope found in the pathogen . With the advent of peptide microarray platforms it became viable to perform a high - throughput screening of peptides, which allows to find these B - cell epitopes that much quicker . However, while this is straightforwardly achieved for pathogens such as viruses and small bacteria, the size of the genome of larger bacteria or eukaryotic parasites creates the need to prioritize and select relevant probes when designing these arrays . In this work we describe a computational method that uses several molecular properties of the proteins and peptides inside a pathogen proteome to produce prioritized lists of potential antigenic targets for further validation . 1. Introduction 3. Results 4. Conclusions 5. References and acknowledgments We created a computational method that successfully predicted antigenicity of proteins and peptides for many different species including gram negative bacteria, gram positive bacteria and eukaryotic parasites . By balancing the training data using ROSE we were able to improve the training for species with few known antigens , as well as making sure each species had the same weight when creating the generic model . By combining the data of all species, we created a single model capable of predicting antigenicity while avoiding over - fitting when testing against the species used for training . We also tested each of our fifteen species using a generic model created from the remaining fourteen species, achieving similar results (data not shown) . While the method showed a good performance, there are ways to further improve it . Having better, larger sources of antigens and true negatives would increase the prediction capabilities of the models . Also, the choice of predictors can be tuned, removing, replacing or adding new predictors based on how well each of them performed . Balancing the training data improved the prediction capabilities of our computational model Figure 3 . ROC curves for the prediction of antigenicity using a model trained with either the original unbalanced data or that same data balanced with ROSE . For this comparison we only used data from L . interrogans , both in testing and in creating the model . The data was divided in training set and test set several times, the prediction score was evaluated each time and the mean AUC was calculated . This was repeated for all fifteen species, seeing a tendency to obtain the largest AUC increases when balancing the protein data of the species with fewer recorded antigens ( L . interrogans , P . gingivalis , M . leprae , S . aureus and S . pyogenes ) . Our generic prediction model was successful in prioritizing proteins and peptides for Trypanosoma cruzi Our generic prediction model was successful in prioritizing proteins and peptides for a novel species ( Onchocerca ) Figure 4 . Density plot of the scores produced by the generic protein and peptide models when analyzing the T . cruzi proteome . For the peptides two scores were used, the one returned by our method, and one combining both the protein and peptide scores . The figure also shows the increase in the proportion of antigens achieved by keeping only the proteins and peptides with an score greater than 0 . 5 , as well as the amount of antigenic and non - antigenic proteins and peptides at both sides of that score . The value of this cutoff is an arbitrary decision and a greater enrichment could be achieved by selecting a larger cutoff at expense of losing some more antigens . Table 1 . Analysis of the prediction obtained by using our computational method on Onchocerca volvulus . The rule to determine the antigens was extracted from Lagatie et . al . 2017 , where they did a proteome - wide linear epitope scan of O . volvulus using high - density peptide microarrays [ 2 ] . For these peptides we also considered as antigenic the neighboring peptides that shared an 8 mer with the recorded antigenic peptides . • [1] Nicola Lunardon , Giovanna Menardi , and Nicola Torelli (2014). ROSE: a Package for Binary Imbalanced Learning . R Journal , 6(1), 82 - 92. • [2] Lagatie , O., Van Dorst , B., & Stuyver , L. J. (2017). Identification of three immunodominant motifs with atypical isotype profile scattered over the Onchocerca volvulus proteome . PLoS Neglected Tropical Diseases , 11(1), 1 – 21. • Supported by grants PICT - 2013 - 1193 from the Agencia Nacional de PromociĂłn CientĂ­fica y TecnolĂłgica , Argentina ( ANPCyT ), and by the National Institute of Allergy and Infectious Diseases, National Institutes of Health (NIAID/NIH, USA) under Award Number R01AI123070 . 2. Methods Figure 1 . General pipeline followed by the computational method to create the generic prediction model . The balancing of the data is explained some more in Figure 2 . The proteome of the novel species has to be analyzed by the predictors before the generic model can be used to obtain the prediction scores . While only one is shown, there are two generic models, one for proteins and one for peptides . Pipeline followed to create and use the prediction models Balancing the data using the R package ROSE L. interrogans full protein data Data points of 3,673 non - antigenic proteins (group 1) Data points of 10 antigenic proteins (group 2) ROSE (a smoothed bootstrap - based technique) L. interrogans balanced protein data Artificial data points of 1,500 non - antigenic proteins Artificial data points of 1,500 antigenic proteins Figure 2 . Example of using the R package ROSE to balance the protein data of Leptospira interrogans . ROSE analyzes the distribution of the values obtained by the predictors for both the antigenic and non - antigenic proteins, and then creates new artificial data based on those distributions [ 1 ] . 1 , 500 is an arbitrary number chosen by us based on the amount of total proteins in all the species used to train the model . Color references Future use of the model Data Processes • B. burgdorferi • B. melitensis • C. burnetii • E. coli • F. tularensis • L. interrogans • P. gingivalis • M. leprae • M. tuberculosis • S. aureus • S. pyogenes • L. braziliensis • P. falciparum • T. cruzi • T. gondii Pathogenic species • Antigenicity stimulation of a immune response ( Bepipred , NetMHCIIpan ) • Peculiarities in the protein sequence ( NetOglyc , PredGPI , SignalP , Xstream ) • Three - dimensional structure ( Iupred , Paircoil2, NetSurfp , TMHMM) • Molecular properties such as pI , MW (EMBOSS pepstats ) • Self - similarity and similarity against the host (custom perl scripts) Proteome analysis by several predictors Parsing and normalization Label antigenic proteins and peptides Balancing antigens and species with ROSE Full protein and peptide data Fitting a binomial logistic regression model Prioritized lists of proteins and peptides Proteome of a novel species Known antigens from bibliography Generic model False positive rate True positive rate Simplified steps: 1) Calculate the dispersion for each predictor for each group 2) Select a random group 3 ) Select a random protein inside that group 4) Create an artificial data point based on 3 , using the dispersions from 1 5) Repeat 2 until done Protein score Peptide score Combined score 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 U B AUC Unbalanced training data Balanced training data *** Density Normalized score Normalized score Normalized score Antigenic proteins/peptides Non-antigenic proteins/peptides 0.0 1.0 2.0 3.0 4.0 0.00 0.25 0.50 0.75 1.00 0.0 1.0 2.0 3.0 4.0 0.00 0.25 0.50 0.75 1.00 0.0 1.0 2.0 3.0 4.0 0.00 0.25 0.50 0.75 1.00 60 16,056 4,872 5,565 1,764 7,483,752 2,925,888 6,187 1,142 8,182,522 2,227,118 3.15 times more antigens 2.7 times more antigens 3.94 times more antigens 182

Upload: others

Post on 04-Jul-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: APRANK (Antigenic Peptide/Protein Ranker): A bioinformatic

# N # Ag Score used AUC# Ag in top5% scores

Top 5%enrichment

Proteins 12,225 886 Protein score 0.690 116 2.77 times

Peptides 4,871,188 1112 > 14,430Peptide score 0.743 ± 6.6e-3 3,069 4.25 times

Combined scores 0.719 ± 6.7e-3 2,432 3.37 times

APRANK (Antigenic Peptide/Protein Ranker):A bioinformatic tool for genome-wide prioritization

of candidate antigens of human pathogensAlejandro Ricci1, Mauricio Brunner2, Diego Ramoa2, Santiago J. Carmona1, and Fernán AgĂĽero1.1 Instituto de Investigaciones BiotecnolĂłgicas - Instituto TecnolĂłgico de ChascomĂşs, Universidad Nacional de San MartĂ­n - CONICET, San MartĂ­n, Buenos Aires, Argentina.2 Licenciatura en Bioinformática, Facultad de IngenierĂ­a, Universidad Nacional de Entre RĂ­os, Oro Verde, Entre RĂ­os, Argentina.Contacts: [email protected]; [email protected].

One of the preferred methods to test for the presence of a pathogen is via immunodiagnostics,relying on the detection of antibodies that bind to a B-cell epitope found in the pathogen.With the advent of peptide microarray platforms it became viable to perform a high-throughputscreening of peptides, which allows to find these B-cell epitopes that much quicker. However,while this is straightforwardly achieved for pathogens such as viruses and small bacteria, thesize of the genome of larger bacteria or eukaryotic parasites creates the need to prioritize andselect relevant probes when designing these arrays.In this work we describe a computational method that uses several molecular properties of theproteins and peptides inside a pathogen proteome to produce prioritized lists of potentialantigenic targets for further validation.

1. Introduction 3. Results

4. Conclusions

5. References and acknowledgments

We created a computational method that successfully predicted antigenicity of proteins andpeptides for many different species including gram negative bacteria, gram positive bacteriaand eukaryotic parasites. By balancing the training data using ROSE we were able to improvethe training for species with few known antigens, as well as making sure each species had thesame weight when creating the generic model.By combining the data of all species, we created a single model capable of predictingantigenicity while avoiding over-fitting when testing against the species used for training. Wealso tested each of our fifteen species using a generic model created from the remainingfourteen species, achieving similar results (data not shown).While the method showed a good performance, there are ways to further improve it. Havingbetter, larger sources of antigens and true negatives would increase the prediction capabilitiesof the models. Also, the choice of predictors can be tuned, removing, replacing or adding newpredictors based on how well each of them performed.

Balancing the training data improved the prediction capabilitiesof our computational model

Figure 3. ROC curves for the prediction of antigenicityusing a model trained with either the original unbalanceddata or that same data balanced with ROSE. For thiscomparison we only used data from L. interrogans, both intesting and in creating the model. The data was divided intraining set and test set several times, the predictionscore was evaluated each time and the mean AUC wascalculated. This was repeated for all fifteen species,seeing a tendency to obtain the largest AUC increaseswhen balancing the protein data of the species with fewerrecorded antigens (L. interrogans, P. gingivalis, M. leprae,S. aureus and S. pyogenes).

Our generic prediction model was successful in prioritizingproteins and peptides for Trypanosoma cruzi

Our generic prediction model was successful in prioritizingproteins and peptides for a novel species (Onchocerca)

Figure 4. Density plot of the scores produced by the generic protein and peptide models when analyzing theT. cruzi proteome. For the peptides two scores were used, the one returned by our method, and one combiningboth the protein and peptide scores. The figure also shows the increase in the proportion of antigens achieved bykeeping only the proteins and peptides with an score greater than 0.5, as well as the amount of antigenic andnon-antigenic proteins and peptides at both sides of that score. The value of this cutoff is an arbitrary decisionand a greater enrichment could be achieved by selecting a larger cutoff at expense of losing some more antigens.

Table 1. Analysis of the prediction obtained by using our computational method on Onchocerca volvulus. Therule to determine the antigens was extracted from Lagatie et. al. 2017, where they did a proteome-wide linearepitope scan of O. volvulus using high-density peptide microarrays [2]. For these peptides we also considered asantigenic the neighboring peptides that shared an 8mer with the recorded antigenic peptides.

• [1] Nicola Lunardon, Giovanna Menardi, and Nicola Torelli (2014). ROSE: a Package forBinary Imbalanced Learning. R Journal, 6(1), 82-92.• [2] Lagatie, O., Van Dorst, B., & Stuyver, L. J. (2017). Identification of three immunodominantmotifs with atypical isotype profile scattered over the Onchocerca volvulus proteome. PLoSNeglected Tropical Diseases, 11(1), 1–21.• Supported by grants PICT-2013-1193 from the Agencia Nacional de Promoción Científica yTecnológica, Argentina (ANPCyT), and by the National Institute of Allergy and InfectiousDiseases, National Institutes of Health (NIAID/NIH, USA) under Award Number R01AI123070.

2. Methods

Figure 1. General pipeline followed by the computational method to create the generic prediction model. Thebalancing of the data is explained some more in Figure 2. The proteome of the novel species has to be analyzedby the predictors before the generic model can be used to obtain the prediction scores. While only one is shown,there are two generic models, one for proteins and one for peptides.

Pipeline followed to create and use the prediction models

Balancing the data using the R package ROSE

L. interrogansfull protein data

Data points of 3,673non-antigenic

proteins (group 1)

Data points of 10antigenic proteins

(group 2)

ROSE(a smoothed bootstrap-

based technique)

L. interrogansbalanced protein data

Artificial data pointsof 1,500 non-

antigenic proteins

Artificial data pointsof 1,500 antigenic

proteins

Figure 2. Example of using the R package ROSE to balance the protein data of Leptospira interrogans. ROSEanalyzes the distribution of the values obtained by the predictors for both the antigenic and non-antigenicproteins, and then creates new artificial data based on those distributions [1]. 1,500 is an arbitrary number chosenby us based on the amount of total proteins in all the species used to train the model.

Color references

Future use ofthe model

Data

Processes

• B. burgdorferi

• B. melitensis

• C. burnetii

• E. coli

• F. tularensis

• L. interrogans

• P. gingivalis

• M. leprae

• M. tuberculosis

• S. aureus

• S. pyogenes

• L. braziliensis

• P. falciparum

• T. cruzi

• T. gondii

Pathogenic species

• Antigenicity stimulation of a immune response (Bepipred, NetMHCIIpan)

• Peculiarities in the protein sequence (NetOglyc, PredGPI, SignalP, Xstream)

• Three-dimensional structure (Iupred, Paircoil2, NetSurfp, TMHMM)

• Molecular properties such as pI, MW (EMBOSS pepstats)

• Self-similarity and similarity against the host (custom perl scripts)

Proteome analysis by several predictors

Parsing andnormalization

Label antigenicproteins and

peptides

Balancing antigensand species with

ROSE

Full protein andpeptide data

Fitting a binomiallogistic regression

model

Prioritizedlists of

proteinsand

peptides

Proteomeof a novelspecies

Known antigensfrom bibliography

Genericmodel

False positive rate

Tru

e po

sitiv

era

te

Simplified steps:1) Calculate thedispersion for eachpredictor for each group2) Select a random group3) Select a randomprotein inside that group4) Create an artificial datapoint based on 3, usingthe dispersions from 15) Repeat 2 until done

Protein score Peptide score Combined score

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

U B

AUC

Unbalanced training data

Balanced training data

***

Den

sity

Normalized score Normalized score Normalized score

Antigenic proteins/peptidesNon-antigenic proteins/peptides

0.0

1.0

2.0

3.0

4.0

0.00 0.25 0.50 0.75 1.00

0.0

1.0

2.0

3.0

4.0

0.00 0.25 0.50 0.75 1.00

0.0

1.0

2.0

3.0

4.0

0.00 0.25 0.50 0.75 1.00

60

16,056

4,872

5,565

1,764

7,483,752

2,925,888

6,187

1,142

8,182,522

2,227,118

3.15 times more antigens

2.7 times more antigens

3.94 times more antigens

182