kt.ijs.sikt.ijs.si/dragikocev/stuff/wp1-joint_dk.docx · web viewto this end, we prune the set of...

13
WP1 Progress in the 2nd reporting period The overall goal of WP1 was to develop models of phagocytosis, based on existing expert knowledge (ICTSM) and on data (JSI). Both existing data and data collected within the project were to be used for this purpose. In addition, methods for automated modeling that can make use of both data and expert knowledge were to be further developed to meet modeling needs typical of systems biology modeling tasks (JSI). Major progress was made towards achieving these goals, but some changes were inevitable due to unforeseeable circumstances. At ICTSM, WP1 experienced problems that affected the progress of the project including long-term illness and subsequent tragic death of the mathematician Professor Jaroslav Stark, and the late start and early departure of the research fellow at ICSTM, Dr Barbara Szomolay (April 2009-October 2011). A biomathematician, Dr. Vahid Shaherezei, was recruited to the project and managed the ICSTM side of WP1. Due to this, the scope of the models constructed is more limited than originally foreseen. At JSI, the scope of WP1 was both shifted and extended towards considering and analyzing the majority of data generated within the project. Originally, only the use of time-course data on protein concentrations was planned within the context of learning models of the dynamics of phagocytosis: This still took place, but at a more limited extent than originally foreseen, due to the limited availability of this kind of data. As the majority of the data generated included different types of screens (genome or compound screens, flow-citometry or image-based), a large body of effort was devoted to analyzing these: This is also reflected in the outcomes of the project. ICTSM: At ICSTM, we first compiled a signaling network for phagosome maturation based on an extensive literature survey. Based on the complexity of the initial network and the limited available data, we concluded that construction of a comprehensive mathematical model of this system is not yet feasible. We then decided to focus on particular aspects of this system and model them in detail.

Upload: others

Post on 27-Jan-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

WP1 Progress in the 2nd reporting period

The overall goal of WP1 was to develop models of phagocytosis, based on existing expert knowledge (ICTSM) and on data (JSI). Both existing data and data collected within the project were to be used for this purpose. In addition, methods for automated modeling that can make use of both data and expert knowledge were to be further developed to meet modeling needs typical of systems biology modeling tasks (JSI).

Major progress was made towards achieving these goals, but some changes were inevitable due to unforeseeable circumstances. At ICTSM, WP1 experienced problems that affected the progress of the project including long-term illness and subsequent tragic death of the mathematician Professor Jaroslav Stark, and the late start and early departure of the research fellow at ICSTM, Dr Barbara Szomolay (April 2009-October 2011). A biomathematician, Dr. Vahid Shaherezei, was recruited to the project and managed the ICSTM side of WP1. Due to this, the scope of the models constructed is more limited than originally foreseen.

At JSI, the scope of WP1 was both shifted and extended towards considering and analyzing the majority of data generated within the project. Originally, only the use of time-course data on protein concentrations was planned within the context of learning models of the dynamics of phagocytosis: This still took place, but at a more limited extent than originally foreseen, due to the limited availability of this kind of data. As the majority of the data generated included different types of screens (genome or compound screens, flow-citometry or image-based), a large body of effort was devoted to analyzing these: This is also reflected in the outcomes of the project.

ICTSM: At ICSTM, we first compiled a signaling network for phagosome maturation based on an extensive literature survey. Based on the complexity of the initial network and the limited available data, we concluded that construction of a comprehensive mathematical model of this system is not yet feasible. We then decided to focus on particular aspects of this system and model them in detail.

Figure WP1-1: Identified or hypothesized kinase-phosphatase complexes (PIK3C – class I PI 3-kinase, PTMT1 – PTEN-like phosphatase). Full lines represent in vivo, dashed line in vitro data.

Phagosome maturation is regulated by the key tagging molecules Rabs and phosphoinositides (PIPs). Inspecting the signaling network that regulates PIPs revealed a surprising interaction between antagonistic enzymes, including interactions between Vps34 kinase-myotubularin phosphatases (see Figure WP1-1). Using mathematical modeling investigated the role of these interactions. Interestingly, we find that the complex formation can produce novel forms of switch-like and bell-shaped response (Figure WP1-2). We have postulated that these could have a functional role in the temporal regulation of PIPs during phagosome maturation (Szomolay and Shahrezai 2012).

Figure WP1-2: Bell-shaped dose response in the active phosphatase case in the zero-order regime. In this figure Kt, Pt, St are the total kinase, phosphatase and substrate concentrations, respectively. R denotes the fraction of total phosphorylated substrate. The dashed line represents the response curve for the extended model in the active phosphatase case and the full line represents the response curve for the corresponding basic model of phosphorylation-dephosphorylation cycle.

We have been also involved in statistical analysis of multiparametric imaging data (Collinet et. al 2010) from our partners at Dresden, describing the intensity of EEA1 and APPL1 endosomes (Rab5 effector proteins necessary for endosomal trafficking) at the single endosome level for wildtype and a range of RNAi knockdowns. In addition to MATLAB, an image analysis software (Motion Tracking, developed by Y. Kaladzidis in the Zerial lab, Dresden) is being used to analyze the data. A recent study has shown that APPL1-positive and EEA1-positive endosomes have a transient overlap (APPL1 endosomes represent and earlier stage than EEA1 endosomes) and that the switch from the APPL1 to the EEA1 stage is controlled by an important phospholipid, PI(3)P (Zoncu et al. 2009). One of our goals is to show that the single endosome data is consistent with this finding: So far, we have not found strong evidence for this. In addition we are looking at the heterogeneity of the markers and how this is regulated. We hope to add insights on the mechanisms of robustness in endosome trafficking. This is still an ongoing work in collaboration with researchers at the Max Plank Institute in Dresden and we hope to publish these results as soon as all the analyses are finished.

JSI: At JSI, we pursued two major lines of research. First, we continued the development of methods for automated modeling of dynamic systems (in the form of ordinary differential equations – ODEs), that can make use of both data and expert knowledge, in the direction of meeting modeling needs typical of systems biology modeling tasks. We also investigated the use of these automated modeling methods for modeling endosome maturation and modeling LDL trafficking, where time-course data on protein concentrations were available. Second, we applied machine learning methods for structured output prediction, and in particular predictive clustering, on a large number of screening datasets, related to phagocytosis, Mycobacteria and immune response, provided by the project partners.

Already at the end of the first reporting period, we started using different machine learning methods to data provided by the PHAGOSYS project partners, with the aim of gaining a better understanding of the process of phagocytosis. In particular, we used methods for structured output prediction, such as predictive clustering trees, to analyze a variety of data related to phagocytosis of mycobacteria. These include microarray data with time course profiles of gene expression level in response to infection: Type 1 and Type 2 human macrophages were infected with M. tuberculosis and Schwann cells were infected with M. leprae (data from LUMC). Interesting groups of genes with common functional annotations and clearly defined response patterns were identified for Schwann cells: The gene network showing some of these groups and the corresponding functions is shown in Figure WP1-3. This enabled the formulation of a hypothesis about a new candidate gene, not previously related to the regulation of intercellular growth of Mycobacteria. The hypothesis was consequently investigated in wet-lab experiments and silencing of the gene was found to reduce bacterial load significantly.

Figure WP1-3: The gene interaction network (from STRING) showing the groups of genes with common functional annotations and clearly defined response patterns for Schwann cells infected with M. leprae.

We have then used predictive clustering trees, as well as feature ranking methods, to analyze data from several screens coming from different PHAGOSYS partners. These included a siRNA image-based screen of endocytosis in HeLa cells (data from MPI-CBG), a flow cytometry screen and a smaller image-based siRNA screen aimed at studying MHC ClassII antigen presentation (data from NKI), which we tried to inter-relate by our analyses. Finally, we analyzed data from a large compound screen (data from LUMC) of MelJuSo cellular assays infected with M. tuberculosis (Mtb) and S. typhimurium (Stm).

We analyzed data from a genome-wide siRNA image screen performed at MPI-CBG to study the process of endocytosis in the HeLa cell line. The phenotype images obtained after silencing each gene were analyzed using MotionTracking (MPI-CBG in-house software) to produce feature-based descriptions of the images. We described each gene with its functional annotation (using GO terms). A total of 20346 genes were analyzed, 3678 without known functions. The goal of the study was first to provide knowledge about endocytosis and then to annotate the hypothetical genes with possible functions using the methodology developed in (Kocev 2011).

Next, NKI provided data from two screens concerning MHC Class II AGP. The first screen includes flow cytometry data for 16675 genes, silenced in assays of the MelJuSo cell line, where the two anti-bodies L243 and CerCLIP were observed. Then, 276 candidate genes from this screen (selected by a z-score cut-off) were used for the second screen, where images of the phenotypes were taken and were then analyzed using the CellProfiler software to extract features. Using (semi)manual clustering on these features, 19 bins were identified (Paul et al. 2012), with quite a few genes/images left unassigned. We described each gene with its GO annotations and 13 CellProfiler features.

We performed several analyses in order to merge the above three studies (the MPI-CBG image screen and the NKI's flow cytometry and image screens) and identify some false negatives, possibly resulting from the z-score cutoff in the flow cytometry study. More precisely, we aimed to find genes (from the MPI-CBG genome wide study) that were not selected for the imaging study of NKI, due to the hard definition of a hit (a z-score larger than 2 or smaller than -2). Figure WP1-4 depicts the intersection of the genes analyzed in the three studies. We conducted two types of analyses. We first performed gene set enrichment analysis using SEGS-BioMine (Podpečan et al. 2011) and then predicted the possible bin assignment (the bins from NKI's study) of the genes from MPI-CBG's study.

Figure WP1-4: The overlap of the genes from the three studies.

In the first analysis, we discovered gene functions that were mostly present in the highly ranked genes from the flow cytometry study, separately for the two antibodies. For the L234 anti-body, most prominent are the genes involved in a mRNA metabolic process (GO:0016071), the genes that interact with genes active in processes in the endocytic vesicle membrane (GO:0030666) and the genes that interact with genes involved in the PPAR signaling pathway (KEGG:03320). For the CerCLIP anti-body, most prominent are the genes involved in cellular macromolecule metabolic process (GO:0044260), nucleic acid metabolic process (GO:0090304), intracellular membrane-bounded organelle (GO:0043231), the spliceosome (KEGG:03040), the genes that interact with genes involved in nuclear mRNA splicing, via spliceosome (GO:0000398), and the genes that interact with genes active in processes in nuclear body (GO:0016604).

The second analysis investigated the possibility of merging the three studies, hoping that the analysis performed in the genome-wide study at MPI-CBG could help elucidate some genes that were found to be negatives in the NKI flow cytometry study. We first predicted the phenotype bins from the CellProfiler features (NKI screen), the flow cytometry scores and the GO annotations. The results show that we can predict the the values for the bins with limited success. We then predicted the phenotype bins from the MotionTracker features (MPI-CBG screen), the flow cytometry scores and the GO annotations. The results, however, showed that the predictive performance is quite low. Information about the gene phenotypes from the MPI-CBG study cannot be (directly and in a straight-forward manner) connected to the phenotype bins defined by the NKI phenotype study. There are several factors that contribute to this outcome: the use of different cell-lines (HeLa vs. MelJuSo), the different cellular mechanisms studies (endocytosis vs. AGP), and different imaging and image description techniques (MotionTracking vs. CellProfiler). The integration of the screen data could be more successful if a smaller subset of genes from the NKI study is selected, and the closest matches from the MPI-CBG study calculated. This is the topic of ongoing collaboration between JSI and NKI.

In collaboration with LUMC, we analyzed data from a compound screen that they conducted. The screen used the LOPAC library and measured the (reduction in) bacterial load with flow cytometry (FACS) in two related studies. In the first study, MelJuSo cells were infected with M. tuberculosis (Mtb) at MOI 10, while in the second study, HeLa cells were infected with S. typhimurium (Stm) at MOI 10. The main goal of the analysis is to elucidate protein targets that contribute to the reduction/increase of the bacterial load, i.e., proteins that are involved in processes that kill or promote bacteria in the host cells.

We first performed data preprocessing, searching for descriptors of the 1260 compounds from the LOPAC library. We linked each of the compounds to the PubChem database at NCBI (http://www.ncbi.nlm.nih.gov/). We then extracted the proteins that were targeted by each compound in a human study, considering only studies were the compound was found to be active. In total, 711 protein targets were identified. We constructed three different descriptions of the dataset. First, we described 964 compounds through their respective protein targets (there are no protein targets for 296 compounds). Second, we described the compounds with the GO terms that the respective protein targets were annotated with. Third, we described the compounds with the GO terms of the target proteins and included gene-to-gene interaction information via the functions (terms) annotated with the interacting proteins. We then performed analyses on the three representations of the data.

We performed three different types of analyses of the data at hand. First, we conducted a discriminative analysis of the targets based on the hit compounds. Second, we performed analysis using predictive clustering trees trying to predict the z-score for reduction of the bacterial load using the three data representations. Third, we performed feature ranking using the three data representations.

For the discriminative analysis of the targets based on the hit compounds, we first reduced the number of targets considered. Namely, the number of 711 target proteins is large and includes proteins that are targeted by a large variety of compounds. These proteins are not necessarily specific or directly relevant for the response to Mtb and Stm infection. Furthermore, more interesting are the protein targets that were found active in studies involving hit compounds (compounds with z-score larger than 2 or smaller than -2) and not active in the remainder of the compounds. To this end, we prune the set of the proteins using the following rule: A protein target is removed from the set of protein targets if at least 90% of the compounds that it is targeted by are not identified as hits.

Using this reduced set of proteins, we formulated three separate lists of compounds and the respective protein targets. The first list gives 60 protein targets that are obtained when the hit compounds were defined by using both the Mtb and the Stm z-score values. The second list gives 58 protein targets that are relevant for the hit compounds for the Mtb study. The third list gives 15 protein targets relevant for the hit compounds for the Stm study. For these gene lists, we obtained three gene networks by using the STRING database of gene/protein interactions.

Figure WP1-5 depicts the gene network for the first gene list. In this network, we can note three parts. The largest part is positioned in the center of the network and includes different kinases. The second part is at the right-hand side of the network and includes a group of somatostatin receptors. The third part is at the left-hand side of the network and includes cell division cycle homologs, which through topoisomerases is connected to the largest part of the network.

Figure WP1-5: The gene network (STRING-based) for the gene list with hits for both Mtb and Stm.

The second analysis involved the construction of predictive clustering trees on the three data representations for each of the bacteria. This analysis discovered which protein targets, with which functions (through the respective GO terms), and which gene interactions are most responsible for the decrease or increase of bacterial load. In Figure WP1-6 CC, we show the predictive clustering tree for predicting the increase/decrease of bacterial load for Mtb, in which each tree leaf lists the compounds that belong to it. This tree selects the following GO terms/functions as most important: negative regulation of type 2 immune response (GO:0002829), defense response to virus (GO:0051607), early endosome (GO:0005769), myeloid cell activation involved in immune response (GO:0002275), phagocytic cup (GO:0001891), ER-associated protein catabolic process (GO:0030433), circulatory system process (GO:0003013) and antigen processing and presentation of exogenous peptide antigen via MHC class I, TAP-dependent (GO:0002479). All these functions are closely related to the process of Mtb infection and the defense response to it.

Figure WP1-6: The predictive clustering tree for predicting the increase/decrease of bacterial load for Mtb. Each tree leaf lists the compounds that belong to it. This tree is obtained from data where the compounds are described with the GO terms that the respective protein targets were annotated with.

For the third analysis, we performed feature ranking on the three data representations. We used RReliefF as a feature ranking algorithm (Robnik-Šikonja and Kononenko 2003). Namely, we would like to obtain an ordered list of genes or gene functions that are most relevant for the decrease/increase of bacterial load. After that, using the top 40 ranked genes we constructed a gene network using the STRING database. We show the two networks in FiguresWP1-7 and 8.

Figure WP1-7. Gene network for the gene list obtained by feature ranking on the Mtb data.

Figure WP1-8. Gene network for the gene list obtained by feature ranking on the Stm data.

References

(Cerepnalkoski et al. 2012) Darko Cerepnalkoski, Katerina Tashkova, Ljupčo Todorovski, Nataša Atanasova, and Sašo Džeroski. The influence of parameter fitting methods on model structure selection in automated modeling of aquatic ecosystems. Ecological Modeling, In press.

(Collinet et al. 2010) Claudio Collinet et al. Systems survey of endocytosis by multiparametric image analysis, Nature 464: 243-249

(del Conte-Zerial et al. 2008) Perla del Conte-Zerial et al. Membrane identity and GTPase cascades regulated by toggle and cut-out switches, Molecular Systems Biology 4:206

(Kocev 2011) Dragi Kocev. Ensembles for predicting structured outputs. PhD Thesis. International Postgraduate School Jožef Stefan, Ljubljana, Slovenia

(Paul et al. 2011) Petra Paul et al. A Genome-wide Multidimensional RNAi Screen Reveals Pathways Controlling MHC Class II Antigen Presentation, Cell 145(2):268-283

(Podpečan et al. 2011) Vid Podpečan et al. SegMine workflows for semantic microarray data analysis in Orange4WS. BMC Bioinformatics 12: 416

(Robnik-Šikonja and Kononenko 2003) Marko Robnik-Šikonja and Igor Kononenko. Theoretical and Empirical Analysis of ReliefF and RReliefF. Machine Learning 53(1-2):23-69

(Szomolay and Shahrezaei 2012) Barbara Szomolay and Vahid Shahrezaei. Bell-shaped and ultrasensitive dose-response in phosphorylation-dephosphorylation cycles: the role of kinase-phosphatase complex formation, BMC Systems Biology 6:26

(Tashkova et al. 2011) Katerina Tashkova, Peter Korošec, Jurij Šilc, Ljupčo Todorovski, and Sašo Džeroski. Parameter estimation with bio-inspired meta-heuristic optimization: modeling the dynamics of endocytosis. BMC Systems Biology, 5: 159

(Vens et al 2008) Celine Vens, Jan Struyf, Leander Schietgat, Sašo Džeroski, and Hendrik Blockeel. Decision trees for hierarchical multi-label classification. Machine Learning 73(2):185-214

(Zoncu et al. 2009) Roberto Zoncu et al. Phosphoinositide Switch Controls the Maturation and Signaling Properties of APPL Endosomes, Cell 136:1110-1121