summative report on bioinformatics case studiesgroups.inf.ed.ac.uk/ok/deliverables/d6.4.pdf ·...

38
OpenKnowledge FP6-027253 Summative Report on Bioinformatics Case Studies Adrian Perreau de Pinninck 1 , Carles Sierra 1 , Chris Walton 2 , David de la Cruz 1 , David Robertson 2 , Dietlind Gerloff 3 , Enric Jaen 1 , Qian Li 4 , Joanna Sharman 5 , Joaquin Abian 6 , Marco Schorlemmer 1 , Paolo Besana 2 , Siu-wai Leung 2 , and Xueping Quan 2 1. Artificial Intelligence Research Institute, IIA-CSIC, Spain 2. School of Informatics, University of Edinburgh, UK 3. Dept of Biomolecular Engineering, University of California Santa Cruz, USA 4. Gulfstream-software Ltd 5. Institute of Structural and Molecular Biology, University of Edinburgh, UK 6. Institute for Biomedical Research of Barcelona, IIBB-CSIC, Spain. Report Version: first Report Preparation Date: 20.11.2008 Classification: deliverable D6.4 Contract Start Date: 1.1.2006 Duration: 36 months Project Co-ordinator: University of Edinburgh (David Robertson) Partners: IIIA(CSIC) Barcelona Vrije University Amsterdam University of Edinburgh KMI, Open University University of Southhampton University of Trento

Upload: others

Post on 15-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

OpenKnowledge FP6-027253

Summative Report on Bioinformatics Case Studies

Adrian Perreau de Pinninck1, Carles Sierra1, Chris Walton2, David de la Cruz1, David Robertson2, Dietlind Gerloff3, Enric Jaen1, Qian Li4, Joanna Sharman5, Joaquin Abian6, Marco Schorlemmer1, Paolo Besana2, Siu-wai Leung2, and Xueping Quan2

1. Artificial Intelligence Research Institute, IIA-CSIC, Spain 2. School of Informatics, University of Edinburgh, UK 3. Dept of Biomolecular Engineering, University of California Santa Cruz, USA 4. Gulfstream-software Ltd 5. Institute of Structural and Molecular Biology, University of Edinburgh, UK 6. Institute for Biomedical Research of Barcelona, IIBB-CSIC, Spain. Report Version: first Report Preparation Date: 20.11.2008 Classification: deliverable D6.4 Contract Start Date: 1.1.2006 Duration: 36 months Project Co-ordinator: University of Edinburgh (David Robertson) Partners: IIIA(CSIC) Barcelona

Vrije University Amsterdam University of Edinburgh KMI, Open University University of Southhampton University of Trento

Page 2: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

1. Introduction Modern biological experimentation requires computational techniques of different kinds to enable large-scale and high-throughput studies. For example, structural genomics efforts aim to understand the function of proteins from their 3-D structures, which are either determined by experimental methods (e.g, X-ray crystallography and NMR spectroscopy) or predicted by computational methods (e.g., comparative modelling, fold recognition, and ab initio prediction). Proteomics efforts (in the most inclusive sense of the term) as another example, aim to understand the functional consequences of the collection of proteins that is present in a cell, or tissue, at a given time - particularly where differences are observed between healthy and disease states.

In both examples the data, and the analytical methodology applied to them, are obviously central to accomplishing the aims of these scientific domains. In addition, however, a framework is required that allows researchers to access the data, interpret the data, and exchange knowledge with one another. Most of the infrastructures currently in use enable straightforward access to centrally stored experimental data via large databases, and to tools and services either via web servers or repositories. Their existence and availability has undoubtedly been seminal for establishing the important position held by applied bioinformatics research in many biological experimental laboratories today; and will still play a role in the future. However, their versatility (e.g., with respect to heterogenic data types) and expandability (e.g., with respect to the ease by which data is shared amongst different research groups) are limited. In our view there is still ample room for improvement, even re-invention, of the framework underpinning biological and bioinformatics research interactions.

The value of integrating resources and researchers more effectively has been recognised by many others in the field, and recently a number of new infrastructures have emerged that address some of the shortcomings and foster interactions. Many of them are focused exclusively on facilitating bioinformatics research within specific domains, and range from work-bench style environments (e.g. Schlueter 2006) to peer to peer based systems for small specialised group (e.g. Overbeek 2005). In our own research we were interested in realising a more general framework.

In this context, any experimental protocol that is followed when one, or several, researchers are undertaking a bioinformatics experiment can be viewed as a series of interactions between the researcher(s), the databases from which the data are obtained, and the tools that are applied to derive secondary information from this data. Many bioinformatics protocols can be represented as consecutive interactions, or steps in a workflow. Moreover, an improved/novel framework should build on existing network connections (such as the internet, or the Grid). Accordingly the developments by the myGrid project group, such as the Computer-Aided Software Engineering (CASE) tool Taverna (Oinn 2004), currently play the most prominent role in the sector of automated experimentation enactment in bioinformatics. However, one of the weaknesses of this design is that, while it facilitates reproducible research, it cannot be extended to facilitate as effective sharing of resources (tools, data, knowledge, etc.) as this is conceivable across a peer to peer network.

Below we describe how OpenKnowledge P2P infrastructure was used to enact bioinformatics analyses, involving consistency checking amongst comparable data from different different bioinformatics programs (ranked lists of short amino acid sequences that could have yielded a given tandem mass spectrum, section 4.1) and different databases (section 5; atomic coordinates of modelled 3-D structures of yeast proteins), peer ranking for measurement of relative popularity of protein identifications by tandem mass spectrometry (MS/MS) (section 4.2), and data sharing for protein identification in proteomics (OK-omics scenario, section 4.3), respectively.

Page 3: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

2. Bioinformatics Problems

2.1. Protein Identification by MS/MS in Proteomics

Tandem mass spectrometry (MS/MS) involves multiple steps of mass selection or analysis and has been widely used to identify peptides and analyze complex mixtures of proteins. In the last decades, many specific techniques for identifying protein sequences from MS/MS spectra were implemented and made available either as downloadable local applications, or in form of web-enabled data processing services. The two most frequently used computational approaches to recognizing sequences from mass spectra are:

(a) peptide fragment fingerprinting approach, in which spectrum analysis is performed specifically for candidate proteins extracted from a database by building theoretical model spectra (from theoretical proteins) and comparing the experimental spectra with the theoretical model spectra. This takes advantage of public genomic-translated databases (GTDB) that can be accessed through data-mining software (search engines) which directly relates mass spectra with database sequences. Most of the search engines (MASCOT (Perkins et al, 1999), OMSSA (Geer LY et al, 2004), SEQUEST (Eng et al., 1994)) are available both as standalone programs enquiring local copy of a GTDB or as web services connected to online GTDBs. The main drawback of this approach is that it can only be used in situations where the genome has been sequenced and all predicted proteins for the genome are known. This approach is not suitable for the proteins with missing post-translational modifications (PTMs) and from unsequenced genomes.

(b) de novo sequencing approach, in which the inferences of the peptide sequences or partial sequences is independent of the information extracted from pre-exsiting protein or DNA databases. Sequence similarity search algorithms are specially developed to compare the inferred complete or partial sequences with theoretical sequences. Once a protein has been sequenced by de novo methods, one can look for related proteins in a GTDB using a matching algorithm such as MS-Blast (Shevchenko et al 2001).

Factors complicate the MS/MS spectrum interpretation task usually include:

• the number of admitted PTMs can multiply the volume of results to be analysed;

• bad quality and noise in mass spectra increase uncertainty of interpretation; and

• database errors in sequence annotations can lead to mis-interpretation.

These obstacles indirectly produce a huge amount of un-interpreted data (for instance, non-matching mass spectra or low-scoring de novo interpreted sequences) which are likely to be trashed. The unmatched data could be due to peptides derived from novel proteins, allelic or species-derived variants of known proteins, or PTMs. Nowadays these un-interpreted data are seldom accessible to other groups involved in the identification of the same or homologous proteins.

2.2. Protein Structure Prediction in Structural Genomics

Protein structure prediction is one of the best-known goals pursued by bioinformaticians. A protein’s three-dimensional (3-D) structure contributes crucially to understanding its function, to targeting it in drug discovery and enzyme design, etc. However, there is a continually widening gap between the number of protein amino acid sequences that are deduced rapidly through the ongoing genomics efforts, and the number of proteins for which atomic coordinates of their 3-D structures are deposited in the Protein Data Bank (PDB) (Berman, et al., 2000), i.e. those that are determined by structural biological techniques. To bridge this gap many computational biology research groups have been focussing on developing ever improving methodology to produce structural models for proteins

Page 4: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

based on their amino acid sequences. Still the resulting methods are far from perfect, and there is no one method that is always producing an accurate model. However, particularly in comparative modelling cases (where a protein with known structure can be used as a template for a protein of interest, based on similarity between their sequences), high-quality modelled structures can be useful resources for biological research (see Krieger 2003 for a review). Consistency checking and consensus building are commonly used strategies in the field to select high quality models from the pool of available models produced by different methods.

3. OpenKnowledge for Bioinformatics Experiments

3.1. OpenKnowledge Infrastructure

The OpenKnowledge system (Siebes et al. 2007) is a fully distributed system that uses peer to peer technologies in order to share interaction models of protocols and service components (so-called OpenKnowledge Components, or OKCs), across a network (normally the Internet). The OpenKnowledge kernel (downloadable from http://www.openk.org) is a small Java program that can easily be downloaded and run (in a manner similar to downloading and running systems like Skype). Its job is to allow the computer on which it has been installed to discover interaction models available in the peer community; put together a confederation of peers to take part in any chosen interaction; and run that interaction to completion if possible. Primarily this involves one peer acting as a coordinator of the interaction, and fellow peers managing OKCs to play changeable roles as are specified in the interaction model (Figure 1). The OpenKnowledge system consists of three main services which can be executed by any computer once the kernel is installed:

• the discovery service is a distributed hash table (DHT) in which the shared interaction models and OKCs are stored so that the users can search for them and download them;

• the coordination service manages the interactions between OKCs; and • the execution service consists of the kernel executing the service on the local machine.

The workflow for implementing a new application is as follows. First the interaction model (or to a biologist or bioinformatician, an experimental protocol), must be expressed in the specification language LCC (Lightweight Coordination Calculus) (Robertson 2004). LCC is described briefly below, in Section 3.2. This interaction model is published to the discovery service so that other users can find it, and subscribe to play a role in it. A developer, not necessarily the one who implemented the interaction model, will develop OKCs that play the roles defined in the interaction model. Some of these OKCs may be shared across the network by publishing them to the discovery service, i.e. other users may download them onto their local computers. At this point the distributed application is implemented and can be executed using the OpenKnowledge system. Peers can proactively search interaction models, or be advertised about newly published interaction models they are interested in. In both cases, peers receive a list of interaction models and will subscribe to perform one role (or more) in those that best match their capabilities and interests. Any peer in the network can opt to assume the coordinator role and manage the interaction. The coordinating peer, randomly chosen by the discovery service among those who have opted in, is handed the interaction model together with the list of matching peers for the roles. Once a confederation of mutually compatible peers has been assembled, they are asked by the coordinator to commit to the services they have subscribed to provide. If the peers commit, the interaction model is executed.

Page 5: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Figure 1: Inside the OpenKnowledge kernel. A peer can store and search interaction models and OKCs in the discovery service. Interaction models specified in LCC define the roles to be played by OKCs. A peer can act as a regular peer managing one, or several, OKCs or it can act as a coordinator for the interactions. An interpreter interprets an interaction model as it is being executed.

3.2. LCC as Experiment Description Language

An interaction model is specified in LCC as a set of clauses, each of which defines how one role in the interaction must be performed by a peer assuming this role. A role is described by the type of the role, and an identifier for the peer taking it on. Formally we describe this as a(Role,Identifier), e.g. a(experimenter,E). The definition of a role is constructed using combinations of the sequence operator (‘then’) or choice operator (‘or’) to connect messages and changes of role. Messages are either outgoing to another peer in a given role (‘=>’) or incoming from another peer in a given role (‘<=’). Message input/output or role changes can be governed by a constraint defined using the normal logical operators for conjunction, disjunction and negation. The example below shows a LCC clause defining the role of a data source as it is used as part of the protocols enacted in the experiments described in Section 4.1 and Section 5, below. According to this definition, a data source, P, either receives a data request from a data retriever, X, then returns to the retriever a data report (here one based on de novo analysis), and then continues as a data source; or it receives a message from the same peer, X, in its role as data collator to signal the end of the experiment. a(data_source, P):: ( data_request(F,D) <= a(data_retriever,X) then data_report(F,Ds) => a(data_retriever,X) <- denovoanalysis(F,D,Ds) then a(data_source,P) ) or end_of_experimentA <= a(data_collator, X) The definition above illustrates most of the features of the LCC language. It provides constants (beginning with lower case characters), variables (beginning with upper case characters) and tree structures data structures (not shown in this example but allowing it to represent structured terms such as lists). Roles can be recursively defined (so the data source above recurses as a data source again after it has dealt with the data retriever’s request). Roles can also be parameterised with arguments (not shown in the example above) so that, taken together with the use of recursion, LCC clauses have an expressivity similar to established declarative programming languages such as

Page 6: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Prolog. Unlike traditional declarative languages, the primitive expressions in the definition of a LCC clause are message passing events – in our example these are the input data request or end of experiment messages, the output data report message, or the null event signifying a change in the peer without an input or output message. Conditions attached to message passing events (using the ‘<-’ operator) connect the events to OpenKnowledge components (OKCs) which link to the programs run by the peer concerned. The full LCC specifications for the protocols enacted in the two bioinformatics experiments (Section 4.1 and 5) are available figure 2 and 23. It should be noted that interaction models do not specify how to solve constraints. The peers solve constraints using plug-in components, called OpenKnowledge Components (OKC), which can be shared along with the interaction models and/or developed anew. In the current implementation of the OpenKnowledge system an OKC is a Java archive (jar) file which contains a facade class exposing methods that are matched onto the constraints upon subscription by the peer hosting the OKC. Currently this includes methods that can access databases, process parameters, and wrap legacy applications or web services.

4. Protein Identification by MS/MS

In this section we described the bioinformatics scenarios in the field of protein identification by MS/MS, including consistency-checking of de novo sequencing tools/ to improve de novo sequencing accuracy, peer ranking for measurement of relative popularity of protein identification approaches/tools, and finally the OK-omics scenario of data-sharing for protein identification in proteomics, all deployed on top of the OpenKnowledge infrastructure.

4.1. Cross-checking for de novo sequencing by MS/MS

De novo methods for MS/MS protein identification have received considerable interest as it is the only way to identify peptide when no appropriate database is available. Useful de novo methods do exist. Lutefisk (Taylor and Johnson, 1997; Johnson and Taylor, 2002) performs de novo sequencing based on spectrum graph approach combined with heuristic algorithm to compute the optimal path connecting the N- and C-termini. It traces sequences starting from the N-terminus until a sequence’s mass matches the peptide’s molecular mass. PEAKS (Ma et al. 2003; Ma et al., 2005), unlike the graph based approach, works directly on mass spectra. It first computes a reward/penalty score such that for each mass value m, a y/b ion has mass m. PEAKS then generates potential sequences whose y and b ions maximize the total rewards at their mass values through a dynamic programming algorithm, followed by re-evaluation of these sequences with a new scoring scheme. Finally PEAKS computes a confidence score for each of the top-scoring peptide sequences. PepNovo (Frank and Pevzner, 2005) uses a probabilistic network representing the rules governing the peptide fragmentation to calculate the likelihood that a y- or b-ion match is true. The likelihood ratios between observed matches and a random match are then calculated and then their logarithms are summed as the final confidence score.

The accuracy of de novo interpretation is much dependent on the quality of the mass spectra, which are often complicated by poor fragmentation and inaccuracy due to mass shift caused by temperature and other instrumental parameters. Several research groups (Resing et al., 2004; Rogers, 2005; Searle and Turner, 2005) have succeeded in improving the sensitivity and accuracy of peptide sequence identification through consistency-checking of the results from different database searching (not de novo) methods for MS/MS interpretation. In this scenario, we investigate the possibility of improving the accuracy of peptide sequence identification through

Page 7: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

consistency-checking of the results from different de novo sequencing methods for MS/MS interpretation.

4.1.1 Roles of Peers in the Experiment Data Sources/Providers: OpenKnowledge Components (OKCs) were developed to access and manipulate Web-based PEAKS Online 2.0, local program PepNovo Win32 Executable and Lutefisk XPv1.0, for handling MS/MS spectrum inquiries and de novo sequencing results.

Data Comparer: This peer is in charge of the comparison between the results given by data sources. A weighting scheme is applied for comparison between de novo interpretations by different de novo sequencing algorithms. The weight score W(n) for a candidate sequence is based on the combination of the individual performance P(k) of the de novo sequencing algorithm (Table 1), and the original confidence score C(n) of the candidate sequence assigned by the algorithm, as shown by the following formula.

)()( )1()( kPnW C

nC ×=

where C(1) is the confidence score of the top sequence in the candidate sequence list. P(k) is equal to “1” for the candidate sequences predicted by PEAKS and PepNovo, and 0.5 for Lutefisk.

Cross checking was done based on pair-wise comparison of the candidate sequences coming from different de novo sequencing algorithms. The pair-wise comparison is done through approximate matching to allow mismatches of one to two amino acids (-1 ≤ Levenshtein edit distance ≤ 2). If two peptide sequences are consistent based on approximate matching, the longer sequence will be assigned with a new weight score, which is equal to the sum of weights of both sequences in comparison. The shorter sequence will be removed from list. The other sequences will be retained with their original weights. The pair-wise comparison continues with other sequences. After re-ranking all the candidate sequences, a candidate sequence list can be generated as the combined results from all participating de novo sequencing peers.

The detailed definition for the interactions between the peers performing these roles and other systematic roles are given in Figure 2.

4.1.2 Experiment Setup Benchmark Data: 280 doubly charged tryptic peptides obtained from low-energy ion trap LC/MS/MS runs (Frank and Pevzner, 2005) were used to illustrate the cross-checking algorithm and its advantages. The sequences of these MS/MS spectra were identified through SEQUEST (Eng et al., 1994) searching with high confidence scores. 210 spectra in this data set were randomly selected as training data to evaluate the performance of the three data sources and to devise the weight scores for cross-checking. The other 70 spectra were used as testing data to evaluate the performance of cross-checking.

Individual Program Performance: For each spectrum inquiry, each de novo sequencing program returned a ranked list of candidate peptide sequences in the order of the confidential score defined by the program. Using a performance evaluation strategy similar to Frank and Pevzner (2005), individual performance of PEAKS Online 2.0, PepNovo Win32 Executable, Lutefisk XPv1.0 was evaluated using the training dataset with 210 training spectra. The prediction precision of the three programs for the given 210 spectra was evaluated based on the top sequences in their results. PEAKS online 2.0 was run with the parent and fragment tolerance, both of 0.5 Da, and trypsin digestion. Lutefisk and PepNovo were run with their default parameters for doubly charged tryptic peptides on ion trap MS. We did not differentiate between amino acids Q and K, and between I and L when calculating the precision.

Page 8: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

// Dts is a list of DTA file a(experimenter, E):: null <- ask_files(Lst_dtas_file,Dtas) then data_request(Lst_dtas_file, Dtas) => a(data_coordinator, X) then data_compared(Lst_dtas_file,SdF) <= a(data_coordinator, X) then null <- printresults(SdF) a(data_coordinator, X):: data_request(Lst_dtas_file, Dtas) <= a(experimenter, E) then null <- getInterestedRole(R) and getPeers(R,Sp) and makeEmptyList(DfBuildUp) then null <- debugVar(Sp) then a(data_collector(Lst_dtas_file, Dtas, Sp, DfBuildUp, DfF),X) then a(experiment_end_marker(Sp), X) then end_of_experimentB(DfF) => a(data_comparer,C) then data_compared(Lst_dtas_file,DfF) => a(experimenter,E) // Lst_dtas_file : list of files names // Dtas: list of file contents // Sp: list of sources // DfF: final list of comparisons a(data_collector(Lst_dtas_file, Dtas, Sp, DfBuildUp, DfF),X):: null <- Lst_dtas_file=[] and assign(DfBuildUp,DfF) or (null <- Lst_dtas_file=[F|Rdtas_filesTail] and Dtas = [D|DtasTail] then null <- makeEmptyList(SBuildUp) then a(data_retriever(F,D,Sp,SBuildUp,SF),X) then data_tocompare(F,SF) => a(data_comparer, C) then data_compared(F,Df) <= a(data_comparer, C) then null <- NewDf=[Df|DfBuildUp] then a(data_collector(Rdtas_filesTail,DtasTail,Sp,NewDf,DfF),X)) a(data_retriever(F,D,Sp,SBuildUp,SF),X):: null <- Sp=[] and assign(SBuildUp, SF) or ( null <- Sp = [P|Rp] then data_request(F,D) => a(data_source,P) then data_report(F,Ds) <= a(data_source,P) then null <- NewSBuildUp = [Ds|SBuildUp] then a(data_retriever(F,D,Rp,NewSBuildUp,SF),X)) a(experiment_end_marker(Sp), X):: null <- Sp =[] or (end_of_experimentA(DfF) => a(data_source,P) <- Sp=[P|T] then a(experiment_end_marker(T), X) ) a(data_source, P):: (data_request(F,D) <= a(data_retriever,X) then data_report(F,Ds) => a(data_retriever,X) <- denovoanalysis(F,D,Ds) then a(data_source,P)) or end_of_experimentA(DfF) <= a(data_coordinator,X) a(data_comparer, C):: (data_tocompare(F,SF) <= a(data_collector,X) then data_compared(F,Df) => a(data_collector,X) <- threeway_check(F,SF,Df) then a(data_comparer, C)) or end_of_experimentB(DfF) <= a(data_coordinator,X)

Figure 2: LCC definition for consistency-checking between different de-novo approach:

Page 9: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Table 1: Performance of de novo protein sequencing by PEAKS online, PepNovo Win32 Executable, and Lutfisk XPv1.0.

Algorithm Precision Recall PEAKS 0.736 0.728 PepNovo 0.737 0.692 Lutefisk 0.402 0.461

The precision and recall of prediction with mass tolerance 2.0 Da are calculated as:

Precision = acidsnoamipredictedofnumberacidsnoamicorrectofnumber

Recall = settestinacidsnoamiallofnumberacidsnoamicorrectofnumber

The predicted sequences were checked against the correct sequence for a spectrum (Table 1). Two amino acids being compared were counted to be in agreement if their difference in mass positions between the experimental spectrum and the predicted spectrum is ≤ 2.0 Da. Precision evaluation was given by dividing the number of correct amino acids by the total number of amino acids in the predicted peptide sequences being compared. Recall was the value given by the number of correct amino acids divided by the number of all amino acids in the test sequences being compared. For example, if the correct sequence for a mass spectrum is “TQLLDVLSNK” and the predicted sequence is “TKLLDVLAE”, dividing 7 (correctly predicted amino acids) by 9 (the length of predicted sequence) = 77.8% yields the precision of this prediction; its recall score is obtained by dividing 7 by 10 (the length of correct sequence) = 70%.

Figure 3: Peer network architectures for consistency-checking of de novo sequence derivation from MS/MS spectra. Primary symbols: , interaction role defined in the LCC-specification for the experiment (Suppl. Mat. S1); , peer (colour varies by type of resource). Secondary symbols: , OKC (functionality depends on the peer); , local database; , locally executed program; , utility scripts in Perl or Java; Peers (PepNovo, Lutefisk) take on data provider roles

and manage OKCs to execute specific programs interpreting MS/MS spectra; the peer (PEAKS) takes on a data provider role and manages an OKC to access a specific web server delivering the results of a third MS/MS interpretation program. Peer X acts as data collator, data collector, data retriever, and experiment end marker. The experimenter peers manage OKCs sending out requests for information to the network. Finally the comparer peer (CDNS) manage OKCs to invoke the consistency checking programs.

4.1.3 Experiment Results Based on the test dataset with 70 spectra, we obtained a re-ranked list of candidate peptide sequences at three different confidence levels. The results show improved precision (Table 2):

Page 10: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Table 2: Improvement of precision by consistency checking.

Confidence Level

Precision Subset/Test Set

Low 0.754 1.00 Medium 0.848 0.600 High 0.921 0.143

Average of predicted amino acids in the first sequences of the re-ranked results for the benchmark dataset, and the percentage of three subset sizes compared to the total data set.

At the low confidence level whose average precision was still higher than that of any single de novo algorithm (Table 2). There was no loss of sequences at the low confidence level. At the medium confidence level, a subset (40 spectra, 60.0% of total test data set) of sequences was obtained. In the list, the top sequences were in agreement by at least two de novo sequencing peers. The average precision of this subset increased to 0.848. At high confidence level, a smaller subset (10 spectra, 14.3%) was obtained but it gave an even higher average precision of 0.921. In this smaller subset, the top sequences in the re-ranked sequence list are in agreement among peers.

As seen from the result of pair-wise comparison (Table 3), the improvement of average precision at the medium confidence level might be attributed to the consensus between PEAKS and PepNovo. The high average precision (0.921) of the smaller subset at high confidence level might be due to the consensus between PepNovo and Lutefisk.

Table 3: Average precision of predicted amino acids in the top consensus sequences in pair-wise comparison.

Algorithm Lutefisk PepNovo PEAKS PEAKS 0.847 (18) 0.858 (35) 0.736 PepNovo 0.908 (10) 0.737 Lutefisk 0.402

The data within the parentheses are the numbers of pair-wise consensuses for the two corresponding programs being compared.

4.2. Peer Ranking of Protein Identification Tools

There is a level of confusion surrounding the selection of specific protein identification approach using validated MS/MS data, and specific protein identification tool with MS/MS data, for specific tasks. In this section, we introduced a novel approach which helps to measure the relative popularity/importance of different protein sequence identification tools employing different approaches, including PFF approach, and a combination of de novo sequencing followed by database similarity searching approach, based on a peer ranking algorithm integrated into the OpenKnowledge system.

Page 11: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

a. LCC definition for de novo + database similarity searching approach:

a(spectra_input, SI):: mass_spectra(SN, Spec) => a(denovo_approach, NOVO) <- uploadSpec(SN, Spec) then result(Val) <= a(output_interface, OI) then null <- Val==1 a(denovo_approach, NOVO):: mass_spectra(SN, Spec) <= a(spectra_input, SI) then denovo_result(Denovo_res) => a(similarity_search, SS) <- denovo_analysis(SN, Spec, Denovo_res) a(similarity_search, SS):: denovo_result(Denovo_res) <= a(denovo_approach, NOVO) then results(SS, Sim_res) => a(output_interface, OI) <- similarity_analysis(Denovo_res, Sim_res) a(output_interface, OI):: results(SS, Res) <= a(similarity_search, SS) then ( null <- filter_result2(SS, Res) then result(1) => a(spectra_input, SI) ) or result(0) => a(spectra_input, SI)

b.LCC code for PFF approach:

a(spectra_input, SI):: mass_spectra(SN, Spec) => a(pff_approach, PFF) <- uploadSpec(SN, Spec) then result(Val) <= a(output_interface, OI) then null <- Val==1 a(pff_approach, PFF):: mass_spectra(SN, Spec) <= a(spectra_input, SI) then null <- pff_analysis(SN, Spec, Res) then results(PFF, Res) => a(output_interface, OI) a(output_interface, OI):: results(PFF, Res) <= a(pff_approach, PFF) then ( null <- filter_result1(PFF, Res) then result(1) => a(spectra_input, SI) ) or result(0) => a(spectra_input, SI)

Figure 4: LCC codes for peer ranking test experiment

Page 12: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

4.2.1 Roles of Peers in the Experiment Five roles are defined in two interaction models (Figure 4). One interaction models described the process for de novo sequencing followed by database similarity searching, while the other described the process for PFF approach for protein identification by MS/MS.

Initiator /Spectra Input: This is a common role initiating both the interaction model for PFF analysis (routine 1), and the interaction model for the combination of de novo sequencing and database similarity searching analysis (routine 2). OpenKnowledge Components has been developed for users to upload the MS/MS spectra, and selection of the interaction model they prefer to execute. If no selection was done, one of the two IMs will be randomly selected by the OpenKnowledge system to execute.

De novo Approach: Two local programs, PepNovo Win32 Executable and Lutefisk Lutefisk XPv1.0, are accessable through OKCs for de novo sequencing analysis if routine 1 is selected to execute. Both Lutefisk and PepNovo were run with their default parameters for doubly charged tryptic peptides on ion trap MS.

Similarity Search: the result yielded by de novo sequencing analysis approach is passed to database searching programs, in this experiment, MS-BLAST, to scan database of known genomes for homologous sequences. MS-BLAST was run with default parameter and database nrdb95.

PFF Approach: OpenKnowledge Components have been developed to access and manipulate two web based servers, MASCOT and OMSSA, for handling MS/MS spectrum inquiries and PFF analysis if routine 2 is selected by the peer spectra_input. MASCOT was run with database NCBInr, enzyme trypsin, and all the other default parameters it provided for peptide mass fingerprint approach. OMSSA was run with database nr, enzyme trypsin, maximum missed cleavages equal to 2, minimum charge to start using multiply charged products equal to 2, all the optional species selected, and all the other default parameters for ion trap spectrometers.

Output Handling: the role of output_interface is in charge of the filtering of the results received from the similarity searching peer, or the pff_approach peer. If the result yielded by an executed interaction model failed to pass the filtering criteria, it will be treated as a failed IM. The filtering criterion varies with the individual peers to ensure the false discovery rate of final result after filtering is less than 0.1 (Table 4)

Table 4: The score name and threshold value for filtering of results given by peers MASCOT, OMSSA, and MS-BLAST

Peer Name Score Name Threshold MASCOT MASCOT score ≥ 30a OMSSA E-value ≤ 0.1b MS-BLAST MS-BLAST score ≥ 57c

a. Perkins et al, 1999; b. Geer et al, 2004; c. Habermann, et al, 2004;

4.2.2 Experiment Setup The Peer Ranking, like Page Rank, works by assigning a ranking to a peer at any given time as a function of its previous ranking modified by the rankings of each peer with which it has interacted, with the purpose of measuring the relative popularity within the peer set. 1. The ranking of peers was dependent upon the times of succeeded interaction models that the peer being involved in.

The testing peer ranking experiment consisting of 280 runs of processing was based on the entire benchmark dataset used in section 4.1.2, 280 doubly charged tryptic peptides obtained from low-energy ion trap LC/MS/MS runs, to reduce its random errors. Each round went through only one of the possible routes with one peer to perform the roles specified in the associated LCC code (Figure

Page 13: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

4 and 5). A single round started with uploading the MS/MS spectra data of peptides to the spectra_upload peer. In addition to spectra uploading, the OKC developed for this peer allows it to select the analysis approach it prefers, which was either PFF approach (Route 1) or de novo sequencing + database similarity searching approach (Route 2), for protein identification from the MS/MS spectral data. If none of the approach was selected, the system will randomly select one approach for further interactions, as was done in this testing experiment. In Route 1 of this experiment, one of the two peers subscribed to the role pff_approach, OMSSA and MASCOT, was randomly selected by the system to perform the role. The selected peer processed the spectra data and sent its result to the output_interface peer. Similarly, in route 2, the role denovo_approach was performed by one of de novo sequencing peers subscribed to the role denovo_approach, PepNovo and Lutefisk. The de novo sequencing result was then sent to the peer performing the role similarity_search, MS-BLAST in this experiment, to find their homologous sequences in database nrdb95, before the similarity search result is sent to the output_interface peer. After reformatting of the obtained result, the output_interface peer filter out the results that didn’t passed the filtering criteria (table 4), and send back the failure (0) or success (1) information to the initiator (spectra_input) peer calling for the end of the interaction model. If no result was passed to the out_interface peer, that is, routine 1 or 2 which was performed didn’t give a result, the interaction model executed was also treated as failed interaction. For detailed LCC codes specifying routine 1 and 2 please see supplementary material S2. The interactions that run longer than 1200 seconds will be forced to shutdown for efficacy.

Figure 5: Illustration of the sequences of operations/roles for the peer ranking testing experiment.

4.2.3 Initial Experiment Result

The initial peer ranking based on the 280 rounds of interactions with benchmark dataset was given in table 5. The two peers, spectra_input and output_interface, appear in all the interactions executed and are ranked at the top. Peer MS-BLAST, which is involved in every interaction in routine 1, is on the third position on the ranking ladder. If more peers were subscribed to the system for the role similarity_search, the ranking of MS-BLAST should drop to lower rank on the ranking ladder. The PFF approach MASCOT is ranked higher than de novo sequencing approach PepNovo and Lutefisk, which is expected as generally PFF approach is more accurate than de novo sequencing approach. However, another PFF approach peer, OMSSA, is on the bottom of the ranking, due to its very low sensitivity (no result returned for most of the interactions it executed). The ranking score of de novo sequencing tool PepNovo is nearly double the score of Lutefisk, which is consistent with the performance evaluation result of different de novo sequencing programs in section 4.1.2.

Page 14: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Table 5: The ranking for the seven peers involved in the peer ranking test experiment

Rank Peer ID Score Total Runs Succeed Runs Failed Runs 1 output_interface 0.308 239 110 129 2 spectra_input 0.308 239 110 129 3 MS-BLAST 0.162 130 47 83 4 MASCOT 0.122 68 53 15 5 PepNovo 0.104 61 30 31 6 Lutefisk 0.059 69 17 52 7 OMSSA 0.023 41 10 31

4.3. OK-omics – Data Sharing for Protein Identification in Proteomics

Nowadays individual proteomics labs usually contain trash datasets of un-interpreted MS/MS un-accessible to other groups involved in the identification of the same or homologous proteins. If we compare data coming from different laboratories then we would be able to eventually discover new matches and/or useful data. We envision many advantages with this new idea, as other laboratories (peers) could provide the missing information for an incomplete spectrum or sequence satisfying the process of identification; or even more, matches could help to recognise new proteins and identify PTMs. Hence, as already reported by Abian et al 2008, we have drawn a new scenario where the information to be searched is no longer centralised in a few repositories, but where information gathered from experiments of proteomics laboratories such as mass spectra and de novo interpreted sequences are made available in a P2P network. An interaction model specifies how this search should be effected.

In this deliverable we shall report on the advancement with respect to the data sharing experiment described in reference Abian et al 2008. A community of peer proteomics laboratories may now share mass spectra and de novo interpreted sequences using the functionalities provided by the OK platform.

The OKCs for search and result visualization have been enhanced with user-friendly GUIs that have been tested by proteomics researchers at the Institute for Biomedical Research of Barcelona, CSIC. Furthermore, visualization GUIs are now launched via annotation in the LCC interaction model as described by Dupplaw et al 2006. In addition, we have included a proteomics lab trust evaluation mechanism into the framework according to the trust model and algorithm presented by Giunchiglia et al 2007.

The data sharing experiment has been carried out on real mass spectrometry data that was generated in a distributed manner at nine proteomics labs of the Spanish ProteoRed network, and has shown an improvement in identifications with respect to conventional methods.

4.3.1 Trust Evaluation In this scenario, the biochemical researcher sends queries to and receives data coming from different proteomics laboratories. For the researcher it is important to have a mechanism to distinguish which laboratories return more significant and relevant data. This can be achieved by measuring the confidence (or trust) on the laboratory. According to Giunchiglia et al 2007, the trust on a laboratory (i.e., a peer) can be measured by taking into account the previous experiences with respect to that laboratory. In our case each experience with a laboratory is based on the quality of the match between the spectrum sent in the query (i.e., the expectation) and the hits (spectra) in the answer (i.e., the commitment). More specifically, an experience in OK-omics is a tuple with the terms:

<biochemicalResearcherName, proteomicLaboratoryName, OmicsInteractionModel, LaboratoryRole,expectation HitTerm,commitment HitTerm,observed HitTerm,

findHit constraintmentTerm,timestamp,hasSucceeded,researcherSatisfication>

Page 15: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

with the following meanings:

The OmicsInteractionModel is the LCC term of our interaction model (explained later).

The expectation HitTerm is the ideal hit that the researcher expects, i.e. the spectrum of the query.

The commitment HitTerm is the term associated with a hit returned by the laboratory.

The observed HitTermm in our proteomics context is the same as the commitment HitTerm. (This is because we consider that what a researcher observes is what a laboratory commits.)

The other parameters have a straightforward meaning from their name.

The similarity between a query and a laboratory hit (the basic building block of the trust calculation in this application domain) is specifically defined as the product of two factors:

spectrum similarity = spectra similarity (i.e. data similarity)

× protocol similarity (i.e. metadata similarity)

Where spectra similarity refers to how significant is the spectrum returned with respect to the spectrum in the query, and protocol similarity refers to how similar are the spectrographic protocols are followed by both the researcher, when obtaining the spectrum in the query, and the laboratory, when obtaining the spectrum on which the answer is based.

4.3.1.1 Data Similarity

With respect the data similarity, this is calculated considering how significant and how many spectra are returned by a laboratory.

To calculate the significance of an spectrum (or its associated sequence of aminoacid characters1),

the proteomic searching tools that work over databases of aminoacid sequences [7, 4] calculate a score value and a probabilistic value: P-value, which is the probabilty that a given sequence appears in the database by chance at least once (Abian et al, 2008). Instead of returning the P-value, these tools calculate and return an expectation value: E-value, which is based on the P-value and indicates the number of times that a sequence can appear by chance in the database. For our similarity purposes, the closer the E-value is to zero, the better is the matching. As the E-value is not normalised in [0..1], for the similarity we calculate the P-value as (we note E and P for the E-value and P-value respectively):

P = 1 − e-E

The second factor to consider is the number of significant spectra in the returned set M. So for example if laboratory A returns one spectrum having a similarity of 0.95, and laboratory B returns three spectra matching between 0.83 and 0.88, the laboratory B will be considered as having more significant data than A, and therefore more trustworthy.

The M spectra returned by the search tool are ordered by matching score. To calculate the overall spectra similarity, we take just the N spectra with a similarity degree with the query as far as the standard deviation of the similarities in the population M, and then we calculate the mean of their similarity.

Note: In proteomics a spectrum can be represented in two ways: as a series of intensity peaks, or rather its equivalent sequence of amino acid characters. chance means the comparison of (i) real but non-homologous sequences; (ii) real sequences that are shuffled to preserve compositional properties; or (iii) sequences that are generated randomly based upon a DNA or protein sequence model (The statistics of sequence similarity scores were retrieved on 19th November 2008 from http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html).

Page 16: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

4.3.1.2 Metadata Similarity

A spectrum in proteomics is represented as a histogram of different types of ions. In addition to the spectra returned, a biochemical researcher needs to have some confidence that the way in which the spectrum in the query has been obtained is comparable to the way in which the expected spectrum is obtained. This is because although the protocol followed by a laboratory is well defined (Klein & Thongboonkerd, 2003), the protocol itself admits certain variations that will produce spectra with different ion types. These variations include:

the enzyme used to modify the amino acids

the enzyme used to digest the amino acids

the type of mass spectrometer used to produce the spectra

Another important factor is the organism from which the protein has being obtained. All this information is provided in the form of metadata. The similarity of the metadata is calculated as:

metadata similarity = organism similarity × modification similarity ×

digestion similarity

× mass spectrometer similarity

The next sections explain how each factor of the similarity is calculated.

4.3.1.3 Organism Similarity

The semantic similarity among organisms is based on the taxonomy tree of the organism according to the NCBI lineage (NCBI National Center for Biotechnology Information). Figure 6 represents part of the lineage tree. To define a similarity measure we have used the semantic similarity equation described in [5]:

where ι is the length (i.e., number of hops) of the shortest path between nodes, h is the depth of the deepest node subsuming both nodes, and κ1 and κ2 are parameters balancing the contribution of shortest path length and depth respectively. For example, the term of the human lineage is:

cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae; Homo/Pan/Gorilla group; Homo; Homo sapiens

and the terms of some other species are: rattus=cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea; Muridae; Murinae; Rattus

Page 17: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Figure 6 Fragment of the organism lineage tree

soleaSenegalensis=cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Actinopterygii; Actinopteri; Neopterygii; Teleostei; Elopocephala; Clupeocephala; Euteleostei; Neognathi; Neoteleostei; Eurypterygii; Ctenosquamata; Acanthomorpha; Euacanthomorpha; Holacanthopterygii; Acanthopterygii; Euacanthopterygii; Percomorpha; Pleuronectiformes; Soleoidei; Soleidae; Solea; Solea senegalensis

macaca = cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; Cercopithecoidea; Cercopithecidae; Cercopithecinae; Macaca; Macaca mulatta

and their similarity with the parameter values κ1 =0.02 and κ2 =0.6 are Sim(homo sapiens, solea) = 0.46

Sim((homo sapiens, rattus) = 0.72

Sim(homo sapiens, Macaca mulatta) = 0.81

The database of organisms and lineages is dynamic and very large (currently there are more that 300.000 organisms). The NCBI database provides a REST

(a method to make web service queries

where the query is written as an URL over HTTP) web service to get lineage information without the need to download the entire database.

4.3.1.4 Modification Similarity

To calculate the similarity of modification terms, rather than a tree distance we have used a binary similarity table. This is because there are situations that cannot be compared. For example, a peptide modified with oxidation of an amino acid cannot be compared with another peptide that has not been modified at all. This table is depicted in Figure 7. In the table, the last row with XXX and YYY refers to two different modification codes and is considered as having null similarity.

Page 18: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Figure 7: Similarity table for peptide modification

4.3.1.5 Digestion Similarity

Similarly, to calculate the similarity of the digestion terms we have used a similarity table to support incompatible cases. For example, a peptide digested with the Trypsin enzyme cannot be compared with another peptide digested with Glu-C. This table is depicted in Figure 8. In the table, XXX and YYY refer to two different digestion terms.

4.3.1.6 Mass Spectrometer Similarity

Finally, the similarity of the mass spectrometers terms is calculated based on the taxonomy tree depicted in Figure 9. The tree classifies the spectrometers according to the type of ion fragmentation that the spectrometry performs.

Figure 8: Similarity table for peptide digestion

Page 19: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Figure 9: Similarity tree for different mass spectrometers

4.3.2 Integration with OK Infrastructure

4.3.2.1 The Interaction Model

Figure 10 shows the interaction model for sequenced MS spectra as specified in the Lightweight Communication Calculus (LCC). This interaction model is based on the one described by Abian et al 2008, and defines the query answering protocol from one peer playing the querying role of researcher to various other peers playing the answering role of omicslab.

The interaction proceeds as follows: a peer in the omicslab role first waits for a message containing a query from a peer in the researcher role, then answers the query by solving a constraint in which the matching hits are returned, sending these back to the querying peer via another message.

A peer in the researcher role first solves a constraint that lets the user select among those omicslab role playing peers it wants to send the query to, then prompts the user for the query, and finally iterates through all the chosen omics labs in the interaction to eventually aggregate all the di�erent results, showing them to the user. The iteration through omics laboratories is done by changing to a recursive researcher subrole, which receives the query and a list of omics laboratory role players. The peer in this subrole sends a message containing the query to the first omics laboratory in the list, then it aggregates the response it receives in the message from the omics laboratory.

Constraints of the researcher role that require user input, such as selecting candidate omics labs or introducing the query, and generate output to the user, such as displaying results, are done via visual constraints. That is, these constraints are annotated in the LCC specification to be solved by means of domain-specific GUIs, specially tailored for sequenced MS spectra sharing.

Page 20: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

r(researcher, initial ) r(omicslab, necessary, 1)

//---visualisation annotations—

@annotation( @role( researcher ),

@annotation( @constraint( selectLabs( LabList, selectedLabList) ),

visual( selectLabsVis( LabList, SelectedLabList) )))

@annotation( @role( researcher ),

@annotation( @constraint( getQuery(SearchType, SearchArguments, InputFormat, Input) ), visual( getQueryVis(SearchType, SearchArguments, InputFormat, Input) )))

@annotation( @role( researcher ),

@annotation( @constraint( showResults(End) ),

visual( showResultsVis(End) )))

a( researcher, Researcher ) ::

null <-getOmicsLabRole(RoleName) and getPeers(RoleName, LabList)

then

null <-selectLabs( LabList, SelectedLabList )

then

null <-getQuery( SearchType, SearchArguments, InputFormat, Input )

then

a( researcher( SelectedLabList, SearchType, SearchArguments, InputFormat, Input ), Researcher)

then

null <-showResults( End )

a( researcher( LabList, SearchType, SearchArguments, InputFormat, Input ), Researcher ) ::

(

query( SearchType, SearchArguments, InputFormat, Input ) => a( omicslab, H ) <-LabList = [H|T]

then

answer( Result, ResultInfo ) <= a( omicslab, H )

then

null <-processResult( Result, ResultInfo )

then

a( researcher( T, SearchType, SearchArguments, InputFormat, Input ), Researcher )

)

or

null <-LabList = []

a(omicslab, OmicsLab) ::

query( SearchType, SearchArguments, InputFormat, Input) <= a( researcher, Researcher )

then

answer( Result, ResultInfo ) => a( researcher, Researcher )

<-findHit( SearchType, SearchArguments, InputFormat, Input, Result, ResultInfo )

then

a(omicslab, OmicsLab)

Figure 10: LCC sequenced MS spectra sharing interaction model

Page 21: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

4.3.2.2 Role OKCs

a. The researcher OKC

This OKC implements the constraints relevant for a peer that wants to participate in the interaction playing the researcher role. Hence, the OKC’s main task is to ask to the human researcher about the proteomic query to solve, forward it to the laboratories, fetch the results, and present them back to the human researcher.

The getOmicsLabRole(RoleName) and getPeers(RoleName, LabList) constraints are used to get the list of omics lab peers participating during a particular execution of the interaction model. The getOmicsLabRole(RoleName) constraint asks the role name used by the omics lab peers and getPeers(RoleName, LabList) returns the list of all peers participating with that role name.

The selectLabs( LabList, SelectedLabList ) filters those laboratories where the user wants the proteomic query to be send. It shows the visualization (Figure 11) to the user through which he or she can identify and choose the laboratories.

Figure 11: GUI to select the omicslab peers

The getQuery( SearchType, SearchArguments, InputFormat, Input ) constraint asks to the human researcher about the proteomic query to solve. It is called with four arguments that need to be provided by the researcher: SearchType: the type of search to be performed (BLAST or OMSSA).

SearchArguments: the parameters to be used by the laboratories when executing their locally installed search engines.

InputFormat: the proteomic search engines (BLAST and OMSSA) allow different input formats, this argument is used to inform to the search engines about the format used in the input.

Input: the proteomic sequences or mass spectra used as input for the search engines.

To solve the getQuery constraint, a custom visualization (Figure 12) is shown to the user. With these GUIs the user can build the proteomic query to be submitted by the system fulfilling the arguments of the constraint.

Page 22: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Figure 12: GUIs to build BLAST or OMSSA queries

As soon as we got the query built by the researcher, it can be sent to each one of the laboratories included at the filtered laboratory list. This task is done by the a( researcher( LabList, SearchType, SearchArguments, InputFormat, Input ), Researcher ) subrole. This role iterates the LabList using recursion; at each iteration the query is sent to a laboratory query( SearchType, SearchArguments, InputFormat, Input ) =>

a( omicslab, H←

LabList = [H|T] and the role waits for the laboratory

response answer( Result, ResultInfo ) <= a( omicslab, H ) to aggregate it.

When all the results from the omics labs are collected, the processResults( End ) constraint is invoked. This constraint launches another custom visualization (Figures 13, 14) to the human researcher where he can examine the different results returned by the omics labs.

The OKC architecture: The researcher OKC has been divided in three main components: the OKC layer, the researcher kernel and the visualisation component. With this division if there are modifications on the interaction model or the visualization requirements, these changes can be applied quickly. The OKC layer acts as a thin interface between the OK P2P network and the researcher kernel, translating the incoming constraints into researcher kernel method invocations. The researcher kernel contains all the utilities to build the proteomics queries, parse the laboratories’ results and invokes the visualizations when needed. A schematic view of the architecture is shown at Figure 15.

Figure 13: BLAST and OMSSA result windows, showing the laboratories results

Page 23: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Figure 14: OMSSA peptide detail and mass spectrum view

b. The omicslab OKC

This OKC implements the only constraint relevant for a peer that wants to participate in the interaction playing the omicslab role, namely findHit( SearchType, SearchArguments, InputFormat, Input, Result, ResultInfo ). This constraint is to solve the query received from the a peer in the researcher role and returns the result. The SearchType, SearchArguments, InputFormat and Input arguments are supplied by the researcher peer, and they are used as input to solve the query. The Result and ResultInfo arguments are filled by the omicslab peer. When the findHit constraint is solved, Result contains the result of the execution of the proteomic search engine and ResultInfo contains metadata about the result.

The OKC architecture: As with the researcher OKC, the omicslab OKC has been split in three main components: the OKC layer, the omicslab kernel and the search engine wrapper component. The OKC layer function is to act as interface between the OK P2P network and the omicslab kernel, mapping the constraints into omicslab kernel functions. The kernel task is to identify the incoming query and to send it to the search engines through the search engine wrapper component.

Figure15 Schematic view of researcher OKC architecture

This latter component executes the locally installed proteomic search engine and returns the result to the omicslab kernel. A schematic view of the architecture is shown at Figure 16.

The omicslab OKC setup: In order to connect an omicslab role playing peer into the OK P2P network it is necessary to install some basic proteomic search engines, to set up the proteomic database, and to configure the peer to use the search engines.

Search engines. The required BLAST and OMSSA packages can be freely downloaded from the

Page 24: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

NCBI site (http://www.ncbi.nlm.nih.gov/). We need to install them locally in every machine acting

as and omicslab. The packages could not be included in the OKCs because of their platform-dependency.

Setup of proteomic database: BLAST and OMSSA software need protein sequence databases to work. There are two options to get a protein sequence database: download an existing proteomic database (some institutions like the NCBI share their protein databases) or build a new one from the mass-spectra data returned by the mass spectrometers. Mass spectrometers return a set of mgf (a common format to collect mass-spectra) files. Each mgf file must be processed by a de novo interpreter tool (we used PEAKS software, which was available to all ProteRed lab members, see Section 4.3.3.1) to obtain a correspondent set of amino-acids sequences.

Before building the database of sequences we can apply a filter over de novo results discarding short sequences (less than 4 bases) and duplicates. We assume the first (highest score) sequence to be the best de novo interpretation; after that de novo score is not taken into account anymore, although its value is annotated in database as header information. The final step consists in formatting these plain text FASTA files in a binary BLAST formatted database (formatdb program provided in the BLAST package).

Figure 16: Schematic view of omicsLab OKC architecuture

Connecting an omicslab OKC with the search engines: The last step to setup an omicslab peer is to configure it to use the search engines. Each machine acting as omicslab peer contains a configuration file. Reading this configuration file the omicslab peer knows where the search engines are locally installed, what database must be used for the search and the default paramaters to use with the search engines. A fragment of configuration file can be seen at Figure 17.

4.3.3 Experiment Setup

4.3.3.1 The ProteoRed Scientific Community

The National Institute for Proteomics, ProteoRed, is a network for the coordination, integration and development of the Spanish proteomics facilities providing services for supporting Spanish researchers in the field of genomics and proteomics. It integrates and supports 15 well-established proteomics facilities giving services all over Spain, and 5 associated facilities that will be operative

Page 25: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

and offering their proteomics services in a near future.

ProteoRed offers major services necessary in all stages of the protein analysis process: protein fractionation, separation and purification of peptides and proteins, protein identification, quantitative determination of peptides and proteins, protein sequencing, complex analysis of proteins, peptide mass fingerprinting, analysis of mass spectra, image analysis and peptide synthesis. <omicsLab> <name>cnbLab</name>

<blastp>

<command>

/usr/bin/blastall

</command>

<param key="-p" value="blastp" />

<param key="-d" value="/var/omicsdb/labs_fasta/cnb/cnb.fasta" />

<param key="-M" value="PAM30" />

<param key="-m" value="7" />

<param key="-W" value="2" />

<param key="-G" value="9" />

<param key="-E" value="1" />

<param key="-e" value="2000" />

</blastp>

<omssa>

<command>

/usr/bin/omssacl

</command>

<param key="-w" value=""/>

<param key="-d" value="/var/omicsdb/labs_fasta/cnb/cnb.fasta" />

<param key="-e" value="0" />

<param key="-v" value="1" />

<param key="-he" value="100" />

<param key="-ht" value="30" />

</omssa>

<openknowledge>

<interactionModel name="omics" path="/home/user/lcc/omics.lcc" />

<omicsLabOKC name="omicslab" subsDescription="cnbLab"

class="org.openk.proteomics.okc.OmicsLabOKC"

path="/home/user/okcs/omicsLab.jar" />

</openknowledge>

</omicsLab>

Figure 17: Fragment of configuration file used by the omicslab peers

Page 26: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

The main objective of ProteoRed is to increase the specialisation and competitiveness of proteomics facilities, considering the type of technologies and equipment available, and the type of customers, their expertise and their geographical situation. Customers are research groups from universities, CSIC, hospitals, or other public institutions, as well as private companies (biotech and pharmaceutical companies). ProteoRed has also the objective of testing new technological developments for providing new proteomics methodologies and equipment to the Spanish proteomics facilities. It also establishes open channels with customers of these proteomics services to know their technological needs, accuracy of the data, quality requirements, price scales, and new services needed for the future. ProteoRed takes also care of the coordination of courses, workshops and meetings to promote and enhance the quality of proteomics knowledge through the scientific community, ProteoRed technical people and governmental agencies.

The ProteoRed consortium coordinates with other technological platforms promoted by Genoma Espa˜na like the National Bioinformatics Institute (NBI) through a common working group in the bioinformatics area composed by personnel from NBI and from ProteoRed. The international focus of ProteoRed is valued by the relationship and collaboration established with other proteomics networks, international platforms or projects as the Human Proteome Organization (HUPO) for the identification and characterisation of all human proteins, the European Proteomics Association (EuPA), the Portuguese Mass Spectrometric Network (ITQB), the Portuguese Proteomics Network (ProCura) and the Turku Center for Biotechnology.

4.3.3.2 The Test Data

For our test data we have decided to use preexisting MS/MS data reservoirs from the 2006 ABRF (Association of Biomolecular Resource Facilities) test sample. It consists of a mixture of 48 purified and recombinant proteins (plus an unknown number of protein contaminants) extensively tested during the ABRF Proteomics Standards Research Group 2006 worldwide survey.

78 laboratories participated in the analysis of these mixtures. Among these, only 35% could correctly identify more than 40 protein components. Thus, the sample, being relatively handy for the purpose of testing the OK system, is of a complexity not far from that found in real proteomics work.

This sample was prepared by combining five picomole aliquots of each protein. For this purpose individual proteins were previously purified to assure a purity >95%, and the protein concentration determined by amino acid analysis. The combined sample was lyophilized in 1 mL polypropylene tubes for storage before analysis. The presence of low levels of impurities in the mixture added an additional challenge to this analysis. Thus, in addition to the 48 standard proteins, most laboratories reported the identification of many other proteins. These identifications could be either due to real contaminants or to false positive identifications. To ascertain which was the case requires a careful analysis of the full data obtained by a laboratory

This ambiguity could be rapidly solved querying the OK system and searching for other laboratories reporting the same unexpected identifications as we will try to show in the test experiment (see Section 4.3.5). This case is an example of a more general situation when a laboratory needs to evaluate the confidence of results that can not be supported by other means -such a high confidence match in a database-by checking if the same data has been obtained by a number of independent partners working with similar biological samples.

4.3.4 Experiment Execution In order to test whether the OK system could be used by ProteoRed and speed up the protein detection process, we have created a simulated environment where some peers part of the OK system emulate the different laboratories. Through this environment a researcher can query these peers to retrieve data from their local databases.

Page 27: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

4.3.4.1 Omicslab peers

The experiment consists of nine peers which are connected to the OK P2P network. Each of these peers manages a database which contains protein data extracted from the ABRF (see Section 4.3.3.2). An OpenKnowledge Component (OKC) called omicslab has been developed that serves as an interface between the ABRF database and the OK system. This omicslab OKC has been installed in each of the peers and has been subscribed to play the role in the MS spectra sharing interaction model which is in charge of replying to incoming proteomic queries.

4.3.4.2 Researcher view

A researcher in the field of proteomics now has a new tool that allows her to search for proteomic information. Through the OK system any researcher could send queries to the omicslab peers and retrieve data from their databases. In order to execute these queries the researcher must have access to the OK system and execute a set of steps which we detail below:

OK kernel download and installation. First of all the researcher must have access to a computer which is part of the OK system. This is achieved by installing the OpenKnowledge Kernel in a computer with an internet connection. The latest version of the Kernel can be downloaded from the OpenKnowledge web site (www.openk.org) and installed in any operating system with the only requirement that it has the Java 1.5 suite installed.

Search for the Interaction Model. The OK system defines how different peers may interact through interaction models (IM). These IMs are developed and published to the OK system by programmers. A user of the OK system has to search for IM which defines the type of interaction she wants to realize. In our current scenario the researcher will use the browsing facility provided by the user interface in the OK Kernel application to search for the appropriate IMs. Searching is achieved by sending query with a set of keywords to the discovery service, which is in charge of storing all of the published data. The discovery service then retrieves all the IMs whose metadata matches the query. The researcher can then read the descriptions associated to the retrieved IMs in order to select the one that suits her best. If no IM suits her, she can refine the query by using different keywords.

Install the researcher OKC. The IM provided in our experiment contains two roles: one played by the laboratory that is in charge of replying to queries, and another played by the researcher that is the one sending the queries. If the researcher peer wants to query the omicslab peers, she will have to do so by playing the researcher role in the IM provided. In order to play this role she must have installed an OKC that is able to play that role. OKCs can also be published in the OK system so that it is easy to find them and install them in our local computer. Downloading and installing an OKC from the OK system can be achieved directly from the Kernel’s user interface. Once the IM and the role the user wants to play have been selected, the user just selects the option to find implementations of the given role. By downloading the returned OKC implementations the researcher is ready to start launching proteomics queries. Although this is the simplest way to install the needed OKC, the users may also find or develop an OKC through other means and plug it into the Kernel. After implementation, he is ready to start launching proteomics queries.

Subscribe to the researcher role. Once the researcher has installed the OKC needed to play the researcher role, she can start the interaction process through a subscription. The researcher selects the IM and role she wants to play and then runs the subscription command from the user interface. This command sends the subscription information to the discovery service which will find a coordinator that will run the interaction between the researcher role, and those peers that have subscribed to play the other roles in the IM. In our case these will be the omicslab peers subscribed to play the laboratory role.

Page 28: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

4.3.4.3 Interaction model execution

Laboratories selection. When the interaction starts the list of all the omicslab peers that have subscribed is received by the peer. This list is shown to the researcher which has to select the subset of the labs that she wants to query.

Build proteomic query. Having selected the laboratories to which the query will be sent, the researcher will then have to create the query. This is also done through the user interface which will show the user a form where she will introduce the query. This form contains: To build the proteomic query, a form is shown to the user. With this form the user can:

an item asking to select the type of search (BLAST or OMSSA)

a text box in which to write the proteomic text query or import it from a file

another text box where the researcher can add the annotations used to calculate the trust of the returned results

and a subform where the user can enter custom search arguments to be used by the search engines

Once the query has been introduced by the researcher it is sent to all the selected omicslab peers so that they can process it and reply with the set of matching proteomic data.

Show laboratories results. Every time an omicslab peer replies to the query, the researcher OKC stores the results. When all the omicslab peers to which the query was sent have replied the results are shown to the user using a custom visualisation. Through this custom GUI the researcher can browse through the different results and compare them. This is the final step of the IM execution, if the researcher wants to start another query she can start another interaction. This will now be easier because she will already know the correct IM and role to search for, and the appropriate OKC will already be installed in the OK Kernel.

4.3.5 Experiment Evaluation To play the researcher role we randomly selected 38 peptide sequences obtained by PEAKS de novo analysis of the mass spectrometric analytical data obtained by the LP CSIC/UAB proteomics laboratory from the ABRF sample. Additionally we included data derived from spectra that during the original analysis matched to proteins not included in the standard ABRF list.

The mass spectrometric data set from LP CSIC/UAB was obtained by LC-MS/MS analysis of the tryptic digest of the protein mixture in the ABRF sample (see Section 4.3.3.2). This data set included 2000 spectra from which 48 of the 49 proteins in the ABRF standard could be identified by conventional proteomic data analysis. Each protein was identified from the sequence of one or more of its tryptic peptides.

Page 29: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Figure 18. Query window and BLAST search parameters used for this study. The sequences shown in the image correspond to the first group of queries.

Queries were performed against 10 databases including 7 proteomics Labs, the NCBI database and the database of the researcher itself (uab column). Inclusion of the researcher as an omicslab role playing peer serves as a true positive control to check for failures in the functionality of the program.

The 38 sequences were divided in 4 groups, in order to be able to mimic the building of a trust evolution history. Despite that this model would not produce a valid history (sequences were grouped randomly), it will allow to evaluate the functionality of the trust calculation.

Each group was queried with the same parameters (Figure 18) and the results analysed in the OK-omics prospector window (Figure 19). As expected, search in the uab database generated always full coincidences. Contrarily, other proteomic labs and the ncbi database produced diverse results. Most of the queries produced high percentage identities values in the ncbi search. These hits give direct information about the identity of the peptide and the source protein ( ”id” and ”desc” text windows in Figure 20). One of the queries in Figure 20 (Query 10) produces a 100% coincidence in the NCBI database. The expectation values for this match indicated that it was not due to hazard. The protein tentatively identified, P20160 (Azurocidin precursor), was however not included in the standard ABRF sample list of components.

Page 30: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Figure19: OK-omics prospector window and response to the first group of queries (%coincidence).

The analysis of the answers from other proteomics labs for this sequence showed that other laboratories found identical (cpif) or highly homologous sequences (ucm). This fact indicated that several laboratories had observed the presence of the same component in their samples and supported the fact that the queried sequence was not the result of noise or lab-specific sample preparation artifacts.

More detailed analysis of the results indicated that the presence of protein P20160 was supported by other NCBI matches (Query 11 in Figure 21) and that the corresponding query also had produced highly scored matches in many proteomics labs.

This analysis clearly indicates that P20160 is a relatively high concentrated contaminant (several laboratories detected several of its tryptic peptides) that could be present in the ABRF sample.

This fact is evident from the information that can be derived from the OK-omics system. However it is not straightforward to arrive to the same conclusion by conventional means as it is difficult to discard organic contamination of the samples. The actual presence of this artifact was only stated after the ABRF study of all the identification reported by the laboratories.

The confidence on the result of each proteomics lab (i.e., the trust) was evaluated for the 4 queries performed (Figure 22). For these calculations the “number of experiences” (number of queries to be sent to a lab for the calculated trust to be highly confident) was set to the number of queries to be performed (4). Trust values for the different laboratories are in the range from near 0 to near 1 indicating different efficiencies sending back high score matches for the queried sequences (Figure 22). As expected by the origin of the data (sequences randomly taken from the uab data set) trust values are stable over the experiments.

Page 31: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Figure 20: Match data from ncbi dataset for query 10.

Figure 21: Match data from ncbi dataset for query 11.

Page 32: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Figure 22: No improvements on the quality of the information derived from the OK-omics system were observed by selecting 2-3 of the more trusted labs for these queries. Due to the small size of the databases, an important fraction of the processing time was due to the public NCBI database search. Selecting a few laboratories of high trust could however increase the performance when a higher number of peers are involved in the interaction.

5. Consistency Checking of 3-D Models for Yeast Protein Structure Prediction

In this experiment we aimed to check consistency among pre-computed comparative models from three public repositories, for the proteins encoded by the genome of the budding yeast Saccharomyces cerevisiae. The SWISS-MODEL repository (Kopp and Schwede, 2006) provides annotated modelled protein structures generated by the SWISS-MODEL (Schwede, et al., 2003) comparative modelling pipeline. SWISS-MODEL only offers models for proteins with unsolved structures. The ModBase repository (Pieper 2004) contains comparative models generated by the ModPipe pipeline using MODELLER (Sali and Blundell, 2003). In ModBase a single target protein may be represented by multiple comparative models. These can be redundant and meet only minimal quality standards if selected at default quality thresholds. A web-hosted database of structural models for yeast proteins generated by the prediction program SAM-T02 (Karplus 2003) was chosen as the third model resource. This repository was created a few years back, through a combination of local structures, hidden Markov model (HMM)-based fold recognition, and ab initio prediction, i.e. using an approach not specifically designed for comparative modelling. It offers provisional models for all predicted yeast proteins. A more detailed description of the databases can be found in Quan et al., 2006.

5.1. Roles of Peers in the Experiment Data Sources/Providers: Three databases of yeast protein 3-D models downloaded to local servers from the Web-based SWISS-Model, ModBase, and SAM-T02/Undertaker. The downloaded databases were accessible by specific data providers in this experiment. The systematic open reading frame (ORF) names of all predicted protein-encoding genes in the yeast genome, commonly referred to as Yids, serve as the keys to retrieve yeast protein structure models from these data sources.

Data Comparers The role of data comparers for consistency checking consists of two parts: (1) using MaxSub program to perform pair-wise sequence comparison between different 3-D structures with

Page 33: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

the same Yids considering only the Ca atoms of the protein side-chain and ignoring the details of other backbone atoms; and (2) extractimg multi-way consenusal structures. The pair-wise comparison uses MaxSub in both forward and backward directions with a distance threshold parameter of 3.5Ǻ. Based on the pair-wise comparison multi-way “MaxSub-supported substructures” i.e. the maximum overlap between all pair-wise matched regions for the same protein sequence, can be extracted. The minimum length of “MaxSub-supported substructures” is set as 45 amino acids. Gaps are sometimes found within the regions matched in the MaxSub comparisons between two 3-D models are ignored in the extraction of “MaxSub-supported substructures” if the gap lengths are less than 35 amino acids.

Additional roles are needed to coordinate the communication among different peers, retrieve data from data sources, and collect retrieved data together for analysis, as defined in the LCC interaction protocol (Figure 23).

a(experimenter, E):: data_request(Is) => a(data_coordinator, X) <- ask_user(Is) then data_compared(Is,Sd) <= a(data_coordinator, X) then null <- printresults(Sd) a(data_coordinator, X):: data_request(Is) <= a(experimenter, E) then null <- getInterestedRole(R) and getPeers(R,Sp) and makeEmptyList(SdS) then a(data_collector(Is, Sp, SdS, SdF),X) then filter(Is,Sp,SdF) => a(data_filter,Y) then filtered(S) <= a(data_filter, Y) then filtered(Is,Sp,S) => a(data_comparer, C) then data_compared(Is,Df) <= a(data_comparer, C) then data_compared(Is,Df) => a(experimenter,E) // Is is the list of yeast IDs // Sp is the list of sources // SdS is the build-up list of 3D structure model // SdF is the final list, passed back as output parameter a(data_collector(Is,Sp,SdS,SdF),X):: null <- Is=[] and assign(SdS,SdF) or ( null <- Is=[I|Ri] then data_request(I) => a(data_source,P) then data_report(I,Ds) <= a(data_source,P) then null <- NewSdS=[Ds|SdS] then a(data_collector(Ri,Sp,NewSdS,SdF),X) ) a(data_retriever(I,Sp,SBuildUp,SF),X):: null <- Sp=[] and assign(SBuildUp, SF) or ( null <- Sp = [P|Rp]

Page 34: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

then data_request(I) => a(data_source,P) then data_report(I,Ds) <= a(data_source,P) then null <- NewSBuildUp = [Ds|SBuildUp] then a(data_retriever(I,Rp,NewSBuildUp,SF),X) // recursion ) a(data_filter, Y):: filter(Is,Sp,Sd) <= a(data_coordinator,X) then filtered(S) => a(data_coordinator,X) <- apply_filter(Sd,S) a(data_source, P):: data_request(I) <= a(data_collector,X) then data_report(I,Ds) => a(data_collector,X) <- lookup(I,Ds) a(data_comparer, C):: filtered(Is,Sp,S) <= a(data_coordinator,X) then data_tocompare(Is,Sp,S) => a(pairwise_comparer, M) then data_pairwise(Is, Pc) <= a(pairwise_comparer, M) then data_compared(Is,Df) => a(data_coordinator,X) <- threeway_check(Is,Pc,Df) a(pairwise_comparer, M):: data_tocompare(Is,Sp,S) <= a(data_comparer, C) then data_pairwise(Is,Pc) => a(data_comparer, C) <- pairwise_check(Is,S,Pc)

Figure 23: LCC code for consistency-checking between different yeast protein 3-D model resources

Page 35: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

5.2. Preparatory Work

To efficiently run the structural comparison between the 3-D structure models from multiple data sources with great redundancy, minimal data pre-processing and filtering was undertaken as described in a previous study (Quan et al., 2006). For instance, the models retrieved from the ModBase for each Yid query typically include multiple 3-D models for the same protein sequence.

Figure 24: Peer network architectures for consistency-checking of yeast protein 3-D models (A) and de novo sequence derivation from MS/MS spectra (B). Primary symbols: , interaction role defined in the LCC-specification for the experiment (Suppl. Mat.); , peer (colour varies by type of resource). Secondary symbols: , OKC (functionality depends on the peer); , local database; , locally executed program; , utility scripts in Perl or Java; , web servers. Peers (SAM, ModBase, SWISS) take on data provider roles and manage specific OKCs to access web databases containing modelled protein structures; Peer X acts as data collator, data collector, data retriever, and experiment end marker. The experimenter peers manage OKCs sending out requests for information to the network. Finally the comparer peers (CYSP) manage OKCs to invoke the relevant consistency checking programs.

5.3. Experiment Results

Systematic retrieval and comparison of the three data sources were done over yeast proteins. The results were made available as a new resource called the Comparison of Yeast 3-D Structure Prediction (CYSP), which is accessible from http://www.openk.org/cysp.html. CYSP recorded the yeast proteins with at least one structure model retained after data filtering from multiple sources (Table 1). We also obtained other information such as the pair-wise matched regions between 3-D models of the same protein obtained from different methods according to MaxSub (Table 6), and the Mscores attained by the matches. The 3-D coordinates of 584 MaxSub-supported substructures for 545 yeast protein sequences with non-identical Yids were also given at the CYSP website. These 545 yeast protein sequences made up 97.5% of those represented in SWISS-SAM; and about 91.8% of those in SWISS-ModBase and ModBase-SAM.

Page 36: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Table 6. Number of pair-wise matched regions between models from the three data sources.

SWISS ModBase SAM SWISS 769/ (717) 649 (594) 585 (559) ModBase 2546 (2280) 620 (594) SAM 2211 (2211)

Within the parenthesis are the numbers of represented yeast proteins. On the diagonal are the total numbers of models and represented yeast proteins in each filtered data source.

References

Abian, J. Atencia, M. Besana, P. Bernacchioni, L. Gerloff, D. Leung, S.W. Magasin, J. Perreau de Pinninck, A. Quan, X.

Robertson, D. Schorlemmer, M., Sharman, J. and Walton, C. 2008. Bioinformatics interaction models. Deliverable

D6.3, OpenKnowledge.

Berman, H.M. Westbrook, J. Feng, Z. Gilliland, G. Bhat, T.N. Weissig, H. Shindyalov, I.N. Bourne, P.E. (2000) The

Protein Data Bank. Nucleic Acids Res. 28: 235-242.

Dupplaw, D. Robertson, D. Croitoru, M. Dasmahapatra, S. Hu, B. Lewis, P. and Xiao, L. 2006. Extension of LCC as a

visualisation language. Deliverable D5.1, OpenKnowledge.

Eng, J.K., McCormack, A.L., and Yates, J.R. (1994) An Approach to Correlate Tandem Mass Spectral Data of Peptides

with Amino Acid Sequences in a Protein Database. J Am Soc Mass Spec. 5: 976-989.

Frank, A. and Pevzner, P. (2005) PepNovo: de novo peptide sequencing via probabilistic network modelling. Anal Chem

77:964-973.

Geer, L.Y. Markey, S.P. Kowalak, J.A. Wagner, L. Xu, M. Maynard, D.M. Yang, X. Shi, W. Bryant, S.H. (2004). Open

mass spectrometry search algorithm. J. Proteome Res. 3(5):958-64.

Giunchiglia, F. Sierra, C. McNeill, F. Osman, N. and Siebes, R. 2007. Good enough answer algorithms. Deliverable D4.5,

OpenKnowledge.

Habermann, B. Oegema, J. Sunyaev, S. Shevchenko, A. (2004). The power and limitations of cross-species protein

identification by mass spectrometry-driven sequence similarity searches. Mol. Cell. Proteomics, 3:238-249.

Johnson, R.S. and Taylor, J.A. (2002) Searching sequence databases via de novo peptide sequencing by tandem mass

spectrometry. Mol Biotech. 22:301-16.

Karplus, K., Karchin, R., Draper, J., Casper, J., Manel-Gutfreund, Y., Diethans, M., and Hughey, R. (2003) Combining

local-structure, fold recognition, and new fold methods for protein structure prediction. Proteins. 53:491-496.

Klein, J.B. and Thongboonkerd, V. 2003. editors. Proteomics in Nephrology. Karger.

Kopp, J. and Schwede, T. (2006) The SWISS-MODEL repository: new features and functionalities. Nucleic Acids Res.

31:216-218.

Krieger, E., Nabuurs, S.B., and Vriend, G. (2003) Homology Modeling. In Bourne, P.E. and Weissig, H. (eds),

Structural Bioinformatics. Wiley & Sons Inc, 509-525.

Page 37: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Ma, B., Zhang, K.Z., Hendrie, C, Liang, C.Z., Li, M., Doherty-Kirby, A., and Lajoie, G. (2003) PEAKS: powerful

software for peptide de novo sequencing by tandem mass spectrometry. Rapid Comm Mass Spec. 17:2337-2742.

Ma, B. Zhang, K.Z., and Liang, C.Z. (2005) An Effective Algorithm for the Peptide De Novo Sequencing from MS/MS

Spectrum. J. of Computer and System Sciences 70:418-430.

Madden, T. 2003. The blast sequence analysis tool. In The NCBI Handbook, chapter 16. NCBI.

Moult, J. (2005) A decade of CASP: Progress, Bottlenecks and Prognosis in Protein Structure Prediction. Curr Opin

Struct Biol. 15:285-289.

Oinn, T., Addis, M.J., Ferris, J., Marvin, D.J., Senger, M., Carver, T., Greenwood, M., Glover, K., Pocock, M.R., Wipat,

A., and Li, P. (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics. 20:

3045-3054.

Overbeek, R. Begley T. Butler R.M., Choudhuri, J.V. Chuang, H.Y. Cohoon, M. de Crécy-Lagard, V. Diaz, N. Disz, T.

Edwards, R. Fonstein, M. Frank, E.D. Gerdes, S. Glass, E.M. Goesmann, A. Hanson, A. Iwata-Reuyl, D. Jensen, R.

Jamshidi, N. Krause, L. Kubal, M. Larsen, N. Linke, B. McHardy, A.C. Meyer, F. Neuweger, H. Olsen, G. Olson, R.

Osterman, A. Portnoy, V. Pusch, G.D. Rodionov, D.A. Rückert, C. Steiner, J. Stevens, R. Thiele, I. Vassieva, O. Ye, Y.

Zagnitko, O. and Vonstein, O. (2005). The Subsystems Approach to Genome Annotation and its Use in the Project to

Annotate 1000 Genomes. Nucleic Acids Res. 33:5691-5702.

Perkins, D.N. Pappin, D.J.C. Creasy, D.M. Cottrell, J.S. (1999). Probability-based protein identification by searching

sequence databases using mass spectrometry data. Electrophoresis. 20(18):3551-3567.

Pieper, U., Eswar, N., Braberg, H., Madhusudhan, M.S., Davis, F.P., Stuart, A.C., Mirkovic, N., Rossi, A., Marti-Renon,

M.A., and Fiser, A. (2004) MODBASE, a database of annotated comparative protein structure models, and associated

resources. Nucleic Acids Res. 32:217-222.

Quan, X., Walton, C., Gerloff, D.L., Sharman, J.L., and Robertson, D. (2006) Peer to peer experimentation in protein

structure prediction: an architecture, experiment and initial results. GCCB 2006: 75-98.

Resing, K.A., Meyer-Arendt, K., Mendoza, A.M., Aveline-Wolf, L.D., Jonscher, K,R., Pierce, K.G., Old, W.M., Cheung,

H.T., Russell, S., Wattawa, J.L., Goehle, G.R., Knight, R.D., and Ahn, N.G. (2004) Improving reproducibility and

sensitivity in identifying human proteins by shotgun proteomics. Anal Chem. 76:3556-3568.

Robertson, D. (2004) Multi-agent coordination as distributed logic programming. Intl Conf on Logic Programming,

Sant-Malo, France.

Rogers, I. (2005) Assessment of an amalgamative approach to protein identification. ASMS Conf on Mass Spec. San

Antonio, TX, USA, Abstract# WP379. 5-9 June.

Sali, A. and Blundell, T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol.

234:779-815.

Schlueter, S.D., Wilkerson, M.D, Dong, Q., and Brendel, V. (2006) xGDB: open-source computational infrastructure for

the integrated evaluation and analysis of genome features. Genome Biol. 7:R111.

Schwede, T., Kopp, J., Guex, M., and Peitsch, M.C. (2003) SWISS-MODEL: An automated protein homology-modeling

server. Nucleic Acids Res. 31:3381-3385.

Page 38: Summative Report on Bioinformatics Case Studiesgroups.inf.ed.ac.uk/OK/Deliverables/D6.4.pdf · OpenKnowledge . FP6-027253 . Summative Report on Bioinformatics Case Studies . Adrian

Searle, B.C., Brundege, J.M., and Scaffold, M.T. (2005) A program to probabilistically combine results from multiple

MS/MS database search engine. ABRF’05. Savannah, GA, USA Abstract# P87-T.

Shevchenko, A. Snyaev, S. Loboda, A. Shevchenko, A. Bork P, Ens W. Standing, K.G. (2001). Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching. Anal Chem. 1;73(9):1917-26.

Siebes, R.,Dupplaw, D., Kotoulas, S., de Pinninck, A.P., van Harmelen, F. and Robertson. D. (2007) The OpenKnowledge

system: an interaction-centered approach to knowledge sharing. Lecture Notes in Computer Science. 4803: 381–390.

Taylor, J.A. and Johnson, R.S. (1997) Sequence database searches via de novo peptide sequencing by tandem mass

spectrometry. Rapid Comm. Mass Spec. 11:1067-1075.

Xu, C. and Ma, B. (2006) Software for computational peptide identification from MS/MS. Drug Discovery Today.

11:595-600.

* The statistics of sequence similarity scores in section 4.3.1.1 was retrieved on 19th November 2008 from

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html.