engineering agent-mediated integration of bioinformatics analysis

15
Engineering Agent-Mediated Integration of Bioinformatics Analysis Tools Vassilis G. Koutkias, Andigoni Malousi, and Nicos Maglaveras Lab. of Medical Informatics, Faculty of Medicine, Aristotle University of Thessaloniki, P.O. BOX 323, 54124, Thessaloniki, Greece {bikout, andigoni, nicmag}@med.auth.gr Abstract. The availability of massive amounts of biological data, dis- tributed in various data sources, has prompted the development of a wide range of data analysis tools. However, due to the heterogeneity and complexity of the domain, performing a specific analysis by combin- ing tools is complicated. In this paper, we argue that the software agent metaphor constitutes a flexible approach towards integration of Bioinfor- matics tools and dynamic composition-execution of data analysis work- flows. To substantiate this assertion, a scalable engineering methodology is presented, aiming to address simple as well as complex cases, in which generic Multiagent Systems are described to address different functional scenarios, incorporating agent coordination strategies and semantic do- main descriptions. Based on this methodology, test cases are discussed on gene prediction tools to illustrate the applicability of the proposed systems. 1 Introduction The massive amounts of biological data, available through diverse distributed data sources, facilitated the development of a wide range of data analysis tools. Despite their beneficial public accessibility, the interpretation of raw sequence data into meaningful knowledge usually comes with unexcelled computational complexities associated with biological simplifications, still too significant to dis- regard [1]. From the computational perspective, biological data analysis involves the implementation of algorithmic approaches that handle distributed and usually heterogeneous data, deposited in multi-institutional databases. Technically, the efficiency of these methods is bounded by the lack of standard exchange protocols and controlled vocabularies to describe data structures and semantics [2]. From the biological point of view, researchers usually perform multiple, com- bined submissions to a variety of resources, in order to analyze query data [3]. Each tool has its own interface and is characterized by various requirements that may diverse even within the same resource class. Currently, in order to perform an analysis workflow, researchers have to select the appropriate tool(s), and de- fine the succession and data-passing between workflow nodes. This procedure is

Upload: others

Post on 18-Mar-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Engineering Agent-Mediated Integration ofBioinformatics Analysis Tools

Vassilis G. Koutkias, Andigoni Malousi, and Nicos Maglaveras

Lab. of Medical Informatics, Faculty of Medicine,Aristotle University of Thessaloniki, P.O. BOX 323, 54124, Thessaloniki, Greece

{bikout, andigoni, nicmag}@med.auth.gr

Abstract. The availability of massive amounts of biological data, dis-tributed in various data sources, has prompted the development of awide range of data analysis tools. However, due to the heterogeneityand complexity of the domain, performing a specific analysis by combin-ing tools is complicated. In this paper, we argue that the software agentmetaphor constitutes a flexible approach towards integration of Bioinfor-matics tools and dynamic composition-execution of data analysis work-flows. To substantiate this assertion, a scalable engineering methodologyis presented, aiming to address simple as well as complex cases, in whichgeneric Multiagent Systems are described to address different functionalscenarios, incorporating agent coordination strategies and semantic do-main descriptions. Based on this methodology, test cases are discussedon gene prediction tools to illustrate the applicability of the proposedsystems.

1 Introduction

The massive amounts of biological data, available through diverse distributeddata sources, facilitated the development of a wide range of data analysis tools.Despite their beneficial public accessibility, the interpretation of raw sequencedata into meaningful knowledge usually comes with unexcelled computationalcomplexities associated with biological simplifications, still too significant to dis-regard [1].

From the computational perspective, biological data analysis involves theimplementation of algorithmic approaches that handle distributed and usuallyheterogeneous data, deposited in multi-institutional databases. Technically, theefficiency of these methods is bounded by the lack of standard exchange protocolsand controlled vocabularies to describe data structures and semantics [2].

From the biological point of view, researchers usually perform multiple, com-bined submissions to a variety of resources, in order to analyze query data [3].Each tool has its own interface and is characterized by various requirements thatmay diverse even within the same resource class. Currently, in order to performan analysis workflow, researchers have to select the appropriate tool(s), and de-fine the succession and data-passing between workflow nodes. This procedure is

time-consuming, given that it is usually repeated, and often error-prone, due tothe extended parameterizations involved in multiple processing steps.

Being able to co-relate different types of biological data is necessary, in orderto increase our understanding on the underlying biological processes [2]. To-wards this aim, considering the widespread relevant resources, as well as theirincompatibilities in structure and semantics, the need for automated workflowexecutions, built within a configurable integration framework, is substantial [4].

In this paper, we assert that the software agent paradigm constitutes a flexibleapproach towards integration of Bioinformatics tools and automated enactmentof data analysis workflows, and we sustain this claim by presenting a scalableengineering methodology of Multiagent Systems, which is applied to address par-ticular integration scenarios. Our approach is based on (a) coordination amongagents, following the mediation-wrapper paradigm, (b) ontology-based classi-fications of tools associated with semantic descriptions, and (c) matchmakingmechanisms.

2 Agent Technology and Bioinformatics ResourcesIntegration

Software agent technology is a rapidly evolving interdisciplinary field that re-lies on the concept of active software entities, situated in an environment, inwhich they are capable of autonomous, reactive and/or proactive, and socialbehavior (communication with other agents, systems or humans) [5]. Typically,agent-based systems consist of a community of software agents, interacting witheach other (either cooperating or antagonizing), formulating this way electronicsocieties, commonly known as Multiagent Systems [6].

Among various application and research domains [5], agent technology hasalso been considered favorable as an integration approach for legacy softwaresystems [7]. In particular, several agent-based approaches have been proposed toaddress the complexity of biological data and the heterogeneity of bio-resourcesin general [2]. The software agent metaphor constitutes a beneficial approach forintegrating Bioinformatics resources, due to the following reasons:

– Software agents may hide the distribution of biological data and tools, basedon appropriate mediation-wrapping mechanisms [8], [9].

– In order to support data acquisition from heterogenous computational re-sources, appropriate query mechanisms have to be introduced [10]. In addi-tion, the vast amount of biological data makes data filtering a prerequisitefor effective information extraction. Software agents may effectively embodythese functionalities [11], [12].

– Complexities in workflow composition and execution may be encapsulatedwithin tasks delegated to agents and facilitated by agent interactions [13].

– Ontology-based semantic descriptions of Bioinformatics related domains,may be utilized by software agents, offering enriched data handling and rea-soning capabilities [12], [14], [15].

Hence, agent technology has been adopted in several research projects forintegration of Bioinformatics resources. For example, GeneWeaver works on theidea of integrating genomic databases and data analysis tools via a communityof software agents [16]. BioMAS is a Multiagent System for genomic annota-tion that relies on an information gathering system, modeled as an extendeddistributed query processing mechanism [17]. In [4], Bioinformatics tools’ in-tegration is addressed by adopting an agent-based middleware, originated byanalysis at different abstraction levels. Finally, in [18], the benefits from adopt-ing agent technology are discussed, in the context of myGrid research project,which aims to help biologists and bioinformaticians to perform workflow-basedin silico experiments by automating their management.

In this paper, we focus on Web-based Bioinformatics data analysis toolsand present the virtue of applying agent technology to cope with the intrinsiccomplexity and heterogeneity of the domain, by presenting a scalable engineeringmethodology, taking into account actual integration scenarios.

3 Agent-based Mediation for Bioinformatics ToolsIntegration

In the scope of this work, we categorize Bioinformatics analysis tools into classes,in terms of their functionality. In the following, three Multiagent Systems arepresented, each one targeting to a particular integration scenario. Specifically,the first one addresses a horizontal integration scenario (i.e., integration of com-plementary tools), the second comprises an approach for vertical integration (i.e.,integration of tools with similar functionality), and finally, the third combinesthe aforementioned architectural approaches to address a complex data analysisscenario of both horizontal and vertical integration [19].

3.1 Workflow Formulation and Enactment with Predefined Tools

A rather common scenario for biological data analysis, is the one in which aresearcher has to use a predefined set of complementary tools. The tasks incor-porated in the pipeline are conditionally executed, according to the previouslyextracted outcome.

Specifically, given a set of tools T = {T1, T2, . . . , TK}, a workflow W has tobe dynamically determined and executed with

W ≡ N1 → N2 → . . . → Nj , (1)

where Ni denotes the ith workflow node, corresponding to a tool from set T , and1 ≤ j ≤ K.

To automatically handle the potential diversities among the successive steps,a rather typical approach is to construct a set of Wrapper Agents [9], one for eachresource, that will hide the heterogeneities among the tools under considerationin a flexible and transparent way [4], [20]. The definition and coordination of the

Fig. 1. The Multiagent architecture enabling workflow formulation and enactment withpredefined tools.

analysis workflow is based on a Broker Agent [8], which encapsulates the under-lying reasoning required, in a dynamic interaction protocol with the WrapperAgents [11]. The Broker Agent incorporates filtering and processing mechanisms,applied to the information obtained from each Wrapper Agent [21]. Based onthis information, the successive workflow steps are determined. The dynamiclocalization of the Wrapper Agents, is addressed by a Directory Facilitator thatregisters their capabilities, in a yellow-page service [7].

To facilitate data exchange among agents, an application ontology is con-structed and incorporated in the agent communication acts, providing a com-mon vocabulary of terms associated with the semantics of the specific agentinteractions [22]. Agent messages encoded in such an ontology contain complexinformation about agent actions or predicates. Specifically, within the content ofthese messages, information such as the description of the tool to be wrapped,or query parameters, are effectively encapsulated.

To illustrate this scenario, the architecture of the proposed Multiagent Sys-tem is presented in Fig. 1. The User Agent initiates an interaction protocol,based on the user request for data analysis, requesting from the Broker Agentto configure and coordinate a relevant workflow [23]. The Broker Agent assessesthe user request, encapsulates its parameters in an agent action, according tothe application ontology, and delegates to the first Wrapper Agent defined, tosubmit the request to its corresponding analysis tool. Then, the Wrapper Agenttransforms the received request in the proprietary query language of the partic-ular analysis tool, submits it, and sends the results to the Broker Agent. Basedon the analysis outcome, the Broker Agent determines whether to continue theworkflow, by proceeding with other Wrapper Agents of the suite, or providethe analysis outcome to the user, via the User Agent. In any case, the analysisoutcomes are appropriately compiled and provided to the end-user.

Using a fixed set of tools to construct biological workflows, is a rather in-adequate approach, considering the variety of the available techniques. A morefavorable approach, is to compose workflows of dynamically-defined tools, basedon user-defined criteria. To cope with this requirement, an intermediate archi-

tecture is proposed, in which the candidate tools are semantically described,enabling efficient matchmaking.

3.2 Matchmaking Candidate Tools via Semantic Descriptions

Generally, in an open environment, finding the appropriate agents/services whichmight have the information or the capabilities an agent/human needs, is referredin the literature as the connection problem [8]. There are three special types ofinformation that have to be taken into account to tackle this problem: capabili-ties, pre-conditions and preferences [24]. In our problem space, in order to selecta tool from a common functional class for participation in a data analysis work-flow, a set of criteria have to be defined. The majority of existing Bioinformaticsanalysis tools are limited by a set of functional constraints and requirements,such as:

– Each tool may perform analysis on species-specific datasets. Hence, the or-ganism(s) of interest together with other input parameters have to be definedin the request.

– Restrictions related to the input data and their format may apply. Similarly,the results obtained are encoded in different formatted outputs, e.g., eitherdisplayed in Web pages, or delivered to a user-defined e-mail address.

– The resulting outcome diverge, even within the same class of tools. For ex-ample, some gene prediction tools’ outputs include only the identificationof the predicted coding regions, while others provide additional information,such as the amino acid sequence of the predicted peptide(s), and/or the typeand position of the signal sensors detected.

– Most Bioinformatics analysis tools are accessible through HTTP requests onincompatible interfaces, based on particular conventions for structuring anddescribing concepts.

Considering the aforementioned issues, we propose the construction of a domainontology to provide the schema for semantic annotation of a potential analysisclass, under consideration [12]. Based on this ontology, a Knowledge Base isconstructed that enables meta-searches on tools’ descriptions via user-definedand implied criteria [25], [26].

Specifically, let T = {T1, T2, . . . , TN} be the set of all analysis tools (of thesame class) registered with the Knowledge Base. If Pr = {Pr1, P r2, . . . , P rM}is the set of user-defined preferences on available analysis options, according tothe ontological schema, and C = {C1, C2, . . . , CN}, Pc = {Pc1, P c2, . . . , P cN}are the sets of capabilities and pre-conditions respectively, corresponding to eachtool in the Knowledge Base, then the matchmaking process f maps Pr to C andPc, such that:

f : Pr → 〈C, Pc〉. (2)

In general, a query in the Knowledge Base is expressed as:

matching query : Pr1 ¦ Pr2 ¦ . . . ¦ PrM ≡ SelectNi=1(Ti), (3)

Fig. 2. The Multiagent architecture enabling integration of a class of analysis toolsbased on semantic annotations.

where the operator ¦ is either u or t, in the general case (taking into accountthe priorities of conjunction and disjunction, appropriately) [27], and Selectcorresponds to the matchmaking mechanism among the N candidates of set T .The output of such a query is a set of tools T ∗ ⊆ T , instances of the KnowledgeBase, that correspond to the specific user preferences Pr.

Comparing to the Multiagent architecture of Fig. 1, we modified the me-diation layer by incorporating the Knowledge Base consisting of the semanticdescriptions of tools (based on the domain ontology) and the Knowledge-BaseWrapper Agent (Fig. 2). Specifically, the role of each module within this medi-ation layer is the following:

– The Broker Agent controls and coordinates access to data analysis tools uponrequest of a User Agent. It cooperates with the Knowledge-Base WrapperAgent to map requests into selection of tools, and compiles the correspondingoutcome in a transparent way, provided by the Wrapper Agents.

– The Knowledge Base module contains a semantic classification of tools alongwith their functional descriptions. This approach enables the execution of thematchmaking procedure between user-defined requirements and the func-tional specification of tools [8], constituting a meta-search mechanism in theKnowledge Base. Knowledge is retrieved/acquired from/to the KnowledgeBase via the Knowledge-Base Wrapper Agent.

– The Knowledge-Base Wrapper Agent applies the matching procedure be-tween the analysis parameters and the tools that fulfill the user requirementsby accessing the Knowledge Base. For this reason, it translates requests intoappropriate Knowledge Base queries, and forwards the list of matching tools(if any) to the Broker Agent. In addition, it registers analysis tools to theKnowledge Base upon request of a Wrapper Agent, via the Broker Agent.

Regarding the annotation of analysis tools, a registration mechanism is ap-plied, according to which each corresponding Wrapper Agent interacts with theBroker Agent to register the tool with the Knowledge Base (e.g., upon an ad-ministrator’s request or on system start-up) [25].

Fig. 3. The brokering protocol for semantic selection of the appropriate analysis tool(s).

To illustrate the brokering protocol applied for semantic selection of theappropriate candidate tool(s), an AUML sequence diagram [28] is presented inFig. 3.

The Multiagent System architecture presented in this section provides atransparent solution to the problem of selecting the appropriate analysis tool(s)from a set of candidates. Consequently, the intrinsic complexity originated bythe pre-conditions and capabilities of each tool are hidden from the end-user. Itis noteworthy, that the current architecture may be also utilized for evaluatinganalysis tools, by extending the behavior of the Broker Agent. Specifically, as-suming that a set of tools matches the request criteria, the Broker Agent couldretrieve the analysis outcome from all corresponding Wrapper Agents and com-bine the results obtained in a comparative view.

Based on the modularity and reusability provided by the agent metaphor,the systems described in sections 3.1 and 3.2 may be utilized to construct ageneralized Multiagent architecture, aiming to address more complex scenarios,in terms of both horizontal and vertical integration. The proposed architectureis presented in the following.

3.3 Coping with Complex Integration Scenarios

In a more generic scenario, several Bioinformatics analysis tools from differentcategories have to be orchestrated, in order to perform a particular computa-tional analysis. Usually, the tools comprising the workflow nodes are not knowna priori. Thus, the appropriate tool from each class has to be selected, accordingto the analysis’s requirements. To support this functional scenario in an auto-mated fashion, we propose a two-level mediation architecture [29], illustrated inFig. 4. Considering the Multiagent architecture presented in section 3.1 as anextendable unit and the one presented in 3.2 as a reusable module, the currentarchitecture enables (1) selection of the appropriate tool from each category and(2) dynamic enactment of the corresponding workflow. This is feasible due toflexibility and scalability offered by the component-based agent paradigm.

The key-module in the current architecture is the Mediator Agent [11], [24].In each class of analysis tools, a Broker Agent, a Knowledge Base, a Knowledge-Base Wrapper Agent, and the Wrapper Agents of the corresponding tools are

Fig. 4. The Multiagent architecture for horizontal and vertical integration, enablingcomplex workflow execution scenarios.

included (following the concepts described in section 3.2). Upon a user request foranalysis, the Mediator Agent decomposes the request into sub-queries, accordingto the available classes, and initiates an interaction protocol by requesting fromthe Broker Agent, corresponding to the first class determined in the workflow, tomatch the defined criteria, encapsulated in the sub-query, against the semanticdescription of its class.

Assume K classes of analysis tools. According to (2), the Broker Agent #i(Fig. 4) applies the following mapping, in order to select the appropriate(s)among the candidate tools of its class:

fi : Pri → 〈Ci, P ci〉, (4)

where fi is the mapping mechanism for class i, Pri are the preferences relatedto the user request that correspond to class i, and Ci, Pci are the capabilitiesand pre-conditions of tools in class i, respectively, with i ∈ {A, . . . , K}.

In Fig. 5, the relevant mediator-broker(s) interactions are illustrated (in-teractions among each class’s Broker Agent, the corresponding Knowledge-BaseWrapper Agent, the Knowledge Base and the Wrapper Agents are omitted, sincethey are similar with the ones presented in Fig. 3). Accordingly, this BrokerAgent initiates a brokering protocol, like the one described in section 3.2, andprovides the analysis results to the Mediator Agent. In the following, the Media-tor Agent determines the next node of the workflow by requesting from anotherclass’s Broker Agent to apply the analysis required. The procedure is iterated,until the dynamically defined workflow reaches a termination condition.

Thus, a potential analysis workflow W is defined as follows (Fig. 4):

W ≡ SelectMi=1(Ai) → . . . → SelectNi=1(Ki), (5)

where Select denotes the matchmaking mechanism, and M , N correspond tothe number of candidate tools in classes A, K respectively.

The architectures illustrated in Fig. 1, 2 and 4, indicate that the modularityof the agent approach facilitates a scalable and extendable integrated solution, to

Fig. 5. An AUML sequence diagram depicting the two-level mediation protocol.

address both horizontal and vertical integration requirements. In the following,three example scenarios are presented, applied to computational gene prediction.

4 Test Cases on Gene Prediction

The applicability of the integration schemas described above has been tested onthe problem of predicting gene structures in eukaryotic genomes.

The identification of gene structures, lying among uncharacterized genomicregions in eukaryotes, is a well-known problem in the Bioinformatics field andseveral computational methods have been developed to automate or facilitategene predictions based on similarity-based or ab-initio approaches [30]. Similarity-based or extrinsic methods exploit the evidence coming from databases of anno-tated ESTs/cDNA or protein sequences, and identify potential coding regions bytheir homology levels. Based on the type of similarity exploited, the homologiescan be identified through various Web-based applications of BLAST alignmentalgorithms (BlastN, BlastP, BlastX, etc.).

Ab-initio or intrinsic approaches employ probabilistic methods to detect po-tential coding regions by identifying specific content sensors and signal-basedfeatures, upon which they build the most probable gene structure [31]. For in-stance, Fgenesh, Genscan, HMMgene, etc., implement ab-initio methods thatgenerate gene assemblies by evaluating the highest-scoring coding segments andsignals detected within the query sequence. The predictive power of these meth-ods is bounded by the alternations in the structure of the predicted features.Although coding regions are highly conserved, the detection of the exact 5’ and3’ end exons, as well as the actual splice junctions, appear to have considerablelower accuracy [30]. In human genome, for example, it is estimated that morethan 35% of the identified genes are alternatively spliced giving more than oneexon assemblies [32].

Along with the ab-initio gene structure finders, additional evidence can be ex-tracted by signal-specific detectors that implement various probabilistic methodsto identify signal sites that regulate gene expression such as promoter regions,splice sites and poly-A signals [31]. FirstEF, NNPP, HCPolyA, and NetServer2are examples of Web-based tools that end-up with a set of probable signal sites

Fig. 6. Classification of analysis tools related to gene prediction, and illustration oftest cases.

that, combined with ab-initio gene prediction methods, may help improve theoverall predictive accuracy.

Each tool independently emerges significant variations in its predictive power,depending on the training models used and the configuration settings [33]. Newlyappeared algorithms propose various refinements, mainly through combinationsof multiple resources of evidence [34]. The basic idea is that associating intrinsicmethods with the added-value offered by homology searches can help gain morereliable insights on the structural and compositional features that regulate geneexpression in eukaryotic genomes.

In the following test cases, we address the gene annotation problem by consid-ering three classes of tools, namely, extrinsic class (ExtC), intrinsic class (IntC)and signal-site class (SigC). Fig. 6 illustrates orchestration scenarios among in-stances of the aforementioned classes. The test cases were implemented by usingtools like the aforementioned, for each analysis class. JADE was adopted as anagent construction and execution environment [35]. The ontologies that seman-tically describe the three analysis classes were developed in Protege [36].

4.1 Test Case 1

To illustrate the applicability of the Multiagent architecture described in section3.1, a fixed set of instances, lets say ExtCi, IntCj and SigCk, is defined toautomate the cascading tasks followed during the annotation of an unknowngenomic sequence [21]. The processing steps involved in this procedure are: (a)perform a specific BLAST-based search (ExtCi) to find evidence in databasesof annotated sequences, (b) execute an ab-initio analysis tool (IntCj) to predictthe optimal gene assembly, in case of no homologies found in step (a), and (c)associate strong indications of a signal sensor (SigCk).

In Fig. 6, the nodes connected with thicker arcs are parts of the workflowW , where W ≡ ExtCi → IntCj → SigCk, according to (1). Going throughstep (a), if the statistical significance of a match reported by ExtCi indicatesa strong homology (Expect value below a predefined stringent threshold), thenit is likely that the query sequence shares common features with the identifiedhomologous. Otherwise, the IntCj tool is executed to detect a potential new gene

structure assembly (step (b)). The extracted features along with their associatedprobabilities can be further assessed with the execution of SigCk, providing morein-depth insight on a specific signal appearing within the query sequence (step(c)).

Related to the architecture presented in section 3.1, for each tool an appro-priate Wrapper Agent is constructed, and the logic according to which the tasksare conditionally cascaded is encapsulated in the reasoning scheme of the BrokerAgent.

4.2 Test Case 2

A more sophisticated approach to deal with effective gene annotations is todynamically handle requests and decide upon the most suitable tool, instead ofrelying on a fixed-closed instance group. The integration of functionally similartools is illustrated in Fig. 6, on a set of M ab-initio gene finders defined asinstances of class IntC [25]. Following the mediation architecture in section3.2, a domain ontology was constructed, to capture the specifications of theincorporated intrinsic tools, in terms of their capabilities and pre-conditions(input, output, configuration).

Fig. 7 depicts an abstract view of the defined classes, as well as their rela-tions (pointed as directed arcs), where IntC corresponds to AbInitioTool class.AgentAction and Predicates classes describe the agent tasks and related con-cepts. Based on the semantic description of class IntC, the matchmaking mech-anism f (provided by (2)) maps user preferences to one or more instance toolsIntCi, where 1 ≤ i ≤ M , according to the annotated capabilities and pre-conditions of tools. An example query follows:

Find all ab-initio gene finders that support human gene predictions and canidentify multiple gene structures on both strands within a genomic sequencelonger than 30.000 base-pairs.

The corresponding range definition and query statement are expressed inPAL (Protege Axiom Language) [36], as follows:

Range Definition:(defrange ?tool :FRAME AbInitioTool)(defrange ?spec :FRAME Organism)

Query Statement:(findall ?tool

(exists ?spec(and

(OrganismName ?spec ‘‘Homo Sapiens’’)(Organisms ?tool ?spec)(NoGenes ?tool (coerce-to-symbol ‘‘Multiple’’))(Strand ?tool (coerce-to-symbol ‘‘both’’))(>(SequenceLength ?tool) 30000))))

Fig. 7. Part of the ontology for semantic annotation of ab-initio gene prediction tools.

Related to the Multiagent architecture presented in 3.2, this query is com-piled by the Knowledge-Base Wrapper Agent and applied to the correspondingKnowledge Base (Fig. 2), in order to identify the tools that match the requestedanalysis. The matchmaking procedure is encapsulated in the reasoning schemeof the Broker Agent.

4.3 Test Case 3

Taking advantage of the modularity and scalability of the mediation architec-tures described in sections 3.1 and 3.2, a generalized integration schema wasdeveloped to illustrate the applicability of the architecture described in section3.3. Given a set of instances, corresponding to ExtC, IntC and SigC classes,the scenario described in section 4.2 is extended to incorporate all stages ofthe workflow presented in section 4.1. In the particular setting, the matchmak-ing mechanism fi is iteratively applied as defined in (4), for all three classes oftools, resulting in a dynamically constructed workflow that extracts the optimaloutcome, according to the specified requirements (Fig. 6).

In this case, a Knowledge Base is constructed for each class, containing de-scriptions regarding tools’ capabilities and pre-conditions [8], similar to the onepresented in section 4.2. Appropriate Wrapper Agents are assigned, for all tools.The overall workflow coordination is delegated to the Mediator Agent, whichdecomposes gene annotation queries and cooperates accordingly with the Bro-ker Agents assigned to the classes. In this context, workflows are defined andexecuted, in the form of (Fig. 6):

W ≡ SelectLi=1(ExtCi) → SelectMi=1(IntCi) → SelectNi=1(SigCi), (6)

where Select denotes the matchmaking mechanism.

5 Discussion

The emergence of high-throughput information acquisition in biological prob-lems has directed research to knowledge extraction techniques. A wide rangeof computational tools has been developed for this purpose, most of them be-ing publicly available over the Internet, aiming to accelerate biological analy-sis procedures. However, their diversity and heterogeneity, regarding functionalcharacteristics and requirements of use, makes their utilization difficult at leastfor non-experienced users, and complicates manual computational workflows inwhich several tools have to be sequentially executed, in order to perform thedesired analysis [25]. These characteristics result in the emergence of severalintegration techniques to facilitate transparent and advanced access to Bioinfor-matics analysis tools, aiming to hide the intrinsic complexities and regularitiesrequired, towards the envisioned knowledge discovery.

Agent technology has gained wide interest as a potential solution to inte-gration of Bioinformatics analysis tools, due to its underlying design conceptu-alization [2]. Specifically, the inherent attributes of the agent metaphor, suchas autonomy, communication, coordination, as well as knowledge-sharing andreuse, provide significant advantages towards a flexible and applicable integra-tion paradigm [4]. Especially, Multiagent Systems through workload distributionand cooperation [23], comprise a modular and flexible approach for large-scaletools integration, and accordingly, workflow composition-execution.

In this paper, we presented an engineering methodology for agent-mediatedintegration of Bioinformatics tools. First, a horizontal integration scenario wasaddressed [19], through a Multiagent architecture for workflow composition andenactment of a fixed set of analysis tools. The approach was based on a brokeringmechanism, which coordinated a set of Wrapper Agents, each corresponding to aparticular tool. In the second system, the focus was on semantic selection of toolsfrom the same class (vertical integration). This was achieved by constructing aKnowledge Base of tools, which facilitated the matchmaking procedure amonguser preferences and tools’ capabilities and pre-conditions.

Following the design principles of the aforementioned systems, a generalizedMultiagent architecture was constructed, in order to address a more complex sce-nario, related to both horizontal and vertical integration. Specifically, automatedworkflow composition and execution was enabled, by dynamically and seman-tically integrating several analysis tools from different classes. In this case, theworkflow was coordinated by a two-level mediation architecture. Through thisengineering methodology, the modularity offered by the agent-based paradigmwas indicated by extending/scaling the first two Multiagent Systems presented,to support the third, more complex integrated scenario. For each one of thesystems presented, test cases were provided to illustrate their applicability andvirtue, applied to the challenging task of predicting genes in eukaryotic genomes.

Due to the persistent interoperability concerns related to biological resources,significant interest has emerged in service-oriented approaches, such as WebServices [37]. These approaches enable a high level of automation in interactionwith, and orchestration among analysis tasks. In the test cases presented, the

Wrapper Agents’ construction was based on screen-scrapping techniques, whichare “sensitive” to updates or modifications. To overcome this issue, we currentlyfocus on standardized and machine-readable resource descriptions, offered byservice-oriented architectures. Particularly, we plan to adopt technologies such asOWL-S [38], to formally describe capabilities and pre-conditions of services, andfacilitate definition of service choreographies, associated with enriched semantics.

References

1. Ng, S.-K., Wong, L.: Accomplishments and challenges in Bioinformatics. IEEE ITProfessional 6(1) (2004) 44–50

2. Karasavvas, K., Baldock, R., Burger, A.: Bioinformatics system integration andagent technology. Journal of Biomedical Informatics 37(3) (2004) 205–219

3. Stevens, R., Goble, C., Baker, P., Brass, A.: A classification of tasks in Bioinfor-matics. Bioinformatics 17(2) (2001) 180–188

4. Corradini, F., Mariani, L., Merelli, E.: An agent-based layered middleware as toolintegration. International Journal on Software Tools for Technology Transfer 6(3)(2004) 231–244

5. Luck, M., McBurney, P., Preist, C.: A manifesto for agent technology: Towardsnext generation computing. Autonomous Agents and Multi-Agent Systems 9(3)(2004) 203–252

6. Wooldridge, M.: An introduction to Multiagent Systems. J. Wiley and Sons (2002)7. Foundation of Intelligent Physical Agents.: FIPA Agent Software Integration Spec-

ification (2001) Available at: http://www.fipa.org/specs/fipa00079/8. Wong, H.C., Sycara, K.: A Taxonomy of Middle-Agents for the Internet. Proceed-

ings of the 5th International Conference on Multi-Agent Systems (2000) 465-4669. Tork Roth, M., Schwarz, P.: Don’t scrap it, wrap it! A wrapper architecture for

legacy sources. Proceedings of the 23rd VLDB Conference Greece (1997) 266–27510. Buttler, D., et al.: Querying multiple Bioinformatics information sources: Can Se-

mantic Web research help?. ACM SIGMOD Record 31(4) (2002) 59–6411. Aldana, J.F., Roldan, M., Navas, I., Perez, A.J., Trelles, O.: Integrating biological

data sources and data analysis tools through mediators. Proceedings of the 2004ACM Symposium on Applied Computing Nicosia Cyprus (2004) 127

12. Lu, J.-J., Hsu, C.-N.: Query answering using ontologies in agent-based resourcesharing environment for biological Web information integrating. Proceedings ofthe IJCAI Workshop on Information Integration on the Web Acapulco Mexico(2003) 197–202

13. Bryson, K., Luck, M., Joy, M., Jones, D.T.: Agent interaction for Bioinformaticsdata management. Applied Artificial Intelligence 15(10) (2001) 917–947

14. Baker, P.G., Goble, C.A., Bechhofer, S., Paton, N.W., Stevens, R., Brass, A.: Anontology for Bioinformatics applications. Bioinformatics 15(6) (1999) 510–520

15. Miled, Z.B., Webster, Y.W., Liu, Y.: An ontology for semantic integration of lifescience Web databases. International Journal of Cooperative Information Systems12(2) (2003) 275–294

16. Bryson, K., Luck, M., Joy, M., Jones, D.T.: Applying agents to Bioinformaticsin GeneWeaver. Cooperative Information Agents IV. LNAI, Vol. 1860. Springer-Verlag, Berlin Heidelberg New York (2000) 60-71

17. Decker, K., Khan, S., Schmidt, C., Situ, G., Makkena, R., Michaud, D.: BioMAS: Amulti-agent system for genomic annotation. International Journal of CooperativeInformation Systems 11(3-4) (2002) 265–292

18. Moreau, L., et. al.: On the use of agents in a Bioinformatics Grid. NETTAB 2002,Agents in Bioinformatics, Bologna Italy (2002)

19. Hernandez, T., Kambhampati, S.: Integration of biological sources: Current sys-tems and challenges. ACM SIGMOD Record 33(3) (2004) 51–60

20. Chang, C.-H., Sick, H., Lu, J.-J., Hsu, C.-N., Chiou, J.-J.: Reconfigurable Webwrapper agents. IEEE Intelligent Systems 18(5) (2003) 34–40

21. Koutkias, V., Malousi, A., Chouvarda, I., Maglaveras, N.: A Multi-agent systemto automate human gene prediction by integrating heterogeneous computationaltools. Proceedings of the 3rd Hellenic Conference on Artifical Intelligence SamosGreece (2004) 139–148

22. Labrou, Y., Finin, T., Peng, Y.: Agent communication languages: The currentlandscape. IEEE Intelligent Systems 14(2) (1999) 45–52

23. Vidal, J.M., Buhler, P., Stahl, C.: Multiagent systems with workflows. IEEE In-ternet Computing 8(1) (2004) 76–82

24. Klusch, M., Sycara, K.: Brokering and matchmaking for coordination of agentsocieties: a survey. Coordination of Internet agents: models, technologies, and ap-plications. Springer-Verlag, London (2001) 197–224

25. Koutkias, V., Malousi, A., Maglaveras, N.: Performing ontology-driven gene pre-diction queries in a multi-agent environment. In: Barreiro J. M., Martin-Sanchez,F., Maojo, V., Sanz, F. (eds.): Biological and Medical Data Analysis. LNCS, Vol.3337. Springer-Verlag, Berlin Heidelberg New York (2004) 378–387

26. Castano, S., Ferrara, A., Montanelli, S., Racca, G.: Matching techniques for re-source discovery in distributed systems using heterogeneous ontology descriptions.Proceedings of the International Conference on Information Technology: Codingand Computing 1 (2004) 360–366

27. Li, L., Horrocks, I.: Matchmaking using an instance store: some preliminary results.Proceedings of the 2003 International Workshop on Description Logics.

28. Odell, J., Van Dyke Parunak, H., Bauer, B.: Representing agent interaction proto-cols in UML. In: Wooldridge, M., Weiss, G., Ciancarini, P. (eds.): Agent-OrientedSoftware Engineering II. LNCS, Vol. 2222. Springer-Verlag, Berlin Heidelberg NewYork (2001) 121–140

29. Durfee, E.H.: Scaling up agent coordination strategies. IEEE Computer 34(7)(2001) 39–46

30. Mathe, C., Sagot, M.F., Schiex, T., Rouze, P.: Current methods of gene prediction,their strengths and weaknesses. Nucleic Acids Research 30 (2002) 4103-4117

31. Zhang, M.Q.: Computational prediction of eukaryotic protein-coding genes. Nature3 (2002) 698–710.

32. Mironov, A.A., Fickett, J.W., Gelfand, M.S.: Frequent alternative splicing of hu-man genes. Genome Research 9 (1999) 1288–1293

33. Rogic, S., Mackworth, A.K., Ouellette, F.: Evaluation of gene-finding programs onmammalian sequences. Genomic Research 11 (2001) 817-832

34. Allen, J.E., Pertea, M., Salzberg, S.L.: Computational gene prediction using mul-tiple sources of evidence. Genome Research 14 (2001) 142–148

35. Bellifimine, F., Poggi, A., Rimassa, G.: JADE - A FIPA-compliant Agent Frame-work. Proceedings of the 4th International Conference and Exhibition on the Prac-tical Application of Intelligent Agents and Multi-Agents (1999) 97–108

36. Noy, N.F., et al.: Creating Semantic Web contents with Protege-2000. IEEE Intel-ligent Systems 16(2) (2001) 60–71

37. Stein, L.: Creating a bioinformatics nation. Nature 417 (2002) 119–12038. Singh, M.P., Huhns, M.N.: Service-oriented computing: Semantics, processes,

agents. J. Wiley and Sons (2005)