[ieee 2006 ieee/aiaa 25th digital avionics systems conference - portland, or, usa...

12
1-4244-0378-2/06/$20.00 ©2006 IEEE. 1C4-1 EXPLORING APPLICABILITY OF GRID TECHNOLOGIES TO NET CENTRIC “EDGE” SYSTEMS 1 Sumit Ray, BAE Systems, Johnson City, New York Madhusudhan Govindaraju, Grid Computing Research Laboratory, Department of Computer Science, State University of New York, Binghamton, New York 1 This work supported in part by the NSF under award number CNS 0454298. Abstract Scientific grids are often characterized by an infrastructure made up of high speed and reliable networks that connect heterogeneous and distributed compute and storage devices. This is in stark contrast to the concept behind the Global Information Grid (GIG). The GIG is an interconnected information transfer infrastructure meant to provide seamless interoperability and information dominance to the war-fighter. This paper explores the applicability of the Web services based grid technologies for monitoring and discovery, workflow, and QoS, to the net centric “edge” applications on the GIG. We discuss the salient features of current grid technologies, along with examples wherever applicable, highlight the ideas that can be readily utilized for “edge” systems, and point out the gap that exists for other required features. 1.0 Introduction Scientific grid technologies often assume an infrastructure made up of high speed and reliable networks connecting super-computers or equivalent clusters with petabytes of storage. This is in stark contrast to the concept behind the Global Information Grid (GIG)[1][2]. The GIG is an interconnected information transfer infrastructure comprised of a GIG backbone or “core” systems and highly mobile “edge” systems that together are meant to provide seamless interoperability and information dominance to the war-fighter. In the DoD’s vision, the GIG will connect the entire gamut of computing systems used by the military in the execution of a war, a battle, or an action. At the “edge,” the GIG will consist of low power intelligent sensors and resource starved hand-held computing devices carried by the foot soldier. It will also include resource constrained computing systems on satellites and aircrafts in which the software is kept artificially simple and deterministic to pass regulatory hurdles. The networks connecting these “edge” devices are typically lossy, mobile, and ad-hoc. At the other end of the spectrum, the resources and networks available at the “core” are expected to rival that of the scientific grid in capacity, capability, and reliability. Along with connecting a heterogeneous set of resources, another key tenet of the GIG is “timely” information transfer that provides the right information to the decision makers in time for them to take the appropriate action. Thus, the GIG infrastructure must inherently support non-real- time, soft real-time, as well as hard real-time transfer and management of information. Scientific grids, on the other hand, typically employ best- effort semantics. No timing guarantee is given. Instead the middleware relies on the quality of the resources and networks to make progress. Despite these differences, the DoD intends the infrastructure to consist of commercially available networking, computing, and storage technologies. The intent is to take advantage of the larger market forces that are fueling advances in commercial technology. Areas of exploration within the scientific grid that may be particularly relevant for use in the GIG are 1) mechanisms for negotiating and managing Quality of Service (QoS), 2) recording them in contracts between the service providers and the clients as Service Level Agreements (SLA), 3) resource brokering and selection based on QoS properties, 4) workflow management to coordinate and compose complex services and bind them with SLAs, and 5) monitoring execution to detect and react to violations of the SLAs. A service provided by resource constrained “edge” systems may require cooperative activity

Upload: madhusudhan

Post on 01-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

1-4244-0378-2/06/$20.00 ©2006 IEEE. 1C4-1

EXPLORING APPLICABILITY OF GRID TECHNOLOGIES TO NET CENTRIC “EDGE” SYSTEMS 1

Sumit Ray, BAE Systems, Johnson City, New York Madhusudhan Govindaraju, Grid Computing Research Laboratory, Department of Computer Science,

State University of New York, Binghamton, New York

1 This work supported in part by the NSF under award number CNS 0454298.

Abstract Scientific grids are often characterized by an

infrastructure made up of high speed and reliable networks that connect heterogeneous and distributed compute and storage devices. This is in stark contrast to the concept behind the Global Information Grid (GIG). The GIG is an interconnected information transfer infrastructure meant to provide seamless interoperability and information dominance to the war-fighter. This paper explores the applicability of the Web services based grid technologies for monitoring and discovery, workflow, and QoS, to the net centric “edge” applications on the GIG. We discuss the salient features of current grid technologies, along with examples wherever applicable, highlight the ideas that can be readily utilized for “edge” systems, and point out the gap that exists for other required features.

1.0 Introduction Scientific grid technologies often assume an

infrastructure made up of high speed and reliable networks connecting super-computers or equivalent clusters with petabytes of storage. This is in stark contrast to the concept behind the Global Information Grid (GIG)[1][2]. The GIG is an interconnected information transfer infrastructure comprised of a GIG backbone or “core” systems and highly mobile “edge” systems that together are meant to provide seamless interoperability and information dominance to the war-fighter. In the DoD’s vision, the GIG will connect the entire gamut of computing systems used by the military in the execution of a war, a battle, or an action. At the “edge,” the GIG will consist of low power intelligent sensors and resource starved hand-held computing devices carried by the foot soldier. It

will also include resource constrained computing systems on satellites and aircrafts in which the software is kept artificially simple and deterministic to pass regulatory hurdles. The networks connecting these “edge” devices are typically lossy, mobile, and ad-hoc. At the other end of the spectrum, the resources and networks available at the “core” are expected to rival that of the scientific grid in capacity, capability, and reliability.

Along with connecting a heterogeneous set of resources, another key tenet of the GIG is “timely” information transfer that provides the right information to the decision makers in time for them to take the appropriate action. Thus, the GIG infrastructure must inherently support non-real-time, soft real-time, as well as hard real-time transfer and management of information. Scientific grids, on the other hand, typically employ best-effort semantics. No timing guarantee is given. Instead the middleware relies on the quality of the resources and networks to make progress. Despite these differences, the DoD intends the infrastructure to consist of commercially available networking, computing, and storage technologies. The intent is to take advantage of the larger market forces that are fueling advances in commercial technology.

Areas of exploration within the scientific grid that may be particularly relevant for use in the GIG are 1) mechanisms for negotiating and managing Quality of Service (QoS), 2) recording them in contracts between the service providers and the clients as Service Level Agreements (SLA), 3) resource brokering and selection based on QoS properties, 4) workflow management to coordinate and compose complex services and bind them with SLAs, and 5) monitoring execution to detect and react to violations of the SLAs.

A service provided by resource constrained “edge” systems may require cooperative activity

1C4-2

among several entities to cumulatively deliver all of the needed resources. In addressing this problem of cooperative activity, the GIG community may benefit from the workflow management techniques and languages being researched in the scientific grid community to allow service composition. SLA negotiations involving all of the service providers in the workflow group may be necessary to effectively co-allocate resources and coordinate cooperative activity. Resource discovery and brokering techniques developed to dynamically select the members of the group and to co-allocate their resources in advance will be important to providing timeliness and QoS guarantees on the GIG.

The Grid and Web services communities are working together to develop formal specifications for monitoring and enforcement of SLAs. An important recent development in the grid community is the WS-Agreement initiative, being considered by the Global Grid Forum (GGF), to produce a specification for the management of resources and services using negotiated SLAs. The use of enforceable SLAs along with workflows has the potential to address many of the challenges in GIG. While many specifications have been proposed for workflow management, the Business Process Execution Language for Web services (BPEL4WS) [3] has emerged as a standard to define logic in terms of execution order and conditions for the invocation of orchestrated Web services.

This paper explores the applicability of the Web services based grid technologies for discovery and monitoring, workflow, and QoS, to the net centric “edge” systems. We discuss the salient features of current grid technologies, along with examples wherever applicable, and highlight the ideas that can be readily utilized for “edge” systems and point out the gap that exists for other required features. The services or the monitors for these “edge” systems may or may not be implemented using the Web services architecture; however, the negotiation or the service startup will use Web services.

The rest of this paper is organized as follows. Section 2 discusses ideas related to QoS in the grid and how they apply to “edge” systems. In Section 3 we discuss workflow systems and explore their applicability in a dynamic and mobile environment.

Section 4 focuses on grid monitoring and discovering systems and explores the features that directly address requirements of “edge” systems. Finally, in Section 5 we present conclusions of this paper.

2.0 QoS Framework Fundamentally, quality of service (QoS) can be

defined as constraints on resource management needed to provide end-to-end assurances to the user about one or more properties of a service. Complex collaborations and the need for performance among distributed and heterogeneous services make providing assurances in terms of fault tolerance, timeliness, and security important to the proper functioning of the grid. However, unlike distributed multimedia systems [4], a comprehensive treatment of the QoS framework for grid systems is still in its infancy. Instead, these systems typically rely on high performance computing and network resources to provide adequate service quality but without any guarantees.

In Section 2.1, we will first provide some background on QoS frameworks as developed for distributed multimedia systems followed by some current work in this area for the Grid. We will then highlight, in Section 2.2, some of the QoS requirements necessary for proper functioning of the “edge” systems in the GIG.

2.1 Design of a QoS Framework A QoS framework consists of mechanisms for

specifying, binding and managing QoS properties at various levels of abstraction. Application level abstractions concern service properties of interest to the user. Examples are end-to-end latency, reliability, information transfer size and throughput, and traffic characteristics like isochronous flow for multimedia applications. Providing end-to-end guarantees mean that these abstract specifications must be mapped into constraints down through the layers and distributed among the resources belonging to the various nodes participating in the service. For instance, both service reliability and latency requirements may translate into specifications for replication at the network or transport layer; whereas a combination of performance requirements such as latency, size,

1C4-3

throughput and traffic characteristics may translate into load, bandwidth, jitter and flow synchronization criteria managed by various elements of the middleware and network protocol stacks. The criteria must also be distributed amongst end systems, monitoring and brokering middleware systems as well as the network to satisfy end-to-end constraints.

QoS-A (Quality of Service Architecture) [4][5] and XRM (Extended Integrated Reference Model) [6] are two example QoS frameworks for multimedia systems. Composed of four architectural layers and three management planes, the QoS-A architecture supports active end-to-end QoS management with tight integration of end system scheduling and I/O resource management with communication protocols and networks. In addition, the framework provides layer specific QoS-mapper services to ease QoS parameter specification and to translate the parameters into the distinct QoS abstractions and mechanisms at each layer.

Possibly the most important concept incorporated within the QoS-A architecture is that of specifying, binding and managing QoS properties with respect to a flow representing a unicast or multicast multimedia stream. The framework provides QoS related services for the entire lifetime of the flow. The key idea associated with the flow is the concept of negotiating an end-to-end contract that has two components. The user makes a “not to exceed resource level demands” commitment and the various components cooperating to provide the service commits to “a level of resource availability.” The degree of commitment may also be negotiated based on priority levels or absolute measures such as deterministic, probabilistic or best-effort. These concepts are supported by the following management services: 1) admission control and resource allocation at flow establishment; 2) monitoring and policing to verify provider and user commitments respectively for maintaining the flow; 3) adapting through filtering and scaling or renegotiating if violations are detected; and finally, 4) releasing resources to shutdown the flow.

XRM is a multimedia networking framework developed by the COMET group at Columbia University that provides end-to-end QoS guarantees

[4][6]. XRM is comprised of five distinct planes that are based on the principle of separation between communication, management, and traffic control architectures with a common set of abstractions provided by the data architecture [7]. In this framework, QoS mechanisms are components of the management and traffic control architectural planes. An interesting insight presented in this framework is that the primary difference between these two planes is due to the different time-scales at which they operate. The management architecture, which operates at a much slower time-scale, consists of mechanisms for coordination and cooperation in a decentralized and distributed system such as locating, instantiating and configuring resources based on QoS properties along with fault handling and accounting management needed for adaptation. The traffic control architecture consists of mechanisms for managing local resources both at end systems and at intermediate network switches. Example services are scheduling, buffer management, routing, admission control and flow control.

These requirements for global coordination and local resource management lead to the two key concepts in XRM needed to provide scalable, end-to-end QoS guarantees in multimedia networks. One is the concept of associating QoS constraints per flow or stream similar to the flow concept in QoS-A. The other key concept is the use of multiple traffic classes to manage the complexity in dynamically allocating local resources that satisfy the flows within the network as a whole [7]. Efficiency is gained through enabling statistical multiplexing of local resource pools allocated to the traffic classes. Four traffic classes are defined within XRM based on traffic characteristics and QoS constraints such as acceptable loss and end-to-end delays. Dynamic resource allocation and admission control rely on the specification of resource capacity regions. A region is defined as an n-dimensional space of possible flows through a resource belonging to the n traffic classes for which quality of service can be guaranteed locally [8]. Provided by the data architecture, this abstraction for network resources is called the schedulable region. The analogous abstraction for multimedia devices is called the multimedia capacity region. Using these capacity regions, a quantitative solution for scheduling flows locally that meet end-to-end

1C4-4

QoS requirements is described using standard solutions for a real-time bin packing problem.

Although XRM is defined with respect to ATM switches and efficiency is partially gained through hardware based intelligent resource allocation at the data link layer [9] the concepts of associating QoS properties with a flow to provide an end-to-end perspective and the concept of a QoS aware traffic classes to support more scalable local resource control are equally applicable to other communication protocols and architectures. The corresponding internet related architectures are Integrated Services (IntServ) (RFC 1633) and Differentiated Services (Diffserv) architecture (RFC 2475 & 2638). IntServ supports per flow QoS requirements through an explicit signaling and resource reservation mechanism provided by RSVP (RFC 2205 & 2210). Since state information grows proportionally to the number of flows, the resulting increase in complexity and the large storage and processing overhead limit the scalability of the IntServ architecture. Diffserv addresses these issues by aggregating flows into traffic classes or services based on their QoS properties allowing a simple and efficient implementation at the core network. Resource requirements are managed through service level agreements (SLAs) at domain boundaries that can be either static or dynamic. However, scheduling, and dynamic resource provisioning, reservation and allocation necessary to develop adaptable systems that maintain end-to-end service levels under variable load conditions is an unsolved problem [10]. Another limitation is that implementing guaranteed service in either IntServ or DiffServ architecture requires ubiquitous deployment, an unrealistic expectation in the heterogeneous Grid or the GIG environment [11].

Grid QoS Management (G-QoSM) represents some initial work on a QoS framework in the Grid environment [12]. Compatible with the Open Grid Services Architecture (OGSA) specification and the Globus toolkit 3.0, G-QoSM represents a QoS management overlay on top of existing Grid architectures. The framework consists of a distributed QoS Grid Service (QGS), a QoS policy service, various local resource managers, and an extended form of the Universal Description, Discovery and Integration (UDDIe) registry. This extension supports publishing and subsequently,

searching and discovering services based on QoS properties.

QGS brokers a service request with QoS constraints by first using the registry in combination with a selection service to select the “best matched” service. Admission control is provided by the corresponding resource managers and the QoS policy service. To enhance scalability and support run-time resource policy management, the policy controlling who, how and when resource reservations should be granted is decoupled into a different service from resource management. After co-reserving all of the necessary resources, QGS offers a service level agreement or SLA to the client.

Once approved, the QGS instantiates the composite service or workflow via the resource managers, namely, the user-level priority based Dynamic Soft Real-Time CPU Scheduler (DSRT), and the Diffserv based Network Resource Manager (NRM). The NRM QoS techniques to enforce the agreed to SLA under variable load conditions such as monitoring, renegotiation, and adaptation are listed as future enhancements. In the existing framework, these techniques are primarily left up to the resource managers and the application.

A promising approach to building a QoS aware distributed workflow management system, which requires modeling, estimating and synthesizing QoS metrics over multi-domain composite Web services, is presented in Service-Oriented Middleware (SOM). The main contribution in SOM is a “stochastic workflow reduction algorithm” (SWR) for synthesizing combined QoS metrics from its components [13].

The notion of “controlled-load” and how to handle service costs is better defined in “QoS-sensitive applications” and QoS-driven service planners. In this work, a computational model is defined to trade-off different QoS parameters in developing a feasible solution in a multi-objective optimization problem. The framework maps services to resources and quality objectives and derives an objective function that maximizes each quality parameter and then defines a weighted sum solution to combine all of the quality objectives into a feasible solution [14].

1C4-5

Web Service Level Agreement (WSLA) is a framework to define and monitor SLAs in multi-domain environments [15]. One characteristic of this framework is that the resources provide their own metrics that are aggregated according to some function. The metrics are the typical ones that can be retrieved from the provider’s infrastructure such as routers, servers and middleware. SLA parameters are based on these aggregated or composite metrics, each of which has a high and low threshold. These SLA parameters drive the runtime configuration of resources. SLAs are attached to the WSDL description of a service, although it may also refer to a composite web service. Monitoring may involve one or more third-parties who may act as either proxies or agents to build trust among the parties that establish the SLA. The runtime architecture consists of five services, which are 1) a SLA Establishment Service, 2) a SLA Deployment Service, 3) a Measurement Service, 4) a Conditional Evaluation Service, and 5) a Management Service. The SLA Establishment Service establishes the SLA parameters and defines the roles of all parties. The SLA Deployment Service, the Measurement Service and the Conditional Evaluation Service jointly provide the monitoring of SLA compliance. These services are implemented as Web services. The Management Service along with the Business Entity, which “embodies the business knowledge, goals and policies,” provide the adaptation logic taking into account the cost of terminating an SLA when a violation is detected. The authors then describe the WSLA language based on an XML Schema that defines the abstract data types needed to specify the SLA. The main components of the language are 1) the Parties involved in the SLA, 2) the Service Description, which characterizes the service and the monitoring parameters, and 3) the Obligations, which specify the constraints on the SLA parameters as well as the actions to implement upon a violation.

2.2 Requirements and Applicability of QoS Frameworks for “Edge” Applications

“Edge” systems in the GIG environment may be highly heterogeneous varying in terms of resource capacities, functional requirements, and domains of control, thus, resulting in a plethora of system, protocols and policies. Just as importantly,

these systems are expected to be highly mobile and self organizing into ad-hoc networks. For example, in the airborne environment, “edge” systems join, leave and self organize into multi-hop ad-hoc networks at speeds reaching 2000 knots. This environment is further complicated by the fact that “edge” systems typically communicate over wireless links that have orders of magnitude less capacity than wired counterparts and are susceptible to interference and mobility effects resulting in rapid variation in bandwidth, connectivity, latency and link quality over time. Furthermore, by virtue of “edge” systems being both a producer of sensor data and consumers of aggregated and fused information needed for decision making, they may act as both clients and servers in this Global Information Grid.

A collaboration framework enabling effective information flow with QoS assurances among disparate and highly mobile “edge” systems must thus satisfy the following requirements: 1) a rich specification language that supports the mapping of complex service demands with QoS attributes representing heterogeneous, highly dynamic and mobile set of resources; 2) a selection service that extends the QoS aware UDDIe registry with mobility related QoS properties to support publishing and, subsequently, searching and discovering services based on QoS properties that include mobility characteristics; 3) predictive and intelligent approaches for admission control and co-reservation based on negotiated SLAs and QoS properties that include an accurate prediction of the establishment, management, and execution costs for an optimal set of mobile components needed in the collaboration; 4) a distributed monitoring approach with aggregation, hierarchies and proxy agents to manage scalability and mobility; and finally 5) an AI planning approach to synthesize complex task sequences required for controlled adaptation that enables migration of services in a predictable and time-phased manner to support fault handling and the mobile nature of the “edge” systems.

Currently available Grid workflow languages already provide a rich set of features for collaboration and composition, which is described in more detail in Section 3. To enhance support for providing assurances in this highly mobile and heterogeneous environment, the specification

1C4-6

language for the GIG should extend these workflow languages with per flow QoS attribute mapping features explored in multimedia frameworks such as QoS-A and XRM. Scalability and performance will require language support for specifying aggregation of the properties into resource or service pools and service classes as described in XRM. Language support for mobility such as locale, relative positions, and a path, which is a concept of service locale changing in time, will be required to support highly mobile “edge” systems. Specifically, mobility language extensions will support the ease of specification for the following QoS capabilities: 1) capacity planning support enabling the generation of multipath configurations comprising close entities capable of providing the same service; 2) fault tolerance support enabling selection of redundant service providers that are close and potentially stably connected; and 3) service continuity support enabling scheduled migrations in an unstable and highly mobile environment.

The execution architecture for this highly dynamic and mobile “edge” environment is likely to resemble that of a supervisory control system with a standard inner and outer loop structure similar to the control and management planes defined in XRM. The inner loop provides the performance centric assured services or the functionality of the system being controlled. This loop may rely on predictive knowledge to pre-initiate migrations for service continuity in a mobile environment. The outer loop, which operates at a much slower time scale, provides the complex fault detection and isolation logic needed to monitor for and detect SLA violations within the distributed workflow. It will also contain the AI planning systems to address the multi-objective optimization problem of safely adapting the system with minimal QoS degradation and maximum stability when possible, otherwise renegotiate the violated SLAs. Both forms of fault handling may involve re-initiating the discovery process.

3.0 Workflow A workflow is a way of specifying and

orchestrating a sequence of tasks, managing the flow of data between various entities, and handling exceptions whenever necessary. Workflow frameworks provide the necessary tools to ensure

that the activities in the workflow script are executed in specified sequence, time, service, and location. In the standard Web service model, the service provider is responsible for managing the lifetime of the service and execution of user jobs in the service. Users are only responsible for accessing and using the services.

3.1. Design of Workflow Systems The currently available Grid middleware

consists of necessary modules to enable scientists to access, manage, execute, authenticate, and authorize the use of services. Successful application development and execution often needs complex composition of these modules. Workflows play an important role in capturing the details of application composition so that end users, such as e-scientists, can be shielded away from complex details of composing the various components of an application using modules in grid middleware technologies.

Keppler [16] is designed to improve the reusability of scientific workflow systems by abstracting widely used individual grid functions, such as job scheduling, data movement, and remote execution, into reusable components. Components within the same domain can be interchangeably bound to different implementations, such as GridFTP, or scp for data movement. This facilitates scientists to compose wofklows using components and sub-workflows in a plug-n-play fashion. Keppler is also designed to interoperate with Web services standards; for example, a Keppler component (also called actor) can accept the URL of a WSDL document and allow the instantiation of any operation specified in the document. The actor serves as a client to the Web service, described by the WSDL document and can be incorporated into a scientific workflow as a local component. Another popular toolkit, Taverna [17], provides a language and necessary tools for scientists to effectively use distributed computing technology via workflows. These workflows can be composed using components located locally as well as on remote machines. Apart from execution of these workflows, Taverna provides hooks to enable end users to graphically compose, edit, and visualize the workflows. This facilitates scientists to carry out experiments in a systematic and repeatable manner.

1C4-7

The industry standard for workflows is BPEL [3], which relies heavily on Web services based technologies, and is tailored for the definition and execution of business processes. It provides support for long running transactions and event management, for example. The workflow specification uses an expressive language with a multitude of features including mature exception handling, support for events, and persistence for long running workflows.

Akhil Sahai et. al. from HP Laboratories provides a comprehensive Web services Monitoring framework for SLA monitoring [18]. They suggest that although business process languages such as ebXML exist to support interactions amongst Web services and WSFL or XLANG can define a process for that interaction, SLA compliance guarantees is still missing in Web services. Their work formalizes the definition of SLAs in terms of Service Level Objectives (SLO). Attributes within these objectives are tied to evaluation functions using measurement or monitoring clauses such as “evalWhen” and “measuredAt.” Callback handlers are used to process violations. The service level monitoring (SLM) engine combined with a management proxy wrapper for specifying the instrumentation and managing the distributed collection of the measurements make up the framework. The authors recognize a need to monitor SLAs for cross-domain applications. To address this need and specify what, how often, and where measurements will take place, they introduce Web Services Flow Language (WSFL) based “SLA monitoring process flow” and the Measurement Exchange Protocol. The collection process is initiated by the Web service itself. To keep the application logic simple and to ensure that they have control of detecting compliance to the SLAs, the authors argue that the best approach is to add the instrumentation logic within the SOAP toolkit itself. Although effective, this approach unfortunately limits usage of this method to only the toolkits that support the modifications.

3.2. Requirements and Applicability of Workflows for “Edge” Applications

The requirements of “edge” systems include the following: (1) the specification language should be rich enough to enable the expression of various

possible causal and temporal dependencies among the tasks of the GIG application; (2) a workflow manager (or execution framework) is needed that abstracts away the complex details of remote and local execution of tasks; (3) a framework that is capable of interaction with monitoring systems and can route events between the different components of the workflow; (4) a user interface (UI) that enables intuitive and dynamic steering of “edge” applications, for situations when a commander or an intelligent “edge” system agent needs to react to urgent events; (5) capability to dynamically adapt the workflow specification and replace individual components with different ones, so that mix and match operations can be carried out in search of an efficient and optimal execution of the workflow.

The grid workflow systems described in Section 3.1 provide a rich set of features that meet many of the requirements listed for “edge” systems. BPEL, Taverna, and Kepler have powerful workflow specification languages and execution frameworks that have been successfully used in a wide variety of scientific applications. Taverna and Kepler have intuitive and powerful Graphical User Interfaces (GUIs) that facilitate in easily saving configurations and repeating experiments multiple times. This allows scientists to inspect, reuse, and modify past workflow scripts along with the provenance and metadata for each run. The need for dynamically swapping workflow components is well designed and tested in the Keppler workflow framework. Most grid workflow systems allow some form of interaction and response to events.

An example of a grid application whose workflow requirements match that of the GIG and “edge” systems is the NSF ITR project Linked Environments for Atmospheric Discovery (LEAD) [19]. The LEAD project, which uses and extends BPEL for workflow, tackles the workflows associated with tornado and hurricane prediction. These predictions can be carried out both in static mode (using stored data samples), as well as dynamic mode that is driven by weather patterns as they develop. The developed framework takes into account the various scenarios that emerge as a result of data that is gathered from radars that scan surface heat conditions in real-time.

The rate at which sensor data may be produced in “edge” systems can match and sometimes exceed

1C4-8

those found in traditional grid applications. Additionally the dynamic and ad-hoc nature of “edge” systems, along with the need for performance, drives the need to optimize workflows. Thus it is important to model, estimate, and synthesize QoS metrics over multi-domain Web services, which is not currently addressed by the widely used workflow systems in the grid. A promising approach could be to combine the work on stochastic workflow reduction algorithm (SWR) [13] along with the workflow framework used in LEAD.

4.0 Discovery and Monitoring Systems Grid discovery and monitoring systems are

designed with a robust and scalable architecture to address the various challenges that stem from the widely distributed and heterogeneous nature of grids. The goal of these systems is to provide the necessary modules to search, query, collect, aggregate, process, and sometimes react to monitoring data with specific actions. Determining the current state of the various distributed components, including detection of new failures, is vital for ensuring the smooth functioning of a grid.

4.1 Design of Grid Discovery and Monitoring Systems Scalability is a key concern in the design of a grid monitoring and discovery systems. The architecture should be capable of handling a large number of messages, addition of new clients and servers, and incorporation of new services in an efficient manner. MDS4 is an example of a discovery and monitoring system that uses a two layered approach to support scalability, domains of control, and performance [20]. The two layered services are (1) the Grid Index Information Service (GIIS) and (2) the Grid Resource Information Service (GRIS). The GRIS service manages the publication and availability of the resources and services at each local node whereas a centralized GIIS sever, managed by a lead site for each organization, provides the interface to the discovery services for that organization, which can be defined as a domain or a topologically co-located set of nodes. These organization-centric GIIS servers work together, typically interacting through a distributed Light-weight Directory Access Protocol

(LDAP), to provide the discovery services infrastructure.

An example of a monitoring system is MonALISA (Monitoring Agents using a Large Integrated Services Architecture), which is designed as a collection of autonomous multi-threaded, self-describing agents [21]. These dynamic agents can discover each other and then cooperate and collaborate in a distributed fashion to perform various tasks related to information collection and processing. This loosely coupled design of MonALISA aligns well with the scalability requirement of the Grid and provides an effective way to efficiently handle the complexity of large-scale monitoring.

Web services based standards and protocols, such as WSDL [22], SOAP [23], WSRF [24], WS-Discovery [25], and other XML technologies, are ideal to facilitate interoperable communication between the various components of the monitoring and discovery system. Due to the self-describing nature of many Web services based specifications, new services can be discovered and connected at run-time, resulting in an autonomous design that can provide reliability in grids where it is not uncommon for some components to fail. For example, MDS4 provides mechanisms for service and resource discovery, and is compliant with the interfaces and behaviors defined in the WS-Notification and WS-Resource Framework specifications.

Even though grids allow mobile and wireless devices to be connected on the “edge,” most of the current grid systems primarily use wired connectivity for a majority of the resources. As a result, to enable seamless interaction between various monitoring components and bootstrapping for new clients and services, it is assumed in monitoring systems, such as MonALISA, that some dedicated lookup and registry services exist at well known locations. Such a distributed registry set can be effectively used to obtain pointers to other services in close proximity to any given client. Triggers are associated with these registries to immediately react to specific resource conditions.

Load balancing is important to manage the vast amount of monitoring information that is collected at various granularities. Effective dissemination of this information, along with active replication and

1C4-9

re-activation of services, is necessary to ensure that timely response is provided for client queries. Brokers, global schedulers, and local schedulers need information of different granularity to correctly match the requirements of pending jobs with available resources. Grid monitoring systems, as a result, provide aggregation services as part of the middleware and allow cooperation and composition to build composite services. Aggregation is also important to prevent the flooding of the grid network with monitoring information. By carefully aggregating information based on current and anticipated future needs, the available network bandwidth can be effectively utilized. Many tools exist to serve the functionality of obtaining local resource information and monitoring tools are designed to allow integration of these tools into their framework. The Nagios toolkit [26] has daemons that run periodically to monitors the status of hosts, networks, and user-specified services. Monitoring information, along with logs and reports, is made available for visualization via a browser. Hawkeye [27], developed as part of the Condor project, uses a language (called ClassAds) for collecting, describing, and reporting information on resources. ClassAds can be used to monitor various attributes including available memory, free disk space, load, and process state. Ganglia [28] is a monitoring system targeted for high-performance computing systems such as clusters and grids. It balances the need for interoperability and performance by using XML for data representation and XDR for data transport.

To prevent the use of stale monitoring information, for resource parameters that may change at a rapid rate, monitoring information is typically provided along with a time-to-live (TTL) value. This ensures that all decisions that depend on the current state of a resource take into account the freshness of the monitoring information. Additionally, the TTL value provides a key role in determining if, where, and for how long the information should be cached by receiving nodes. Dedicated and specialized registries and services use soft-state mechanisms for an automated clean-up and management of information such as client references.

Publish-subscribe is a simple and effective model to de-couple collecting and disseminating monitoring information. Collecting modules send the monitoring information to Monitoring Channels, which provide many features to clients, including querying for historical information, filtering based on a specified set of rules, and persistence beyond the lifetime of the source components. In some cases, the Channel can directly connect clients and information sources to allow the transmission of sensitive data via negotiated security protocols. An example of a monitoring system based on the publish-susbscribe model is R-GMA [29].

4.2. Requirements and Applicability of Grid Monitoring Systems on the “edge”

The requirements for discovery and monitoring for “edge” systems include the following: (1) load balancing mechanism: “edge” applications may consist of a vast number of sensors collecting information of different sizes at varying frequencies. The capacity of a single monitoring or discovery server may not suffice to handle this load, and thus requires the use of load balancing strategies; (2) timely response to monitoring information can often be critical to the success of a mission; (3) dissemination of monitoring information should not result in the flooding of the network; (4) a mechanism is needed to enable effective caching of information along with means for automation of the cleanup and management of registries used for monitoring and discovery.

The MonALISA system, with its use of a collection of autonomous agents to provide services, aligns well with the load balancing requirement. Its multi-threaded engine will allow the various GIG components to host loosely-coupled services, register themselves in registries, and be automatically discovered by interested clients. This autonomous functioning of the monitoring environment is also suitable for the GIG. However, just like most other grid systems, MonALISA requires the use of a few dedicated lookup and registry services. This particular design feature may prove to be a major impediment in the functioning of the GIG, which is characterized by the highly mobile, dynamic, and ad-hoc network environment. A feature that is unique to “edge” systems is the need to have services that provide the

1C4-10

scheduled location of a service, a plane's flight path for example, as opposed to a single reference in a registry. An interested client can then consult the schedule to contact the appropriate local registry to obtain a temporary reference.

Due to the highly critical functionality provided by the GIG, it requires guarantees on the Quality of Service (QoS) for timely delivery of monitoring information. These requirements include non-real-time, soft real-time, as well as hard real-time guarantees. Currently, most grid monitoring systems are just designed to provide best-effort assurances. Moreover, the use of Web services stack adversely affects the performance of the system. This is not adequate for the GIG. The performance optimizations employed by MDS4 and MonaLISA need to be coupled with new algorithms and techniques to ensure the required QoS by the GIG.

The GIG may require the management of a large volume of data. The design and implementation of MDS4 is ideally suited to handle the capacity of this data. MDS4 has been successfully used in many large scale applications, including climate modeling via the Earth Science Grid (ESG) community resource, which has a portal that interacts with thousands of users downloading many terabytes of data [30].

The heterogeneity of the GIG, in terms of the software, hardware, metrics, formats, domain, and governance, requires the use of interoperable, self-describing protocols that adhere to well-established ontologies. Web services based standards provide a good starting point for these protocols. In recent years, the GLUE schema [31] has emerged as abstract model for grid resources for use in Grid Information Services. This work needs to be further enhanced for the design of an ontology that captures all the requirements in a mobile and ad-hoc environment.

5.0 Conclusions The GIG and “edge” system infrastructure can

be classified into two categories: (1) the backbone with resources that can match the state of the art in grid computing; and (2) the “edge” services with need for specialized design and implementation that is not fully available in the grid middleware. In this

paper we chose three important aspects of grid computing: monitoring, workflow, and QoS, and studied the applicability of the current technology for applications on the “edge.”

Grids consist of heterogeneous resources from geographically distributed locations with a software stack that is often uniform across all resources, though each host site may have different local policies on how their resources are used. In the GIG, even though the resources may be distributed, the domains are coarser: Army, Airforce, and Navy have different ontologies and the coalition they work in may be less well defined. The capabilities are different even within a domain. As a result, further work is needed to design a light-weight software stack for devices on the “edge” that takes this heterogeneity into account. Also, in contrast to grids, “edge” systems need to be designed for networks where the bandwidth is often limited.

The widespread use of XML based technology in grid middleware implies a software stack that is memory intensive and has performance limitations. While the infrastructure of the GIG backbone can match the computation and storage resources available on the Grid, the ”edge” devices need a software stack that takes into account limited memory and processing power.

It will be useful to determine a way to bound, based on characteristics of the problem domain, the negotiation cost and time for an SLA. Such a binding function can then be used as an input into the decision problem of when to use the Web services negotiation architecture. Other important considerations include fault tolerance scalability, performance both in terms of XML processing and security overhead, compliance monitoring and auditing.

6.0 References [1] Core Services - Net-Centric Enterprise Services, Defense Information Systems Agency, Department of Defense, http://www.disa.mil/main/prodsol/cs_nces.html.

[2] Global Information Grid - The GIG Vision Enabled by Information Assurance, “National Security Agency,” http://www.nsa.gov/ia/industry/gig.cfm?MenuID=10.3.2.2.

1C4-11

[3] BPEL: Business Process Execution Language for Web services, Version 1.1, 2005, http://www-28.ibm.com/developerworks/library/specification/ws-bpel/.

[4] Aurrecoechea, Cristina, Andrew T. Campbell, Linda Hauw, “A Survey of QoS Architectures,” in 4th IFIP International Conference on Quality of service, Paris, France, March 1996.

[5] Campbell, Andrew, Geoff Coulson, Francisco García, David Hutchison, Helmut Leopold, 1993, Integrated Quality of Service for Multimedia Communications, Proceedings of the IEEE INFOCOM '93.

[6] Lazar, Aurel, Shailendra K. Bhonsle, Koon Seng Lim, 1994, A Binding Architecture for Multimedia Networks, Proceedings of COST-237 Conference on Multimedia Transport and Teleservices, Vienna, Austria, 1994.

[7] Hyman, Jay, Aurel Lazar, Giovanni Pacifici, 1991, Real-Time Scheduling with Quality of Service Constraints, IEEE Journal on Selected Areas in Communications, Vol. 9, No. 7, April 1990.

[8] Lazar, Aurel, 1994, Challenges in Multimedia Networking, Proceedings of the International Hi-Tech Forum, Osaka, Japan, February 24-25, 1994, pp. 24-33.

[9] Lazar, Aurel, Adam Temple, Rafael Gidron, 1990, An Architecture for Integrated Networks that Guarantees Quality of Service, International Journal of Digital and Analog Communication Systems, Vol. 3, pp. 329-238.

[10] Karagiannis, Georgios, Vlora Rexhepi, Geert Heijenk, 2000, A Framework for QoS & Mobility in the Internet Next Generation, Ericsson Business Mobile Networks B.V., Internet Next Generation Report, 2000.

[11] Xiao, Xipeng, Lionel Ni, 1999, Internet QoS: A Big Picture, IEEE Network, 13(2), Pgs 8-18.

[12] Al-Ali, Rashid, Kaizar Amin, Gregor von Laszewski, Omer Rana and David Walker, 2003, An OGSA-Based Quality of Service Framework, Proceedings of the Second International Workshop on Grid and Cooperative Computing (GCC2003), Shanghai, China, December 2003.

[13] Sheth, Amit, Jorge Cardoso, John Miller, Krys Kochut, 2002, QoS for Service-oriented Middleware, Proceedings of the Conference on Systemics, Cybernetics and Informatics, Orlando, FL, July 2002.

[14] Musunoori, Sharath B., Frank Eliassen, Viktor S. Wold Eide, 2005, QoS-Driven Service Configuration in Computational Grids, Grid 2005 - 6th IEEE/ACM International Workshop on Grid Computing, Seattle, Washington, USA, November 13-14, 2005.

[15] Keller, Alexander, Heiko Ludwig, Defining and Monitoring Service Level Agreements for dynamic e-Business, Proceedings of LISA '02: Sixteenth Systems Administration Conference, Philadelphia, PA, USA, November 2002.

[16] Altintas, I., C. Berkley, E. Jaeger, M. Jones, B. Ludascher, and S. Mock, Kepler: An Extensible System for Design and Execution of Scientific Workflows, 2004.

[17] http://taverna.sourceforge.net/, 2006.

[18] Sahai, Akhil, Vijay Machiraju, Mehmet Sayal, Aad van Moorsel, Fabio Casati, “Automated SLA Monitoring for Web services,” IEEE/IFIP DSOM 2002, Montreal, Canada, Oct. 2002 (also HPL-2002-191).

[19] Droegemeier, K. K., et. al., Linked environments for atmospheric discovery (LEAD): A cyberinfrastructure for mesoscale meteorology research and education, In 20th Conf. on Interactive Info. Processing Systems for Meteorology, Oceanography, and Hydrology, January 2004.

[20] “A Performance Study of Monitoring and Information Services for Distributed Systems,” X. Zhang, J. Freschl, and J. Schopf, Proceedings of HPDC, August 2003.

[21] “MonALISA: An Agent based, Dynamic Service System to Monitor, Control and Optimize Grid based Applications,” C. Legrand, H. B. Newman, R. Voicu, C. Cirstoiu, C. Grigoras, M. Toarta, C. Dobre, In CHEP 2004, Interlaken, Switzerland, September 2004.

[22] Christensen, E., F. Curbera, G. Meredith, S. Werawarana, “Web services Description Language (WSDL) 1.1,” http://www.w3.org/TR/wsdl, 2001.

1C4-12

[23] W3C, SOAP Version 1.2, http://www.w3.org/TR/soap/.

[24] Snelling, David, Ian Robinson, Tim Banks, Oasis Web Services Resource Framework (WSRF), 2006.

[25] Web Services Dynamic Discovery, http://schemas.xmlsoap.org/ws/2005/04/discovery/, April 2005.

[26] http://nagios.org/, 2006.

[27] A Monitoring and Management Tool for Distributed Systems, http://www.cs.wisc.edu/condor/hawkeye/.

[28] http://ganglia.sourceforge.net/, 2006.

[29] “The Relational Grid Monitoring Architecture: Mediating Information about the Grid,” A.W.Cooke

et al., Journal of Grid Computing, Vol. 2, No. 4 December 2004.

[30] The Earth System Grid: Supporting the Next Generation of Climate Modeling Research, D. Bernholdt, S. Bharathi, D. Brown, K. Chancio, M. Chen, A. Chervenak, L. Cinquini, B. Drach, I. Foster, P. Fox, J. Garcia, C. Kesselman, R. Markel, D. Middleton, V. Nefedova, L. Pouchard, A. Shoshani, A. Sim, G. Strand, D. Williams, Proceedings of the IEEE, 93:3, March, 2005, 485-495.

[31] Andreozzi, S., GLUE Schema Implementation for the LDAP Data Model, Technical Report INFN/TC-04/16, 30 September 2004.

25th Digital Avionics Systems Conference

October 15, 2006