ua cac technews -...

16
Fall 2008 UA CAC Technews UA CAC Technews UA CAC Technews Inside this issue About the center ........................ 1 Letter From Director................... 2 Autonomic Computing: The Next Era to Design Self* Systems and Applications................................ 3 A Physics Aware Programming Paradigm .................................... 4 Scale-Right Provisioning ............. 8 Autonomic Performance/Power Optimizations in Next Generation Datacenters ................................ 9 On Going Projects ......................11 About the Center Autonomic Computing (AC) refers to a broad range of scientific and engineering R&D on methods, architectures and technologies for the design, implementation, integration and evaluation of special- and general-purpose computing systems, components and applications that are capable of autonomously achieving desired behaviors. AC systems aim to self-govern and self-manage themselves in order to enable independent operation, minimize cost and risk, accommodate complexity and uncertainty or enable systems of systems with large numbers of components. Hence, system integration and automation of management are important areas of research whose contexts subsume other AC research topics. These might include, to varying degrees, self- organization, self-healing, self-optimization (e.g., for power or speed), self-protection and other so- called self-* behaviors. CAC research activities will advance several disciplines that impact the specification, design, engineering and integration of autonomic computing and information processing systems. They include design and evaluation methods, algorithms, architectures, information processing, software, mathematical foundations and benchmarks for autonomic systems. Solutions will be studied at different levels of both centralized and distributed systems, including the hardware, networks, storage, middleware, services and information layers. Collectively, the participating universities have research and education programs whose strengths cover the technical areas of the center. Within this broad scope, the specific research activities will vary over time as a reflection of center member needs and the evolution of the field of autonomic computing. Volume 1, Issue 1 Visit our website: http://nsfcac.arizona.edu

Upload: lytuong

Post on 13-Jul-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Fall 2008

UA CAC TechnewsUA CAC TechnewsUA CAC Technews

Inside this issue

About the center ........................ 1

Letter From Director................... 2

Autonomic Computing: The Next Era to Design Self* Systems and Applications................................ 3

A Physics Aware Programming Paradigm .................................... 4

Scale-Right Provisioning ............. 8

Autonomic Performance/Power Optimizations in Next Generation Datacenters ................................ 9

On Going Projects ......................11

About the CenterAutonomic Computing (AC) refers to a broad range of scientific and engineering R&D on

methods, architectures and technologies for the design, implementation, integration and evaluation of special- and general-purpose computing systems, components and applications that are capable of autonomously achieving desired behaviors. AC systems aim to self-govern and self-manage themselves in order to enable independent operation, minimize cost and risk, accommodate complexity and uncertainty or enable systems of systems with large numbers of components. Hence, system integration and automation of management are important areas of research whose contexts subsume other AC research topics. These might include, to varying degrees, self-organization, self-healing, self-optimization (e.g., for power or speed), self-protection and other so-called self-* behaviors. CAC research activities will advance several disciplines that impact the specification, design, engineering and integration of autonomic computing and information processing systems. They include design and evaluation methods, algorithms, architectures, information processing, software, mathematical foundations and benchmarks for autonomic systems. Solutions will be studied at different levels of both centralized and distributed systems, including the hardware, networks, storage, middleware, services and information layers. Collectively, the participating universities have research and education programs whose strengths cover the technical areas of the center. Within this broad scope, the specific research activities will vary over time as a reflection of center member needs and the evolution of the field of autonomic computing.

Volume 1, Issue 1 Visit our website: http://nsfcac.arizona.edu

needed to establish, maintain and operate large IT infrastructures. Autonomic Computing (AC) models IT infrastructures and their applications as closed-loop control systems that need to be continuously monitored, analyzed followed by corrective actions whenever any of the desired behavior properties (e.g., performance, fault, security) are violated. Such AC techniques are inspired by strategies used by biological systems to deal with complexity, dynamism, heterogeneity and uncertainty. The design space for designing AC systems spans multiple disciplines such as distributed computing, virtualization, control theory, artificial intelligence, statistics, software architectures, mathematical programming, and networking. Our research efforts will accelerate the research and development of core autonomic technologies and services. It will also establish strong collaboration programs with the US IT indus-try, specifically the Arizona IT industry (e.g. IBM, Raytheon, Intel and Avirtec) to develop next generation information and communi-cation technologies and services that are inherently Autonomic.

In this newsletter, we highlight our ongoing autonomic research activities and how to apply them to wide range of applications/domains with profound impact on economy and industry such as Autonomic Management of IT infrastructure (specifically perform-ance/power of large-scale datacenters, network defense system, survivable systems and services, and control and management of Wireless networks), Autonomic Scientific/Engineering Applications (specifically accelerating the research and discovery of grandchallenges water and climate issues, optimizing the design and engineering of net-centric weapon systems), and Autonomic health-care systems and services.

In summary, the NSF-CAC will not only advance the science of autonomic computing, but also accelerate the transfer of tech-nolo0gy to industry and contribute to the education of a workforce capable of designing and deploying autonomic computing sys-tems to many sectors---from homeland security and defense, to business agility to the global change science. Furthermore, the cen-ter will leave its mark on the education and training of a new generation of scientists that have the skills and know-how to create knowledge in new transformative ways.

As we move forward, we would like to invite you to join the center so together we can develop innovative autonomic technolo-gies and services that will revolutionize how to design and deploy next generation information and communications services.

Salim Hariri, DirectorUA NSF Center for Autonomic Computing

“The mission of the NSF-CAC is to advance the knowledge of designing Information Technology systems & services to make them self-governed and self-managed”

Letter From DirectorIt is my pleasure to introduce to you the

National Science Foundation Center for Auto-nomic Computing (NSF CAC) - a center funded through the NSF Industry/University Coopera-tive Research Centers program, industry, gov-ernment agencies and matching funds from member universities, which currently include the University of Florida (Lead), the University of Arizona and Rutgers, The State University of New Jersey. The mission of the center is to ad-vance the knowledge of designing Information Technology (IT) systems and services to make them self-governed and self-managed. This will be achieved through developing innovative

designs, programming paradigms and capabili-ties for computing systems.

The explosive growth of IT infrastructures, coupled with the diversity of their components and a shortage of skilled IT workers, have re-sulted in systems whose control and timely management exceeds human ability. Current IT management solutions are costly, ineffective and labor intensive. According to estimates by the International Data Corporation (IDC), worldwide costs of server management and administration will reach approximately US $150 Billion by 2010 and are estimated to rep-resent more than 60% of the total spending

2

Autonomic Computing: The Next Era to Design Self* Systems and Applications

by Salim Haririby Salim Haririby Salim Hariri

The advances in computing and communication technologies and software tools have resulted in an explosive growth in networked applications and information services that cover all aspects of our life. These services and applications are inherently complex, dynamic and heterogeneous. In a similar way, the underly-ing information infrastructure, e.g. the Internet, is large, complex, heterogeneous and dynamic, globally ag-gregating large numbers of independent computing and communication resources, data stores and sensor networks. The combination of the two results in application development, configuration and management

complexities that break current computing paradigms, which are based on static behaviors, interactions and compositions of compo-nents and/or services. As a result, applications, programming environments and information infrastructures are rapidly becoming brittle, unmanageable and insecure. This has led researchers to consider alternative programming paradigms and management tech-niques that are based on strategies used by biological systems to deal with complexity, dynamism, heterogeneity and uncertainty.

Autonomic computing is inspired by the human autonomic nervous system that handles complexity and uncertainties, and aims at realizing computing systems and applications capable of managing themselves with minimum human intervention. In this paper we first give an overview of the architecture of the autonomic nervous system and use it to motivate our approach to develop the autonomic computing paradigm. We then illustrate how this paradigm can be used to control and manage complex applications.

The Need for Integration and AutomationThe control and management of computing systems have evolved from

an environment in which a single process running on a computer system to a large, complex and dynamic environment in which multiple processes run-ning on geographically dispersed heterogeneous computers that could span several continents (e.g., Grid). The techniques to design computing systems and services that meet their requirements have been mainly ad hoc. Ini-tially, designers focused on developing efficient parallel processing and high performance architecture to improve system and application performance. As the deployment of computing systems and applications spread to many areas, especially those where failures can be catastrophic and life threaten-ing, the reliability and availability of such systems and applications become a major concern. This requirement has driven separate research activities that have focused on reliability and fault tolerance computing. In a similar manner, the research in computing security has mainly addressed the needs to protect the integrity and the confidentiality of computing systems and their services without consideration to other important system attributes such as performance, reliability, and configuration. Consequently, this has lead to the development of specialized and isolated computing systems and applications that can efficiently optimize a few of the system attributes or functionalities, but not all of them.

However, the next generation of computing systems and applications need to run fast, reliably, securely and cost-effectively. The ad-hoc integration of the techniques that have been developed to manage performance, security, and fault-tolerance results in sys-tems that are costly, insecure and unmanageable as shown in Figure 1; The actions performed by the security technique might cancel the actions taken by the high performance computing technique or their impacts on other management techniques are unpredict-able. Furthermore, the performance, security and fault tolerance requirements of such systems and applications might change con-

3

Figure 1: Integration of isolated solutions is inefficient, costly, insecure and unmanageable.

tinuously at runtime. Hence, it is essentially critical for next generation computing systems and/or software architecture to be holistic in addressing the system and application requirements. In addition to the need to address system and application requirements in an integrated manner, it is also critically important to control and manage these systems in a timely manner. The complexity, heteroge-neity and dynamism of networked systems and their applications have resulted into systems whose control and timely managementexceeds human ability.

Autonomic Components and SystemsThis has led researchers to consider alternative design para-

digms and management techniques that are based on strategies used by biological systems to deal with complexity, dynamism, heteroge-neity and uncertainty – a vision that has been referred to as auto-nomic computing. Autonomic computing is inspired by the human autonomic nervous system and aims at realizing computing systems and applications that are capable of managing themselves with mini-mum human intervention. There have been several efforts to charac-terize the main features that make a computing system or an applica-tion autonomic. However, most of these techniques agree that an autonomic system must at least support the following four features:

Self-Protecting: an autonomic system should be able to detect attacks and protect its resources from both internal and external attacks.

Self-Optimizing: an autonomic system should be able to detect sub-optimal behaviors and intelligently perform self-optimization functions.

Self-Healing: an autonomic system must be able to detect hardware and/or software failures and should have the ability to re-configure itself to continue its operations in spite of failures.

Self-Configuring: an autonomic system must have the ability to dynamically change the configuration of its resources in order to maintain the overall system and application requirements.

Large scale autonomic computing systems can be dynamically composed from smaller Autonomic Components (ACs) where each component supports in a seamless manner any combination of the four properties (self-protecting, self-optimizing, self-healing and self-configuring) mentioned above. That means, each AC can be dynamically and automatically configured, seamlessly tolerate any component failure, automatically detect component attacks and protect against them, and automatically change its configuration parameters to improve performance once it deteriorates beyond certain performance threshold. Once these autonomic components become available, we can dynamically build an autonomic computing system to meet any static or dynamic requirement such as cost-effective high performance systems, high performance and secure systems as shown in Figure 2.

A Physics Aware Programming Paradigmby Salim Hariri and Yeliang Zhangby Salim Hariri and Yeliang Zhangby Salim Hariri and Yeliang Zhang

Large scale scientific applications generally experience different execution phases at runtime and each phase has different com-putational, communication and storage requirements as well as different physical characteristics. An optimal solution or numerical scheme for one execution phase might not be appropriate for the next phase of the application execution. Choosing the ideal nu-merical algorithms and solutions for all application runtime phases remains an active research area. A new programming methodol-ogy, A novel new approach to develop and implement applications based on autonomic computing principles is critically needed. Autonomic Programming (AP) paradigm enables application developers to identify the appropriate solution methods to exploit the heterogeneity and the dynamism of the application execution states. Once an application is developed based on AP paradigm, anAutonomic Runtime Manager (ARM) can then periodically monitors and analyzes the runtime characteristics of the application to identify its current execution phase (state). For each change in the application execution phase, ARM will exploit the spatial and tem-

4

Figure 1: Holistic Approach: Autonomic Computing System

poral attributes of the application in the current state to identify the ideal numerical algorithms/solvers that optimize its perform-ance. We have evaluated this programming paradigm using a real world application (Variable Saturated Aquifer Flow and Transport (VSAFT2D)) commonly used in subsurface modeling. We evaluated the performance gain of the AP paradigm with up to 2,000,000 nodes in the computation domain implemented on 32 processors. Our experimental results show that by exploiting the application physics characteristics at runtime and applying the appropriate numerical scheme with adapted spatial and temporal attributes, asignificant speedup can be achieved (around 80%) and the overhead injected by ARM is negligible. We also show that the results us-ing AP is as accurate as the numerical solutions that use fine grid resolution.

MotivationIn the domain of scientific computations, discretization on time and space is usually encountered in a large class of problems

such as hydrology underground water study, Stokes problem, thermomechanical, Computational Fluid Dynamics (CFD) and Elastohy-drodynamic Lubrication problems. Most of these problems are solved by dividing computation domain into small grids (represented by spatial characteristics Dx, Dy and Dz) and advancing the computation in a small time step (represented by temporal characteristics Dt) until the desired results obtained. However, most of the applications are generally time dependent and their volatile nature make them hard to be solved. As time evolves, these problems will evolve into different phases with different physical characteristics. For example, in wildfire simulation, for over fifty years, attempts have been made to understand and predict the behavior (intensity, propagation speed and direction, and modes of spread) of wildfires. However, the factors that determine wildfire behavior are com-plex; they include fuel characteristics and configurations, chemical reactions, balances between different modes of heat transfer, topography, and fire/atmosphere interactions. These factors influence a fire’s behavior over a wide range of time and spatial scales while the dynamism of the problem and the complicated interactions between these factors make accurate wildfire simulation diffi-cult. Multiple physical phases not only exist in wildfire simulation, it also exists in forefront astronomical research such as superno-vae. The supernovae core-collapse problem being modeled is inherently multi-phased, heterogeneous (in time, space and computa-tional complexity) and dynamic. It involves hydrodynamics, nuclear fusion and transport phases. Each simulation phase requires dif-ferent computational models with computational resources and the transition from one phase to the next and the computational models applicable at each phase and their computational requirements are determined by criteria based on local state and known only at runtime.

Autonomic Programming (AP) ParadigmMost of the current execution techniques of applications use one algorithm to implement all the phases of the application execu-

tion, but a static solution or numerical scheme for all the execution phase might not be ideal to handle the dynamism of the problem as discussed in Section 1. Some techniques use application errors to refine the solution as in Adaptive Mesh Refinement (SAMR) where the mesh size is refined to achieve a smaller error rate on particular computation domain. In this scheme, the error is used to

5

Computational Requirements Science Driven Solution

Grid point selection Assume that there are 500,000 Gridpoints along x and y. The computational require-ment is (500,000x500,000) x 1000 x 1000= 25x1016 floating point operations 10 Gflops processor25 106 seconds = 8 Years!

Since it’s a saturated situation, Ks and Kc are constants and Kc < Ks. We can enlarge Dx and Dy by 10 and reduce the time step by 10. Then the computat ional requirement is (50,000x50,000) x100 x 1000= 25x1013 25 103

seconds = 4 Days!

Computational Reduction Ratio (CRR)

(n2)x N_iteration x N_time_Step / (n/R e d u c t i o n s p a t i a l ) 2 * N _ t i m e _ s t e p /Reductiontime = (Reductionspatial )

2 * Reduction-time

If Reductionspatial = Reductiontime = 10, thenCRR = 1000 times reduction by just considering the physics properties of the problem.

Table 1

drive the dynamic changes in the grid point size that might or might not optimize the application performance. In our AP approach, we apply a novel programming paradigm that takes into consideration the current application’s physical properties and derive spatial and temporal characteristics from such properties at runtime. In this programming paradigm, the appropriate solution that can meet the desired accuracy and improve performance will be determined for each application phase at runtime. This programming tech-nique is general and can be applied to Finite Element Method, Domain Decomposition Method, and Finite Volume Approximations.

For example, let us consider a Variable Saturated Aquifer Flow and Transport (VSAFT2D) application kernel developed at The University of Arizona. The major computing step in this routine is matrix solving routine. For two different hydraulic media: Slit and Sand, the routine goes through several phases, where in some phases it is possible to enlarge Dx and Dy by 10 and reduce the time step by 10 without affecting the stability of the algorithm and its accuracy. Table 1 shows the performance gains that can be obtained when that is exploited in the AP paradigm. It is clear from this Table that several order of magnitudes can be obtained by just exploit-ing the physics properties of the applications and identifying the right solution for each phase.

To exploit this programming paradigm, an Autonomic Runtime Manager (ARM) is developed to determine the application execu-tion phase by monitoring the application execution, identify the application phase changes by exploiting the knowledge about theapplication physics and how it behaves at runtime, and then use the appropriate numerical algorithm/solver for each detected phase during the application execution. For each execution phase of numerical application, different numerical schemes and solvers that can best exploit its physics characteristics and its state were chosen by a knowledge base. In wildfire simulation, we use AP to de-compose the computational domain into several natural regions (e.g., burning, unburned and burned) at runtime according to wildfire phase. The number of burning, un-burned and burned cells determines the current state of the fire simulation and can then be used to accurately predict the computational power required for each re-gion. By regularly monitoring and analyzing the state of

6

Figure 3: VSAFT2D Execution Time

Figure 2: Autonomic Runtime Manager Architecture

the simulation and the phase transition and drive the runtime optimization through this information, we can achieve significant per-formance gains.

Our preliminary results show that, the AP paradigm brings very little overhead and comparable error rate to the solutions using the finest grid size but with superior speedup.

The Monitor Engine monitors the application execution to identify the computational characteristics and the physical properties of the current application execution phase. In the example discussed in Section 2, the monitor engine will identify the heat diffusion acceleration changes and the heat conductivity changes. The application properties are then fed into the Planning Engine.

The Planning Engine will determine the optimal spatial (Dx, Dy, Dz) and temporal characterization (Dt) for the application solu-tion while maintaining the desired accuracy of the solution. Consequently, the format of the linear system will be projected based on the numerical method the application uses. The Knowledge Base identifies the optimal solution for each execution phase based on analytical and historical data; for example, if an application is using Finite Difference implicit Method, the linear system will involve solving a tri-diagonal matrix, then Conjugate Gradient with block Jacobi preconditioner will be stored in the knowledge based as the best linear solver. After the optimal algorithm is selected, the configuration engine generates the data needed by the new algorithm based on the previous phase execution results and the temporal and spatial characteristics of the new solution. For example, regen-erate grid values using interpolation and extrapolation after we change the grid size. After the configuration engine finishes its job, the application resumes its execution with the new configuration.

We evaluate the performance of the PAP approach with transient problem setting as shown in Figure 3. For simplicity, we only consider the spatial characteristics of the application such as determining the ideal grid size for each phase, although our approach can support both of adaptations types (spatial and temporal). Furthermore, we assume the area is divided evenly between silt andsand (each represents 50% of the computational domain). In reality, this distribution varies depending on the area being modeled. In fact, the more heterogeneous the computational domain is, the more performance gain that can be achieved by using the PAP para-digm; traditional programming techniques suffer more degradation in performance because the adopted solution must satisfy alldomains by choosing the most conservative solution. We compare the performance of PAP implementation of VSAF2SD with the im-plementation that uses the finest grid resolution. We execute the code with finest grid size and with PAP approach in which PARMchooses dynamically the optimal grid size for each phase of the application execution that ran for a total simulation period of 0.3 day.

The execution time of this application on different number of nodes is shown in Figure 3. For 2M nodes, PAP approach achievedan 81% performance gain when compared with the finest grid implementation.

Importance of the research problemThis research will formulate methodologies and develop infrastructures for building the next generation scientific and engineer-

ing simulations of complex physical phenomenon on widely distributed, highly heterogeneous and dynamic, networked computa-tional environment. The simulations targeted by this effort will be built as dynamics compositions of autonomous components thatintegrate scalable distributed (and heterogeneous) computing with interactive control and computational steering, collaborative analysis, and scientific databases and data archives. Composing, configuring and managing the execution of these applications to ex-ploit the underlying computational power in spite of its heterogeneity and dynamism will present significant challenges. Our pro-gramming paradigm, execution model, component frameworks and runtime infrastructures will help researchers better understand the operations and performance issues of applications, limitations of algorithms and architectures. It will also enable the develop-ment of “smarter” algorithms and applications that are capable of sensing the state of their environments and reacting to optimize overall execution, utilization and performance. Finally, this research will support the trend toward the next generation of an inte-grated software development life cycle that allows users to describe the requirements of the application components at each phase of its life cycle. These requirements can then be used by compilers and runtime systems to produce applications that are controllable, observable, and maintainable.

7

Scale-Right Provisioningby Mazin Yousifby Mazin Yousifby Mazin Yousif

Scale-Right Provisioning (SRP) is intended to streamline deployment, allocation, management of compute, memory and I/O resources and scaling to efficiently react to runtime changes (e.g., failure, and workload). Adopting this will mitigate many of datacenters’ inefficiencies such as: (i) Provisioning for peak loads and some-times for multiple peak loads resulting in low average resources utilization (< 25%); (ii) Lack of autonomic fea-tures such as the inability to self-optimize when runtime workloads vary, inability to self-configure when re-sources are added/removed and inability to self-heal when failures happen; and (iii) Manageability, which has remained ad-hoc with considerable overhead on Total Cost of Ownership (TCO).

The premise of this project is to degenerate the physical resources (e.g., servers, network switches and SAN or NAS devices) in an enclosure (or server farm or rack or …) into pools of virtual resources. Then, based on the workload’s re-sources requirements and its executions con-straints such as Service Level Agreement (SLA), security and availability, a set of compute, mem-ory, network and storage virtual resources are selected and launched as a dynamic virtual plat-form (Virtual Machine) to run the workload. Also, we provide the ability to automatically scale up/down resources allocated to a dynamic platform as runtime demands change. For example, when a running workload on a dynamic platform requires more compute power, one or more virtual logical threads from the resource pool is added in to this VM. Similar arrangements can be made for other scenarios of runtime changes such as power, SLA or availability. The specific features we plan to support include, but not limited to: (i) Ability to manage an enclo-sure of compute & I/O nodes independent of system software, and relying on high-level policies; (ii) Ability to create dynamic plat-forms from pools of compute/memory and I/O resources with capacities that match workloads’ requirements; (iii) Ability to dynami-

cally scale resources allocated to a dynamic platform up/down based on runtime changes (e.g., workload & failure); (iv) Ability to enforce various power budgeting schemes within the enclo-sure, as well as thermal-gradient characteristics; (v) Ability to mitigate failures through the inte-gration of a Fault Prediction Agents to predict failures and gracefully migrate workloads from one dynamic platform to another; and (vii) Ability to perform fabric topology configuration to opti-mize resource allocation to dynamic platforms.The Proof of Concept (PoC) for this project will include mechanisms to create: (i) pools of virtual resources; and (ii) a comprehensive autonomic management infrastructure. The first could be established through Virtual Machine Monitors (VMM) in each server or create an enclosure-wide VMM (EVMM). To create the EVMM, it is necessary to enhance existing VMMs with capabilities allowing them to communicate with each other and collaborate to project one aggregate VMM. Specifically: (i) Enable arbitrary grouping of resources to create dynamic platforms, and help cre-ate multiple of such platforms within an enclosure; (ii) Presents abstraction of the platforms to enable decoupling of hardware and software resources guaranteeing security and isolation; (iii) Help to dynamically allocate/de-allocate/reallocate resources within an enclosure and as as-signed to a platform; (iv) Helps manage the enclosure independently of system software; and (v) Helps redirect events to appropriate blade where a platform resides or to the management plat-form.

8

The vision of the capabilities provided by EVMM and EAM is to allow future backplanes to mimic current server mother-boards. Specifically, what is on current servers’ motherboards is a collection of chips (processors, chipsets, ...) and a BIOS that enables them to expose themselves as a plat-form to run system software (OS or VMM). Similarly, EVMM enable of compute and I/O resources in various physical platforms to be aggregated and exposed as dynamic plat-form on which system soft-ware will run.

The autonomic management infrastructure includes multiple components: (i) Resource-Autonomic Manager (RAM) in each server with monitoring and limited decision capabilities based on locally collected data; (ii) Platform Autonomic Manager (PAM) with ability to make decisions at the platform-level; and (iii) Enclosure Autonomic Manager (EAM) that will have visibility to the whole en-closure. Specifically, the EAM is tasked with capabilities such as: (i) Self discovery and configuration of servers, inventory and catalog-ing capabilities at the enclosure level; (ii) Provisioning platform to boot OS images and applications based on system-level policies; (iii) Selecting virtual resources and launching dynamic platforms to run workloads; (v) Transparently migrating or scaling dynamic plat-form resources up/down based on runtime changes; (vi) Enforcing high-level policies such as those related to power and SLA; and (vii) projecting all information on a user interface to allow user to view both physical and virtual configurations in the enclosure.

Autonomic Performance/Power Optimizations in Next Generation Datacenters

by Mazin Yousifby Mazin Yousifby Mazin Yousif

The goal of this project is to design innovative autonomic framework and architecture to integrate in traditional server platforms and enclosures to intelligently optimize their performance/watt. To achieve this goal, we plan to extend our performance/watt opti-mization at the resource-level to the platform-, enclosure- and datacenter-levels. The central ideas behind this research, which in-cludes power and thermal optimizations are: (i) to proactively detect and reduce resource over-provisioning in server platforms such that it is just right-sized to handle the requirements of the application; and (ii) migrate virtual machines from one physical server in an enclosure to another server to eliminate thermal hot spots and smooth thermal gradients within the enclosure, reducing cooling costs. This holistic multi-variable mathematically rigorous optimization approach for determining optimal performance/watt lendsitself best to achieve the desired goals.

Most early work on server power management has either focused on specific components such as the processor or used heuris-tics to address base power consumption in server clusters or have ignored thermal ramifications completely. This motivated us toadopt a holistic approach for system-level power management within a server farm or enclosure where we exploit interactions and dependencies among different resources and platforms.

We consider an enclosure with multiple servers – each consists of multi-core processors and multi-rank memory subsystems plus other resources. The Autonomic Enclosure consists of three hierarchies of management, as shown the first Figure – the Enclosure Autonomic Manager (EAM) at the enclosure level; the Platform Autonomic Manager (PAM) at the platform-level; and Core Manager (CM) and Rank Manager (RM) at the individual processor core and memory rank, respectively. EAM ensures that all platforms withinthe enclosure operate within the pre-determined thermal gradient and thermal/power envelope by migrating virtual machines run-ning specific workloads from one platform to another. PAM’s objective is to ensure that platform resources (processor/memory) are configured to meet the dynamic application resource requirements such that additional platform capacity can be transitioned to low-power states. In this manner both the EAM and PAM save total power without hurting application performance. The platform power and performance parameters together determine the platform operating point in an n-dimensional space at any instant of time during the lifetime of the appli-cation. PAM manages the platform power and performance by maintain-ing the platform operating point within a predetermined safe operating zone, as the example shown in the second Figure. PAM predicts the tra-jectory of the operating point as it changes in response to changes in the nature and arrival rate of the incoming workload and triggers a platform reconfiguration whenever the operating point drifts outside of the safe operating zone. See how we used this approach for memory perform-ance/watt optimizations, where we specifically monitor additional pa-rameters such as memory miss ratio, memory end-to-end delay and

9

Platform Energy

steady-state behaviour

transient behaviour

ss

t

safe operating zoneanomalous operating zone

dzdecision

Platform waitTime

Platform procTime

Platform reqLoss

Memory Miss Ratio

Memory End-to-end Delay

Memory reqLoss

t ss ss

t

t

t

ssdz

dz

dz

dz

memory request loss to determine the best memory configuration that would maintain the platform response time within the safeoperating region .

Platform state is defined by the number of processor cores in active state, the number of memory ranks in the active state and I/O devices. Since the performance of the platform state depends on the physical configuration of the platform, PAM platform recon-figuration decision actually involves a platform state transition from the current state to a target state that would maintain the per-formance while giving the smallest power consumption. The search for this ideal target state is formulated as an optimization prob-lem.

EAM will rely on thermal sensors that will be placed in critical positions within the enclosure and platforms. Data from all these sensors will be collected by EAM, which will make decisions based on the thermal gradient and presence/absence of thermal hotspots within the enclosure, the set of resources required by the running workloads and the decisions PAM has undertaken. One inter-esting research challenge is the sensitivity to data inaccuracies when moving across the hierarchy from the component managers to the platform managers to the enclosure manager.

This project will develop innovative management techniques at the resource level to address the following research challenges: 1) How to efficiently and accurately model power and energy consumption from a system-level perspective that involves complex interactions of different classes of devices such as processor, memory, network and I/O? A system-level view of these componentswould present more opportunities for power savings since we can exploit the non-mutually exclusive behaviors of these componentsto set them at power states such that the global system power consumption is minimal; 2) How to predict, in real-time, the behavior of system resources and their power consumptions, as workloads change dynamically by several order of magnitude within a day or a week; and 3) How to design efficient and self-adjusting optimization mechanisms that can continuously learn, execute, monitor, and improve themselves in meeting the collective objectives of power, thermal and performance optimizations? Game theory and datamining techniques will be exploited to address this research challenge.

10

On going projectsAutonomic Defense System

by Youssif Alby Youssif Alby Youssif Al---NashifNashifNashif

The complexity, multiplicity, and impact of cyber at-tacks have been increasing at an alarming rate in spite of the significant research and development invest-ment in cyber security products and tools. The current techniques to detect and protect cyber infrastructures from these smart and sophisticated attacks are mainly characterized as being ad hoc, manual intensive, and too slow. We are developing ADS based on an alterna-tive approach inspired by biological systems, which can efficiently handle complexity, dynamism and uncertainty. In our approach, an online moni-toring and multi-level analysis are used to analyze the anomalous behav-iors of networks with respect to connection flows, protocols and applica-tion/payload contents. By combining the results of different types of analysis using a statistical decision fusion approach we can accurately detect any types of cyber attacks with high detection and low false alarm rates and proactively respond with corrective actions to mitigate their impacts and stop their propagations. We also apply risk and impact analy-

sis to determine the most effective sequence of actions that will have minimal impacts on normal network operations while at thesame time minimizing the impact of the detected cyber attack .

Intrusion detection can be broadly classified as signature based, classification based and anomaly based systems. The last two techniques are significantly different from the signature-based approaches in that they do not rely on a database of fixed ‘signatures’ to identify the malicious activity.

The main concept behind the ADS is the integration and fusion of several types of independent behavior analysis (applications, protocols, networks, and data link protocols). The Cyberinfrastructure is monitored, the process of selecting the correct feature to improve detection is applied, and then aggregation and correlation is used to reduce the number of analyzed records with minimumloss in information. After that the anomaly based detection technique is used to detect abnormal behavior in the cyberspace. Once an abnormal behavior is detected, the risk and impact analysis process is triggered to recommend the best set of actions to be ap-plied with minimum loss in cyber operation functionality. Finally, the recommended set of actions are either applied automatically or prompted for administrator confirmation in the visualization and management console. Our experimental results show that our ADS prototype can detect automatically protect against any type of TCP attacks (known or unknown) and a wide range of networkattacks (Dos, Scanning, R2L, Worm, just to name a few) with almost zero false alarms.

Autonomic Application Security Managementby Ram Prasad Vby Ram Prasad Vby Ram Prasad V

Network monitoring systems can be broadly classified into signature-based and anomaly-based sys-tems. Signature based systems are limited by the number of anomalies they can detect (which depends on the number of signatures in the database) while anomaly based systems have a high false positive rate. The existing payload/application anomaly detection systems use either byte distributions or work on the first line of the pay-load. Such an approach limits the number of attacks that can be detected and can work only for certain proto-cols (e.g., GET request of HTTP).

Our application anomaly detection system that is a part of the ‘Autonomic Network Defense (AND) System’ classifies the network traffic into various objects such as headers, text, images, audio and video. The

11

system consists of three major routines: (i) A sniffer module that collects the normal traffic and stores it in a database, (ii) a model generator routine which uses the collected traf-fic to model each of the objects mentioned above and (iii) a detector routine which then scans the traffic to detect de-viations from the normal behavior. We are currently analyz-ing the anomalous behavior of HTTP protocol. We have im-plemented the following HTTP header models:

· Language model: This model is used to profile the byte distribution of the HTTP headers. It helps us in detecting anomalies such as shell code injection as HTTP uses ASCII based headers and presence of code will alter the byte distributions present in the packets.

· Keyword/Value based model: These models divide the headers into “keyword-value” pairs. The standard keywords are specified in the HTTP specification (rfc-2616). The keywords are profiled and the various statistics are generated for the values of these keywords. From these statistics, we build the following models:

· Keyword average, maximum and mode: The mode, the average length and the maximum length of the value for each keyword are determined during the model building stage. During the detection phase, the profiled values are checked against the real time values and any deviation is flagged.

· New keywords: HTTP allows users to define new keywords. But the interpretation of the keywords depends on the server and the client. Hence, the presence of any new keyword denotes a deviation from the normal behavior (especially since we profile the normal behavior of the traffic).

· Keyword ordering: We study the ordering of keywords in the normal case. The ordering is profiled and any deviation from the normal ordering is flagged. The change in the ordering can be attributed to hand crafted packets or buffer overflow attacks.

· Time window based model: Our detection routine works in a time window. The traffic is collected for certain duration, say 10 seconds and then the models are verified. At the same time, the traffic is analyzed over the time window. Any similarity of packets is flagged as it can correspond to scanning or denial of service attacks.

We have developed a framework called AND - Tcpdump framework to monitor multiple application level protocols simultane-ously. Based on which applications to monitor, the framework segregates the payload information and hands-over relevant packets to different application protocol handlers. The framework now has the capability to monitor the control port of FTP and detect all the data ports where the transfers are scheduled to happen.

Accelerated Discovery Cycle Development EnvironmentYaser JararwehYaser JararwehYaser Jararweh

In the face of rapid global changes the considerable recent advances in ecosystem science have nonethe-less failed to keep pace with the accelerating need to better understand “how the world works”. The science of global change requires a new model for the comprehensive study of complex systems that quantitatively combines: (a) experimental infrastructure that encompasses realistically complex earth systems, but simulta-neously allows for precise manipulation of that system with (b) networks of field monitoring stations measur-ing whole ecosystem dynamics using a diverse array of instrumentation. To realize this integration between experimental and observational data information integration and informatics capacity are needed to both ingest the massive datasets needed to capture large-scale dynamic ecosystem complexity, and to instantly

12

process and update it in order to test contrasting mechanistic models and drive the next set of experi-ments. Furthermore, to ensure that such an ap-proach is accessible to the entire scientific commu-nity, the data, models, and knowledge generated need to be easily accessible and visualized.

Currently, no coupled physical and computa-tional infrastructure exists to address this pressing need. A computational collaboration that can support such integration would generate immediate transfor-mative science opportunities.

Cyber-infrastructure and information technology accelerate the way many sciences are conducted in rapidly evolving multi-disciplinary fields such as, bio-medical informatics, ecosystems sciences, geo-sciences, and bioinformatics, etc. These scientific fields are characterized by huge amounts of data streaming and data processing. For example in the context of ecosystem observation architecture, hun-dreds of sensors scattered in the field collect temperature, humidity, respiration etc information. This leads to large amount of data collection from the sensor network.

There is a need for Accelerated Discovery Cycles (ADCs) for integrating experimental and observational data to capture large-scale dynamic ecosystem complexity, to instantly process massive datasets, to test contrasting mechanistic models and to drive the next set of experiments. The overreaching objective is to enable ADCs by coupling recent advances in computational models and cyber-systems with the unique experimental infrastructure of Biosphere 2 (B2), a large-scale earth system science facility now under management by the University of Arizona.

In the context of ADCs, there is a need for software development environment for modeling complex systems and a middle-ware for data streaming from the field into the models. Kepler is an open source tool that enables the end user to design scientific workflows in order to manage scientific data and perform complex analysis on the data. Ring Buffered Network Bus (RBNB) Data Tur-bine is a middleware system that is used to integrate sensor-based environment observing systems with Data Processing systems. Currently the integration between Kepler and Data Turbine is limited to reading from the Data Turbine only.

In ADC, multiple hypotheses are tested with different assimilation models. These models run on a distributed computing environ-ment. Our ongoing research activities are focusing on integrating our Autonomia environment with Kepler, RBNB and MATLAB to enable the development of sophisticated autonomic workflows and thus accelerate research and discovery of global climate changesand complex ecosystems.

Anomaly-based Fault Management in Distributed SystemsByoung KimByoung KimByoung Kim

The reliability, availability, and robustness to hardware and software failures are one of the important design criteria for parallel/distributed computing systems. The increase in complexity, interconnectedness, dependency and the asynchronous interactions be-tween the components that include hardware resources (computers, servers, network devices), and software (application services, middleware, web services, etc.) makes the fault detection and tolerance a challenging research problem. In this project, we are devel-oping an innovative concept to achieve self-healing by analyzing transition sequences of length n during a window interval to detect hardware/software faults as well as root-cause analysis. In our approach, we monitor and analyze all the interactions between all the components of a distributed computing system. We use an innovative data structure, AppFlow, which is an n-dimensional array of features to capture spatial and temporal variability. The AppFlow will then be analyzed by an anomaly behavior analysis engine that produces an alert whenever transition sequence pattern violate normal patterns due to a software or hardware failure.

13

AppFlowIt characterizes the dynamic behaviors of applications and systems simultaneously with respect to key operational features.

These features can be categorized into three distinct classes: Hardware Flow (HWFlow); Software Flow (SWFlow); and Network Flow (NETFlow). Keys can be viewed as links connecting different spaces. For example, if there are thousands of application instances run-ning in a distributed system (large scale data center), we need information al-lowing us to differentiate each instance after fault detection. By classifying and tracing the keys from the AppFlows, we can instantaneously identify the source of faults once an anomallous behavior is detected. We have successfully imple-mented a self-healing engine and evalu-ated its performance using an e-commerce application (TPC-W). Our preliminary results show that our ap-proach can detect various faults that we injected asynchronously, and obtain a detection rate of 99.9% with no occur-rences of false alarms for a wide range of fault scenarios.

Survivability Modeling and AnalysisSeungchan OhSeungchan OhSeungchan Oh

Our dependence on Information Technologies (IT) has introduced a new form of vulnerability that gives cyberattackers the opportunity to launch attacks against our national infrastructure (national defense systems, air traffic control systems, power grind control systems, etc.) . Understanding the vulnerability of our Cyberin-frastructure and how to quantify it becomes critically important to secure and protect our IT services and re-sources. The formal definition of survivability is the ability of the system to provide essential services in case of faults, attacks or accidents in a timely manner. Security in general focuses on recognition and resistance of at-tacks, but survivability includes faults and accidents.

The quantification of survivability can be used to analyze the robustness and survivability of different to-pologies and distribution of cyberspace resources. Additionally we can improve the survivability of a system by locating the vulnerable hardware and/or software components so they can be hardened. Very few research has been conducted to quantify the survivability of IT systems and their services due to its challenging complexity. In our approach, we adopt Ellison’s description of survival systems that should have three properties (3R) – Resistance, Recognition and Recovery. In our approach, the possible attacks (faults and accidents) are divided into sub-events and then we calculate how the system responses to those events in resisting, recognizing and recovering from these events.

14

Autonomia: An Autonomic Control and Management EnvironmentSankaranarayanan Veeramoni MythiliSankaranarayanan Veeramoni MythiliSankaranarayanan Veeramoni Mythili

The increased complexity, heterogeneity and the dynamism of networked systems and applications make current configuration and management tools to be ineffective. A new paradigm to dynamically configure and manage large-scale complex and heterogeneous networked systems is critically needed. In this project, we are implementing an autonomic system, Autonomia, that is based on the principles of autonomic computing that can handle efficiently complexity, dynamism and uncertainty in configuring networked systems and their appli-cations. Autonomia provides dynamic programmable control and management services to support the devel-opment and deployment of autonomic applications. It can automate the dynamic allocation of resources to achieve high performance and fault tolerant operations. It provides a secure, open computing environment with automated deployment, registration and discovery of components. Our autonomic management is implemented using two soft-ware modules: Component Management Interface (CMI) that enables us to specify the configuration policies and operational policies associated with each component that can be a hardware resource or a software component; and Component Runtime Manger (CRM) that monitors the component operational state through well defined management interface ports.

The main Autonomia modules include System Manage-ment Editor, Autonomic Management Library, Component Runtime Manager (CRM) and Compound Component Runtime Manager (CCRM). System Management Editor is used to spec-ify the component management requirements according to the specified CMI schema. Autonomic Management Library is a set of common services/functionalities (e.g., fault tolerant service) we have developed that can be invoked by the Com-pound Component Runtime Manager (CCRM). CRM is a run-time manager that aims at monitoring the component behav-ior and controls its operation in order to maintain the desired component attributes and functionalities. Compound CRM (CCRM): Several autonomic components (e.g., autonomic serv-ers, clusters, and software systems) can be controlled and managed by one autonomic system that we refer to as an Autonomic Compound Component (ACC), and the correspond-ing CRM refers to CCRM. In a similar way, larger autonomic systems can be built by composing several autonomic compound compo-nents as so to create hierarchical management structure.

System Management Editor (SME): It is used to specify the component management requirements according to the specified CMI schema. Each autonomic component requires a CMI associated to it, no matter it is basic component or compound component.

Autonomic Management Library (AML): It is a set of common services/functionalities (e.g., fault tolerant service) we have devel-oped that can be invoked by any Component Runtime Manager in an autonomic component. The library includes machine learning algorithms, which can be used for decision making module; fault detection algorithms, which help identifying the faults; system and application performance measurement functions; configuration interface functions, which measure the specified attributes and ef-fect the configuration changes with format we defined (component information base); and some security detection utilities we devel-oped, like checking if host is alive, port is open etc .

Component Runtime Manager (CRM): It monitors the component behaviors and controls its operations in order to maintain its desired operational requirements.

Compound Component Runtime Manager (CCRM): Several autonomic components can be hierarchically controlled and man-aged by CCRM.

15

Benefits of NSF CAC MembershipCAC members are afforded access to leading-edge developments in autonomic comput-ing and to knowledge accumulated by academic researchers and other industry part-ners. New members will join a growing list of founding members that currently includes BAE Systems, EWA Governemnt Systems, IBM, Intel, Merrill-Lynch, Microsoft, Northrop-Grumman, NEC, Raytheon, Xerox, Avirtech, Citrix, Imaginestics, and ISCA Technologies. Benefits of membership include:

Collaboration with faculty, graduate students, post-doctoral researchers and other center partners;

Choice of project topics to be funded by members’ own contributions; Formal periodic project reviews along with continuous informal interaction and

timely access to reports, papers and intellectual property generated by the cen-ter.

Access to unique world-class equipment, facilities, and other CAC infrastructure; Recruitment opportunities among excellent graduate students. Leveraging of investments, projects and activities by all CAC members. Spin-off initiatives leading to new partnerships, customers or teaming for com-

petitive proposals to funded programs.

FundingPer NSF guidelines, industry and government contributions in the form of annual CAC memberships ($35K/year per regular

membership), coupled with baseline funds from NSF and university matching funds, directly support the Center's expenses for

personnel, equipment, travel, and supplies. Memberships provide funds to support the Center's graduate students on a one-to-

one basis, and thus the size of the annual membership fee is directly proportional to the cost of supporting one graduate student,

while NSF and university funds support various other costs of operation. Multiple annual memberships may be contributed by

any organization wishing to support multiple students and/or projects. The initial operating budget for CAC is projected to be ap-

proximately $1.5M/year, including NSF and universities contributions, in an academic environment that is very cost effective.

Thus, a single regular membership is an exceptional value. It represents less than 3% of the projected annual budget of the Center

yet reaps the full benefit of Center activities, a research program that could be significantly more expensive in an industry or gov-

ernment facility.

UniversitiesTo Become a Member Contact us atDirector: Salim Hariri (520) 621-4378 [email protected]: Mazin Yousif (503) 819-4638 [email protected] Director: Youssif Al-Nashif (520)-621-9915 [email protected] Manager: Firas Barakat (520) 548-6325 [email protected] Dept. 1230 E. Speedway Tucson, AZ 85721-0104 http://nsfcac.arizona.edu

Members

The University of Florida,

the University of Arizona and

Rutgers, the State University of

New Jersey, have established a

national research center for

autonomic computing (CAC).

This center is funded by the

Industry/University Coopera-

tive Research Center program

of the National Science Foun-

dation, CAC members from

industry and government, and

university matching funds.