research challenges in autonomic computing

IBM Research

© 2003 IBM Corporation

Research Challenges inAutonomic Computing

Jeff KephartIBM Research

[email protected]/autonomic

2

IBM Research

© 2003 IBM CorporationResearch Challenges in Autonomic Computing | Rutgers University, November 13, 2003

Outline

Background and Motivation

Autonomic Computing Research at IBM Architecture

Overview of Research Program

Autonomic Computing Research Challenges

Conclusions

3

IBM Research


Background and Motivation (Kephart)

My role in autonomic computing My group does research on agents and multi-agent systems

– Architecture, Communication, Negotiation, Machine learning AC Research strategy; joint program manager

University relations; faculty awards, equipment grants

Chair, Autonomic Computing Advisory Board

What I hope to achieve here Stir up interest in autonomic computing research

Explore collaborations with IBM Research

Learn from you: new viewpoints, new approaches

4

IBM Research


Complex heterogeneous infrastructures are a reality!

Directory Directory and Security and Security

ServicesServicesExistingExisting

ApplicationsApplicationsand Dataand Data

BusinessBusinessDataData

DataDataServerServer

WebWebApplicationApplication

ServerServer

Storage AreaStorage AreaNetworkNetwork

BPs andBPs andExternalExternalServicesServices

WebWebServerServer

DNSDNSServerServer

DataData

Dozens of systems and applications

Hundreds of components

Thousands of tuning

parameters

5

IBM Research


Autonomic Computing: Motivation

Individual system elements increasingly difficult to maintain and operate 100s of config, tuning parameters for commercial databases, servers, storage

Heterogeneous systems are becoming increasingly connected Integration becoming ever more difficult

Architects can't intricately plan component interactions Increasingly dynamic; more frequently with unanticipated components

This places greater burden on system administrators, but they are already overtaxed

they are already a major source of cost (6:1 for storage) and error

We need self-managing computing systems Behavior specified by sys admins via high-level policies

System and its components figure out how to carry out policies

6

IBM Research


Facets of Self-Management

Self- The Human-Intensive Present The Autonomic Future

Configure Corporate data centers are multi-vendor, multi-platform. Installing, configuring, integrating systems is time-consuming, error-prone.

Automated configuration of components, systems according to high-level policies; rest of system adjusts seamlessly.

Heal Problem determination in large, complex systems can take a team of programmers weeks.

Automated detection, diagnosis, and repair of localized software/hardware problems.

Optimize Web servers, databases have hundreds of nonlinear tuning parameters; many new ones with each release. Adjusted manually.

Components and systems will continually strive to improve their own performance and efficiency.

Protect Manual vulnerability analysis. Manual detection and recovery from attacks, cascading failures.

Automated defense against malicious attacks or cascading failures; use early warning to anticipate and prevent system-wide failures.

Increased resiliency, responsiveness, efficiency, ROI

Reduced down-time, risk, time-to-value, cost

Business case:

7

IBM Research


Manual Autonomic

Ben

efi

tsS

kill

sC

har

acte

rist

ics

Level 1 Level 2 Level 3

Evolving towards Autonomic Computing Systems

Multiple sources of

system generated data

Extensive, highly skilled

IT staff

Basic Requirements

Met

Data & actionsconsolidated through mgt

tools

IT staffanalyzes &

takes actions

Greater system awareness

Improved productivity

Sys monitors correlates & recommends

actions

IT staffapproves &

initiates actions

Less need for deep skills

Faster/better decision making

Sys monitors correlates &

takesaction

IT staff manages performance against SLAs

Human/system interaction

IT agility & resiliency

Level 5

Componentsdynamically respond to business policies

IT staff focuseson enabling

business needs

Business policy drives IT mgt

Business agility and resiliency

Level 4

8

IBM Research


Outline




AI Research Challenges

Conclusions

9

IBM Research


Autonomic Computing ArchitectureThe Autonomic Element

AEs are the basic atoms of autonomic systems

An AE contains Exactly one autonomic manager

Zero or more managed element(s)

AE is responsible for Managing own behavior in accordance

with policies

Interacting with other autonomic elements to provide or consume computational services

An Autonomic Element

Managed Element

ES

Monitor

Analyze

Execute

Plan

Knowledge

Autonomic Manager


E.g. Database, storage, server, software app, workload mgr, sentinel, arbiter, OGSA infrastructure elements

Service-oriented architecture

Software agents

10

IBM Research


Autonomic Computing Architecture Element interactions

System self-* properties, behavior arise from interactions among autonomic managers

Interactions are Dynamic, ephemeral

Formed by (negotiated) agreement

Flexible in pattern; determined by policies

Based on OGSA and specific AC extensions

– Required messages

– Optional but standard

– Application-specific

For advanced interactions: conversation support “Choreography” defines structure of multi-step

interactions

A multi-agent system!

11

IBM Research


Overview of IBM’s Autonomic Computing Research Program

Over 150 researchers working on various aspects of Autonomic Computing Some projects predate AC initiative; now trying to realign them with AC architecture

Technologies for specific autonomic elements Database, storage, server, client…

Generic element technologies for autonomic elements Autonomic Manager Toolset integrates many element-level technologies

– Modeling, analysis, forecasting, optimization, planning, feedback control, etc. Uses Open Grid Services Architecture standards for inter-element communication Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later

Generic system-level technologies Dependency management, problem determination and remediation, workload management,

provisioning, …

System scenarios and prototypes Small- to medium-scale autonomic systems Demonstrate self-* arising from AC architecture + technology Identify gaps, necessary modifications

12

IBM Research








provisioning, …


13

IBM Research


LEarning Optimizer for DB2 (LEO)G. Lohman, Almaden

Plan Execution

Optimizer

Best Plan

Plan Execution

Optimizer

Best Plan

StatisticsSQL Compilation

1. Monitor

2. Analyze

3. Feedback4. Exploit

AdjustmentsAdjustments

Estimated CardinalitiesEstimated

Cardinalities

Actual Cardinalities

ActualCardinalities

Query

14

IBM Research


IBM IceCube ServerR. Freitas, Almaden

“Brick”

10 Gbit/s

capacitive

“Coupler”

(6) per brick

=

“Thermal

Bus Array”

6”

Prototype Brick:

- (12) 2.5” disks

- 8-port Switch

- Linux on fast CPU

Full IceCube System

blue: Storage Bricks

yellow: Compute Bricks

3D mesh @ 10 Gb/s per link

No connectors,

wires, fibers,

lasers or fans

Lego-like Collection of ‘Intelligent Bricks” Fail-in-place policy: bad bricks are left in place 7 x smaller than equivalent standard systems Fast, power-hungry components (CPU etc) ok Includes resource allocation software First Application : Petabyte-class Storage Server

intended to be managed by one person

15

IBM Research


SLEDS (SLA-based management of storage performance)D. Chambliss, Almaden

Storage customers establish SLAs w/ storage system Storage system throttles optimally in accord w/ SLAs

0 20 40 60 80 100

Demand (k IOPS)

1

10

100

1000

Response T

ime (

ms) Missed

Target

On Target

On Target

Cust Policy

Cust Policy

Storage Customers

SAN Fabric

Storage Server

SLAServer

Manager

16

IBM Research


Personal software configurationD. Bantz & D. Frank, Watson

inventory collection

Analysisrules

Analysis:characterize

inventory

PPE

Plan:Choose

components, resolve

dependencies

SmartCatalog

Clean up?New software?Make space?

install/ uninstall

Planningrules

Policies

Automate SW maint & migration on personal devices

“Upgrade all my applications”

“Make my new laptop work like the old one”

“Migrate most valuable Palm apps to my PC”

17

IBM Research








provisioning, …


18

IBM Research


Autonomic Manager ToolkitW. Arnold et al., Watson

Facilitates autonomic mgr construction In accordance w/ AC architecture

Catcher for generic AM technologies OGSA messaging

Policy tools

Monitoring technologies

AI tools for knowledge representation, reasoning

Math libraries for modeling, analysis, planning

Feedback control

V1.0 available as part of Emerging Technologies Toolkit v 1.1 on IBM alphaWorks (www.alphaworks.ibm.com)

Considering open sourceAn Autonomic Element

Managed Element

ES

Monitor

Analyze

Execute

Plan

Knowledge

Autonomic Manager


ES

19

IBM Research


Policies and Autonomic ComputingD. Verma and D. Kandlur, Watson

Policy: Set of guidelines or directives provided to autonomic element to influence its behavior.

Key Challenge: Move away from low level controls

Move towards high level directives (policies) over autonomic decisions

Developing scenarios, standards and technologies to support policies for autonomic computing

Element

M

A

S

EP

E

K

S E

Element

MM

AA

S

EEPP

E

KK

S E

1. External policies are delivered through effectors.

3. AnalyzeAnalyze system operation w.r.t. policiesCreates reports as dictated by policy

4. PlanAssigns tasks based on policesAssigns resources based on policies Enables sensorsAdd/modify/delete policies

2. Policies are stored as knowledge

5. Enabled/disabled based on policies

6. Enabled/disabled based on external policies

20

IBM Research


Mathematical Modeling and Optimization

M. Squillante, Watson

Develop and implement sophisticated mathematical methods and algorithms to support AC systems Modeling

– Statistical Analysis– Stochastic Models– Forecasting

Optimization

– Discrete– Stochastic– Nonlinear

Control

– Control Theory– Dynamical Systems– Chaos

Think TimesThink Times

ServersServers

RouterRouter

21

IBM Research


Generic Adaptive ControlJ. Hellerstein, Watson

Admin

ES

KeepAliveMaxClients

CPUMem

CPU*Mem*

Apache Server

ControllerCPU*Mem*

M+

-

t

Web service requests

A

E

P

Feedback control to tune effectors

Based on high-level behavioral specs Multiple goals

Multiple effectors

Time-varying demand

Various database and server applications

22

IBM Research


Utility Functions and Autonomic ComputingW. Walsh, Watson

Utility functions can guide autonomic decision making within an element Self-optimization: natural and flexible

way to express optimization criteria based on business objectives

– Avoids hard-coded preferences, special-purpose algorithms

Basis for translating business-level objectives into resource allocation objectives Algorithms based on modeling and

optimization

Response time RT

V(R

T)

Utility function

23

IBM Research








provisioning, …


24

IBM Research


Dependency Mgt & Self-Healing G. Kar, Watson and H. Lee & S. Ma, Watson

Determine functional dependencies among elements Mine design docs, system config metadata, log files

Actively probe running system Use dependency information for system management

Localize problem (real-time active inference & learning)

WS AS DBS R HWS HAS HDBSpWS 1 1 1 1 1 1 1 pAS 0 1 1 1 0 1 1pDBS 0 0 1 1 0 0 1pingR 0 0 0 1 0 0 0pingWS 0 0 0 1 1 0 0pingAS 0 0 0 1 0 1 0 pingDBS 0 0 0 1 0 0 1

Dependency Matrix

Probe

Analysis & Control

Router

Web Server DB Server

App Server

HWS

HA

S

HDBS

25

IBM Research








provisioning, …


26

IBM Research


Human Interaction with Autonomic SystemsP. Maglio, Almaden

Basic questions What do middleware administrators do?

How can we better support the problems and practices they have?

Learn answers to these questions via ethnographic studies

Use insights to develop new ways to interact with complex computing systems

… but we thought that was the return

port!

We had it wrong. Our assumption of how it worked was incorrect.

We start with looking at the proxy server log files, then the web server log files, then the application server admin log files then the application log files.

27

IBM Research


Enterprise Workload ManagementD. Dillenberger

InternetInternet

Appliance Appliance ServersServers

Web Web Application Application

ServersServersData and Data and

Transaction Transaction ServersServers

Internet/Internet/ExtranetExtranet

Business Business PartnersPartners

Large, distributed,heterogeneous system

Achieves end-to-end performance via adaptive algorithms Administrator defines policy

– Desired response times for various classes of users, apps eWLM managers on each resource cooperate to adaptively tune parameters

– OS, network, storage, virtual server knobs– JVM heap size, # garbage collection threads– Workload balancing, routing parameters

28

IBM Research


Example scenario: Autonomic Data Center

Autonomic Data Center

Client 1-1

ResourceArbiter

Registry

PolicyRepository

SystemManager

Application Environment

Application

Manager

Database

Router Server

Storage

Application Environment

DatabaseRouter

Server

Storage

Application

Manager

Client 1-2

Client 2-1

Client 2-2

Resource-level utility Service-

level utility

29

IBM Research


Outline




Scenarios

Autonomic Computing Research Challenges Systems and Software

– Architecture, software engineering & tools, testing/validation– Prototyping a large-scale self-* system

Human-Computer Interaction

– Policies, Interfaces Artificial Intelligence

Learning, Negotiation, Self-healing, Emergent Behavior

Conclusions

30

IBM Research


Challenge: Architecture

AE: How to coordinate multiple threads of activity? AE’s live in complex environments

Multiple task instances and types

– concurrent, asynchronous Multiple interacting expert modules

AE: How to detect/resolve conflicts arising from Internal decisions by independent expert modules

External directives (possibly asynchronous)

Internal policies vs. external directives

System-level: Enable more flexible, service-oriented patterns of interaction As opposed to traditional top-down, hierarchical systems management

Multi-agent architecture

– Communication– Representing and reasoning about needs, capabilities, dependencies

Managed ElementES

Monitor

Analyze

Execute

Plan

Knowledge

Autonomic Manager


ES

Define set of fundamental architectural principles from which self-* emerges

31

IBM Research


Challenge: Software engineering and programming tools

Develop appropriate software engineering concepts and programming tools for composing autonomic elements and systems; support for Monitoring, analysis, planning and execution

Expressing and understanding policies

Interactions with other elements

– Negotiation– Monitoring and enforcing agreements

32

IBM Research


Challenge: Testing and Verification

Develop methods for testing and verifying behavior of autonomic elements testbeds and simulation environments

in situ mechanisms that permit new versions of software to run alongside old versions until they have established their trustworthiness

33

IBM Research


Policy: “Set of guidelines or directives provided to autonomic element to influence its behavior”

Challenge: Policy

Managed ElementES

Monitor

Analyze

Execute

Plan

Knowledge

Autonomic ManagerES

Human interface Authoring and understanding policies

Avoiding or ameliorating specification errors

Developing a universal representation and grammar Many different application domains, disciplines

Many different flavors of policy

Covers service agreements too?

Algorithms that operate upon policies (and agreements?) Automated derivation of actions (e.g. planning, optimization)

Automated derivation of lower-level policies from high-level policies

E.g. “Maximize profit from this set of service contracts”

Conflict resolution Both design time and run time

Need to establish protocols, interfaces, algorithms

34

IBM Research

© 2003 IBM CorporationResearch Challenges in Autonomic Computing | CMU, September 4, 2003

Three flavors of (policy = “decision-making guide”)

CurrentState

S

PossibleState

1

PossibleState

2

PossibleState

3

a1

a2

a3

Action rule If (S) then do a2

Results implicitly in desired state 2

Goal Achieve a most desired state 2

Compute a2 most likely to result in 2

Assumes that most desired state can be determined a priori

Utility function Achieve state with maximal net value V() – C(aS)

Benefit and burden of being explicit about value

States have intrinsic value; value of policy is a derived quantity

Element utility functions

System utility functions

Machinecode

Rules

ActionsElementGoals

Workflows

[More levels of code hierarchy]

Higher-level specifications

GenerativePlanning

Optimization Modeling,

Optimization

Adapters,TranslatersProgramming

Decision-theoreticPlanning

35

IBM Research


Policies: Theory meets Reality

We can’t specify the full state of the world Policy conflicts can arise from incomplete descriptions of state

E.g. different action-rule antecedents can apply to same state, but have conflicting consequents

Goal-type policies can conflict too (sets of acceptable and feasible states don’t intersect)

It’s hard to elicit a full specification of desired behavior from people Preference elicitation is difficult when there are many attributes

But people are good at noticing when the system isn’t behaving as they like

– “Complaint-based tuning” (Ganger, CMU)

Can a universal representation and calculus handle such a broad range? Storage, network, database, server, etc.

Temporal conditions; correlations

Access control

Classification

36

IBM Research


Challenge: Human-System Interface

Develop new languages, metaphors and translation technologies that enable humans to monitor, visualize, and control AC systems Specify goals and objectives to AC systems, and visualize their

potential effect

Techniques must be

– Sufficiently expressive of preferences regarding cost vs. performance, security, risk and reliability

– Sufficiently structured and/or naturally suited to human psychology and cognition to keep specification errors to an absolute minimum

– Robust to specification errors

37

IBM Research


Challenge: Learning

Single element level AE needs to learn a model of itself and environment quickly; environment

is noisy, and dynamic in both state and structure

On-line, so exploration of the space can be costly and/or harmful

May be several hundreds of tunable parameters!

– Maybe only a few dozen are relevant, but which ones?– Some of them can only be changed upon reboot – is it worthwhile?

System level Multi-agent system: several interacting learners

What are good learning algorithms for cooperative, competitive systems?

– What are conditions for stability?– What is sensitivity to perturbations?

Opportunities for layered learning

Establish theoretical foundation for understanding and performing learning and optimization in multi-agent systems.

38

IBM Research


Challenge: Negotiation

Develop and analyze Methods for expressing or computing preferences

Negotiation protocols

Negotiation algorithms

Establish theoretical foundation for negotiation Explore conditions under which to apply

– Bilateral– Multi-lateral (mediated, or not)– Supply-chain

Study how system behavior depends on mixture of negotiation algorithms in AE population

39

IBM Research


Challenge: Self-Healing Systems

GU

IInference & Learning

Engs.

ProbeDriver

Real-timeEventMgr

Diagnos. StateDep. Info, Config

Problem Diagnosis/Localization Mgr

Simulator & Action Mgr

ProblemDeterm.

DB

Web Server DatabaseNetwork Router

Probe Station

Remediator

DependencyAgent

Develop robust, scalable approaches to monitoring/controlling health, security and performance of autonomic systems Automated capture of human expert

knowledge about problem diagnosis and recovery

Predictive, adaptive diagnosis/recovery

Data mining to learn correlated event patterns for diagnosis

Automated learning and execution of appropriate recovery plan

Construction and learning of adaptive statistical models of large networked systems

And do it all without being too invasive!

40

IBM Research


Challenge: Control and Harness Emergent Behavior

Understand, control, and exploit emergent behavior in autonomic systems How do self-*, stability, etc. depend on

– Behaviors and goals of the autonomic elements– Pattern and type of interactions among AEs– External influences and demands on system

Invert relationship to attain desired global behavior

– How?– Are there fundamental limits?

Develop theory of interacting feedback loops Hierarchical

Distributed

41

IBM Research


Outline



Scenarios


AI Research Challenges

Conclusions

42

IBM Research


Conclusions

Autonomic Computing is a grand challenge, requiring advances in several fields of science and technology

Policy, planning, learning, knowledge representation, multi-agent systems, negotiation, emergent behavior

Human-system interfaces

Integrating these technologies to support self-management in complex, realistic environments is a research challenge in itself

What are the best architectures and design patterns? Role of (multi-)agent systems? Building system prototypes is key to developing and validating AC technology and

architecture

What to do if you’re interested in working on these problems Just go do it and publish your results Find an IBM Researcher who is interested in collaborating with you (I can help)

Get them to help you pursue a faculty award or equipment grant

How can we establish a research community around autonomic computing? International Conference on Autonomic Computing, May 17-18, 2004, New York City

Co-located with WWW 2004Co-chair: Manish Parashar

What about defining challenge problems?

We have developed several realistic industry scenarios that could serve as a basis

43

IBM Research


Additional Information

A Vision of Autonomic Computing IEEE Computer, January 2003

IBM Systems Journal special issue on Autonomic Computing http://www.research.ibm.com/journal/sj42-1.html

Web site www.research.ibm.com/autonomic

International Conference on Autonomic Computing www.autonomic-conference.org

May 17-18, New York City

Submission deadline: January 12, 2003

44

IBM Research


Backup Slides

45

IBM Research


Other Autonomic Computing Workshops and Conferences

First Workshop on Algorithms and Architectures for Self-Managing Systems (at FCRC ’03)

June 11, 2003 in San Diego, CA 5th Annual International Conference on Active Middleware Services:

Autonomic Computing Workshop June 25, 2003 in Seattle, WA

IJCAI-03 AI and Autonomic Computing: Developing a Research Agenda for Self Managing Computer Systems

August 10, 2003 in Acapulco, Mexico First International Workshop Autonomic Computing Systems at 14th

International Conference on Database and Expert Systems Applications (DEXA'2003)

1-5 September, 2003 in Prague, Czech Republic 14th IFIP/IEEE International Workshop on Distributed Systems: Operations

& Management (DSOM-03) October 20-22, 2003 in Heidelberg, Germany

46

IBM Research


Controller

AC Mech.

Thermostat

AC Mech.

Thermostat

AC Mech.

Thermostat

AC Mech.

Thermostat

• Locus of high-level policy optimization• Authority over thermostats in domain

• Local knowledge of environment• Direct control of cooling mechanism• Varying degrees of sophistication

Challenge: Putting it all together into a self-managing systemAutonomic Thermostat scenario

47

IBM Research


Scenario: Autonomic Thermostat

72

$100

76

$80

Temperature50 90

Valu

e

Value function

How much would you pay to get temperature T?

Cost function

72

$100

76

Temperature50 90

Cost

$25

$35

How costly is it to attain temperature T?

U(Temperature) = Value(Temperature) – Cost(Temperature)

Controller Policy: Choose temperature that maximizes

48

IBM Research


Scenario: Autonomic Thermostat

AC Mech.

Thermostat

Controller

kWh(Tcurrent, Textern, T)

kWh1(T) ?

kWh1(T)

72

10

76

Temperature50 90

kW

H2.5

3.5

AC Mech.

Thermostat

AC Mech.

Thermostat

72

$100

76

$80

Temperature50 90

Valu

eValue function

Policy Repos.

Power Co.

C(kWh)

kWh

Cost

0 5 10

V1(T) – C1(T) ?

Determine T*that maximizesV1(T) – C1(T)

49

IBM Research


Scenario: Autonomic ThermostatConflict Resolution

AC Mech.

Thermostat

Controller

Temp. goal = T* +/- *

Action Policies

1.If (in cooling mode && Tcurr < T* - *) then turn AC off 2.If (in cooling mode && Tcurr > T* + *) then turn AC on

Man. Control

Temp. goal = T’ +/- ’

Priority Policies

1. Abide by temp goal from entity with higher authority2. If (cost exceeds X) reset temp goal to affordable value

72

$100

76

$80

Temperature50 90

Val

ue

Value function

Cost function

72

$100

76

Temperature50 90

Cost

$25

$35

research challenges in autonomic computing

Documents