johan tordsson department of computing...

Introduction to Autonomic Computing

Johan Tordsson

Department of

Computing Science

www.cloudresearch.org

About me •  MSc (Civ.Ing) Computer Science (2004) •  PhD Umeå, Grid computing (2009) •  Postdoc in Madrid Spain (2009), OpenNebula •  Architect etc. in misc. EC projects (2009-2013) •  Associate professor (2014 - now) •  Research

–  Autonomic cloud and data center management –  How to make clouds run themselves faster/better/cheaper?

•  Spare time job: –  CTO & co-founder for Elastisys (UMU cloud research spinoff) –  Evangelizing that computers (will) beat humans at IT operations

Outline •  Why

– do we need autonomic computing? •  What

– are autonomic systems? •  How

–  to build these autonomic systems? •  When

– will they happen? •  Who

– will build them?

Motivation: software complexity

Motivation: scale

•  Enorma byggnader med servrar, lagringsutrustning, nätverk, kylning

•  En fabrik för IT-tjänster

5

Motivation: faults Question: what is the probability of a hard drive failure? In my laptop?

Will happen every few years, hopefully not right now…

In a large supercomputer or data center?

More than 100k nodes Will happen during this talk!

Motivation: costs

•  Question: How many servers can be handled by a system administrator?

•  Very old question… •  Some numbers:

–  10 - very complex systems –  ~300 - standard large-scale organization –  Several 1000s – virtualized data center –  26k (Facebook 2013)

•  Highel-level management and better abstractions are needed –  Alternative: exponential increase in need for systems

management

Autonomic option

•  Autonomic computing –  Named after autonomic nervous

system –  Systems manage themselves

according to admin goals –  Self-governing operation of entire

system, not just parts of it –  New components integrate

effortlessly - as a new cell establishes itself in the body

Autonomic Computing

•  IBM initiative in early 2000’s •  Landmark paper published 2003

in IEEE Computer by Kephart and Chess @ IBM

•  Active research field since, during 2003-2013: –  200 conferences/workshops –  8000+ papers

•  Lots of funding –  EC FP6, FP7, H2020 –  WASP…

•  Industry uptake –  Many big IT vendors & startups

•  Key point –  Self-management of IT systems

Self-management (1/3)

•  Self-management – Changing components – External conditions – Hardware/software failures

•  Ex. component upgrade – Continually check for component upgrades – Download and install – Reconfigure itself – Run a regression test – When it detects errors, revert to the older

version

Self-management (2/3) •  Four aspects of self-management

– Self-configuration •  Configure themselves automatically • High-level policies (what is desired, not how)

– Self-optimization •  Continually seek ways to improve their operation • Hundreds of tunable parameters

– Self-healing • Handle faults and errors •  Analyze information from logs and monitors

– Self-protection • Malicious attacks •  Cascading failures •  Admin mistakes

Self-management (3/3) •  Autonomic computing achievable

without self-awareness? – Without hard artificial intelligence

•  (Hollywood) Misconception:

machines will take over all human tasks – AI could be a “real danger”

(S. Hawking) – Unemployment? – 

•  Actual idea: Machines will free people to manage systems at higher level

IBM Research

Policy 2011 Keynote © 2009 IBM Corporation June 7, 2011 3

Machines will take over all management tasks, rendering humans superfluous

Visions of Autonomic Computing

Hal 9000, 2001

Terminator

Wrong!

Machines will free people to manage systems at a higher level

Right!

Star Trek: The Next Generation

IBM Research




Hal 9000, 2001

Terminator

Wrong!


Right!


IBM Research




Hal 9000, 2001

Terminator

Wrong!


Right!


Autonomic elements

•  Fundamental atom of the architecture –  Managed element(s)

•  Server, database, storage system, etc.

–  Autonomic manager

•  Responsible for: –  Providing its service –  Managing behavior

according to goals Interacting with other autonomic elements

44 Computer

interactions among autonomic elements as it willfrom the internal self-management of the individualautonomic elements—just as the social intelligenceof an ant colony arises largely from the interactionsamong individual ants. A distributed, service-ori-ented infrastructure will support autonomic ele-ments and their interactions.

As Figure 2 shows, an autonomic element willtypically consist of one or more managed elementscoupled with a single autonomic manager that con-trols and represents them. The managed elementwill essentially be equivalent to what is found inordinary nonautonomic systems, although it canbe adapted to enable the autonomic manager tomonitor and control it. The managed element couldbe a hardware resource, such as storage, a CPU, ora printer, or a software resource, such as a data-base, a directory service, or a large legacy system.

At the highest level, the managed element couldbe an e-utility, an application service, or even anindividual business. The autonomic manager dis-tinguishes the autonomic element from its nonau-tonomic counterpart. By monitoring the managedelement and its external environment, and con-structing and executing plans based on an analysis

of this information, the autonomic manager willrelieve humans of the responsibility of directly man-aging the managed element.

Fully autonomic computing is likely to evolve asdesigners gradually add increasingly sophisticatedautonomic managers to existing managed elements.Ultimately, the distinction between the autonomicmanager and the managed element may becomemerely conceptual rather than architectural, or itmay melt away—leaving fully integrated, auto-nomic elements with well-defined behaviors andinterfaces, but also with few constraints on theirinternal structure.

Each autonomic element will be responsible formanaging its own internal state and behavior andfor managing its interactions with an environmentthat consists largely of signals and messages fromother elements and the external world. An element’sinternal behavior and its relationships with otherelements will be driven by goals that its designerhas embedded in it, by other elements that haveauthority over it, or by subcontracts to peer ele-ments with its tacit or explicit consent. The elementmay require assistance from other elements toachieve its goals. If so, it will be responsible forobtaining necessary resources from other elementsand for dealing with exception cases, such as thefailure of a required resource.

Autonomic elements will function at many levels,from individual computing components such asdisk drives to small-scale computing systems suchas workstations or servers to entire automatedenterprises in the largest autonomic system of all—the global economy.

At the lower levels, an autonomic element’s rangeof internal behaviors and relationships with otherelements, and the set of elements with which it caninteract, may be relatively limited and hard-coded.Particularly at the level of individual components,well-established techniques—many of which fallunder the rubric of fault tolerance—have led to thedevelopment of elements that rarely fail, which isone important aspect of being autonomic. Decadesof developing fault-tolerance techniques have pro-duced such engineering feats as the IBM zSeriesservers, which have a mean time to failure of sev-eral decades.

At the higher levels, fixed behaviors, connections,and relationships will give way to increaseddynamism and flexibility. All these aspects of auto-nomic elements will be expressed in more high-level, goal-oriented terms, leaving the elementsthemselves with the responsibility for resolving thedetails on the fly.

Autonomic manager

Knowledge

Managed element

Analyze Plan

Monitor Execute

Figure 2. Structure of an autonomic element. Elements interact with otherelements and with human programmers via their autonomic managers.

Autonomic element details

•  Sensors: monitor environment •  Effectors: tune managed element •  MAPE loop:

–  Process for self-management of autonomic element

Monitor

Analyze

Sensors

Execute

Plan

Effectors

Knowledge A

uton

omic

M

anag

er

Man

aged

E

lem

ent

Sensors Effectors

The MAPE loop

1. Monitor: – Collect information about state of system – Lot of metrics around – Which ones to gather? – How often to monitor?

4. Execute – Turn the “knobs” of the managed element –  Interactions between knobs?

• Unknown, even to human operators •  At Google, 238 knobs in each managed entity

The MAPE loop (cont.) 2.  Analyze

–  Estimate current state based on monitoring data –  Commonly use model of the world for this

•  “All models are wrong, but some are useful” •  What part of system to model? How? •  Correlations?

3.  Plan –  Select action(s), i.e., which knobs to turn? –  Can be formulated as optimization problem –  Reactive vs. Predictive/Proactive methods

•  Knowledge management –  Update model dynamically

(monitoring) –  Evaluate effects of actions

(execution)

Engineering challenges (1/3)

•  Life cycle of an autonomic element – Design, test, and verification

•  Testing autonomic elements a challenge –  Installation and configuration

•  Element registers itself in a directory service

– Monitoring and problem determination •  Elements will continually monitor themselves •  Adaptation, optimization, reconfiguration

– Upgrading – Uninstallation or replacement


•  Relationships among autonomic elements – Specification

•  Set of output/input services of autonomic elements •  Expressed in a standard format • Description syntax and semantics

– Location •  Find input services that autonomic element needs

– Negotiation – Provision – Operation

•  Autonomic manager oversees the operation

– Termination


•  System-wide issues – Authentication, encryption, signing – Autonomic elements can identify themselves – Autonomic system must be robust against

insidious forms of attack •  Goal specification

– Humans provide the goals and constraints – Ensure that goals are specified correctly in the

first place – Autonomic systems need to protect themselves

from bad input goals: •  Inconsistent, implausible, dangerous, or unrealizable

Specifying goals (1/3)

•  Rules – Often simple condition-action pairs

•  If something happens, do this •  If something else happens, do that • …

– Can use more complex languages to express states, context, etc.

– Explicit enumeration tedious – Very limited ability to express complex

actions


•  Utility functions – Mathematical expressions – Maps system state to scalar value – Represents high-level objectives – What parts of system state to include? – What should function look like?


•  Policies –  (higher-level) descriptions of goals and

constraints for operation – How to map to lower-level behavior? – Composition of multiple policies – What high-level language to use?

•  Turing-complete? • No widely used languages available today

•  Human operators used to explicit steering – Not used to indirect goal specification

Autonomic management techniques - requirements

•  Robustness –  Avoid oscillations or behavioral changes

•  Scalability –  Internet-scale: millions of servers and networks,

even more autonomic agents (50 billion devices?) •  Adaptive to changing workloads

–  Some methods reliable for certain load patterns, but unstable once the load or system dynamics change

•  Performance –  Need to make decisions fast enough to react timely –  Optimal solutions vs. approximations

•  Simplicity –  Key to adoption –  Complex models vs. model-free? –  Learning phase required before deployment?

Autonomic management - sample techniques

•  Heuristic frameworks – Fast and simple, rules of thumb

•  Control theory – Used to steer, e.g., industrial plants,

embedded systems, etc. – Discretization for data packet flows

(queuing theory) •  Machine learning

– Evolve behavior based on empirical (monitor) data

– Examples: Neural networks, genetic algorithms, reinforcement learning

Heuristics

•  Rules of thumb –  Often lack theoretic background

•  Often used to handle very complex (NP-hard) problems –  Scalable, find fast solutions

•  Greedy: •  Local decisions that make sense right here/now •  May not result in optimal solution

–  Hill climbing •  Steer search (manage system in this case) towards steepest

slope –  Often no upper bound

•  Not possible to know distance from optimal solution –  ”The O-word…”

Control theory •  Mathematical models to monitor and steer

dynamic systems – Real-time allocation of CPU, memory, etc.

•  Some simple examples: – Proportional control

•  Adjust signal proportionally to compensate error

– PID (Proportional Integral Derivative) control: •  Integral: adjustment w.r.t. error over time • Derivative: adjustment w.r.t. error trend

Neural networks •  Mimics the brain’s neuron systems •  Input/hidden/output layers of neurons:

–  Neurons in hidden layer: activation functions maps input signal to output signal

–  Action functions tuned upon error in output layer (errors are propagated back for tuning)

•  Often used to capture multi-dimensional problems that are hard to model with other techniques

•  Hard to train (need representative training data) •  Hard to understand cause/effect (hidden layers)

& Deep learning

Genetic algorithms •  Inspired by natural evolution •  Ingredients:

–  Population with genetic representations (behaviors) for candidate solutions (can be hard)

–  Inheritance, crossover, and mutation operators –  A rating function to compare solutions and select

•  Termination? –  Only compares

to prev. generation –  Optimal solution?

•  Adaptable to dynamic environments?

Reinforcement learning

•  Previous methods use a model (internal representation) of the world

•  Reinforcement learning (can be) model free •  System learns dynamically to

–  select the best action for a given state –  based on reward (reinforcement) function

•  How to: –  Assign value to actions? –  Balance exploration (learning) vs. exploitation (benefit from

good, known actions) •  What if environment is too dynamic?

–  Most states have not been seen before?

Autonomic element(s) •  Autonomic element seems doable •  Autonomic elements? •  Multi-agent systems as inspiration

– Behaviors and goals of the systems – Pattern and type of interactions among

agents •  How to decentralized achieve

high-level goals? – Understand, control, and

exploit emergent behavior – Convergence?

Autonomic elements interaction

•  Relationships – Dynamic, short-lived – Formed by agreement?

• May be negotiated

– Full spectrum •  Peer-to-peer • … • Hierarchical

– Subject to policies •  Compare single-element

policies

Interacting control/optimization loops Transaction Requests

Increase demand

Server 1

DB Service

Server 2

File System

Storage Service 2

Storage Service 1

Increase service

Feedback control & optimization of single autonomic elements

• Done for 1-2 variables

What happens when feedback loops interact?

Interacting control/optimization loops Transaction Requests

Increase demand

Server 1

DB Service

Storage Service 1

Capacity limit reached: Get more storage

X

Server 2

File System

Storage Service 2

Interacting control/optimization loops

Demand not being met: Find alternate supplier

Getting more storage

X

Transaction Requests Server 1

DB Service

Storage Service 1

Server 2

File System

Storage Service 2

Interacting control/optimization loops Transaction

Requests Server 1

DB Service

Storage Service 1

Server 2

File System

Storage Service 2

Sorry; already found an alternative

Ready to give you that extra service

X

Transaction Requests Server 1

DB 1

Server 2

File System 1

Storage 2 Storage 1

Negotiation and resource allocation

Request( QueryService, Queries = 800/sec, Type = 2, RT = 5 sec)

Request( QueryService, Queries = 400/sec, Type = 5, RT = 3 sec)

Request( TableSpace, Size = 3 GBytes, Reads = 2000/sec, Writes = 100/sec)

Request( LogicalVolume, Size = 12 Gbytes, Reads = 500/sec, Writes = 500/sec)

Counterpropose( TableSpace, Size = 3 GBytes, Reads = 1600/sec, Writes = 100/sec)

Counterpropose( QueryService, Queries = 320/sec, Type = 5, RT = 4 sec)

Should all requests be met? Compute costs and benefits, propagate them down Forms of negotiation:

• Bilateral • Multilateral • Auction • Supply chain • Competitive/coop

Learning • During negotiation • Strategy evolution • Collective behavior?

Autonomic Computing Adaptation?

•  Fully autonomic computing –  Evolve as increasingly sophisticated autonomic managers are

to existing managed elements •  Autonomic elements will function at many levels

–  At the lower levels •  Limited range of internal behaviors •  Hard-coded behaviors

–  At the higher levels •  Increased dynamism and flexibility •  Goal-oriented behaviors

•  Hard-wired relationships will evolve into flexible relationships that are established via negotiation

Adaptation (cont.)

1. Collect and aggregate information – Support decisions by human administrators

2. Advisors suggesting possible actions by humans

3. Autonomic systems entrusted with lower-level decisions

4. Over time, less frequent and more high-level decisions by operator – Carried out by numerous autonomic actions

at lower level

Autonomic computing – a developer perspective •  Delegation of human operator responsibility

– Trust •  A breakdown of the MAPE loop breakdown:

– Monitoring: Delayed? Missing? Incorrect? – Analyze & Plan: model is wrong! – Execute:

• What if actuators (knobs) do not act as expected? •  The underlying system is likely (autonomically) trying

to counteract actuators •  And your autonomic system is being steered by a

higher-level one

Developer perspective (cont.)

•  In autonomics, so much more can go wrong –  All computer systems fail –  Autonomic systems actively steer other systems,

i.e., can actively make other systems fail •  “Intelligent” actions harmful •  Cascading failures

•  #1 feature: turn if off •  #2 feature: add a “I don’t understand” mode •  “What can go wrong?”

–  If your automated system cannot handle odd inputs/configs/etc, you should not build it…

Autonomic Computing Research Trends

(Selected) Research trends •  Cyber-physical systems

– Datacenters: building + hardware + software •  Interacting autonomic systems

– Hierarchical & distributed – Understanding and controlling these

•  Multi-criteria – Multiple goals (cost, energy, performance, …) – Multiple stakeholder

• Datacenter owners, Application owners, end-users

Even more research trends

•  Data-driven, predictive, & proactive – Feedback control not enough

•  Self-aware systems – Self-reflective, self-predictive, self-adaptive – Context, correlations, and online models

•  Need for benchmarks – Not only performance, but other self-* aspects

43

Summary •  Autonomic computing needed for management

of complex systems such as clouds •  Systems manage (config, repair, optimize,

protect) themselves according to admin goals –  Achievable w/o solving hard AI problem

•  Many different techniques for autonomic management

•  Goal-specification can be hard •  Interacting autonomic elements complicate •  Great care needed to build autonomic systems •  Many unsolved research questions

Thanks!

Questions?

Capacity planning is hard!

johan tordsson department of computing...

Documents