openstack upgrade without_down_time_20141103r1

35
Openstack Upgrade Without Down Time November 5, 2014 Takashi Natsume, Software Innovation Center, NTT [email protected] Yankai Liu, Canonical [email protected]

Upload: yankai-liu

Post on 14-Jun-2015

763 views

Category:

Presentations & Public Speaking


0 download

TRANSCRIPT

Page 1: Openstack upgrade without_down_time_20141103r1

Openstack Upgrade Without Down Time

November 5, 2014

Takashi Natsume, Software Innovation Center, [email protected]

Yankai Liu, [email protected]

Page 2: Openstack upgrade without_down_time_20141103r1

Agenda

● Introduction● Live Upgrade Test Strategy and Plan

○ Pre-upgrade Investigation○ Considerations in Creating Upgrade Procedure○ Concrete Upgrade Procedure○ Testing○ Upgrade Test Results and Issues

● Summary

2

Page 3: Openstack upgrade without_down_time_20141103r1

Introduction

Page 4: Openstack upgrade without_down_time_20141103r1

Introduction

Who We Are:Takashi NatsumeTakashi Natsume has been working for NTT corporation since April, 2013.I am engaged in system design of public cloud systems based on OpenStack and functional verification of OpenStack.Before I was engaged in performance analysis and performance troubleshooting for systems.

Yankai LiuYankai Liu is the Cloud Architect at Canonical being responsible for cloud architecture design and delivery. I worked with NTT team to provide consultancy on the upgrade test project.

4

Page 5: Openstack upgrade without_down_time_20141103r1

Openstack Upgrade Overview

With the fast openstack releases rolling out, openstack upgrade becomes one of the key operation factors for the deployments, which can be performed off-line or live-upgrade.

For the production deployments, live upgrade is desired to achieve these goals:

● Minimal or no down time

● Catch up the short release cycle of Openstack [1]

● Ensure the maintenance support(because of short maintenance period[2])

● Reduce the cost comparing to off-line upgrade

In this session, we will introduce how NTT designed and tested the live upgrade from Havana to Icehouse service by service.

5

Page 6: Openstack upgrade without_down_time_20141103r1

The Goal of NTT Cloud Live Upgrade

No impact on users’ resources usage● Users can utilize their resources(VMs, virtual volumes,virtual networks)

that have already created or are running without any interruption during live upgrade.

For example, VM stop and network communication interruption● No performance problem that affects users’ resource utilization

significantly.

No impact on users’ API calls● During live upgrade, users can use the openstack API services as usual

with:No errors or failsNo incorrect resultsNo performance problem that affects users’ operations

significantly.

6

Page 7: Openstack upgrade without_down_time_20141103r1

Upgrade environment and components

•System environment• Built a test environment based on NTT production public cloud

system architecture (See the figure in the next page.)

•Upgrade components• OpenStack components

• Nova, Cinder, Glance,Neutron,Keystone,Heat

• Non-openstack components such as MySQL, RabbitMQ、Load balancer(ldirector) and OS were NOT included.

•Upgrade version• Stable/havana(2014.2.2) to icehouse-1(nova, icehouse-3)

7

Page 8: Openstack upgrade without_down_time_20141103r1

System Architecture Built for Upgrade TestingActive/Active: processes that do not retain their stateActive/Standby: processes that retain their state    No HA(single): hypervisor hostsProcesses that receive REST API requests can be blocked by deploying load balancers in front of them.

OS: Ubuntu Server 12.04 LTS

8

Page 9: Openstack upgrade without_down_time_20141103r1

Live Upgrade Test Strategy and Plan

Page 10: Openstack upgrade without_down_time_20141103r1

NTT Cloud Live Upgrade Test Strategy and Plan

Overall Strategy● Step-by-step(Rolling) upgrade is needed for live upgrade● Openstack components co-exist on different versions

Live Upgrade Test Plan1. Pre-upgrade investigation: items that should be considered in

advance2. Considerations in creating details procedure3. Concrete upgrade procedure4. Testing5. Upgrade Test results and issues

10

Page 11: Openstack upgrade without_down_time_20141103r1

Live Upgrade Test Strategy and Plan- Pre-upgrade investigation -

Page 12: Openstack upgrade without_down_time_20141103r1

Pre-Investigation for Live Upgrade

A) Database schema• Some cases that OpenStack database schemas are different

between new version and old version.• Investigate on the DB schema changes before creating the

upgrade plans

B) Consistency of APIs between componentsC) Consistency of APIs in each component.

• REST API• RPC API

12

Page 13: Openstack upgrade without_down_time_20141103r1

Live Upgrade Test Strategy and Plan- Considerations in Creating Upgrade Procedure -

Page 14: Openstack upgrade without_down_time_20141103r1

Considerations in Creating Upgrade Procedure

•User resources• User resources that are on hosts to upgrade need be migrated to

another host.

14

Page 15: Openstack upgrade without_down_time_20141103r1

Process AProcess BProcess C

Legends:

Process

RPC call

Server

A caller is upgraded after a callee upgrade.In this case, upgrade is performed in the order of process A, process B and process C.

The order of upgrade

Decide the upgrade order based on RPC API version compatibility in the component

15

Page 16: Openstack upgrade without_down_time_20141103r1

Operations Required for Step by Step Upgrade

•Blockade(Blocking requests)• load balancer (ldirectord(LVS))• Disable Service(nova-compute, cinder-volume)

•Check processings in progress• Check connections at the load balancer

• e.g. glance-api• Check child processes

• e.g. nova-novncproxy• If a graceful shutdown function can be used, it had better be used.

• Nova: icehouse-1 or later• Cinder: icehouse-1 or later• Neutron: icehouse-2 or later• Heat: havana-3 or later(We fixed a bug in juno-1)• Glance: No need in our environment• Keystone: No need in our environment

16

Page 17: Openstack upgrade without_down_time_20141103r1

Database Schema• Change database schema at the beginning of

procedure and the end of procedure• The beginning

• Add tables, add columns and add indexes• The end

• Drop tables, delete columns and delete indexes• In current nova live upgrade procedure(community), nova-conductors

are upgraded at the same time.(New version and old version nova-conductors don’t run at the same time.)

• Conversion of data format should be considered• We need not convert the data format in our trial. There is no problem.• Check codes that define the database schema sufficiently.

• For example, in nova• nova/db/sqlalchemy/migrate_repo/versions/*

• Data conversion may be needed in some cases.• Adding 'triggers' in database tables?

17

Page 18: Openstack upgrade without_down_time_20141103r1

Database Schema (cont’d)• Avoid database lock for a long time

• We can use some tools• pt-online-schema-change[3]• oak-online-alter-table[4]

18

Page 19: Openstack upgrade without_down_time_20141103r1

HA Configuration

• In the point of view of live upgrade, Active/Active configuration is better.

• But there are some cases that Active/Active cannot be configured, so Active/Standby is forced.

• cinder-volume(depends on backends)• Active/Active can be configured by using Ceph

(Refer to the discussion https://bugs.launchpad.net/cinder/+bug/1280367)

• While Active/Active setup can’t be supported by all the drivers. https://bugs.launchpad.net/cinder/+bug/1322190

• neutron-server(depends on plugin)• neutron-l3-agent/neutron-dhcp-agent• nova-consoleauth• heat-engine(but multiple engine function has been implemented in

icehouse-2.)

19

Page 20: Openstack upgrade without_down_time_20141103r1

HA Configuration (cont’d)

• In Active/Active case(controller)• At Load balancer, blocking the node which is in the upgrade process

• In Active/Standby case• When switching Active/Standby, there is service down time of the

component as expected.

20

Page 21: Openstack upgrade without_down_time_20141103r1

Upgrade Procedure by HA Configuration

Active/Active configuration

Block requests/connections

to target host

Migrate users’ reources

Upgrade host

Unblock

Repeat on each target hosts

No HA(Single)

Block requeststo target host

Migrate users’ reources

Upgrade host

Unblock

Active/Standby configuration

Block requeststo ‘Active’ host

Upgrade ‘Standby’ host

Switch Active/Standby

Unblock

Repeat on each target hosts Repeat on each target hosts

(if possible)

21

Page 22: Openstack upgrade without_down_time_20141103r1

Live Upgrade Test Strategy and Plan- Concrete upgrade procedure -

Page 23: Openstack upgrade without_down_time_20141103r1

System Architecture Built for Upgrade TestingActive/Active: processes that do not retain their stateActive/Standby: processes that retain their state    No HA(single): hypervisor hostsProcesses that receive REST API requests can be blocked by deploying load balancers in front of them.

OS: Ubuntu Server 12.04 LTS

23

Page 24: Openstack upgrade without_down_time_20141103r1

Overall Upgrade Procedure

24

Page 25: Openstack upgrade without_down_time_20141103r1

Live Upgrade Test Strategy and Plan- Testing -

Page 26: Openstack upgrade without_down_time_20141103r1

•Background workload during upgrade test• Background workload(API requests) covered patterns of calls

between components and between processes in components in our use case.

• Network communication(ping)• North-South• East-West

• Remain VNC console connected during upgrade test

Create test plans, test tools and test data

26

Page 27: Openstack upgrade without_down_time_20141103r1

•Build a test environment• Same configurations as a production environment

• HA configuration(Active/Active, Active/Standby) required.

• In order to repeat upgrade testing, we constructed the environment to get back easily by using chef.

Build a test environment

27

Page 28: Openstack upgrade without_down_time_20141103r1

•Evaluation criteria• No impact on users’ resources

• Users can utilize their resources(VMs, virtual volumes,virtual networks) that have already created or are running without any interruption.

• No performance problem that affects users’ resource utilization significantly.

• No impact on users’ API calls• No error• No ‘wrong’ results• No performance problem that affects users’ operations significantly

• Operation step does not need a lot of time• Consistency between records that OpenStack manages and actual

resources.

Execute(Test) the procedure

28

Page 29: Openstack upgrade without_down_time_20141103r1

Live Upgrade Test Strategy and Plan- Upgrade Test results and issues -

Page 30: Openstack upgrade without_down_time_20141103r1

•Solved issues• Heat Graceful shutdown issue

• NTT team fixed it in juno-1• https://bugs.launchpad.net/heat/+bug/1304244

•Remaining issues• Errors due to Active/Standby switchover

• Volume Resource creation failure(ERROR state)

• Errors due to mismatch of RPC API major versions• From nova-compute to nova-consoleauth• From nova-novncproxy to nova-consoleauth

Communication interruption (expected to be resolved in Juno)• Neutron-l3-agent• Changing ‘admin_state_up’ of neutron-l3-agent to False solves

‘scheduling’ issue, but communication interruption occurred.• Interruption of the console connection

• VM live migration/nova-novncproxy upgrade• Impossible to fallback after changing DB schema at the beginning

Identify issues

30

Page 31: Openstack upgrade without_down_time_20141103r1

•Clean install• Some source code directories/files should be removed during the

upgrade and fallback. Otherwise it will cause errors and issues.• When overwriting openstack components’ files, errors occurred.• AttributeError: type object 'foo' has no attribute 'bar'

Lesson learns

31

Page 32: Openstack upgrade without_down_time_20141103r1

Summary

Page 33: Openstack upgrade without_down_time_20141103r1

Summary

● The goal of the upgrade test is to achieve the upgrade without down time.But there were some issues to prevent us from achieving upgrade openstack without down time.

● During our upgrade test, the down time of the services including:○ Network downtime

■ neutron-l3-agent (expected to be fixed in Juno)● Trade-off between the new vRouter creation failure and VM

communication, e.g. a few of minutes downtime to schedule the new vRouter creation OR a few of minutes communication interruption for some VMs communication

○ Some API requests downtime during the Active/Standby switchover● Neutron server● Heat engine● Cinder volume

○ Nova instance console connection interruption■ Need reconnect or Need getting console url again.

33

Page 34: Openstack upgrade without_down_time_20141103r1

•Cinder-volume drivers Active/Active HA support• Presently some drivers for commercial products prevent from configuring

Active/Active•Consistency of RPC API major versions

• 1 version rolling upgrade is (limited) supported in Nova. • It should be considered in all core projects.• If OpenStack components utilize oslo.messaging, errors caused by RPC API

major version difference might occur during live upgrade.•Seamless console connection

• There is a discussion In Juno summit for console seamless migration [5]•Consider live upgrade in REST API versions deprecation •SDN controller Active/Active HA support should be considered when integrating into Neutron as a plugin

•Although Ceilometer is not in the test scope, there are still gaps to support Active/Active HA

•Graceful shutdown of all services

Suggestions for communities

34

Page 35: Openstack upgrade without_down_time_20141103r1

•[1] Release Cycle• https://wiki.openstack.org/wiki/Release_Cycle

•[2] Releases• https://wiki.openstack.org/wiki/Releases

•[3] Percona Toolkit

• http://www.percona.com/software/percona-toolkit

•[4] openark kit• http://code.openark.org/forge/openark-kit

•[5] Improve performance of live migration on KVM• https://etherpad.openstack.org/p/juno-nova-kvm-live-migration

Reference

35