openstack upgrade without_down_time_20141103r1
TRANSCRIPT
Openstack Upgrade Without Down Time
November 5, 2014
Takashi Natsume, Software Innovation Center, [email protected]
Yankai Liu, [email protected]
Agenda
● Introduction● Live Upgrade Test Strategy and Plan
○ Pre-upgrade Investigation○ Considerations in Creating Upgrade Procedure○ Concrete Upgrade Procedure○ Testing○ Upgrade Test Results and Issues
● Summary
2
Introduction
Introduction
Who We Are:Takashi NatsumeTakashi Natsume has been working for NTT corporation since April, 2013.I am engaged in system design of public cloud systems based on OpenStack and functional verification of OpenStack.Before I was engaged in performance analysis and performance troubleshooting for systems.
Yankai LiuYankai Liu is the Cloud Architect at Canonical being responsible for cloud architecture design and delivery. I worked with NTT team to provide consultancy on the upgrade test project.
4
Openstack Upgrade Overview
With the fast openstack releases rolling out, openstack upgrade becomes one of the key operation factors for the deployments, which can be performed off-line or live-upgrade.
For the production deployments, live upgrade is desired to achieve these goals:
● Minimal or no down time
● Catch up the short release cycle of Openstack [1]
● Ensure the maintenance support(because of short maintenance period[2])
● Reduce the cost comparing to off-line upgrade
In this session, we will introduce how NTT designed and tested the live upgrade from Havana to Icehouse service by service.
5
The Goal of NTT Cloud Live Upgrade
No impact on users’ resources usage● Users can utilize their resources(VMs, virtual volumes,virtual networks)
that have already created or are running without any interruption during live upgrade.
For example, VM stop and network communication interruption● No performance problem that affects users’ resource utilization
significantly.
No impact on users’ API calls● During live upgrade, users can use the openstack API services as usual
with:No errors or failsNo incorrect resultsNo performance problem that affects users’ operations
significantly.
6
Upgrade environment and components
•System environment• Built a test environment based on NTT production public cloud
system architecture (See the figure in the next page.)
•Upgrade components• OpenStack components
• Nova, Cinder, Glance,Neutron,Keystone,Heat
• Non-openstack components such as MySQL, RabbitMQ、Load balancer(ldirector) and OS were NOT included.
•Upgrade version• Stable/havana(2014.2.2) to icehouse-1(nova, icehouse-3)
7
System Architecture Built for Upgrade TestingActive/Active: processes that do not retain their stateActive/Standby: processes that retain their state No HA(single): hypervisor hostsProcesses that receive REST API requests can be blocked by deploying load balancers in front of them.
OS: Ubuntu Server 12.04 LTS
8
Live Upgrade Test Strategy and Plan
NTT Cloud Live Upgrade Test Strategy and Plan
Overall Strategy● Step-by-step(Rolling) upgrade is needed for live upgrade● Openstack components co-exist on different versions
Live Upgrade Test Plan1. Pre-upgrade investigation: items that should be considered in
advance2. Considerations in creating details procedure3. Concrete upgrade procedure4. Testing5. Upgrade Test results and issues
10
Live Upgrade Test Strategy and Plan- Pre-upgrade investigation -
Pre-Investigation for Live Upgrade
A) Database schema• Some cases that OpenStack database schemas are different
between new version and old version.• Investigate on the DB schema changes before creating the
upgrade plans
B) Consistency of APIs between componentsC) Consistency of APIs in each component.
• REST API• RPC API
12
Live Upgrade Test Strategy and Plan- Considerations in Creating Upgrade Procedure -
Considerations in Creating Upgrade Procedure
•User resources• User resources that are on hosts to upgrade need be migrated to
another host.
14
Process AProcess BProcess C
Legends:
Process
RPC call
Server
A caller is upgraded after a callee upgrade.In this case, upgrade is performed in the order of process A, process B and process C.
The order of upgrade
Decide the upgrade order based on RPC API version compatibility in the component
15
Operations Required for Step by Step Upgrade
•Blockade(Blocking requests)• load balancer (ldirectord(LVS))• Disable Service(nova-compute, cinder-volume)
•Check processings in progress• Check connections at the load balancer
• e.g. glance-api• Check child processes
• e.g. nova-novncproxy• If a graceful shutdown function can be used, it had better be used.
• Nova: icehouse-1 or later• Cinder: icehouse-1 or later• Neutron: icehouse-2 or later• Heat: havana-3 or later(We fixed a bug in juno-1)• Glance: No need in our environment• Keystone: No need in our environment
16
Database Schema• Change database schema at the beginning of
procedure and the end of procedure• The beginning
• Add tables, add columns and add indexes• The end
• Drop tables, delete columns and delete indexes• In current nova live upgrade procedure(community), nova-conductors
are upgraded at the same time.(New version and old version nova-conductors don’t run at the same time.)
• Conversion of data format should be considered• We need not convert the data format in our trial. There is no problem.• Check codes that define the database schema sufficiently.
• For example, in nova• nova/db/sqlalchemy/migrate_repo/versions/*
• Data conversion may be needed in some cases.• Adding 'triggers' in database tables?
17
Database Schema (cont’d)• Avoid database lock for a long time
• We can use some tools• pt-online-schema-change[3]• oak-online-alter-table[4]
18
HA Configuration
• In the point of view of live upgrade, Active/Active configuration is better.
• But there are some cases that Active/Active cannot be configured, so Active/Standby is forced.
• cinder-volume(depends on backends)• Active/Active can be configured by using Ceph
(Refer to the discussion https://bugs.launchpad.net/cinder/+bug/1280367)
• While Active/Active setup can’t be supported by all the drivers. https://bugs.launchpad.net/cinder/+bug/1322190
• neutron-server(depends on plugin)• neutron-l3-agent/neutron-dhcp-agent• nova-consoleauth• heat-engine(but multiple engine function has been implemented in
icehouse-2.)
19
HA Configuration (cont’d)
• In Active/Active case(controller)• At Load balancer, blocking the node which is in the upgrade process
• In Active/Standby case• When switching Active/Standby, there is service down time of the
component as expected.
20
Upgrade Procedure by HA Configuration
Active/Active configuration
Block requests/connections
to target host
Migrate users’ reources
Upgrade host
Unblock
Repeat on each target hosts
No HA(Single)
Block requeststo target host
Migrate users’ reources
Upgrade host
Unblock
Active/Standby configuration
Block requeststo ‘Active’ host
Upgrade ‘Standby’ host
Switch Active/Standby
Unblock
Repeat on each target hosts Repeat on each target hosts
(if possible)
21
Live Upgrade Test Strategy and Plan- Concrete upgrade procedure -
System Architecture Built for Upgrade TestingActive/Active: processes that do not retain their stateActive/Standby: processes that retain their state No HA(single): hypervisor hostsProcesses that receive REST API requests can be blocked by deploying load balancers in front of them.
OS: Ubuntu Server 12.04 LTS
23
Overall Upgrade Procedure
24
Live Upgrade Test Strategy and Plan- Testing -
•Background workload during upgrade test• Background workload(API requests) covered patterns of calls
between components and between processes in components in our use case.
• Network communication(ping)• North-South• East-West
• Remain VNC console connected during upgrade test
Create test plans, test tools and test data
26
•Build a test environment• Same configurations as a production environment
• HA configuration(Active/Active, Active/Standby) required.
• In order to repeat upgrade testing, we constructed the environment to get back easily by using chef.
Build a test environment
27
•Evaluation criteria• No impact on users’ resources
• Users can utilize their resources(VMs, virtual volumes,virtual networks) that have already created or are running without any interruption.
• No performance problem that affects users’ resource utilization significantly.
• No impact on users’ API calls• No error• No ‘wrong’ results• No performance problem that affects users’ operations significantly
• Operation step does not need a lot of time• Consistency between records that OpenStack manages and actual
resources.
Execute(Test) the procedure
28
Live Upgrade Test Strategy and Plan- Upgrade Test results and issues -
•Solved issues• Heat Graceful shutdown issue
• NTT team fixed it in juno-1• https://bugs.launchpad.net/heat/+bug/1304244
•Remaining issues• Errors due to Active/Standby switchover
• Volume Resource creation failure(ERROR state)
• Errors due to mismatch of RPC API major versions• From nova-compute to nova-consoleauth• From nova-novncproxy to nova-consoleauth
Communication interruption (expected to be resolved in Juno)• Neutron-l3-agent• Changing ‘admin_state_up’ of neutron-l3-agent to False solves
‘scheduling’ issue, but communication interruption occurred.• Interruption of the console connection
• VM live migration/nova-novncproxy upgrade• Impossible to fallback after changing DB schema at the beginning
Identify issues
30
•Clean install• Some source code directories/files should be removed during the
upgrade and fallback. Otherwise it will cause errors and issues.• When overwriting openstack components’ files, errors occurred.• AttributeError: type object 'foo' has no attribute 'bar'
Lesson learns
31
Summary
Summary
● The goal of the upgrade test is to achieve the upgrade without down time.But there were some issues to prevent us from achieving upgrade openstack without down time.
● During our upgrade test, the down time of the services including:○ Network downtime
■ neutron-l3-agent (expected to be fixed in Juno)● Trade-off between the new vRouter creation failure and VM
communication, e.g. a few of minutes downtime to schedule the new vRouter creation OR a few of minutes communication interruption for some VMs communication
○ Some API requests downtime during the Active/Standby switchover● Neutron server● Heat engine● Cinder volume
○ Nova instance console connection interruption■ Need reconnect or Need getting console url again.
33
•Cinder-volume drivers Active/Active HA support• Presently some drivers for commercial products prevent from configuring
Active/Active•Consistency of RPC API major versions
• 1 version rolling upgrade is (limited) supported in Nova. • It should be considered in all core projects.• If OpenStack components utilize oslo.messaging, errors caused by RPC API
major version difference might occur during live upgrade.•Seamless console connection
• There is a discussion In Juno summit for console seamless migration [5]•Consider live upgrade in REST API versions deprecation •SDN controller Active/Active HA support should be considered when integrating into Neutron as a plugin
•Although Ceilometer is not in the test scope, there are still gaps to support Active/Active HA
•Graceful shutdown of all services
Suggestions for communities
34
•[1] Release Cycle• https://wiki.openstack.org/wiki/Release_Cycle
•[2] Releases• https://wiki.openstack.org/wiki/Releases
•[3] Percona Toolkit
• http://www.percona.com/software/percona-toolkit
•[4] openark kit• http://code.openark.org/forge/openark-kit
•[5] Improve performance of live migration on KVM• https://etherpad.openstack.org/p/juno-nova-kvm-live-migration
Reference
35