upgrading openstack without breaking everything (including neutron?)

Today&we’re&here&to&talk&about&upgrading&OpenStack

Ideally&we&don’t&want&to&break&everything

And&the&session&description&promised&you&we&wouldn’t&even&break&Neutron,&but&we’ll&see&how&that&worked&out.

Both&principal&engineers&at&TWC&on&the&OpenStack&teamClayton&C focus&on&automation,&CI/CD,&deployments,&etcSean&C focus&on&networking,&compute

● Our&OpenStack&team&started&with&four&people&about&two&years&ago● We&did&our&proof&of&concept&implementation&on&Havana&and&then&after&the&

Atlanta&summit&decided&to&switch&everything&to&Icehouse&and&VXLAN&based&networking&before&going&to&production&in&the&summer

● Since&then&we’ve&done&an&upgrade&to&Juno&and&Kilo● These&are&the&versions&of&the&services&we’re&currently&running

○ This&talk&will&focus&on&our&last&round&of&control&node&upgrades,&which&included&Nova,&Neutron,&Glance,&Cinder&and&Heat

○ Since&our&Kilo&upgrade,&we’ve&moved&Heat&into&a&Docker&container&and&upgraded&it&to&Liberty

○ Horizon&and&Keystone&aren’t&included&because&those&were&already&on&Kilo.

● There&are&a&few&core&tenets&that&we&feel&are&important&and&that&we&try&to&follow&regarding&OpenStack&upgrades.

● The&first&one&is:&You&really&don’t&want&to&fall&behind.&● We&plan&on&upgrading&every&6&months

We&think&you&should&also,&even&if&you&want&to&wait&for&bug&fixes&on&the&stable&branch

The&primary&reason&is&that&is&the&only&tested&path&for&upgradesAnd&with&rolling&upgrades&and&lazy&DB&migrations,&there&are&now&

intermediate&steps&that&have&to&be&done&between&releasesFor&example,&in&Kilo,&nova&flavor&migration&must&be&run&before&upgrading&

to&Liberty

Automate&everythingIf&you&don’t&automate&everything&then&when&you&start&your&testing….

You’re&going&to&feel&like&this&guyTest&it&over&and&over

Get&your&process&downUpgrades&might&impact&customers,&so&try&to&find&out&what&that&impact&is

● Our&team&gave&an&upgrades&talk&in&Vancouver,&some&of&you&may&have&been&to&that&talk&also

○ We&appreciate&anyone&that&felt&like&they&wanted&to&hear&us&talk&about&OpenStack&upgrades&twice&in&one&year.

● We’re&going&to&try&not&to&cover&too&much&of&the&same&ground,&the&Juno&talk&is&on&Youtube&and&it&covers&our&overall&approach

○ We’re&going&to&talk&more&about&updates&to&that&approach&and&issues&we&ran&into&while&upgrading&to&Kilo

● So&when&deciding&timing&for&our&Kilo&upgrade,&there&was&one&major&feature&we&were&looking&forward&to:

● Like&most&people&using&OpenStack,&we&use&RabbitMQ&as&the&message&broker&for&all&intra&service&communications

● Like&most&people&using&OpenStack,&we’ve&had&tons&of&problem&with&this,&although&it’s&gotten&better

● The&biggest&remaining&problem&we’ve&seen&with&Juno&was&that&if&anything&went&wrong,&OpenStack&services&would&not&realize&they&were&disconnected&from&Rabbit

○ NovaCcompute&was&particularly&bad&about&this.● AMQP&heartbeats&are&a&protocol&level&feature&that&let&the&RabbitMQ&server&

and&clients&check&in&on&each&other&regularly○ If&one&of&them&goes&missing,&everything&gets&cleaned&up&and&clients&

can&go&reconnect&in&a&timely&fashion○ This&was&added&as&an&experimental&feature&in&Kilo&and&we’d&heard&

good&things.

Before&you&start&down&the&path&of&upgrading,&you&have&to&know&requirements&for&acceptable&downtime&and&outage.

This&also&requires&balancing&technical&capabilities&and&desires&with&customer&needs.

For&instance...

If&you&can&just&forklift&upgrade&to&a&new&environment&or&even&reinstall&the&same&servers,&the&easiest&approach.&&

We,&as&operators,&love&this.&&It&makes&our&life&operationally&easy.

Another&option&we&like&is&to...

● ...think&of&the&upgrade&process&as&a&pit&stop…,&● pulling&the&entire&cloud&out&of&the&race&and&swapping&workloads&over&a&short&

period&of&time.● It’s&a&short&outage,&but&a&total&one.

● The&problem&is,&our&customers&don’t&want&_any_&outage

● This&is&what&our&customers&want.&&Zero&downtime!&&That’s&what&we&need.

● These&guys&change&two&tires&on&the&car&in&about&5&minutes,&while&the&car&is&driving&down&the&road&the&whole&time.

● And,&unfortunately,&we&don’t&get&to&change&the&tires&on&just&one&side&of&the&car.

● In&the&end,&our&requirements&ended&up&being...● Our&customers&are&ok&with&an&API&outage&for&say&10&or&15&minutes.● They’re&not&ok&with&any&other&sort&of&outage● This&is&basically&what&our&requirements&were&for&both&our&Juno&and&Kilo&

upgrades● For&Juno&our&upgrade&weakness&was&networking.● Let’s&talk&about&our&improvement&goals&for&our&Kilo&upgrade

For&the&Kilo&upgrade&we&also&integrated&lessons&learned&from&our&Juno&upgrade.This&meant...

We&did&our&Juno&upgrade&in&the&early&evening&and&the&feedback&from&our&customers&was&that&this&was&their&peak&time.&&For&Kilo&we&changed&our&upgrade&time&to&be&2am&local&time.&(ugh)

We&also&realized&that&we&need&to&test&major&upgrades&using&production&data&from&both&regions,&we&did&this&and&thankfully&didn’t&have&an&issues&there.

The&major&problem&with&our&Juno&upgrade&was&that&we&had&unexpected&network&outages&when&upgrading&in&production:

Primary&reason&for&this&was&because&we&had&dramatically&more&routers&in&our&production&environment&than&we&did&in&dev&or&staging.

In&dev&and&staging&the&outage&was&just&not&long&enough&for&us&to&notice&it&and&we&weren’t&doing&good&monitoring

To&address&this:We&put&tooling&in&place&to&spin&up&around&100&virtual&networks&and&routers&

and&an&instance&behind&each&one&in&order&to&give&us&a&more&realistic&test&environment

We&also&put&in&place&high&granularity&ping&monitoring&of&those&instances&so&we&could&get&good&metrics&about&what&was&going&on&during&our&upgrade&testing.

This&was&really&effective&in&letting&us&understand&what&was&happening&during&the&testing

● We&talked&about&how&important&upgrade&automation&is&before,&I&just&want&to&touch&on&that&briefly&and&cover&how&we&handle&that

● All&of&our&upgrade&automation&is&done&using&Ansible&to&drive&changes&via&Puppet

○ Puppet&is&responsible&for&all&package&management,&config&changes,&service&restarts,&etc

○ Ansible&does&everything&else&and&handles&all&orchestration&and&ordering

● This&is&something&we&covered&in&a&fair&amount&of&depth&in&our&Vancouver&talk&if&you’re&interested&in&more&detail

● When&doing&our&Kilo&upgrade,&we&started&with&the&Juno&upgrade&automation&and&we&were&able&to&reuse&nearly&all&of&it

● So&let’s&look&at&what&our&actual&upgrade&process&looks&like

● This&is&what&our&starting&point&looks&like&for&our&control&cluster.● We&have&3&control&nodes.

○ Each&node&hosts&the&services&we’re&going&to&be&upgrading,&plus&a&bunch&of&virtual&routers.

○ They&are&also&all&part&of&a&shared&MySQL&cluster&and&RabbitMQ&cluster.

○ External&users&talk&to&these&nodes&via&a&hardware&load&balancer.○ What’s&not&shown&here&is&that&internal&traffic&goes&through&HAProxy

● So&let’s&walk&through&the&process&of&the&actual&upgrade.&○ Keep&in&mind&that&all&the&steps&you&are&seeing&were&automated&with&

Ansible&playbooks.

● The&goal&here&was&to&take&two&of&the&control&nodes&out&of&service&and&then&upgrade&the&first&node.&Here’s&how&we&got&there.

● We&do&is&shutdown&and&backup&the&database&on&two&of&the&nodes● Next&we&use&L3&agent&failover&to&move&all&the&routers&from&the&first&control&

node&to&the&other&two.○ The&issue&we’re&trying&to&avoid&here&is&that&when&the&OVS&agent&is&

started&during&the&upgrade■ It&will&drop&all&network&flows,&leading&to&a&loss&of&network&

connectivity.■ We’re&going&to&talk&about&that&more&later&on

○ To&avoid&that,&we&shut&down&the&L3&agent&on&the&first&control&node■ After&the&L3&agent&on&nodes&2&and&3&detect&the&“failure”&of&the&

L3&agent&on&the&first&control&node,&they’ll&start&taking&over&those&routers

○ Once&all&routers&are&moved,&we&disable&the&L3&agent&on&node&1&via&the&Neutron&API&so&that&when&it&comes&back&up&during&the&upgrade,&routers&don’t&move&back&automatically.

● This&leaves&us&functional,&not&in&an&outage,&but&with&a&cluster&of&only&one.● The&last&thing&we&do&before&starting&the&API&outage&is&get&a&list&of&all&instances&

with&floating&IPs○ We&set&up&a&small&script&to&ping&all&the&floating&IPs&and&report&on&their&

status&while&we&proceed&with&the&upgrade

● Start&the&API&outage&by&turning&off&external&load&balancer○ We&ran&into&some&issues&here,&but,&we’re&going&to&cover&that&later

● Then&we&shut&down&all&OpenStack&services&on&all&3&control&nodes.○ The&goal&is&to&not&have&Juno&services&trying&to&make&changes&against&a&

Kilo&database○ The&routers&continue&to&function&because&that&occurs&in&the&kernel

● Run&puppet&on&the&first&control&node.&&It&upgrades&all&the&packages,&updates&config&file&settings&and&finally&restarts&all&the&services

○ We&set&OS_ENDPOINT_TYPE&to&internalURL&when&running&Puppet&so&that&it&can&talk&via&the&internal&haproxy&load&balancer&instead&of&the&external&endpoints&that&we’ve&disabled

○ This&also&sets&the&nova&API&compat&flag&so&that&Juno&compute&nodes&can&still&talk&to&the&Kilo&control&services.

● When&this&is&complete,&we&run&a&simple&smoke&test&via&the&CLI&clients&to&verify&the&services&have&basic&functionality&before&continuing&on

● Once&we’ve&completed&our&smoketests,&we&want&to&start&getting&things&back&to&normal

● We&enable&the&L3&agent&on&the&Kilo&control&node,&it&will&detect&that&the&L3&agent&on&the&other&two&nodes&is&dead.

● Once&it’s&given&up&on&them,&it&will&start&plumbing&out&everything&needed&for&the&routers&on&the&first&control&node&and&they’ll&be&moved&automatically.

○ A&little&later&we’ll&talk&about&the&gross&workarounds&that&were&needed&to&make&this&work&well.

● We&reCenable&the&load&balancer.&We’re&out&of&outage&and&back&to&a&one&node&cluster.

○ Length&of&the&API&outage&is&basically&the&time&to&move&routers,&install&new&packages&and&run&DB&migrations

● We&can&now&relax&a&bit,&the&worst&is&mostly&over.&but&we&have&two&more&control&nodes&to&upgrade

● The&next&step&is&to&get&the&MySQL&Galera&cluster&back&up&and&running.&● When&we&start&the&database&on&the&other&nodes,&Galera&replication&will&ensure&

the&database&on&the&other&nodes&are&up&to&date.○ No&more&database&migrations&are&needed.

Then&we&let&puppet&run&through&the&other&two&nodes&one&by&one,&upgrading&packages&to&Kilo&and&restarting&services.

Once&all&nodes&are&upgraded&we’re&nearly&done,&&except&one&node&is&hosting&all&the&routers.&We&have&a&script&that&will&rebalance&the&routers&evenly&across&the&nodes,&while&avoiding&moving&any&high&profile&tenants

● And&now&we’re&done&with&control&nodes.&We&do&a&bunch&more&testing&here,&including

○ LiveCmigrating&a&canary&instances&on&compute&nodes○ Running&our&regression&test&suite○ Checking&logs,&etc.

● To&finish&the&upgrade,&we&need&to&get&the&compute&nodes&upgraded● We&live&migrate&all&instances&off&of&a&few&compute&nodes&and&put&canary&

instances&on&them● Upgrade&those&nodes&and&do&extensive&testing&on&them

○ Live&migration,&volume&attach/detach,&etc● Proceed&with&a&normal&deploy

○ This&causes&a&short&outage&because&the&OVS&agent&drops&all&flows&when&it’s&restarted.

○ Unfortunately&we&can’t&avoid&this&for&Kilo● Control&and&Compute&upgrades&took&less&than&3&hours&per&region,&and&we&did&

the&two&regions&on&separate&nights.● The&last&thing&we&did&was&merge&a&change&to&remove&the&API&compat&flag&on&

the&control&nodes&and&deploy&that&as&part&of&the&next&normal&deploy

Overview

● As&we&mentioned&before,&a&big&problem&in&our&Juno&upgrade&was&loss&of&customer&network&connectivity&during&the&upgrade

● We&tracked&this&down&to&several&causes:○ Tunnel&MAC&learning&flows&have&a&default&timeout&of&5&minutes&and&

require&L2&Agent&to&be&running&to&refresh.&&If&your&upgrade&takes&more&than&more&than&5&minutes,&they’re&going&to&expire&and&you’re&going&to&drop&customer&traffic.

○ On&startup&the&OVS&L2&agent&flushes&all&flows.&&■ Dropping&all&the&flows&wouldn’t&be&too&bad,&except&that&

rebuilding&them&on&a&busy&control&node&is&*really*&slow■ Over&10C15&minutes&for&a&complete&rebuild&2500&flows&for&50C60&

routers.○ The&other&issue&we&ran&into&was&caused&by&our&abuse&of&Router&HA&

Agent&Failover&beyond&it’s&design.■ The&router&on&the&old&control&node&would&continue&ARPing&for&

the&gateway,&and&blackholing&the&traffic● Here’s&how&we&addressed&these...

Detail

● Early&in&the&upgrade&we&change&the&OVS&MAC&learning&flow&timeouts&on&all&compute&and&control&nodes&from&the&default&of&5&minutes&to&30&minutes.&&

○ The&reason&we&do&this&is&that&we&know&we’re&going&to&have&Neutron&down&long&enough&during&the&upgrade&that&the&5&minute&timers&will&expire&and&we’ll&start&dropping&traffic

○ There&is&still&the&remaining&issue&that&any&*new*&flows&may&expire&before&the&upgrade&is&complete

■ We&didn’t&observe&this&being&an&issue&in&practice.

Detail● First&work&around&is&to&avoid&ever&restarting&the&OVS&agent&on&a&node&that&is&

actively&passing&traffic.○ On&control&nodes&you&just&move&the&routers&to&a&box&that’s&not&actively&

being&upgraded○ On&compute&nodes&you&could&do&live&migration,&but&we&decided&not&to,&

since&rebuilding&flows&there&is&much&faster&due&to&lower&density.● We&use&L3&agent&failover&to&preCbuild&flows&when&we&move&routers.&&xxxx

○ This&means&that&the&time&to&build&those&flows&occurs&before&we&have&an&outage,&instead&of&during.

● Lastly,&the&long&term&fix&for&this&is&in&Liberty.○ In&Liberty,&the&OVS&agent&will&tag&flows&with&a&cookie&so&that&it&can&

properly&identify&the&flows&in&the&future○ On&restart,&Instead&of&rebuilding&everything&it&will&synchronize&the&OVS&

flow&state&with&what&Neutron&wants&it&to&be,&instead&of&the&brute&force&approach&that&it&used&to&take

Detail● Lastly,&we&had&to&work&around&this&issue&with&routers&not&moving&properly&

sometimes● After&moving&the&routers&to&the&new&control&node,&we&cleaned&them&up&on&old&

hosting&control&node&with&the&following&steps:○ Delete&flows&in&the&integration&and&tunnel&bridges○ Delete&all&the&router&ports○ Delete&the&router&namespaces

● This&is&absolutely&a&brute&force&approach,&but&it&was&very&effective&in&avoiding&the&ARP&issue&and&we&had&very&few&tenants&losing&networking&with&this&approach.

● So&how&did&our&testing&and&upgrade&go?Let’s&use&realCworld&tropical&storm&Kilo&as&a&metaphor&for&our&Kilo&upgrade

It&slowly&meandered&all&over&the&place&and&it&eventually&died&out&after&about&3&weeks.

The&tropical&storm&was&the&3rd&longest&lasting&tropical&storm&in&record&history

We&ran&into&a&wide&variety&of&minor&and&major&problems&and&we&wish&our&Kilo&upgrade&had&only&lasted&3&weeks&like&the&storm&did

Even&with&lessons&learned&from&JunoPartially&this&was&because&we&put&more&network&testing&in&place&and&had&to&

improve&our&tooling&and&that’s&a&worthwhile&investmentBut&we&also&ran&into&a&lot&more&problems&with&the&Kilo&upgrade.Some&of&that&was&our&own&fault,&and&some&of&it….was&other&people’s&fault.

● After&our&upgrade&in&our&second&region&we&realized&that&cinderCvolume&was&completely&broken

○ It&was&really&odd,&because&we’d&done&exactly&the&same&thing&in&the&other&region&and&it&worked&with&no&issues

● Eventually&we&tracked&it&down&to&this○ The&os_region_name&variable&is&what&Nova&uses&to&determine&which&

region’s&cinder&endpoint&it&should&talk&to.○ If&you&only&have&one&region,&this&doesn’t&matter&at&all,&there&is&only&one&

cinder&endpoint■ If&you&have&multiCregions,&the&libraries&pick&the&endpoint&with&the&

lowest&UUID■ So&when&Nova&tried&to&attach&a&volume,&it&was&talking&to&cinder&

in&the&wrong&data&center!■ So&it&was&dumb&luck&we&ran&into&this&at&the&second&region,&

instead&of&the&first.○ The&problem&is&that&os_region_name&used&to&be&in&the&DEFAULT&

section.○ In&Kilo&it&moved&to&the&[cinder]&section,&but&we&didn’t&catch&that

● DEFAULT/os_region_name&was&deprecated&in&Juno,&but&we&apparently&ignored&that&when&we&did&our&upgrade

○ There&was&no&mention&of&the&removal&of&the&backwards&compatability&in&the&Kilo&release&notes

● If&you&have&more&than&100C200&routers&with&pythonCneutronclient&2.3.x,&you&can&run&into&this&issue

○ Returns&“Request&URI&too&long”● This&is&a&bug&that&had&already&been&fixed&upstream,&but&Canonical&packaged&

the&version&that&was&in&in&the&global&requirements&list● The&global&requirements&list&had&the&Juno&version&of&neutron&client&until&

August● Attempting&to&downgrade&the&Neutron&client&packages&to&work&around&this&is&

how&we&ended&up&accidently&uninstalling&Nova.

● So&with&the&Kilo&upgrade,&you&need&to&migrate&flavor&data&after&the&upgrade&to&get&things&into&the&new&way&of&storing&that&data.

● Once&nova&is&brought&up,&it&starts&lazily&migrating&this&data&as&flavors&are&accessed

● Shortly&after&the&upgrade&in&a&shared&dev&environment&we&*accidently*&uninstalled&Nova&on&all&nodes

● We&ended&up&with&flavor&data&that&was&partially&migrated&because&of&this,&and&that&caused&Nova&to&crash&on&startup.

● We&spent&hours&tracking&this&down&and&eventually&had&to&fix&it&by&hand&by&editing&the&database&entries.

● After&this&we&changed&our&automation&to&migrate&the&flavor&data&immediately&after&doing&the&upgrade,&and&before&we&brought&API&services&back&online

● In&Kilo,&Neutron&added&a&new&option&‘allow_automatic_dhcp_failover’○ This&provides&the&ability&to&have&DHCP&server&health&checked&

regularly,&and&if&one&failed,&it&would&automatically&be&spun&up&on&another&DHCP&agent.

● Unfortunately,&it&detects&spurious&failures&pretty&regularly,&for&us&multiple&times&a&day

● Unfortunately,&when&it&does&fail&over,&it&hits&another&bug&a&good&percentage&of&the&time&that&causes&the&DHCP&neutron&ports&to&get&stuck&in&creating&status

○ So&in&effect&this&was&killing&good&DHCP&servers&instead&of&recovering&bad&ones

● We&don’t&even&need&this&feature,&we&run&three&control&nodes,&and&two&DHCP&agents&per&network

● However,&it&defaults&to&on,&so&for&about&a&week&after&our&upgrade&we’d&have&tenants&dropping&offline&because&their&DHCP&server&hit&this&combination&of&bugs&and&is&dead&until&we&manually&clean&things&up

● There&was&no&mention&of&this&feature&in&the&release&notes.● Part&of&how&we&discovered&that&this&feature&existed&and&was&buggy&was&by&

looking&at&the&DHCP&code&changes&on&the&master&branch&for&neutron&and&comparing&it&to&the&kilo&branch

○ We&realized&this&feature&had&a&lot&of&bugs&when&we&found&lots&of&fixes&for&it&on&the&master&branch.

○ Of&the&half&dozen&fixes,&only&one&or&two&of&them&were&backported.● We&ended&up&just&turning&off&this&off

● As&I&implied&before,&we&ran&into&issues&with&validating&services&while&the&external&endpoints&were&offline

● Normally&the&CLI&clients&get&a&list&of&service&endpoints&from&keystone&and&default&to&the&public&one

○ By&setting&the&OS_ENDPOINT_TYPE&environment&variable&or&passing&the&same&thing&in&via&a&commandCline&option,&you&can&override&this&and&tell&them&to&use&the&internalURL,&which&for&us&is&separate&and&based&on&HAProxy

● The&issue&is&that&some&of&the&CLI&clients,&including&Neutron&and&Cinder&were&broken,&and&would&ignore&both&of&these.

● This&broke&our&Puppet&runs&during&the&upgrade&and&it&broke&our&smoke&test&scripts

● Unfortunately,&because&we&found&this&issue&very&late&in&the&process,&we&ended&up&deciding&to&just&leave&the&external&LB&for&our&production&upgrades.

● We&also&ran&into&schema&problems&with&Glance.● In&Kilo,&Nova&started&using&the&V2&Glance&API● The&V2&API&does&schema&validation,&but&the&v1&API&doesn’t&really

○ So&it&was&possible&to&create&images&with&attributes&via&the&V1&api,&that&the&V2&api&thought&was&invalid.

○ Like&description&being&NULL&instead&of&an&empty&string○ When&that&happens,&Nova&couldn’t&do&anything&with&the&image,&

because&it&would&fail&schema&validation&via&the&V2&API● There&was&no&way&to&tell&Nova&to&use&the&V1&API&instead● Flavio&from&the&Glance&team&helped&us&get&this&fixed&very&quickly● Canonical&backported&it&quickly

● We&ran&into&a&similar&issue&with&Glance&but&in&the&schema&file&instead&of&in&Glance&code

● The&attributes&this&time&were&kernel_id&and&ramdisk_id● We&changed&the&schema&file&to&allow&these&fields&to&be&nullable● This&has&been&fixed&upstream&in&the&same&way.

● When&doing&the&first&upgrade&in&our&shared&dev&environment,&we&ran&into&a&problem&with&Nova&migrations

● MySQL&was&failing&to&run&a&migration&to&convert&a&column&from&NULL&to&a&NOT&NULL

● It&was&failing&because&MySQL&5.6&has&a&bug&that&prevents&converting&a&column&to&NOT&NULL&if&it&has&a&foreign&key&constraint

● This&didn’t&happen&in&all&of&our&environments,&and&if&we&did&a&mysqldump&and&restore,&the&problem&went&away

● We&opened&a&support&case&with&Percona,&waited&for&them&to&track&it&down&and&got&a&new&build&from&them&that&resolved&the&issue.

● If&you&see&a&problem&like&this&when&running&DB&migrations,&your&problem&is&probably&due&to&existing&database&tables&not&matching&the&default&database&sort&order,&or&collation.

What&happened&for&us&is&that&we&had&some&databases&using&utf8_unicode_ci&and&the&upstream&Puppet&modules&changed&the&default&database&collation&to&utf8_general_ci

That&means&newly&created&tables&had&a&different&sort&order&than&the&existing&ones&and&when&adding&foreign&keys&between&an&old&and&new&table,&MySQL&would&refuse&add&them

This&could&happen&for&any&database&in&theory,&for&any&migration&that&changes&foreign&keys.

● Keystone&middleware&that&all&projects&use&for&token&validation&was&moved&into&a&separate&package&in&Juno,&but&Juno&still&supported&the&old&library&names.&&

In&Kilo&the&old&names&were&removed,&but&this&wasn’t&mentioned&in&the&Kilo&release&notes.&&

The&control&nodes&we&had&that&were&upgraded&from&icehouse&still&had&the&old&value

This&was&an&easy&fix&once&we&found&it.Issues&like&this&are&particularly&hard&to&find,&since&oslo.configs&normal&

deprecation&mechanisms&can’t&cover&this&scenario

● Last&but&not&least,&we&found&this&problem&after&completing&our&first&prod&upgrade&and&turning&API&services&back&on

New&feature&in&Nova&scheduler&called&“scheduler_tracks_instance_changes”.&&This&can&track&instance&state&to&allow&scheduler&filters&to&make&more&

informed&decisions.This&is&the&commit&message&for&the&new&feature

On&startup&the&scheduler&polls&all&compute&nodes&for&instance&state&in&batches&of&10&at&a&time

Our&experience&was&that&this&meant&that&novaCscheduler&was&chewing&up&100%&of&a&core&until&this&was&done&and&it&took&forever&to&finish

RabbitMQ&would&get&disconnected&CC we&believe&because&heartbeats&were&failing&due&to&the&thread&not&being&scheduled

Even&after&turning&off&heartbeats,&we&still&saw&instances&not&being&scheduled&while&this&was&enabled

We&don’t&use&any&scheduler&filters,&we&didn’t&need&it,&turned&it&offOnly&vague&notions&of&this&in&the&release&notes,&and&we&didn’t&understand&what&

was&going&on&until&we&found&this&commit&message.DocImpact&tag&definitely&didn’t&translate&to&release&note&updates&in&this&

case.

● After&all&those&issues,&this&is&about&how&we&felt&by&the&time&we&were&done&with&our&prod&kilo&upgrades

● If&you&haven’t&seen&Groundhog&Day,&you&should,&it’s&literally&a&classic.

● So&a&number&of&these&problems&we&ran&into&are&because&we&didn’t&pay&attention&to&deprecations&in&Juno,&and&when&those&features&were&removed&in&Kilo,&we&didn’t&know&because&we&just&read&the&Kilo&release&notes&for&our&Kilo&upgrade,&not&the&Juno&release&notes&for&our&Kilo&upgrade.

● MySQL&has&bugs,&we’re&good&at&finding&them&with&OpenStack&upgrades.&&Yay?

● Part&of&the&reason&we&upgrade&is&that&we&want&new&features&(and&bug&fixes),&but&at&least&two&of&the&problems&we&had&were&because&new&features&were&on&by&default,&and&they&were&buggy.

● Buggy&services&are&one&thing,&but&in&both&cases,&there&was&no&real&documentation&around&these&features.

○ One&of&them&wasn’t&mentioned&in&the&release&notes&at&all,&and&the&other&had&no&detail&about&what&it&did

● And&to&give&credit&where&credit&is&due,&some&projects&are&really&good&at&release&notes.

○ The&Cinder&Kilo&release&notes&were&widely&credited&as&being&good&at&the&Operator’s&MidCCycle&meetup

○ Looking&through&the&Liberty&release&notes,&the&Nova&section&is&really&really&good.&&It&would&be&nice&if&everyone&followed&their&example.

● So&with&that&litany&of&horrible&issues,&you&may&be&wondering&if&we&thought&upgrading&was&worthwhile:

● After&resolving&these&issues,&overall&stability&has&been&improved● So&AMQP&heartbeats&have&increased&stability&dramatically&for&us.

○ This&has&cleared&up&a&lot&of&intermittent&issues&for&us,&and&also&allowed&us&to&put&RabbitMQ&behind&a&load&balancer.

○ We&wanted&to&put&Rabbit&behind&a&load&balancer,&because&we’re&in&the&process&of&moving&our&OpenStack&environments&to&a&new&network&architecture,&and&this&helps&us&quiece&RabbitMQ&before&taking&it&offline.

● To&wrap&up,&let’s&talk&about&our&next&upgrade● We’ve&started&some&work&on&moving&to&Liberty&already

○ We’re&on&master&for&all&of&the&Puppet&modules&now&(except&keystone)○ We&don’t&know&what&the&timing&for&our&Liberty&upgrade&will&be&yet,&but&

I’ll&be&surprised&if&it’s&not&before&Austin● We’ve&learned&that&no&matter&what,&we’re&going&to&run&into&weird&problems.

○ For&example,&we&ran&into&MySQL&bugs&in&both&Juno&and&Kilo&upgrades,&so&apparently&we&should&just&assume&that&will&happen&and&add&another&two&weeks&to&get&that&fixed….

● We’re&going&to&continue&moving&services&into&containers.&&We’ve&got&heat&and&designate&in&containers&now,&and&it’s&allowed&us&to&upgrade&them&(or&not)&independently&of&other&services.

○ This&will&allow&us&to&avoid&having&to&deal&with&conflicting&dependencies&between&services

○ It&also&allows&us&to&stage&the&new&version&of&a&service&before&the&upgrade.&&Right&now&a&lot&of&our&upgrade&time&is&actually&installing&packages.

● As&we’ve&mentioned&before,&a&lot&of&the&complexity&in&our&upgrades&have&to&do&with&the&fact&that&upgrading&the&OVS&agent&causes&it&to&drop&all&active&flows.

○ We’re&really&looking&forward&to&deleting&a&bunch&of&code,&assuming&this&works&in&Liberty&(it’s&on&by&default)

● Lastly,&we’re&hoping&to&move&to&using&HA&routers&once&we’re&on&Liberty,&and&with&that&in&place&we&hope&to&avoid&moving&any&routers&around&during&the&

upgrades○ Hopefully&that&will&help&with&our&Mitaka&upgrade

● That’s&all&we’ve&got,&we&appreciate&everyone&coming● Hopefully&have&some&time&for&questions

upgrading openstack without breaking everything (including neutron?)

Technology