ntts journey with openstack-final
TRANSCRIPT
© 2015 NTT Software Innovation Center
NTT’s Journey with OpenStack
Shintaro Mizuno
Takashi Natsume
NTT Software Innovation Center
OpenStack Summit Tokyo 2015
2 Copyright©2015 NTT corp. All Rights Reserved.
Outline
1. Introduction
2. How we did before
3. How we do it now
4. Giving back to the community
5. Next steps
3 Copyright©2015 NTT corp. All Rights Reserved.
Introducing NTT Group
Other Businesses
R&D
4 Copyright©2015 NTT corp. All Rights Reserved.
OpenStack in production
Other Businesses
R&D
R&D Cloud since 2013
Multiple customer
environments
E-mail servers
since 2014
Public cloud service
since 2013
Web service at NTT Resonant
since 2014
R&D Dev environment
Field trial with a customer
since 2014
5 Copyright©2015 NTT corp. All Rights Reserved.
Community contribution
• Total commits: 1107 (ranked 18th of 263)
• Total LOC: 127,575 (ranked 25th of 267)
• Reviews: 5937 (ranked 16th of 212)
• Draft Blueprints: 103 (ranked 16th of 212)
• Completed Blueprints: 35 (ranked 18th of 138)
• Filed Bugs: 797 (ranked 14th of 237)
• Resolved Bugs: 439 (ranked 14th of 204)
• Total 67 contributors from all NTT Group
Source: www.stackalystics.com as of 10th Sep 2015
6 Copyright©2015 NTT corp. All Rights Reserved.
Behind the scenes of
R&D cloud and Public cloud development
7 Copyright©2015 NTT corp. All Rights Reserved.
Timeline
2011 2012 2013 2014 2015 Folsom Diablo Essex Grizzly Havana Icehouse Juno Kilo Cactus Liberty
1st production development Current development Joined the Community
8 Copyright©2015 NTT corp. All Rights Reserved.
How we did in the 1st development
In 2012 (Folsom era),
when people were still skeptic about the hype of OpenStack,
We focused in QA tests
- including
- Full-API function test (incl. parameter boundary tests)
- Non-API function test
- Full state transition test
- External-system failure test
- API race conditions/multiple requests
- Long-term stability test (scenario test)
9 Copyright©2015 NTT corp. All Rights Reserved.
Network QA tests to understand the limits
- Function
- Max MAC address learning, learning speed, MTU, fragment
- Capacity
- Max routers per tenant/region, static routes per router, num ports per network, networks per region, dhcp servers per network node…
- Performance
- Throughput for: VM to VM, VM to external network via router
- Multiple tenant, multiple network ,multiple routers, short packet, long packet, noisy neighbor, DoS simulation…
- API request processing speed, time to apply changes
- Availability
- Network node swichover time, packet loss, high network load, with numbers of routers, with floating IP
10 Copyright©2015 NTT corp. All Rights Reserved.
Quality level found
Major issues/weakness found in Folsom
- API race condition especially in Quantum - Lacking appropriate locking mechanism
- E.g. create port + create port = error
- Internal error handling - Lacking exception handling in many cases
- Resources fell into “ERROR” state so easily
- Need to clean up orphan resources, e.g. vifs, ports, instances, etc
- State transition - No workflow management.
- No rollback mechanism (e.g. migration, resize)
- API parameter validation
- HA feature (switchover time)
11 Copyright©2015 NTT corp. All Rights Reserved.
Our answer in 2012
“Folsom has good features!”
“…but it’s too fragile for public clouds”
12 Copyright©2015 NTT corp. All Rights Reserved.
Our first "Folsom-based" system
GUI/CLI/API
Resource Mgmt
Transaction Mgmt
Host Mgmt
User Mgmt
DB
Nova Cinder Glance Quantum (Neutron) Keystone
End user/operator
We built a proprietary system to be “gentle” to OpenStack
Driver
Folsom
Workflow engine
patch patch patch patch patch
13 Copyright©2015 NTT corp. All Rights Reserved.
What we added
- Proprietary GUI for end-users - Provide “business view” of resources and don’t let users touch OpenStack
resources/features directly
- Proprietary operation GUI - Host management, monitoring, resource/user management
- Transaction Management - API workflow management using Request-id tracking/notification
- Add “purge” feature for rollback/roll forward/clean-up after API failure
- Workflow engine - Execute certain scenario composed of multiple API calls (like what Heat does)
- API parameter validation check - Strict parameter validation before handing over to OpenStack API
- Cinder Driver for EMC VNX - There weren’t one from EMC!
14 Copyright©2015 NTT corp. All Rights Reserved.
Convincing business people
Question to answer:
“Why should we use OpenStack when we already have vCenter and CloudStack?”
15 Copyright©2015 NTT corp. All Rights Reserved.
What we discussed
- Cost comparison
- Compute feature comparison with vCenter
- Network feature comparison
- Future growth expectations
16 Copyright©2015 NTT corp. All Rights Reserved.
How we dealt with 150 OpenStack bugs
• Patches
• Live migration bug (Nova, about 13%)
• Input check improvement (about 9%)
• Log output improvement (about 7%)
• Unnecessary ‘things’ remaining (about 6%)
• Add timeout parameter (about 4%)
• API response improvement (about 4%) • Change HTTP Status code
• Volume boot bug (Nova, about 3%)
• Security (about 3%)
• Race condition(about 3%)
We did upstream for our patches with Canonical because there were many patches!
17 Copyright©2015 NTT corp. All Rights Reserved.
How we dealt with 150 OpenStack bugs(contd.)
• Merged(18 patches) • Tests (about 27%)
• Race condition bugs (about 17%)
• Unnecessary ‘things’ remaining (about 11%)
• Add timeout parameter (about 11%)
• Rejected • Multiplicity control function
• Input parameter check(Do it in the next major API version)
• Already merged by other companies(about 60 patches) • Input parameter check
• delete namespaces when they are no longer needed
• Multiple regions support for quantum in nova-compute
• No need upstream(about 50 patches) • The bug cannot be reproduced, etc.
18 Copyright©2015 NTT corp. All Rights Reserved.
Upstream proprietary function
• Transaction Management and Workflow engine • Log-request-id-mapping
• Enable us to analyze API calls between components by mapping each request ID
• Our proprietary function used common request ID and enable us to to analyze API calls between components by tarcking one request ID.
• The spec has been approved in openstack-specs. We will implement it.
• TaskFlow • Needed for our retry, rollback and API trace(checking the progress of API process) function
• Work in progress
• A lot of things to do... • Force delete for ‘rollback’
• Optimization of Error Handling
• EMC driver • Use the driver provided to the community by EMC Corporation(We do not
upstream)
19 Copyright©2015 NTT corp. All Rights Reserved.
What we learned from the first release
• “upstream-first” is very important
• The work of the development and fix is in vain because they have already been fixed by other companies in the community code.
• Our proprietary function/tools have to be modified because prerequisite function cannot be merged.
• It takes a long time to do upstream for our proprietary function since it needs coordination and persuasion at the community.
20 Copyright©2015 NTT corp. All Rights Reserved.
Timeline
2011 2012 2013 2014 2015 Folsom Diablo Essex Grizzly Havana Icehouse Juno Kilo Cactus Liberty
1st production development Current development Joined the Community
21 Copyright©2015 NTT corp. All Rights Reserved.
How we do it now…
We had to change our mindset “Don’t be greedy.
Find a way to live with the community code”
22 Copyright©2015 NTT corp. All Rights Reserved.
How we do it now
Features:
1. Try to satisfy with what you have or try to figure out with what you can get
2. Try to write a spec/RFE to realize you ideas (it’ll take quite some time, though)
3. (If upstream doesn’t work) and (if you really really need it) and (if you can afford it), then think of building it “outside”
23 Copyright©2015 NTT corp. All Rights Reserved.
How we do it now
Bugs:
1. Report the bug and wait
2. If you need it quick, pick up the bug and fix
3. If the community wont fix it or if the community says “it’s a spec”, try to live with it by “writing documents” 1. Work arounds and recovery manuals for operators
2. FAQs for users
4. If the bug may cause critical system failure, consider closing relevant APIs until it get fixed.
5. If above doesn’t work, create in-house patch but “keep it minimum” and maintain them.
24 Copyright©2015 NTT corp. All Rights Reserved.
What we did and didn't do
Against requirements from service/operation side.
We dropped everything that needed to change OpenStack specs:
- Features that will change current API behavior/specs
- “Do like CloudStack/vCenter does” thing - Created workarounds or leveraged equivalent OpenStack features
We did what was mandatory for operation without changing OpenStack:
- Add API filter to hide immature APIs (apache proxy)
- Add notification/API-log collection tool (external tool)
- Built cascaded domain/tenant/user model using existing keystone APIs (manual)
- Developed High-availability for virtual machines (open sourced)
25 Copyright©2015 NTT corp. All Rights Reserved.
Our current system overview
Nova Cinder Glance Neutron Keystone
Pure Juno/Kilo
Reverse proxy (Apache) Virtual Machine High Availability
(Masakari)
Notification/API log collection
End user/operators
filter rules for end user
filter rules for operators
OpenStack API
Notification
API Log
VM recovery
Event from agents Compute node Monitoring agents
OpenStack API (subset) Operation tools
26 Copyright©2015 NTT corp. All Rights Reserved.
Our current OpenStack configuration(figure)
Controller Node(2)
pacemaker(1Act-1Sby) •VIP(neutron-sv, haproxy) •neutron-server •nova-consoleauth
keystone-all nova-api nova-conductor nova-novncproxy nova-scheduler cinder-api cinder-scheduler Apache(keystone) haproxy
Network Node(4)
OVS
Compute Node(4)
nova-compute OVS
Backend Node(3)
mysql-pxc(3Act) RabbitMQ(2Act)
pacemaker(nAct-1Sby) • neutron-linuxbridge-agent • neutron-dhcp-agent • neutron-l3-agent
pacemaker(nAct)
Storage Node(2)
glance-api glance-registry
pacemaker(nAct-1Sby) •cinder-volume(NFS, iSCSI)
pacemaker(3Act) •VIP(MQ, PXC)
Active-Active
Legend:
DMZ Load Balancer(2)
haproxy
pacemaker(1Act-1Sby) •VIP(api & novncproxy endpoint)
27 Copyright©2015 NTT corp. All Rights Reserved.
Our current OpenStack configuration
• stable/kilo(2015.1.0) and Ubuntu 14.04 LTS
• Host aggregates for VM scheduling
• OS type(3 types) and memory capacity of nova flavors (2 types)
• Full HA architecture
• HA on each node
• Multiple data center architecture
• Support HA configuration between multiple data centers
28 Copyright©2015 NTT corp. All Rights Reserved.
Contributing to the community
• Cinder • Restrict users from uploading volume to image based on glance
protected properties
• Glance • Restrict users from downloading image based on policy
• Add multifilesystem store to support NFS servers as backend
• Reload configuration files on SIGHUP signal
• Neutron • Add enable_new_agents to neutron server
• Agent terminates services when turning admin_state_up False
29 Copyright©2015 NTT corp. All Rights Reserved.
Where OpenStack fit and still doesn't fit
Best fit in
- Private cloud hosting web services - Lower entrance barrier for the cattle model
Still hard but is running in production
- Public cloud for enterprise - Customer’s cattle is our precious pets
Maybe OpenStack is not the one (at least for some time)
- Core network function virtualization
- Virtualization of legacy silo applications
30 Copyright©2015 NTT corp. All Rights Reserved.
Next steps • Practical use of applications in upper level(PaaS, etc.)
• Practical use of OpenStack in NFV
• Now we are trying to do upstream for the following functions
• Nova
• Improve unshelve performance
• Neutron
• AZ support
• Congress
• Congress for OPNFV doctor use case
• Cross project
• Log request-id mappings
• Other
• VM/HA(Masakari)
31 Copyright©2015 NTT corp. All Rights Reserved.
Sessions from/about NTT Group
• From NTT Group • Korejanai Story: How To Integrate OpenStack Into Your
Business Strategy(October 29 3:30pm - 4:10pm)
• Gohan: An Open-source Service Development Engine for SDN/NFV Orchestration (October 29 4:30pm - 5:10pm)
• About NTT Group • Telco OpenStack Roadmap Panel(October 29 1:50pm -
2:30pm)
32 Copyright©2015 NTT corp. All Rights Reserved.
Questions? Masakari wo
nageru