pets, cattle, and herding dogs
TRANSCRIPT
Copyright 2013 Alcatel-‐Lucent. All rights reserved. CONFIDENTIAL -‐ SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW
PROPRIETARY – USE PURSUANT TO COMPANY INSTRUCTION Nuage Networks
Dimitri SSliadis @ds$liadis
Pets, Ca)le and Herding Dogs
Pets & Ca)le
Don’t forget the herding dogs
The herding dogs keep the caTle safe The control plane maTers
Adventures with the Neutron Herd
• Goal: Push Neutron to its limits • Maximize port acSvaSon rate • Check stability under heavy load • InteracSons with other components (Nova, Keystone)
• Create a new Neutron benchmark
Does Neutron scale and is it producSon ready ?
Focus: Neutron + Nuage VSP
• Neutron consists of two components • Core Neutron server • Plugins
• Reference OVS/ML2 plugin used in most previous tests • These tests only with the Nuage VSP plugin
Background (Canonical Tests) At around 170 instances per compute server, we hit our next bo<leneck; the Neutron agent status on compute nodes started to flap, with agents being marked down as instances were being created.
we took the decision to turn Neutron security groups off in the deployment and run without any VIF level iptables security.
however with Neutron in the design, we could not realis$cally get past 5-‐6 chassis of servers, so we took the decision to remove Neutron from the cloud design and run with just Nova networking.
with the revised configuraSon, we were able to create instances in batches of 100 at a respectable throughput of iniSally 4.5/sec
hTp://javacruc.wordpress.com/2014/06/18/168k-‐instances/
Cloud Service Management Plane
Virtualized Services Directory
Datacenter Control Plane
Virtualized Services Controller
Virtualized Services Directory (VSD) • Network Policy Engine – abstracts complexity • Service templates and analyScs
Nuage Networks Virtualized Services PlaKorm (VSP)
Virtual RouMng & Switching (VRS) • Distributed switch / router – L2-‐4 rules • IntegraSon of bare metal assets
Virtualized Services Controller (VSC) • SDN Controller, programs the network • Rich rouSng feature set
WAN Router
MP-‐BGP
MP-‐BGP
Datacenter Data Plane
Virtual RouSng & Switching
HYPERVISOR
HYPERVISOR
HYPERVISOR
HYPERVISOR
HYPERVISOR
HYPERVISOR
Brooklyn Datacenter -‐ Zone 1
IP Fabric
Hardware GW for
Bare Metal
Nuage Networks Virtualized Services PlaKorm (VSP)
Differences from core implementaMon
Agent-‐less architecture No l3agent, dhcp agent
No network node Distributed L2, L3, L4 Single mulS-‐tenant bridge in compute nodes ConfiguraSon of high level policies at compute nodes rather than ACLs Scale-‐out architecture of controllers
Our Setup
• Control plane only tesSng in AWS • Compute nodes use libvirt-‐lxc to avoid VM boot performance boTlenecks
Nova Ctrl (Mysql/Rabbit/MQ) Neutron Server
Nuage VSD
Libvirt-‐LXC
Libvirt-‐LXC
Libvirt-‐LXC
41 Compute Nodes
Compute Nodes
Nuage VSC
AMI – c3.8xlarge 64 cores/ 60G
AMI – c3.8xlarge AMI – c3.8xlarge AMI – c3.2xlarge 8 cores
AMI – c3.xlarge
Test
Create 5K networks AcSvate instances randomly in the network using batch instance creaSon Start 50 instances at a Sme, wait unSl they are done and conSnue Where does it break ?
First a)empt
1 instances/second Timeouts all over the place
First Steps
Adjust nova and neutron workers
Tune Keystone (mulSple workers)
MySQL connecSons
Improvement Ac$vated 4K instances in about 10 minutes (about 6.8 instances/second) Can we do be<er ? Where are the bo<lenecks
Nova and Neutron Server UMlizaMon
0
10
20
30
40
50
60
70
80
3:46:03
3:46:23
3:46:43
3:47:03
3:47:23
3:47:43
3:48:03
3:48:23
3:48:43
3:49:03
3:49:23
3:49:43
3:50:03
3:50:23
3:50:43
3:51:03
3:51:24
3:51:44
3:52:04
3:52:24
3:52:44
3:53:04
3:53:24
3:53:44
3:54:04
3:54:24
3:54:44
3:55:04
3:55:24
3:55:44
3:56:04
Neutron Server
Nova Server
nova-‐scheduler
0
10
20
30
40
50
60
70
80
90
3:46:03
3:46:23
3:46:43
3:47:03
3:47:23
3:47:43
3:48:03
3:48:23
3:48:43
3:49:03
3:49:23
3:49:43
3:50:03
3:50:23
3:50:43
3:51:03
3:51:24
3:51:44
3:52:04
3:52:24
3:52:44
3:53:04
3:53:24
3:53:44
3:54:04
3:54:24
3:54:44
3:55:04
3:55:24
3:55:44
3:56:04
Nova Scheduler
Nova Scheduler
mysqld
0
100
200
300
400
500
600
700
800
3:46:03
3:46:23
3:46:43
3:47:03
3:47:23
3:47:43
3:48:03
3:48:23
3:48:43
3:49:03
3:49:23
3:49:43
3:50:03
3:50:23
3:50:43
3:51:03
3:51:24
3:51:44
3:52:04
3:52:24
3:52:44
3:53:04
3:53:24
3:53:44
3:54:04
3:54:24
3:54:44
3:55:04
3:55:24
3:55:44
3:56:04
mysql
mysql
Query stats
Outliers (AWS EBS)
Queries take longer
First suspect for MySQL problems mysqldumpslow -‐a -‐s r -‐t 5 /var/log/mysql/mysql-‐slow.log Count: 20000 Time=0.06s (1142s) Lock=0.00s (2s) Rows=1.0 (20000), root[root]@ip-‐10-‐0-‐1-‐23.us-‐west-‐2.compute.internal SELECT count(*) AS count_1 FROM (SELECT ports.tenant_id AS ports_tenant_id, ports.id AS ports_id, ports.name AS ports_name, ports.network_id AS ports_network_id, ports.mac_address AS ports_mac_address, ports.admin_state_up AS ports_admin_state_up, ports.status AS ports_status, ports.device_id AS ports_device_id, ports.device_owner AS ports_device_owner FROM ports WHERE ports.tenant_id IN ('S')) AS anon_1
Quota check gets a count of all ports for a tenant We used just one tenant for all our ports
Corresponding Code
def get_ports_count(self, context, filters=None): ! return self._get_ports_query(context, filters).count()
That’s the wrong way to get a count in SQLAlchemy
Fixing the Query
VSD UMlizaMon
0
10
20
30
40
50
60
70
80
90
3:46:03
3:46:33
3:47:03
3:47:33
3:48:03
3:48:33
3:49:03
3:49:33
3:50:03
3:50:33
3:51:03
3:51:34
3:52:04
3:52:34
3:53:04
3:53:34
3:54:04
3:54:34
3:55:04
3:55:34
3:56:04
VSD
VSD
VSD MySQL UMlizaMon
0
50
100
150
200
250
300
350
400
450
500
3:46:03
3:46:33
3:47:03
3:47:33
3:48:03
3:48:33
3:49:03
3:49:33
3:50:03
3:50:33
3:51:03
3:51:34
3:52:04
3:52:34
3:53:04
3:53:34
3:54:04
3:54:34
3:55:04
3:55:34
3:56:04
VSD mysql
VSD mysql
Modified Test
Create 5K networks Ac$vate instances with 5 vPorts per instance Start 50 instances at a Sme, wait unSl they are done and conSnue Avoid nova-‐scheduler boTleneck Push neutron-‐server to its limits
Improvement
Ac$vated 4K instances with 20K vports in 10 minutes * 500 vports on every server fully configured with DHCP served * 34 ports/second (an order of magnitude be<er than Canonical) * number of instances per second limited by Nova * Neutron was much faster than Nova in comple$ng the required work * Nuage VSP was by no means the bo<leneck in any of the above – Lots of free capacity
Nova Control Node uMlizaMon
0
10
20
30
40
50
60
70
12:26:20
12:26:40
12:27:00
12:27:20
12:27:40
12:28:01
12:28:21
12:28:41
12:29:01
12:29:21
12:29:41
12:30:01
12:30:21
12:30:41
12:31:01
12:31:21
12:31:41
12:32:01
12:32:21
12:32:41
12:33:01
12:33:21
12:33:41
12:34:02
12:34:22
12:34:42
12:35:02
12:35:22
12:35:42
12:36:02
Nova Control Node
Nova
Neutron Server UMlizaMon
0
10
20
30
40
50
60
12:26:20
12:26:40
12:27:00
12:27:20
12:27:40
12:28:01
12:28:21
12:28:41
12:29:01
12:29:21
12:29:41
12:30:01
12:30:21
12:30:41
12:31:01
12:31:21
12:31:41
12:32:01
12:32:21
12:32:41
12:33:01
12:33:21
12:33:41
12:34:02
12:34:22
12:34:42
12:35:02
12:35:22
12:35:42
12:36:02
neutron
neutron
Nova Scheduler
0
10
20
30
40
50
60
70
80
90
12:26:20
12:26:40
12:27:00
12:27:20
12:27:40
12:28:01
12:28:21
12:28:41
12:29:01
12:29:21
12:29:41
12:30:01
12:30:21
12:30:41
12:31:01
12:31:21
12:31:41
12:32:01
12:32:21
12:32:41
12:33:01
12:33:21
12:33:41
12:34:02
12:34:22
12:34:42
12:35:02
12:35:22
12:35:42
12:36:02
nova-‐scheduler
nova-‐scheduler
MySQL
0
50
100
150
200
250
300
350
400
12:26:20
12:26:50
12:27:20
12:27:50
12:28:21
12:28:51
12:29:21
12:29:51
12:30:21
12:30:51
12:31:21
12:31:51
12:32:21
12:32:51
12:33:21
12:33:51
12:34:22
12:34:52
12:35:22
12:35:52
mysql
nova mysql
Increased uSlizaSon sSll there, although maximum numbers are 40% beTer
New Log Analysis
Count: 320 Time=0.14s (45s) Lock=0.00s (0s) Rows=13749.8 !(4399926), root[root]@ip-10-0-1-23.us-west-2.compute.internal !! SELECT ports.tenant_id AS ports_tenant_id, ports.id AS ports_id, ports.name AS ports_name, ! ports.network_id AS ports_network_id, ports.mac_address AS ports_mac_address, ports.admin_state_up ! AS ports_admin_state_up, ports.status AS ports_status, ports.device_id AS ports_device_id, ! ports.device_owner AS ports_device_owner, ! ipallocations_1.port_id AS ipallocations_1_port_id, ipallocations_1.ip_address AS ! ipallocations_1_ip_address, ipallocations_1.subnet_id ! AS ipallocations_1_subnet_id, ipallocations_1.network_id AS ipallocations_1_network_id, ! securitygroupportbindings_1.port_id AS securitygroupportbindings_1_port_id, ! securitygroupportbindings_1.security_group_id AS securitygroupportbindings_1_security_group_id ! FROM ports LEFT OUTER JOIN ipallocations AS ipallocations_1 ON ports.id = ipallocations_1.port_id !LEFT OUTER JOIN securitygroupportbindings AS ! securitygroupportbindings_1 ON ports.id = securitygroupportbindings_1.port_id ! WHERE ports.tenant_id IN ('10bee9ce4661476993a5c75ff7fcf016') !
Long qeury related to IP address allocaSon
Time for More IteraMons
Well .. Not exactly, since we run out of Sme before the Summit Stay tuned ….
Conclusions Neutron is by far not the boTleneck in a high performance Openstack installaSon as long as the right SDN system is used Significant effort needed to opSmize Openstack end-‐to-‐end
Art vs science Pay aTenSon to SQLAlchemy statements Call to acSon: End-‐to-‐end profiling of the code base across all Openstack projects OSProfiler looks promising