bin/tails from openstack operations: rarm nagalingam, red hat

32
/bin/tails from OpenStack Operations OpenStack Australia Day Rarm Nagalingam DevOps J.O.A.T Engineer May 2016

Upload: openstack

Post on 15-Apr-2017

1.003 views

Category:

Technology


2 download

TRANSCRIPT

/bin/tails from OpenStack Operations

OpenStack Australia Day

Rarm NagalingamDevOps J.O.A.T EngineerMay 2016

OpenStack Australia Day 2016

INTRODUCTION

Rarm Nagalingam

DevOps J.O.A.T Engineer

[email protected]

linkedin.com/in/rarm-nagalingam-736aa54

OpenStack Australia Day 20163

● Current Architecture, Size, Workloads

● Patch Methodology

● User Issue: Is the Cloud Slow!! today?

● egrep fail -R ./ == fail

● Let's play the blame game

● Fool me once, shame on you, fool me twice, monitor it!

● Role Play

● Questions & possibly Answers

AGENDAOpenStack Australia Day: /bin/tails from OpenStack Operations

Architecture

OpenStack Australia Day 2016

Current - RHELOSP 5.0 (ICEHOUSE)

• 3 x Physical Controllers

• 3 x Physical DB Nodes

• 2 x Virtual Load Balancers

• 26 x Compute Nodes (56 vCPUs and 256 GB ram)

• 1456 vCPUs / 6.6TB of RAM – 90% allocated

• Storage NFS via Filer

OpenStack Australia Day 2016

OpenStack Australia Day 2016

Future - RHELOSPd 8.0 (LIBERTY)

● 3 x Physical Controllers ● 3 x Physical DB Nodes● 3 x Physical CEPH Monitor Nodes● 9 x Physical CEPH Storage Nodes (~ 36TB per node with NVMe Journals)● 2 x Virtual Load Balancers● (xxx) x Compute Nodes (56 vCPUs and 512 GB ram each)

OpenStack Australia Day 2016

OpenStack Australia Day 2016

Current Workloads

● Cloud Based

● Web Apps

● Cloudy-VMs ++

https://www.flickr.com/photos/truedimensions/

Patch Methodology

OpenStack Australia Day 2016

Patch Methodology

https://www.flickr.com/photos/emma-lego/

Is the Cloud Slow!! today?

OpenStack Australia Day 2016

● Option 1: Scatter Gun

Take Aim Fire Ah...

www.safaribooksonline.com

OpenStack Australia Day 2016

Option 2: Become an Elite Cloud Admin

(cc) https://www.flickr.com/photos/-chuckc-/

egrep fail -R ./ == FAIL

OpenStack Australia Day 2016

ERROR nova.openstack.common.rpc.common [req-c5e13da1-97f2-4da5-855f-1c09a11f328a None None] ['Traceback (most recent call last):\n', ' File "/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 461, in _process_data\n **args)\n', ' File "/opt/stack/nova/nova/openstack/common/rpc/dispatcher.py", line 172, in dispatch\n result = getattr(proxyobj, method)(ctxt, **kwargs)\n', ' File "/opt/stack/nova/nova/openstack/common/rpc/common.py", line 439, in inner\n return catch_client_exception(exceptions, func, *args, **kwargs)\n', ' File "/opt/stack/nova/nova/openstack/common/rpc/common.py", line 420, in catch_client_exception\n return func(*args, **kwargs)\n', ' File "/opt/stack/nova/nova/network/manager.py", line 573, in get_instance_nw_info\n instance_uuid)\n', ' File "/opt/stack/nova/nova/db/api.py", line 561, in virtual_interface_get_by_instance\n return IMPL.virtual_interface_get_by_instance(context, instance_id)\n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 126, in wrapper\n return f(*args, **kwargs)\n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 138, in wrapper\n instance_get_by_uuid(context, instance_uuid)\n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 126, in wrapper\n return f(*args, **kwargs)\n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1678, in instance_get_by_uuid\n columns_to_join=columns_to_join)\n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1684, in _instance_get_by_uuid\n filter_by(uuid=uuid).\\\n', ' Filepython2.7/dist-packages/sqlalchemy/engine/base.py", line 1449, in execute\n params)\n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1584, in _execute_clauseelement\n compiled_sql, distilled_params\n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1698, in _execute_context\n context)\n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1691, in _execute_context\n context)\n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 331, in do_execute\n cursor.execute(statement, parameters)\n', ' File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in execute\n self.errorhandler(self, exc, value)\n', ' File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler\n raise errorclass, errorvalue\n', 'OperationalError: (OperationalError) (1054, "Unknown column \'instances.locked_by\' in \'field list\'") \'SELECT anon_1.instances_created_at AS anon_1_instances_created_at, anon_1.instances_updated_at AS anon_1_instances_updated_at, anon_1.instances_deleted_at AS anon_1_instances_deleted_at, anon_1.instances_deleted AS anon_1_i instances_hostname, instances.launch_index AS instances_launch_index, instances.key_name AS instances_key_name, instances.key_data AS instances_key_data, instances.power_state AS instances_power_state, instances.vm_state AS instances_vm_state, instances.task_state AS instances_task_state…...ces_access_ip_v6, instances.auto_disk_config AS instances_auto_disk_config, instances.progress AS instances_progress, instances.shutdown_terminate AS instances_shutdown_terminate, instances.disable_terminate AS instances_disable_terminate, instances.cell_name AS instances_cell_name, instances.internal_id AS instances_internal_id, instances.cleaned AS instances_cleaned \\nFROM instances \\nWHERE instances.deleted = %s AND instances.uuid = %s \\n LIMIT %s) AS anon_1 LEFT OUTER JOIN instance_info_caches AS instance_info_caches_1

OpenStack Australia Day 2016

http://logstash.openstack.org/#/dashboard/file/logstash.json

OpenStack Australia Day 2016

Got Logs

● Troubleshooting from the 90’s

● Log Aggregation FTW

● Support infrastructure just as important as the Cloud

● Testing in Prod == a resume generating event

Difference between Metrics and Monitoring

OpenStack Australia Day 2016

Use metrics to prove your theories

https://www.elastic.co/blog/kibana-4-5-0-released

Let's Play the Blame Game

OpenStack Australia Day 2016

Let's Play the Blame Game

∙ Enforce OLAs

∙ Influence and support purchasing

Fool me once, shame on you.

Fool me twice, monitor it!

OpenStack Australia Day 2016

Fool me twice, monitor it!

(cc) rarm

Role Play

OpenStack Australia Day 2016

(cc) https://www.flickr.com/photos/d0ppler/

Role Play

OpenStack Australia Day 2016

Exercise 1:

You arrive to work and discover one of you compute nodes had been hard powered off. The

node was running three high priority instances, a small 60GB Windows instance and two

medium RHEL instances.

Goal:

Without rebuilding the compute node, restart the instances on another node.

Example Scenario

BackUps!

OpenStack Australia Day 2016

Exercise 3:

One of the admins accidentally dropped a database table. However, rather than just clearing

out the redundant data they dropped all the tables form the OpenStack nova database.

Thankfully you saw the user do this and can respond quickly.

Goal:

Redirect users to a temporary site stating that an outage has occurred. Restore the database

and ensure that all services are able to successfully interact with the database before

removing the redirect

BackUp Scenario

OpenStack Australia Day 2016

Now you are an Elite Cloud Admin

(cc) https://www.flickr.com/photos/-chuckc-/

Questions & Possibly Answers

THANK YOU

plus.google.com/+RedHat

linkedin.com/company/red-hat

youtube.com/user/RedHatVideos

facebook.com/redhatinc

twitter.com/RedHatNews