jeff sly principal it architect [email protected] case study nagios @ nu skin

58
Jeff Sly Principal IT Architect [email protected] Case Study Nagios @ Nu Skin

Upload: dane-harmon

Post on 30-Mar-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Jeff Sly

Principal IT Architect

[email protected]

Case Study Nagios @ Nu Skin

Page 2: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Who is in the Audience?

How many of you are: Suppliers of Nagios or some value add-on for

Nagios? Customers using Nagios? Just implementing Nagios or expanding

implementation? Using NagiosXI?

Page 3: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Who is Nu Skin?

Page 4: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Our Technology Footprint

Ecommerce – Home grown Applications – Java, EJB, ABAP, .NetDatabases – Oracle, MySQL, MSSQLOS – HPUX, Redhat, Windows, VMWareERP – SAP Supply Chain, CRM, FI

Datacenters – 6 locations in 6 countriesOffices – 50 Countries

Page 5: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Monitoring Goals

Monitoring presents operations with a completely integrated global view.

Good monitoring is proactive; it helps teams prevent problems from becoming outages.

Good monitoring helps minimize outage downtime, quickly identify root cause and contacts correct people.

Page 6: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Centralized Monitoring System

Page 7: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Our Monitoring History

We tried for 10 years…

Page 8: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Do it all in ‘One Tool Projects’

One Monitoring Tool to rule them all: Mercury SiteScope Remedy Help Desk HP OpenView Quest Foglight Home grown (several) One monitoring person

• He decided to quit!

Page 9: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Could never get everything

All Failed – We always gave up! Why?Servers and agents that were proprietaryHuge foot print inefficient performanceSteep learning curveVery expensiveUpdates costly and very time consumingSystem Administrators like their own

scripts, can see what they are doing

Page 10: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Resulting Monitoring Issues

Tried to make Operations clearing house for all warnings and alerts from 10+ tools

Operations was overwhelmed Took 4 process steps and lots of software

to notify of critical failuresMost Administrators setup own private

monitoring to receive warningsMany false notificationsLate notifications

Page 11: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

As Is (start of project)

Our Business Customers were Unhappy

Page 12: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Old Monitoring Work Flow

Four steps to notify system administrator

Page 13: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Network

Foglight

Email HelpDesk

Error

SystemScripts

BAC

HP NNM

SiteScope 8

Sitescope 6

Step 1: Everything Emails Operations

NagiosDatabase

Page 14: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Network

Foglight

SystemScripts

BAC

HP NNM

SiteScope 8

Sitescope 6

Step 2: Operations Opens Email

NagiosDatabase

Email HelpDesk

Error

Page 15: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Email HelpDesk

Error

Step 3: Operations Checks Source

Network

Foglight

SystemScripts

BAC

HP NNM

SiteScope 8

Sitescope 6

NagiosDatabase

Page 16: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Email HelpDesk

Error

Step 4: Operations Calls admin

Network

Foglight

SystemScripts

BAC

HP NNM

SiteScope 8

Sitescope 6

NagiosDatabase

Page 17: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Inventory of Existing Checks

Regular Expression found on Web Page Monitoring

HTTP Check - Up or Down

Ping Host Up or Down

PORT monitoring FTP checking SMTP checking

SNMP monitoring - no trap catching yet

Radius DNS monitoring Disk Space monitoring

CPU and Load Average monitoring

Memory Monitoring

Page 18: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Inventory of Existing Checks

Service monitoring Transaction monitoring -

page load times – performance graph

Website click through (Webinject not working)

Log File monitor –parse for Errors

Java HEAP, Thread, Threadlock monitoring

Apache thread and worker count monitors

Ecommerce shop monitors

Email can send and receive

SQL query ODBC (catalog ODBC had bugs)

Page 19: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

To Be

Happy Customers

Page 20: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Key Ideas

1. MoM

2. Tool Requirements

3. Shared Ownership

4. Lowest Level

5. Nagios Monitor Method

Page 21: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Idea 1: MoM

Our first “break though” was the idea that even through we needed a centralized view for all monitoring that did not mean all monitoring had to be done by one monitoring tool.

We had to pick a “Manager

of the Monitors” (MoM)

to bring together the best of

breed monitoring.

Page 22: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

MoM - according to Gartner

Page 23: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Idea 2: Tool Requirements

Open – not proprietary and closedMainstream – wanted good native support and

strong communityInterface – to 3rd Party MonitoringFlexible – adapt to many types of monitoringEfficient – minimal foot print on production

servers, not chatty on networkNotification – granular controlReliable – good clean architectureUsability – GUI interface, reporting

Page 24: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Idea 3: Shared Ownership

Core team Operation of Monitoring Environment: backups,

upgrades, & custom plug-ins Monitoring Experts Training

Monitoring leads in Development & Admin teams: Set up own monitors Keep own monitors current Adjust monitors If something is not monitored not core teams fault

Page 25: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Email HelpDesk

Error

Operations Owned Monitoring

Network

Foglight

SystemScripts

BAC

HP NNM

SiteScope 8

Sitescope 6

NagiosDatabase

Page 26: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Team Leads Own Monitoring

Network

SystemScripts

SAP

Asia

Europe

WebDatabase

Operations

Page 27: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

How to Guides

Page 28: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin
Page 29: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

How to Setup NRPE - HPUX

Page 30: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Idea 4: Lowest Level

Handle alerts at the lowest possible level in the organization

Only forward alerts if not handled at lower levels before they become critical

Page 31: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Handle events at lowest level

Network

SystemScripts

SAP

Asia

Europe

WebDatabase

Operations

Page 32: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Only forward unhandled alerts

Network

SystemScripts

SAP

Asia

Europe

WebDatabase

Page 33: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Idea 5: Nagios Monitor Method

Choose the Nagios Monitoring MethodActive Check from Nagios Server (normal)Active Check performed by remote client

NRPE, NSClientPassive Check – Listen to 3rd party

monitors NSCA

Page 34: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Active Local Check

DB

DBMonitor

Web

Unix

Win

HTTP or

Ping

Nagios

Page 35: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Active Remote Check - UX

DB

DBMonitor

Web

Unix

Win

CPU, RAM

(NRPE)

Nagios

Page 36: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Active Remote Check - Win

DB

DBMonitor

Web

Unix

Win

CPU, RAM(NSClient)

Nagios

Page 37: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Passive 3rd Party Alert

DB

DBMonitor

Web

Unix

Win

Nagios

3rd Party Alert NSCA

3rd Party Check DB

Page 38: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Bonus Idea - Tune

Tune the databaseAdd Ram Drive

Page 39: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Tune the Database

Modify contents of the /etc/my.cnf [mysqld] section.

tmp_table_size=524288000max_heap_table_size=524288000table_cache=768set-variable=max_connections=100wait_timeout=7800query_cache_size = 12582912query_cache_limit=80000thread_cache_size = 4join_buffer_size = 128K

http://web3us.com Info on: MySQL Tuning, Nagios Tuning

Page 40: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

RAM DriveCreate a RAM disk for Nagios tempory files

I created a ramdisk by adding the following entry to the /etc/fstab file:

none                  /mnt/ram               tmpfs   size=500M           0 0

Mount the disk using the following commands

# mkdir -p /mnt/ram; mount /mnt/ram

Verify the disk was mounted and created

# df -k

Modify the /usr/local/nagios/etc/nagios.cfg file with the following tuned parameters

temp_file=/mnt/ram/nagios.tmptemp_path=/mnt/ramstatus_file=/mnt/ram/status.datprecached_object_file=/mnt/ram/objects.precacheobject_cache_file=/mnt/ram/objects.cache

Page 41: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Implementation Methodology

Site SurveyInventory existing monitorsProof of conceptBuild new environmentMigrate monitors from each platform to

Nagios, one at a timeIntegrate OEM, and to send monitors to

Nagios

Page 42: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Three Project Phases

Deliver something useful in each phaseBuild a level at a time

Page 43: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Phase I1. Set up a pilot of Nagios XI using Trial License.

2. Set up Foglight monitoring of JVM (Java Virtual Machine).

3. Purchase NagiosXI and Consulting Support

4. Bring in a consultant for two weeks to help set up the architecture and help us work with the system.

5. Documentation Web Site for Nagios learning's and “How to guides”

6. Define a set of standards and guidelines to follow to help aid an effective monitoring process.

7. Backups on Running on Production Nagios Server

8. Set up services which aren't being caught right now and move a few of the important services over to the new Nagios XI monitoring system.

9. Test Nagios plugins and server performance

Page 44: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Phase II1. Migrate off of Sitescope 6 and shutdown

2. Migrate off of Sitescope 8 and shutdown

3. Decommission Foglight

4. Clean up the old monitoring server

5. Migrate the network team from old Nagios to core NagiosXI system

6. Set up standby NagiosXI system, cron to replicate weekly

7. Research missing alerts and add them to the new NagiosXI system

Page 45: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Phase III1. Implement Global Monitoring Add monitors for existing international systems Add monitors using JMX to monitor Java servers Nagios Remote Process Execution (NRPE) to monitor remotely Remote Monitoring for Windows Servers (NS Client++) Implement notification and escalation of alerts Add monitors for critical business functions

Page 46: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Phase III continued…2. Corporate Enhancements Request recurring down time enhancement from Ethan Galstad Automate refresh of NagiosXI standby system Build Network Map Retire Windows SiteScope Add monitors for phone systems Add monitors to data center (UPS, Temperature, Humidity) Integrate to SAP Tidal monitoring tool

Page 47: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Phase III continued…3. Business Business review and approve SLA (using business terms) Monitor both the Business Functions and the individual point

devices that provide the Business Function Follow the Sun with Eyes on Glass. Training

How to setup alerts How to receive alerts How to report on performance graphs

Create a new Dashboard for HelpDesk and International IT Staff

Page 48: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Inventory of Monitor ChecksQty Things we figured out how to do from Nagios Solution50 Regular Expression found on Web Page Monitoring HTTP Check

170 HTTP Check - Up or down HTTP Check600 Ping Host Up or down Nagios Check alive100 PORT monitoring Check TCP port #10 FTP checking Nagios FTP plugin8 SMTP checking Nagios SMTP plugin5 SNMP monitoring - no trap catching yet Not Using 4 Radius Nagios plugin, difficult

16 DNS monitoring Nagios Check DNS

250 Disk Space monitoringNSClient, NRPE -Nagios Disk plugin

170 CPU and Load Average monitoringNSClient, NRPE - Custom Linux plugin

170 Memory MonitoringNSClient, NRPE -Custom Linux plugin

Page 49: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Inventory continued…Qty Things we figured out how to do from Nagios Solution

170 Memory MonitoringNSClient, NRPE -Custom Linux plugin

80 Service monitoringNSClient, NRPE with bash shell script

30Transaction monitoring - page load times - performance data graphs

Custom using Selenium Scripts

30 Website click through (webinject not working) Custom using mechanize10 Log File monitor -p parse for Errors NRPE - script parse log files

6 Day HEAP, Thread, Threadlock monitoringJava Management Extensions (JMX)

8 Apache thread and worker count monitors Custom plugin Apache statics

18 ShopApp and SignupApp monitorsHTTP Check Custom app status page

5 Email can send and receive Custom Nagios plugin

Page 50: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Nagios XI Interface

Page 51: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin
Page 52: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin
Page 53: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Data Centers in 7 Countries

Page 54: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin
Page 55: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Goal Quick Notification & Recovery from Outage

Type of Monitor

Notification of outages with details on which system is down, so we know who to contact

Solution Migrate from Sitescope, Openview to NagiosXI

IT Operations

Page 56: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Goal Prevention of outage

Type of Monitor

Warnings about conditions before outages occur, allow for corrective actions that will prevent likely outages

Solution Migrate from Sitescope, Openview to NagiosXI, Integrate OEM SAP and Scripts with Nagios

IT Team Managers

Page 57: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Summary

1. MoM ~ Manager of Managers Allow specialized tools

2. Tool Requirements, enough but not all

3. Ownership for implementation, shared

4. Handle alerts, lowest level in organization

5. Choose Nagios monitoring method

Page 58: Jeff Sly Principal IT Architect jsly@nuskin.com Case Study Nagios @ Nu Skin

Tips, Tricks & Demos

Nagios XI Large Implementation

Day 3, 2:00 Track 3 (Nate Broderick)3 DemosPerformance challenges and solutionsIntegrating monitoring solutions OracleMigrating from BAC & FoglightCustomizationGraphing, and more.