nagios conference 2012 - eric loyd - nagios implementation case eastman kodak company

25
Nagios Implementation Case: Eastman Kodak Company Eric Loyd Founder & CEO Bitnetix Incorporated [email protected] www.bitnetix.com 877.BITNETIX

Upload: nagios

Post on 26-Jun-2015

621 views

Category:

Technology


3 download

DESCRIPTION

Eric Loyd's presentation Case Study on Nagios Implementation Case Eastman Kodak Company. The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

TRANSCRIPT

Page 1: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

Nagios Implementation Case:Eastman Kodak Company

Eric LoydFounder & CEO

Bitnetix Incorporated

[email protected]

877.BITNETIX

Page 2: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

2© 2012 Bitnetix Incorporated

About Eric Loyd and Bitnetix

Founder and CEO of Bitnetix Incorporated

VOIP services and IT/network consulting

25 Years in IT at places like

Eastman Kodak

Frontier Communications

Global Crossing

Bitnetix started its seventh year in July, 2012

2012 Digital Rochester GREAT Award Finalist in Communications Technology

Using Nagios to monitor our client equipment, VOIP platform, and still using it at Kodak since 2004

Page 3: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

A History of Eastman Kodak’s kodak.com Web Server

Infrastructure (non-confidential)

Page 4: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

4© 2012 Bitnetix Incorporated

History of kodak.com

Pre-2004

Machines located in Rochester, NYPublic Apache servers

Reverse proxy Apache servers

Application servers (ATG/Dynamo, Tomcat, etc)

Database boxes, Production Support, etc.

2004 – Moved ~80 machines from ROC -> ???

ROC <-> ??? Firewalls

Bandwidth requirements

Minimal user impact

Flipped the switch, went live

Page 5: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

5© 2012 Bitnetix Incorporated

History of kodak.com

Some of the things kodak.com did at the time

Consumer store and product information

B2B portal and wholesaler purchasing

“Picture Of The Day” (www.kodak.com/go/potd)

Warranty registration

Photo lab calibration strips

“Phone home” reports for printers, docks, cameras, etc

Software/firmware updates

Corporate press releases, bios, and regulatory information

Reverse proxy for internal information through secure channels

Dozens of sitelets for products and campaigns

Page 6: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

Why Kodak Chose Nagiosto Monitor kodak.com

Page 7: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

7© 2012 Bitnetix Incorporated

Why Nagios?

No centralized corporate monitoring software

Nothing to compete with internally

Nothing to build on, either

Cost

No additional cost beyond existing human resources

Framework

Nagios worked with firewalls without needing agents

Leverage SSH, HTTP and other remote protocols

Custom checks and notifications (very important)

Page 8: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

Initial Hurdles in the New Complex Server Environment

Page 9: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

kodak.com Network

© 2012 Bitnetix Incorporated

Page 10: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

10© 2012 Bitnetix Incorporated

Initial hurdles

Firewalls

Public load balancers on external Internet IPs

Public Apaches in Zone 1, Kodak network

Reverse proxy, app servers in Zone 2, semi-secure

Nagios machine in internal Zone 3, most secure

Complex “top” and “bottom” checks for web site

Is the site working from the user’s perspective (top)?

From the application side (bottom)?

How to separate apparent from actual failure

Page 11: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

11© 2012 Bitnetix Incorporated

Initial hurdles

No Internal Nagios Knowledge

It was a contractor who set up Nagios (me)

Contractors typically have a finite lifespan at Kodak

Contractor made custom checks, event handlers, and all Nagios configurations. Uh-oh…

Escalation and Paging

Screw it – let’s email everyone, every time and let Thunderbird sort it all out

Paging done via texting gateway email addressWhich means email gateway failure = notification failure

Twitter API as backup / current primary notification

Page 12: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

SSH to Remote Servers

Page 13: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

13© 2012 Bitnetix Incorporated

SSH to the rescue

One user, one key, infinite access

Software apps run as second user, with SSH auth

Additional robot accounts can be added at any time

Wrap existing checks in an SSH shell

Provides additional control, error handling, reporting

Allows all checks to submit results to SQL databaseSQL Database Side Note – all custom scripts executed CLI Perl code that locked a file, logged to it, and unlocked it. A Perl cron job woke up every 5 minutes, locked the file, read it, pushed things to Oracle, unlocked, and deleted log file. A second cron pruned Oracle daily to 400 days of data and collapsed checks older than 30 days so that successive checks with the same status were removed.

Page 14: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

Managing NagiosConfiguration Files

Page 15: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

15© 2012 Bitnetix Incorporated

Configuration Management

SCCS

Solaris’s “poor man’s CVS”

Pre-installed, no additional cost, existing expertise

Current configuration is managed through SVN

Rsync – the workhorse to move config files

Configuration Repository and Push (CRaP) directory

Cfengine

Local versus remote execution

Post-install, ignore pid files, deploy/restart, etc.

Makefile – the “CLI” to the entire process

Page 16: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

Common Event Handler

Page 17: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

© 2012 Bitnetix Incorporated 17

Common Event Handler

EKrestart – That Which Does

Setup

• Arguments• Conversions• do_soft/hard?• do_something?• do_restart

do_restart

• Lock, logs, SQL• send_nagios• SSH to remote• Remote

EKrestart• Process args• do_<service>• send_nagios• Unlock, log, SQL• Terminate

do_<service>

• Locks (level 2)• Instance mapping• Port mapping• App restart• Email & log• Exit

Page 18: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

18© 2012 Bitnetix Incorporated

A Closer Look at EKrestart#!/bin/shPATH=...

[ "$1" = "-r" ] && client_code

host="$1"service="$2"baseService=`echo $service | awk -F: '{print $1}'`state="$3"type="$4"tries="$5"perfdata="$6"class="<based on machine name, e.g., x-y-CLASS-nnn.kodak.com>"number="<based on machine name, e.g., x-y-class-NNN.kodak.com>"

case "$state" in OK) do_fixit;; WARNING) do_nothing;; UNKNOWN) do_nothing; CRITICAL) do_something; *) do_nothing;esac

Page 19: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

19© 2012 Bitnetix Incorporated

A Closer Look at EKrestartdo_fixit() { case "$baseService" in Workers) do_restart;; *) do_nothing;; esac}

do_nothing() { $debug && echo "$service is in $state state ($type) for $tries tries."}

do_something() { case "$type" in SOFT) do_soft;; # Take action before it's too late? HARD) do_restart;; # Hard CRITICAL - Our last chance to take action *) do_nothing;; esac}

do_soft() { case "$tries" in 3,4,5) do_restart;; # Okay, let's restart it before it goes hard *) do_nothing;; # Don't restart yet esac}

Page 20: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

20© 2012 Bitnetix Incorporated

A Closer Look at EKrestartdo_restart() { # <figure some stuff out, set up lock files, send_nagios, log to SQL, etc> ssh $machine <EKrestart> -r do_$service <parameters> # <tear down, unlock, close log, send_nagios, log to SQL, etc> exit}

# On the client side, we use the same EKretart script, but start at client_code()client_code() { host=`hostname` function="$2" service="$3" # (etc) eval $function exit}

# Example functiondo_Dynamo() { # lock file processing # turn off new sessions, wean existing ones # /etc/init.d/restart_dynamo_$instance # tear down return}

Page 21: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

Integrating Nagios into Operational Procedures

Page 22: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

22© 2012 Bitnetix Incorporated

Integration with Operations

Homebrew API

nchart, send_nagios, nlog – all portable to other installations of Nagios on other machines

Integrate with start/stop scripts

Lock files. Lots of lock files! TOO MANY lock files!!

The “Rippler”

Leverage EKrestart, cron, and send_nagios

Pager / Twitter and lots of private twitter feeds

Inter-group notifications

Predominately with procmail

Page 23: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

Predictive Failure Recoveryand a Good Night’s Sleep

Page 24: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

24© 2012 Bitnetix Incorporated

Predictive Failure Recovery

On ATG/Dynamo (and other) services

do_soft triggers do_restart on third failure

do_hard always triggers restart

Notifications on fourth failure

Escalation to pager only on fifth notification

Nagios has time to restart things that are bad, or are going bad, prior to sending out notifications

Service check dependencies allow us to know whether it’s a bad application, server, or user experience

Twitter – follow private tweets with smartphone, use apps to acknowledge problems, and get an even better night’s sleep!!

Page 25: Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak Company

Questions

Eric LoydFounder & CEO

Bitnetix Incorporated

[email protected]

877.BITNETIX