nagios conference 2011 - dave williams - nagios in the real world - the datacentre

54
Nagios in the Real World Dave Williams Technical Architect

Upload: nagios

Post on 19-Nov-2014

2.344 views

Category:

Technology


2 download

DESCRIPTION

Dave William's presentation on using Nagios in the datacenter. The presentation was given during the Nagios World Conference North America held Sept 27-29th, 2011 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

TRANSCRIPT

Page 1: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

Nagios in the Real WorldDave Williams Technical Architect

Page 2: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

2 ©Bull, 2011 Presentation Title

Agenda

Page 3: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

3 ©Bull, 2011 Presentation Title

Agenda

- Introduction- General Background- System Monitoring Background

- Example Implementations of Nagios- UK Customer Examples

- Datacentre Monitoring with Nagios- What is a Datacentre ?- Software & Hardware combinations- Vision

- Conclusions

Page 4: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

4 ©Bull, 2011 Presentation Title

Background

- UK based- Mainframe (IBM & Honeywell)- Unix (HP-UX, AIX, Solaris)- Network (CASE, 3COM,

CISCO)

- Working for Bull- French Computer Manufacturer- Mainframes, Unix, HPC,

Security, Managed Services

Page 5: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

5 ©Bull, 2011 Presentation Title

Background

- System Monitoring- OpenView- Netview- Open Master

- Open Source Monitoring- NetSaint on AIX- Nagios

Page 6: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

6 ©Bull, 2011 Presentation Title

Example Implementations

Page 7: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

7 ©Bull, 2011 Presentation Title

Crown Office Procurator Fiscal Service

- Responsible for the prosecution of crime in Scotland - Investigation of suspicious deaths- Complaints against the Police

- IT Locations in Glasgow & Edinburgh- Windows at every Courts of Justice in Scotland- AIX / Oracle DB at Glasgow & Edinburgh

Page 8: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

8 ©Bull, 2011 Presentation Title

Crown Office Procurator Fiscal Service

- Already used Solarwinds for some network monitoring

- Strategy demanded AIX based monitoring & reporting- In a competitive tender Nagios selected- Main success points were – simplicity, ease of customisation- Fitted within AIX based distance data replication already in use

Page 9: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

9 ©Bull, 2011 Presentation Title

Crown Office Procurator Fiscal Service

- 60+ Windows systems monitored for CPU, Disk Space etc

- 2 AIX servers monitored for CPU, Disk Space etc

- Two Oracle Instances monitored for performance and DBspace usage

- All alerts shown on monitor screen and if necessary SMS Text alerts- Installed 2005, still working- Provides ‘backstop’ to Solarwinds for capacity monitoring on

the WAN & LAN.

Page 10: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

10 ©Bull, 2011 Presentation Title

Rother District Council

- “Working with the community to improve the overall well-being of the District “- Responsible for Waste Collection, Housing, Planning &

Building Control- The District covers some 200 square miles and serves a

population of around 90,000 inhabitants.

Page 11: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

11 ©Bull, 2011 Presentation Title

Rother District Council

- Monitoring 20+ Windows Servers for CPU, Disk Utilsation etc

- Monitoring numerous disparate Applications- Reporting on Availability- Monitoring Printer status- Unexpected benefits

Page 12: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

12 ©Bull, 2011 Presentation Title

North Yorkshire County Council

- Internet Access system for 30,000 pupils

- Monitoring e-mail, internet access, IDS, AV, Webservers- Reporting on Availability- Monitoring Service Level Indicators- Mix of application providers (Scalix, Plesk)- Mix of appliance systems – Cisco, Panda, Radware,

NetEnforcer, MyFilter

Page 13: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

13 ©Bull, 2011 Presentation Title

North Yorkshire County Council

- System Schematic

Page 14: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

14 ©Bull, 2011 Presentation Title

North Yorkshire County Council

- Uses NRPE to perform active checks on hosts

- Multi O/S support- Debian- RedHat

- Uses NSCA to accept check results from Windows- Via NagiosEventLog

Page 15: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

15 ©Bull, 2011 Presentation Title

North Yorkshire County Council

- E-mail- Scalix running on Redhat

Cluster. Checking all processes, cluster state etc.

- PLESK Web server- Checking availability of web

sites via test installation- Monitoring disk utilsation

and processor utilisation

- AV systems- Monitoring availability- Checking on AV database

- Myfilter- Monitoring email filters

running- Checking that sufficient

email filters are available

Page 16: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

16 ©Bull, 2011 Presentation Title

North Yorkshire County Council

- E-mail- Nagios server runs external

email loopback test every 20 minutes to confirm external reachability.

- PLESK Web server- Straightforward

implementation of check_http

- NetBackup- Monitoring that backups

have run- Checking that enough

backup tapes are available

- Business Availability- Define which services

constitute a business line- 07:00 check – tell support

before the customers come on line

Page 17: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

17 ©Bull, 2011 Presentation Title

NYCC - Nagiosgraph

- Nagiosgraph- Uses process_performance

_data- Example of Unix load average

Page 18: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

18 ©Bull, 2011 Presentation Title

NYCC – Nagios Monitoring

- Scalix Email System

Page 19: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

19 ©Bull, 2011 Presentation Title

NYCC

- Alerts sent via email to customers as well as support

- Backup notifications via SMS Text

- Use Nagios Looking Glass for Customer View

- nagiosgraph used to catch all service performance data- Debian & Redhat perfomance metrics- Network throughput from LAN switches- LDAP response time

Page 20: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

20 ©Bull, 2011 Presentation Title

Datacentre Monitoring with Nagios

Page 21: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

21 ©Bull, 2011 Presentation Title

What is a DataCentre ?

- A data center (or datacentre) is a facility used to house computer systems and associated components, such as telecommunications and storage systems. It generally includes redundant or backup power supplies, redundant data communications connections, environmental controls and security devices.

(Wikipedia)

Page 22: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

22 ©Bull, 2011 Presentation Title

How good is your DataCentre ?

- The TIA-942:Data Center Standards Overview describes the requirements for the data centre infrastructure. The simplest is a Tier 1 data centre, which is basically a server room, following basic guidelines for the installation of computer systems. The most stringent level is a Tier 4 data centre, which is designed to host mission critical computer systems, with fully redundant subsystems and compartmentalized security zones controlled by biometric access controls methods .

(Wikipedia)

Page 23: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

23 ©Bull, 2011 Presentation Title

What is a DataCentre ?

- Tier 1 Requirements- Single non-redundant distribution path serving the IT equipment - Non-redundant capacity components - Basic site infrastructure guaranteeing 99.671% availability

- Tier 2 Requirements- Fulfills all Tier 1 requirements - Redundant site infrastructure capacity components guaranteeing 99.741% availability

- Tier 3 Requirements- Fulfills all Tier 1 and Tier 2 requirements - Multiple independent distribution paths serving the IT equipment - All IT equipment must be dual-powered and fully compatible with the topology of a site's architecture Concurrently maintainable

site infrastructure guaranteeing 99.982% availability

- Tier 4 Requirements- Fulfills all Tier 1, Tier 2 and Tier 3 requirements - All cooling equipment is independently dual-powered, including chillers and heating, ventilating and air-conditioning (HVAC)

systems - Fault-tolerant site infrastructure with electrical power storage and distribution facilities guaranteeing 99.995% availability

- ©Uptime Institute

Page 24: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

24 ©Bull, 2011 Presentation Title

What is a Green DataCentre ?

- The most commonly used metric to determine the energy efficiency of a data centre is power usage effectiveness, or PUE. This simple ratio is the total power entering the data centre divided by the power used by the IT equipment.

- PUE = Total facility Power / IT Equipment Power

- Power used by support equipment, often referred to as overhead load, mainly consists of cooling systems, power delivery, and other facility infrastructure like lighting. The average data centre in the US has a PUE of 2.0, meaning that the facility uses one Watt of overhead power for every Watt delivered to IT equipment. State-of-the-art data centre energy efficiency is estimated to be roughly 1.2.

Page 25: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

25 ©Bull, 2011 Presentation Title

Bull Datacentre BC1 ?

- New datacentre build on an already existing site

- Design criteria PUE 1.6

- Easily expanded on demand

- Tier 3

Page 26: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

26 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC1

- What do you get for £1.2m ?

Page 27: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

27 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC1

- New Mains Incomer- Took feed from 11Kv ring- Had to build own substation

- 1.2Mw Generator- Required 8000 litre fuel

tank- Switchgear to automatically

start generator if mains incomer fails (10-45 seconds)

- 3 x Ambient CRAC Units- Cooling via external

temperature differential- N+1 configuration- Hot Aisle Containment

- In-Line UPS- UPS only required to keep IT

equipment running until generator fires up

- Uses space in Cab rows, easily scalable according to load

Page 28: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

28 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC1 - Monitoring

- Physical Environment- APC Netbotz Devices

• Translate inputs from sensors• Humidity, Temperature, Dew

Point- SEAL I/O Dry Contact

• Voltage indicators• For CRAC, FM200, Generator,

UPS

- Electrical Efficiency- PowerLogic ION software

reads from power meters- Power meter on every

Distribution Board- Real-time calculation of PUE

- Power Distribution- Every PDU strip (2 per Cab)

monitored for power consumption & problems

- A number of PDU strips also have remote control down to socket level

- Management Network- LAN infrastructure required to

support the Datacentre- Servers required to support the

datacentre- External alert mechanisms

Page 29: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

29 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC1

- What does Netbotz look like ?

Page 30: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

30 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC1

- What does SeaLevel look like ?

Page 31: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

31 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC1

- What does ION look like ?

Page 32: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

32 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC1

- What does a metered PDU look like ?

Page 33: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

33 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC1

- What does a managed PDU look like ?

Page 34: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

34 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC1

- Nagios Map

Page 35: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

35 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC1

- Nagios Host Groups

Page 36: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

36 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC1

- Do things go wrong - yes

Page 37: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

37 ©Bull, 2011 Presentation Title

Bull UK Datacentre BC1

- Do things go wrong - yes & no

Page 38: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

38 ©Bull, 2011 Presentation Title

Datacentre Monitoring Schematic

Page 39: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

39 ©Bull, 2011 Presentation Title

Nagios Products in use

- Nagios Core- NRPE- NSCA

- Nagios Looking Glass

- Nagvis

- EventDB

- SNMPTT

- Nagmap

- NDO

Page 40: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

40 ©Bull, 2011 Presentation Title

Other Open Source Products in use

- Nedi

- Arpwatch

- PSAD

- SMS-Client

- Bacula

- Confluence (Wiki)

- i-doit (ITIL CMDB)

- MRTG

- Routers2cgi

Page 41: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

41 ©Bull, 2011 Presentation Title

BC1 Datacentre Monitoring Elements

- Nagios Core- Normal install with direct

polling of devices- Only looking at Datacentre

- Nagios Display System- Central reporting Nagios - Absorbs updates from other

Nagios instances

- Information Display- Normal system with 5

heads

- Nagios Customer System- Running on an appliance

connected to Customer network

- Sends data via encrypted secured link to Display System

- Backup System- Use tape library- Hosts CMDB & WiKi

Page 42: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

42 ©Bull, 2011 Presentation Title

BC1 Datacentre Nagios Core

- Hardware Platform - Intel- O/S Centos 5- Xeon 2.8Ghz , 8Gb memory, 72GB RAID-1 disk

- Nagios 3.2.0- Built from source tarball

- Nagios Plugins 1.4.15-2- Installed from RPM

Page 43: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

43 ©Bull, 2011 Presentation Title

BC1 Datacentre Nagios Display System

- Hardware Platform - Intel- O/S Fedora Core 9- P4 2.8Ghz , 2.5Gb memory, 76GB RAID-1 disk- Nvidia dual monitor display Card – DVI interfaces

- Nagios 3.0.6- Built from source tarball

- Nagios Plugins 1.4.13-9- Installed from RPM

Page 44: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

44 ©Bull, 2011 Presentation Title

BC1 Datacentre Normal Display System

- Hardware Platform - AMD- O/S Centos 5- Athlon 1.2Ghz , 1.0 Gb memory, 3GB disk- Matrox G200 Quad Head

- Runs console displays – http/RDP/ssh

Page 45: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

45 ©Bull, 2011 Presentation Title

BC1 Datacentre Customer System

- Hardware Platform – Motion Tablet- O/S Ubuntu 10.04 LTS- Pentium M 1.5Ghz , 0.5 Gb memory, 30GB disk- Touch Screen tablet system

- Nagios 3.2.3- Built from tarball

- Nagios Plugins 1.4.15- Built from tarball

- Nagios NSCA- Sends status (encrypted) to central reporting system

Page 46: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

46 ©Bull, 2011 Presentation Title

BC1 Datacentre Backup System

- Hardware Platform – Intel- O/S Centos 5- Xeon 3.06Ghz , 2.0 Gb memory, 108GB disk

- Uses Bacula 5.0.3- Controls SDLT 20 slot tape library- Backs up all Datacentre Infrastructure

• Windows• Centos• Ubuntu

Page 47: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

47 ©Bull, 2011 Presentation Title

Conclusions

Page 48: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

48 ©Bull, 2011 Presentation Title

Conclusions

- Strategic Overall Design- Know what you need to

monitor- Know who needs to be told

- Expect to throw the first version away- Only when you have fully

engineered the solution will you understand all of the issues

- Keep a record of design decisions

- You will have to make it pretty for management- Accept that an attractive

display will be required- Reporting will become key

- It must be reliable- Make backups- Consider clustering &

recovery options

Page 49: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

49 ©Bull, 2011 Presentation Title

& Hints

Page 50: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

50 ©Bull, 2011 Presentation Title

Hints & Experience

- Separate Display systems from Monitoring systems- If you are tracking 10,000’s of

services you don’t want processor heavy graphics as well

- Escalation & Alerting take time- Firstly to get right with your

organisation- Secondly to actually physically

do !

- Suppliers go out of their way to make it difficult- Don’t give in – there is always

a way to get Nagios involved- Screen scrape, email,

telnet,RS232 are all possible

- SNMP is your friend- When in doubt use SNMP to

help you out- SNMP V3 with AES cypher is

suitably secure for most implementations

Page 51: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre
Page 52: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

52 ©Bull, 2011 Presentation Title

Page 53: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre

53 ©Bull, 2011 Presentation Title

Page 54: Nagios Conference 2011 - Dave Williams - Nagios In The Real World - The Datacentre