1 © bull, 2014 october 14th 2014 dave williams technical architect multi-tenant nagios monitoring

28
1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

Upload: moses-hines

Post on 18-Dec-2015

222 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

1© Bull, 2014

October 14th 2014 Dave Williams

Technical Architect

Multi-Tenant Nagios Monitoring

Page 2: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

2© Bull, 2014

Agenda

BackgroundMulti-Tenant MonitoringWhy Multi-TenantMulti-Tenant DesignService CatalogueFutures & ‘Blue Sky thinking’Questions

Page 3: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

3© Bull, 2014

Background

UK basedMainframe (IBM & Honeywell)Unix (HP-UX, AIX, Solaris)Linux (RedHat, SLES, Debian)Network (CASE, 3COM, CISCO)

Working for BullFrench Computer ManufacturerMainframes, Unix, HPC, Security, Managed Services, Advisory Services

Page 4: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

4© Bull, 2014

Background

System MonitoringOpenViewNetviewOpen Master

Open Source MonitoringNetSaint on AIXNagios

Page 5: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

5© Bull, 2014

Why Multi-Tenant ?

Outsourcing Support & MonitoringMultiple Customers

–Different Levels of security–Different Hardware / Software Platforms

One Support Team–Only need to know about real problems–Can be driven by support ticket not Nagios

Required 365 x 24–Infrastructure must survive all outages without loss of service

Page 6: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

6© Bull, 2014

Multi-Tenant Design

Each customer may have 2-3000 hosts10-100 services per hostReal time monitoring

Customer profileSLA ReportingBatch Event completionDifferent SLA’s for each Business Process per customerDifferent alerting & escalation methods per customer

Page 7: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

7© Bull, 2014

Multi-Tenant Design

Hardware Platform – Central SupportVirtualised Platform (Intel based)

–XenServer Hypervisor Allows clustering with shared storage Inexpensive Licensing

Shared Storage–NAS

Using QNAP Appliances with underlying RAID-5 & Hot Spare protection Network connection using dual interfaces bound across multiple switches Could have used FreeNas

LAN Infrastructure–Dual connections to all hardware–SNMP managed switches

Page 8: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

8© Bull, 2014

Hardware Platform – Basic Schematic

Page 9: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

9© Bull, 2014

Multi-Tenant Design

Hardware Platform – ResilienceVirtualised Platform (Intel based)

–XenServer Hypervisor Allows clustering with shared storage If Primary node fails cluster will ‘spin up’ image on 2nd node

Same data / logs (Shared storage)

LAN Infrastructure–Dual connections to all hardware

Bonded interfaces for NAS access – no data loss / access loss with failure SNMP managed switches

Page 10: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

10© Bull, 2014

Hardware Setup

Page 11: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

11© Bull, 2014

Multi-Tenant Design

Hardware Platform – RecoveryVirtualised Platform (Intel based)

–XenServer Hypervisor Allows clustering with shared storage If Primary Site fails will spin up image Internet Access fails over – using BGP

Shared Storage – replicated from Prime Site–NAS

Using QNAP Appliances with underlying RAID-5 & Hot Spare protection Using RTRR (Real Time Remote Replication) between sites Network connection using dual interfaces bound across multiple switches

LAN Infrastructure–Dual connections to all hardware

Bonded interfaces for NAS access – no data loss / access loss with failure SNMP managed switches

Page 12: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

12© Bull, 2014

Hardware Platform - Resilience

Page 13: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

13© Bull, 2014

Hardware Platform – Customer Site

Using generic netbooks Minimum requirement

–1Gb Memory , Atom processor, Ethernet Port–Running Centos 6.4 64 bit Operating System

Can use Raspberry Pi for small customers–512K Memory , Arm processor , Ethernet Port –Running Raspbian Operating System

Page 14: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

14© Bull, 2014

Software Platform – Central Site

Nagios – CoreRunning latest 4.0.8Using MK Livestatus for interfacingUsing Thruk for Visualisation

Graylog2 / Elastic SearchStore all logs & Syslog in ‘Big Data’ repository using MongoDB

Asterisk PBXAllow all alerting to use standard dial-up with speech synthesis + IVR

SMS-ClientStill using TAPI to SMS Text contacts

Page 15: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

15© Bull, 2014

Software Platform – Central Site (contd)

NRPERunning 2.1.5

NSCA &NSCA-ngUsing NSCA for external communicationUsing NSCA-ng for issuing remote commands

Postfix / ProcmailUsed to generate emails but also handle responses.Routes unsolicited alerting emails (HP Insight, Pingdom)

OTRSRecord alerts, track issues

Page 16: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

16© Bull, 2014

Software Platform – Remote Site

Nagios – CoreRunning latest 4.0.8

NRPERunning 2.14

NSCA Using NSCA for external communication

OpenVPNCommunication via IPSec VPN

Page 17: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

17© Bull, 2014

Customer Multi-Tenant

Page 18: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

18© Bull, 2014

Multi Tenant Schematic

Page 19: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

19© Bull, 2014

Service Catalogue

ITIL FlavourReally just services & their characteristics

Page 20: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

20© Bull, 2014

Service Catalogue

Agreed list of servers / servicesWith importance levelsWith alerting pathsWith escalation pathsRecovery options

Feeds into Service Level Agreements and Operational Level AgreementsBasis of agreed reporting structures

Page 21: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

21© Bull, 2014

Examples

Basic Spreadsheet plus Shell scriptUsually easy to create, Shell script is different for each customer based on a initial standard script

Chef or PuppetUse Exported ResourcesNagios Cookbook – Nagios Conference 2012 Presentation

Page 22: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

22© Bull, 2014

Multi Tenant Issues

Naming conventionsEvery customer has a server01Customers naming conventions are obscure Customers have multiple physical locations or levels of security

–This gives rise to different nagios names to actual names:–Custloc1-swfeltsw01–Custloc2-nwfeltsw01

Not so smart when a non-Nagios originated alert is received,–‘swfeltsw01 – RAID battery backup failure’ from HP Insight for example–The external alert processor has to perform table lookups before building the

appropriate NSCA command for example

Page 23: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

23© Bull, 2014

Futures & Blue Sky thinking

The Nagios Visualisation is resource heavyAll Customers want their own Dashboard All Customers want a different screen layout

Why not move the visualisation into the cloud ?Use a Amazon EC2 image to access central Livestatus via httpsAllow end user to authenticateCustomer portal allows ‘spin up’ & ‘spin down’ of images

–Move billing to the customer–Scale horizontally for Visualisation

Page 24: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

24© Bull, 2014

Load Sharing

Using plugins like check_wmi_plus put a strain on the monitoring system, large number of queries that take wall clock time to complete and parse.Better to have ‘worker nodes’ via Merlin or Mod Gearman similar to perform these functions – Raspberry Pi for example.No great expense to add 2/3 Pi’s to customer site configurations, easy fall back if they fail – no unique locally stored data

Page 25: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

25© Bull, 2014

BPI Example

Page 26: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

26© Bull, 2014

Dashboard Example

Page 27: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

27© Bull, 2014

Questions ?

Page 28: 1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

28© Bull, 2014