dave williams - nagios log server - practical experience

Nagios Log ServerPractical Experience

Dave Williams

1

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Agenda

▶ Background

▶ Why choose Nagios Log Server

▶ Implementation

▶ Source Configuration

▶ Useful things to know

▶ Initial Dashboards

▶ Final Dashboards

▶ System Performance

▶ Conclusions

2


Background

▶UK based– Mainframe (IBM & Honeywell)

– Unix (HP-UX, AIX, Solaris)

– Linux (RedHat, SLES, Debian)

– Network (CASE, 3COM, CISCO)

▶Working for Atos– French Outsourcing Company

– Mainframes, Unix, HPC, Security, Managed Services, Advisory Services


Background

▶ System Monitoring

– OpenView

– Netview

– Open Master

▶ Open Source Monitoring

– NetSaint on AIX

– Nagios

– Nagios XI


Why choose Nagios Log Server?

▶ Needed a log server of some nature

▶ Already built a Elk & Logstash system (not using Kibana) by hand

▶ Used Splunk in a previous life to good effect

▶ Last year Nagios Logserver announced – after Ethan and others had taken note

▶ Seemed to be a ‘cost effective’ easy build option

▶ Included authentication & access control necessary for Managed Services environment.

5


Implementation

▶ Because of use of Centos installed from source

– no great issues, ntp requirement in install script overcome.

• Complete!

• 12 Aug 18:40:02 ntpdate[2930]: no server suitable for synchronization found

• ===================

• INSTALLATION ERROR!

• ===================

• Installation step failed - exiting.

• Check for error messages in the install log (install.log).

• If you require assistance in resolving the issue, please include install.log

• in your communications with Nagios Enterprises technical support.

6


Implementation

• The step that failed was: 'prereqs'

• # Set date/time because ssl certificates can be in the future... (fix for pypiand get-pip)

• # ntpdate -u pool.ntp.org

▶ Easily able to move data storage to a nominated filesystem

7


Implementation

▶ Connecting a new instance to the cluster :

– really is as simple as the manual describes

• install on new host

• connect to the web interface

• enter IP address / name of original cluster node

• enter Cluster ID of the original system

– Finish Installation.

8


Underlying Structure

9

Server 1

Server N

Logstash

Logstash

ElasticsearchCluster

Kibana

Queried by

Push data into


Source Configuration

▶ Creation of feeds straightforward.

– First syslog, using syslog remote to accept other systems data

– Because of SNMPTT SNMP traps appearing in syslog also recorded

– Could use Eventlog (NXLog) for Windows in future

▶ VMware logs – from ESXi not the VM’s :

– Add Input, udp {

type => 'esxilogs'

port => 1514

}

– Save and apply, adjust iptables if required

– follow this VMWare configuration guide to setup your ESXI hosts to log to udp://nagios.log.server.ip:1514

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayK

C&externalId=1007329

– Or read https://assets.nagios.com/downloads/nagios-log-server/docs/Sending-ESXi-Logs-To-Nagios-Log-Server.pdf

10


Source Configuration

For NetFlow use this :-

Logstash has native NetFlow v5 and v9 codecs. It can't handle high volume (I'm guessing no more than a few hundred flows per second)..

– udp { host => "0.0.0.0"

– port => 2055

– codec => netflow { cache_ttl => 1 versions => [ 5, 9 ] }

– type => "netflow" }

– Save and apply, adjust iptables if required

11


Source Configuration (Pi)http://www.paluch.biz/blog/134-capturing-and-visualizing-sensor-data-using-the-elk-stack.html

▶ IoT (Internet of Things) simple solution:

– RasPi distance sensor :

– The RaspberryPi is sending its data regularly to logstash using the TCP input using JSON. JSON is the simplest data format available on IoTplatforms.

– input{ tcp{ port => 9400

– codec => "json_lines"

– }

– }

– output{

– elasticsearch_http{

– host => "localhost"

– port => 9200

– index => "distance-%{+YYYY.MM.dd}" } }

12

import socket import json import time from distancemeter import get_distance,cleanup # Logstash TCP/JSON Host JSON_PORT = 9400 JSON_HOST = '192.168.55.34' if __name__ == '__main__': try: s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect((JSON_HOST, JSON_PORT)) while True: distance = get_distance() data = {'message': 'distance %.1f cm' % distance, 'distance': distance, 'hostname': socket.gethostname()} s.send(json.dumps(data)) s.send('\n') print ("Received distance = %.1f cm" % distance) time.sleep(0.2) # interrupt except KeyboardInterrupt: print("Program interrupted")


Source Configuration (Pi)http://www.paluch.biz/blog/134-capturing-and-visualizing-sensor-data-using-the-elk-stack.html

13


Source Configuration (The Force Awakens)

14


Useful things to know

▶ How do I install Logstash plugins ?

– /usr/local/nagioslogserver/logstash/bin/plugin install logstash-codec-cef

– (Installs ArcSight logfile handler…)

▶ Check the latest upgrade documentation for how to pause shard allocation :

– https://assets.nagios.com/downloads/nagios-log-server/docs/Upgrade-Instructions-For-Nagios-Log-Server.pdf

– For large clusters makes a real difference to how long a rolling update can take

▶ One of my favourite filters :

– if [severity_label] == "Notice“ and [program] == “sudo” {

– drop {}

– }

15



▶ Get used to looking at curl -XGET 'http://localhost:9200/

▶ Need the cluster state ? :-

– # curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'{

"cluster_name" : "80e9022e-f73f-429e-8927-xxxxxxxxxx","status" : "yellow","timed_out" : false,"number_of_nodes" : 3,"number_of_data_nodes" : 3,"active_primary_shards" : 86,"active_shards" : 136,"relocating_shards" : 0,"initializing_shards" : 6,"unassigned_shards" : 30

16



▶ Monitoring the Nagios Log Server

– Other presentations will cover this topic – see Eric Loyd , Track 1 @ 2:30 today

▶ But mainly use :9200 locally (via NRPE) and then check_proc for the appropriate processes.

▶ To uninstall manually :-

– Stop all of the relevant NLS processes (elasticsearch, logstash, and httpd) and remove the following directories:

– rm -rf /usr/local/nagioslogserver

– rm -rf /var/www/html/nagioslogserver

– You can now do a ./fullinstall

17



▶ If you run equipment that has to output syslog on port 514 then Logserver can cope (privileged port access)- NetApp is an example

– There’s a document for this ! https://assets.nagios.com/downloads/nagios-log-server/docs/Listening-On-Privileged-Ports-With-Nagios-Log-Server.pdf

– You can change logstash to run as the root user.

– Open /etc/sysconfig/logstash and find the line: LS_USER=nagios

– Change this line to read LS_USER=root

– Restart the logstash service: # service logstash restart

18



▶ Alternative method of log shipping :-

– Was lumberjack but now logstash-forwarder (still lumberjack protocol )

• Encrypted shipping of compressed logs

• Low impact compared to a full Logstash install

• Use self signed certificates.

• Runs in EC2 micro instances

▶ CentOS 6

– wget http://packages.elasticsearch.org/logstashforwarder/centos/logstash-forwarder-0.3.1-1.x86_64.rpmrpm -ivh logstash-forwarder-0.3.1-1.x86_64.rpm

▶ CentOS 5

– wget http://download.elasticsearch.org/logstash-forwarder/packages/logstash-forwarder-0.3.1-1.x86_64.rpmrpm -ivh logstash-forwarder-0.3.1-1.x86_64.rpm

19



▶ Logstash plugins – over 180 at https://github.com/logstash-plugins

– Nice thing to know:-

– :::ruby

– output { if [type] == "syslog"

– and [program] == "jenkins"

– and [job] == "Install on Cluster"

– and "_grokparsefailure" not in [tags]

• {

• nagios_nsca {

– host => “nagios.example.com" port => 5667

– send_nsca_config => "/etc/send_nsca.cfg"

– message_format => "%{job} %{repo}"

– nagios_host => "jenkins"

– nagios_service => "deployed %{repo}"

– nagios_status => "2" } }

– # if type=syslog, program=jenkins, job="Install on Cluster" }

– # output

20


Initial Dashboards

▶ Apache dashboard :-

21

Hmm – what are the 404’s ?


Initial Dashboard

22


Initial Dashboards

▶ Zoom in by clicking on the 404 part of the Pie chart :-

23

Ah ! A good idea to find win40.jpg then.


Final Dashboards

24


Final Dashboards

25


Performance

▶ A good setting to configure to help control ES memory usage is to set the indices field cache size. Limiting this indices cache size makes sense because you rarely need to retrieve logs that are older than a few days. By default ES will hold old indices in memory and will never let them go. So unless you have unlimited memory than it makes sense to limit the memory in this scenario.

▶ To limit the cache size simply add the following value anywhere in your custom elasticsearch.yml configuration file. This setting and adjusting the Java heap memory size should be enough to get started but there are a few other things that might be worth checking.

▶ indices.fielddata.cache.size: 40%

26


Performance

▶ Another idea worth looking at for an easy performance boost would be disabling swap if it has been enabled. Again, in most cloud environment and images swap is turned off, but it is always a setting worth checking.

▶ To bypass the OS swap setting you can simply configure a no swap value in ES by adding the following to your elasticsearch.yml configuration file.

• bootstrap.mlockall: true

– To check that this has value has been configured properly you can run this command.

– curl http://localhost:9200/_nodes/process?pretty

– This may cause memory warnings when ES starts up (eg, unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out. Increase RLIMIT_MEMLOCK (ulimit).) but you should be able to ignore these warnings. If you are concerned, turn these limits off at the OS level

▶ Centos /etc/sysctl.conf:

– Fs.file-max = 16384

▶ Centos /etc/security/limits.conf:

– * - nofiles 16384

27


Performance

▶ Rules of thumb :-

– Due issues with JVM heap size, individual Elasticsearch nodes don't scale well beyond 64GB of RAM. After reaching 64GB of RAM (with 31GB allocated to the Java heap), you should scale horizontally rather than vertically.

– Elasticsearch has a lot of optimizations built around fast retrieval from disk, and a lot of knobs you can tweak to ensure that the most frequently searched indices live on SSD.

– With respect to the concern about high-volume indexing causing search performance problems: if this is a problem you can use index routing to help by ensuring that data is indexed on nodes with the fastest disk (say SSD in RAID 0), then moved to nodes with spinning disk. If your cluster is search-heavy you could also increase the number of replica shards, which requires more storage but decreases search time.

28


Conclusions

▶ Obvious ones first :

– You can’t run this on a RaspberryPi ! (Or maybe you can – ask me outside this presentation….)

– You need log sources that matter

– You need time to develop filters and alerts that make sense to your organisation.

▶ Anything can be a logfile

– You can point Logserver at any readable file and parse the content

29


Questions

30

Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company, Yunano, Zero Email, Zero Email Certified and The Zero Email Company are registered trademarks of the Atos group. July 2015. © 2015 Atos. Confidential information owned by Atos, to be used by the recipient only. This document, or any part of it, may not be reproduced, copied, circulated and/or distributed nor quoted without prior written approval from Atos.

31-07-2015

© Atos

Thanks

For more information please contact:T+ 33 1 98765432M+ 44 (0) [email protected]

dave williams - nagios log server - practical experience

Presentations & Public Speaking