dave williams - nagios log server - practical experience

31
Nagios Log Server Practical Experience Dave Williams 1

Upload: nagios

Post on 15-Feb-2017

2.130 views

Category:

Presentations & Public Speaking


4 download

TRANSCRIPT

Page 1: Dave Williams - Nagios Log Server - Practical Experience

Nagios Log ServerPractical Experience

Dave Williams

1

Page 2: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Agenda

▶ Background

▶ Why choose Nagios Log Server

▶ Implementation

▶ Source Configuration

▶ Useful things to know

▶ Initial Dashboards

▶ Final Dashboards

▶ System Performance

▶ Conclusions

2

Page 3: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Background

▶UK based– Mainframe (IBM & Honeywell)

– Unix (HP-UX, AIX, Solaris)

– Linux (RedHat, SLES, Debian)

– Network (CASE, 3COM, CISCO)

▶Working for Atos– French Outsourcing Company

– Mainframes, Unix, HPC, Security, Managed Services, Advisory Services

Page 4: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Background

▶ System Monitoring

– OpenView

– Netview

– Open Master

▶ Open Source Monitoring

– NetSaint on AIX

– Nagios

– Nagios XI

Page 5: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Why choose Nagios Log Server?

▶ Needed a log server of some nature

▶ Already built a Elk & Logstash system (not using Kibana) by hand

▶ Used Splunk in a previous life to good effect

▶ Last year Nagios Logserver announced – after Ethan and others had taken note

▶ Seemed to be a ‘cost effective’ easy build option

▶ Included authentication & access control necessary for Managed Services environment.

5

Page 6: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Implementation

▶ Because of use of Centos installed from source

– no great issues, ntp requirement in install script overcome.

• Complete!

• 12 Aug 18:40:02 ntpdate[2930]: no server suitable for synchronization found

• ===================

• INSTALLATION ERROR!

• ===================

• Installation step failed - exiting.

• Check for error messages in the install log (install.log).

• If you require assistance in resolving the issue, please include install.log

• in your communications with Nagios Enterprises technical support.

6

Page 7: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Implementation

• The step that failed was: 'prereqs'

• # Set date/time because ssl certificates can be in the future... (fix for pypiand get-pip)

• # ntpdate -u pool.ntp.org

▶ Easily able to move data storage to a nominated filesystem

7

Page 8: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Implementation

▶ Connecting a new instance to the cluster :

– really is as simple as the manual describes

• install on new host

• connect to the web interface

• enter IP address / name of original cluster node

• enter Cluster ID of the original system

– Finish Installation.

8

Page 9: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Underlying Structure

9

Server 1

Server N

Logstash

Logstash

ElasticsearchCluster

Kibana

Queried by

Push data into

Page 10: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Source Configuration

▶ Creation of feeds straightforward.

– First syslog, using syslog remote to accept other systems data

– Because of SNMPTT SNMP traps appearing in syslog also recorded

– Could use Eventlog (NXLog) for Windows in future

▶ VMware logs – from ESXi not the VM’s :

– Add Input, udp {

type => 'esxilogs'

port => 1514

}

– Save and apply, adjust iptables if required

– follow this VMWare configuration guide to setup your ESXI hosts to log to udp://nagios.log.server.ip:1514

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayK

C&externalId=1007329

– Or read https://assets.nagios.com/downloads/nagios-log-server/docs/Sending-ESXi-Logs-To-Nagios-Log-Server.pdf

10

Page 11: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Source Configuration

For NetFlow use this :-

Logstash has native NetFlow v5 and v9 codecs. It can't handle high volume (I'm guessing no more than a few hundred flows per second)..

– udp { host => "0.0.0.0"

– port => 2055

– codec => netflow { cache_ttl => 1 versions => [ 5, 9 ] }

– type => "netflow" }

– Save and apply, adjust iptables if required

11

Page 12: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Source Configuration (Pi)http://www.paluch.biz/blog/134-capturing-and-visualizing-sensor-data-using-the-elk-stack.html

▶ IoT (Internet of Things) simple solution:

– RasPi distance sensor :

– The RaspberryPi is sending its data regularly to logstash using the TCP input using JSON. JSON is the simplest data format available on IoTplatforms.

– input{ tcp{ port => 9400

– codec => "json_lines"

– }

– }

– output{

– elasticsearch_http{

– host => "localhost"

– port => 9200

– index => "distance-%{+YYYY.MM.dd}" } }

12

import socket import json import time from distancemeter import get_distance,cleanup # Logstash TCP/JSON Host JSON_PORT = 9400 JSON_HOST = '192.168.55.34' if __name__ == '__main__': try: s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect((JSON_HOST, JSON_PORT)) while True: distance = get_distance() data = {'message': 'distance %.1f cm' % distance, 'distance': distance, 'hostname': socket.gethostname()} s.send(json.dumps(data)) s.send('\n') print ("Received distance = %.1f cm" % distance) time.sleep(0.2) # interrupt except KeyboardInterrupt: print("Program interrupted")

Page 13: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Source Configuration (Pi)http://www.paluch.biz/blog/134-capturing-and-visualizing-sensor-data-using-the-elk-stack.html

13

Page 14: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Source Configuration (The Force Awakens)

14

Page 15: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Useful things to know

▶ How do I install Logstash plugins ?

– /usr/local/nagioslogserver/logstash/bin/plugin install logstash-codec-cef

– (Installs ArcSight logfile handler…)

▶ Check the latest upgrade documentation for how to pause shard allocation :

– https://assets.nagios.com/downloads/nagios-log-server/docs/Upgrade-Instructions-For-Nagios-Log-Server.pdf

– For large clusters makes a real difference to how long a rolling update can take

▶ One of my favourite filters :

– if [severity_label] == "Notice“ and [program] == “sudo” {

– drop {}

– }

15

Page 16: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Useful things to know

▶ Get used to looking at curl -XGET 'http://localhost:9200/

▶ Need the cluster state ? :-

– # curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'{

"cluster_name" : "80e9022e-f73f-429e-8927-xxxxxxxxxx","status" : "yellow","timed_out" : false,"number_of_nodes" : 3,"number_of_data_nodes" : 3,"active_primary_shards" : 86,"active_shards" : 136,"relocating_shards" : 0,"initializing_shards" : 6,"unassigned_shards" : 30

16

Page 17: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Useful things to know

▶ Monitoring the Nagios Log Server

– Other presentations will cover this topic – see Eric Loyd , Track 1 @ 2:30 today

▶ But mainly use :9200 locally (via NRPE) and then check_proc for the appropriate processes.

▶ To uninstall manually :-

– Stop all of the relevant NLS processes (elasticsearch, logstash, and httpd) and remove the following directories:

– rm -rf /usr/local/nagioslogserver

– rm -rf /var/www/html/nagioslogserver

– You can now do a ./fullinstall

17

Page 18: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Useful things to know

▶ If you run equipment that has to output syslog on port 514 then Logserver can cope (privileged port access)- NetApp is an example

– There’s a document for this ! https://assets.nagios.com/downloads/nagios-log-server/docs/Listening-On-Privileged-Ports-With-Nagios-Log-Server.pdf

– You can change logstash to run as the root user.

– Open /etc/sysconfig/logstash and find the line: LS_USER=nagios

– Change this line to read LS_USER=root

– Restart the logstash service: # service logstash restart

18

Page 19: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Useful things to know

▶ Alternative method of log shipping :-

– Was lumberjack but now logstash-forwarder (still lumberjack protocol )

• Encrypted shipping of compressed logs

• Low impact compared to a full Logstash install

• Use self signed certificates.

• Runs in EC2 micro instances

▶ CentOS 6

– wget http://packages.elasticsearch.org/logstashforwarder/centos/logstash-forwarder-0.3.1-1.x86_64.rpmrpm -ivh logstash-forwarder-0.3.1-1.x86_64.rpm

▶ CentOS 5

– wget http://download.elasticsearch.org/logstash-forwarder/packages/logstash-forwarder-0.3.1-1.x86_64.rpmrpm -ivh logstash-forwarder-0.3.1-1.x86_64.rpm

19

Page 20: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Useful things to know

▶ Logstash plugins – over 180 at https://github.com/logstash-plugins

– Nice thing to know:-

– :::ruby

– output { if [type] == "syslog"

– and [program] == "jenkins"

– and [job] == "Install on Cluster"

– and "_grokparsefailure" not in [tags]

• {

• nagios_nsca {

– host => “nagios.example.com" port => 5667

– send_nsca_config => "/etc/send_nsca.cfg"

– message_format => "%{job} %{repo}"

– nagios_host => "jenkins"

– nagios_service => "deployed %{repo}"

– nagios_status => "2" } }

– # if type=syslog, program=jenkins, job="Install on Cluster" }

– # output

20

Page 21: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Initial Dashboards

▶ Apache dashboard :-

21

Hmm – what are the 404’s ?

Page 22: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Initial Dashboard

22

Page 23: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Initial Dashboards

▶ Zoom in by clicking on the 404 part of the Pie chart :-

23

Ah ! A good idea to find win40.jpg then.

Page 24: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Final Dashboards

24

Page 25: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Final Dashboards

25

Page 26: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Performance

▶ A good setting to configure to help control ES memory usage is to set the indices field cache size. Limiting this indices cache size makes sense because you rarely need to retrieve logs that are older than a few days. By default ES will hold old indices in memory and will never let them go. So unless you have unlimited memory than it makes sense to limit the memory in this scenario.

▶ To limit the cache size simply add the following value anywhere in your custom elasticsearch.yml configuration file. This setting and adjusting the Java heap memory size should be enough to get started but there are a few other things that might be worth checking.

▶ indices.fielddata.cache.size: 40%

26

Page 27: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Performance

▶ Another idea worth looking at for an easy performance boost would be disabling swap if it has been enabled. Again, in most cloud environment and images swap is turned off, but it is always a setting worth checking.

▶ To bypass the OS swap setting you can simply configure a no swap value in ES by adding the following to your elasticsearch.yml configuration file.

• bootstrap.mlockall: true

– To check that this has value has been configured properly you can run this command.

– curl http://localhost:9200/_nodes/process?pretty

– This may cause memory warnings when ES starts up (eg, unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out. Increase RLIMIT_MEMLOCK (ulimit).) but you should be able to ignore these warnings. If you are concerned, turn these limits off at the OS level

▶ Centos /etc/sysctl.conf:

– Fs.file-max = 16384

▶ Centos /etc/security/limits.conf:

– * - nofiles 16384

27

Page 28: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Performance

▶ Rules of thumb :-

– Due issues with JVM heap size, individual Elasticsearch nodes don't scale well beyond 64GB of RAM. After reaching 64GB of RAM (with 31GB allocated to the Java heap), you should scale horizontally rather than vertically.

– Elasticsearch has a lot of optimizations built around fast retrieval from disk, and a lot of knobs you can tweak to ensure that the most frequently searched indices live on SSD.

– With respect to the concern about high-volume indexing causing search performance problems: if this is a problem you can use index routing to help by ensuring that data is indexed on nodes with the fastest disk (say SSD in RAID 0), then moved to nodes with spinning disk. If your cluster is search-heavy you could also increase the number of replica shards, which requires more storage but decreases search time.

28

Page 29: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Conclusions

▶ Obvious ones first :

– You can’t run this on a RaspberryPi ! (Or maybe you can – ask me outside this presentation….)

– You need log sources that matter

– You need time to develop filters and alerts that make sense to your organisation.

▶ Anything can be a logfile

– You can point Logserver at any readable file and parse the content

29

Page 30: Dave Williams - Nagios Log Server - Practical Experience

| 31-07-2015 | Dave Williams | © Atos GB | Managed Services | TTS

Questions

30

Page 31: Dave Williams - Nagios Log Server - Practical Experience

Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company, Yunano, Zero Email, Zero Email Certified and The Zero Email Company are registered trademarks of the Atos group. July 2015. © 2015 Atos. Confidential information owned by Atos, to be used by the recipient only. This document, or any part of it, may not be reproduced, copied, circulated and/or distributed nor quoted without prior written approval from Atos.

31-07-2015

© Atos

Thanks

For more information please contact:T+ 33 1 98765432M+ 44 (0) [email protected]