supermicro server monitoring with superdoctor 5 and nagios...

21
Supermicro Server Monitoring with SuperDoctor 5 and Nagios Using SNMP Protocol Version 1.1b

Upload: buikhanh

Post on 13-Sep-2018

419 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

Supermicro Server

Monitoring with SuperDoctor 5 and

Nagios Using SNMP Protocol

Version 1.1b

Page 2: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

Supermicro Server Monitoring with SuperDoctor 5 and Nagios Using SNMP Protocol

Release: v 1.1b

Document release date: 11/15/2013

Copyright © 2013 Super Micro Computer, Inc.

All Rights Reserved.

Legal Notices

This software and documentation is the property of Super Micro Computer, Inc., and supplied only under

a license. Any use or reproduction of this software is not allowed, except as expressly permitted by the

terms of said license.

Information in this document is subject to change without notice.

Trademark Notice

All trademarks and copyrights referred to are the property of their respective owners.

Page 3: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

ii

Supermicro Server Monitoring with SuperDoctor 5 and Nagios Using SNMP Protocol

Revision History

Date Rev Description

Jul-4-2011 1.0 1. Initial Document.

Sep-20-2012 1.1 1. Reorganize MIB structures.

Jul-12-2013 1.1a 1. Change product name to SuperDoctor 5 (SD5).

Nov-15-2013 1.1b 1. Changed default install folder of SD5.

Page 4: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

iii

Supermicro Server Monitoring with SuperDoctor 5 and Nagios Using SNMP Protocol

Contents

1. Introduction ..................................................................................................... 4

2. Prerequisites ................................................................................................... 5

2.1 Installing Java Runtime Environment (JRE) ................................... 5

2.2 Installing the check_snmp_health Plug-in ...................................... 5

2.3 SuperDoctor 5 (SD5) ...................................................................... 5

2.4 Seting Up SNMP Service in Linux .................................................. 5

2.5 Installing Smartctl Utility ................................................................. 6

3. Getting Started ................................................................................................ 7

3.1 Defining the Hosts .......................................................................... 7

3.2 Defining a Command ..................................................................... 7

3.3 Defining the Services ..................................................................... 8

3.4 Validating the Nagios Configurations .............................................. 9

3.5 Restarting Nagios Service ............................................................ 10

3.6 Connecting to the Nagios Web UI ................................................ 10

4. Using check_snmp_health ............................................................................ 11

4.1 –h or --help .................................................................................. 11

4.2 –bc ............................................................................................... 11

4.3 –cn ............................................................................................... 12

4.4 –co ............................................................................................... 12

4.5 –d ................................................................................................. 13

4.6 –i .................................................................................................. 13

4.7 –t .................................................................................................. 13

4.8 –to ................................................................................................ 14

5. Appendix ....................................................................................................... 15

5.1 SD5 FAQ ..................................................................................... 15

5.2 How to Reset Memory Error Status? ............................................ 15

5.3 Can I Disable the SD5 Web? ....................................................... 16

5.4 Can I Disable the NRPE Protocol? ............................................... 17

5.5 No Health Information from SNMP Was Fetched ......................... 17

Contacting Supermicro ............................................................................................... 19

Page 5: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

4

1. Introduction

This Nagios plug-in, named check_snmp_health, uses SNMP to talk to SuperDoctor 5

and check the health of the following hardware components:

Fan

Processor temperature

System temperature

DDR3 temperature

Power supply failure

Voltage

Chassis intrusion

Physical disk failure

Memory failure (Linux platform only) 1

Processor failure (Linux platform only) 2

RAID health (LSI MegaRAID 2108 and 2208 controllers only)

The results of executing the check_snmp_health plug-in are shown on the Nagios Web

UI.

1 The memory health check includes CECC and UECC. Both kinds must be BIOS supported,

and this function is currently only available on Linux platforms. 2 The processor failure checks must be BIOS supported and is currently only available on

Linux platforms.

Page 6: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

5

2. Prerequisites

2.1 Installing Java Runtime Environment (JRE)

The check_snmp_health plug-in is written in Java. To run the plug-in, install JRE 1.6 or

above in your Nagios server. Please set the JAVA_HOME environment variable to the

JRE installation path.

2.2 Installing the check_snmp_health Plug-in

1. Unzip the package file SSMServerPlugin-1.0-build.[xyz].zip to the

/usr/local/nagios/libexec/SSMServerPlugin folder, assuming your Nagios is

installed on the /usr/local/nagios location.

2. Using the command “chmod +x check_snmp_health.sh” to make the

check_snmp_health plug-in executable.

3. Execute the check_snmp_health.sh program without providing any argument. If

the JRE and the plug-in are installed correctly, the error message appears:

“Invalid options. Three options must be provided for -i (--ip).”

2.3 SuperDoctor 5 (SD5)

The check_snmp_health plug-in is designed to work with the SuperDoctor 5, which

implements an SNMP extension to support Supermicro MIBs (see “5.3 Supermicro

MIB” in SuperDoctor 5 User's Guide for details).

For the installation of the SuperDoctor 5, please refer to “Chapter 2 Setting Up SD5” in

SuperDoctor 5 User's Guide. For the quick installation of multiple SD5s, see “2.1.4

Tips for Deploying a Large Number of SD5s” in SuperDoctor 5 User's Guide.

2.4 Seting Up SNMP Service in Linux

To support SNMP, the NET-SNMP service needs to be installed and configured on

your Linux. For the installation and configuration of the NET-SNMP service, please

refer to “5.2 Setup SNMP Service in Linux” in SuperDoctor 5 User's Guide.

Page 7: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

6

2.5 Installing Smartctl Utility

SD5 uses an open source program named smartctl to check the health of physical

disks. To enable this function, you need to manually install the smartctl program.

Download the program from:

http://sourceforge.net/apps/trac/smartmontools/wiki/Download

Many Linux distributions provide pre-compiled packages to simplify the installation of

smartctl. For example, on the CentOS 5.x, you can install smartctl by using the yum

command as shown below.

Page 8: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

7

3. Getting Started

3.1 Defining the Hosts

Define a host for each of the SD5s. Suppose that the host is written in the host1.cfg

file.

define host {

host_name 10.134.12.36

alias 10.134.12.36

address 10.134.12.36

use linux-server

}

Edit the $NAGIOS_HOME$/etc/nagios.cfg file to include the host1.cfg.

3.2 Defining a Command

Define a command for check_snmp_health. Suppose that the command is written in

the commands.cfg file.

define command {

command_name check_snmp_health_all

command_line

/usr/local/nagios/libexec/SSMServerPlugin/check_snmp_health.s

h –i $HOSTADDRESS$ -t $ARG1$

}

Page 9: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

8

3.3 Defining the Services

Define the service to be checked by Nagios. Suppose that the service is written in the

host1.cfg file.

define service {

use local-service

service_description check_snmp_health

host_name 10.134.12.36

check_command check_snmp_health_all!a

}

You can also define a service to check a particular type of monitored items, e.g., fan,

disk, or memory, by specifying the type argument:

a: all (like checking all health)

w: power,

f: fan

c: current

d: disk

m: memory

t: temperature

v: voltage

s: switch

p: processor

r: raid

For example, the service used to check fan status is shown below:

define service {

use local-service

host_name 10.134.12.36

service_description check fan status

check_command check_snmp_health_all!f

}

You can also check multiple types of monitored items by enumerating each of the type.

A service used to check disk, fan, and voltage status is shown below.

Page 10: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

9

define service {

use local-service

host_name 10.134.12.36

service_description check disk, fan, and voltage status

check_command check_snmp_health_all!dfv

}

3.4 Validating the Nagios Configurations

Page 11: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

10

3.5 Restarting Nagios Service

3.6 Connecting to the Nagios Web UI

The results are shown on the Nagios Web UI as below.

Page 12: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

11

4. Using check_snmp_health

4.1 –h or --help

The -h or --help option shows the help menu, as shown below.

4.2 –bc

Use the –bc option to specify user-defined thresholds for memory and processor

checking. The argument format is as follows:

[type][duration][fail count],….

[type]:

m: correctable single bit ECC errors.

M: uncorrectable ECC errors.

p: processor failures.

[duration]:

d: day

h: hour

m: minute

s: second

[fail count]: The acceptable number of failures. To trigger a critical status, the

failure counts must be greater than this value.

Page 13: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

12

Example: To specify a threshold for memory that indicates four single bit ECC errors

per 1GB RAM within one day (24 hours) is allowed (i.e., m1d4), and 0 uncorrectable

ECC error is allowed within 1 hour (i.e., M1h0).

-bc m1d4,M1h0

Note: To reset the memory, refer to “5.2 How to Reset Memory Error Status?” for more

information.

Example: To specify a threshold for processor that indicates 2 correctable processor

failures within 30 days is allowed (i.e., p30d2), and 0 uncorrectable processor failure is

allowed within 1 hour (i.e., P1h0).

-bc p30d2,P1h0

4.3 –cn

Use the –cn option to specify user-defined thresholds for checking the number of

processors, memory and hard disks. The argument format is as follows:

[type][number],….

[type]:

p: processor.

m: memory.

d: hard disk drives.

[number]: The expected number of processors, memory, or hard disks. To trigger

an OK status, the assigned number must be equal to the number of the

processors, memory, or hard disks installed on the system under monitoring.

Example: The arguments are specified for the –cn option, and this option indicates the

system under monitoring has one processor, four memory DIMMs, and one hard disk

drive.

-cn p1,m4,d1

4.4 –co

Use the –co option to specify an SNMP community string.

Page 14: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

13

4.5 –d

Use the –d option to show detailed information regarding the monitoring logics, which

is used for debugging propose only. This option should not be used in Nagios.

4.6 –i

Use the –i option to specify the host name or IP address to be checked.

4.7 –t

Use the –t option to specify the type of monitored items to be checked. The default

value is “all”.

Use the "-t r" option as shown below to check the health status of a RAID controller,

including the states of its components such as battery backup units, virtual drives and

hard disks.

Page 15: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

14

The following figure indicates one virtual drive and one hard disk are alerted, and the

health status of the RAID controller is thus critical.

The following figure shows the RAID controller is critical due to the absent BBU.

4.8 –to

Use the –to option to specify the SNMP timeout value. The default value is 15 seconds.

You may need to increase the timeout value if the check_snmp_health plug-in cannot

retrieve all MIBs. For example, a host to be checked has several hard disks. Checking

its physical disk failures may be longer than 15 seconds and times out the

check_snmp_health plug-in. To avoid such a situation, specify a larger timeout value

by using the –to option.

Page 16: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

15

5. Appendix

5.1 SD5 FAQ

Q: I see some error messages in the [SD5 install folder]/wrapper.log file. Do you have

a list of all error messages and solutions?

A: Here are the known error messages.

NO Message Root Cause Solution

A0001 HealthInfo

initialization error.

com.supermicro.ss

m.tmhealth.model.

MotherboardModel

NotExistException:

The SD5 is run on a

non-Supermicro server.

Health information is only

available on Supermicro

servers.

Install SD5 on Supermicro servers.

A0002 Unable to start

JVM: No such file

or directory

The SD5 cannot find the

required Java Virtual

Machine (JVM) located in

the [SD5 install folder]/jre

folder.

Reinstall the SD5.

5.2 How to Reset Memory Error Status?

Q: An uncorrectable ECC error has been raised on a server and I have manually

changed the pragmatic memory module. However, the check_snmp_health plug-in still

shows a critical status.

A: The check logic of memory errors is based on these:

1. There are memory error logs in the BIOS event log.

2. The log's generated time is in the check time period.

For example, suppose that you use the -bc M1d0 option (i.e., any uncorrectable ECC

error occurring in one day will cause a critical state) to check memory error. Once an

uncorrectable ECC error has been found, the status will remain critical for one day

even after the problematic memory is manually changed. To get an OK status

immediately after manually repairing the memory, you need to follow these steps:

Page 17: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

16

1. Clear BIOS event logs from the BIOS setup menu.

2. Delete the file [SD5 install folder]/config/bioslogs.txt

5.3 Can I Disable the SD5 Web?

Q: I only use the check_snmp_health plug-in to check the health of a host and do not

use a browser to view the sensor readings via the SD5 Web. Can I disable it?

A: Yes, the SD5 Web can be disabled during installation. At the Setup SuperDoctor 5

Web step, select 2- No to disable the SD5 Web. See the figure below.

You can also manually disable the SD5 Web after installation. Use a text editor to open

the [SD5 install folder]/plugins/builtin/web/plugin.cfg file, as shown below.

Page 18: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

17

Change the enabled attribute from 1 to 0 and save the document. Exit the text editor

and restart the SD5 to apply the setting. If the SD5 Web is disabled, the TCP ports

8181 and 8444 are not used.

5.4 Can I Disable the NRPE Protocol?

Q: I only use the check_snmp_health plug-in to check the health of a host and do not

use the NRPE protocol to talk with the SD5. Can I disable the support of the NRPE

protocol?

A: The SD5 supports three NRPE connection modes:

Mode A: Plain text with allowed IP (port 5333)

Mode B: Anonymous SSL connection with allowed IP (port 5666)

Mode C: SSL encryption with a public key infrastructure (port 5999)

Because the NRPE protocol is the default connection protocol provided by the SD5, it

cannot be completely turned off. At least one connection mode must be specified. For

modifying the connection mode settings, refer to “3.2 SuperDoctor 5 Connection

Modes” in SuperDoctor 5 User's Guide.

5.5 No Health Information from SNMP Was Fetched

Q: I execute the command check_snmp_health.sh -i [host_ip] and the result shows

“No health information from SNMP was fetched.” What is the problem?

A: Usually this message indicates that the host to be checked does not support

Supermicro MIB. Possible reasons include:

Page 19: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

18

The operating system’s built-in SNMP service (i.e., the Net-SNMP) does not

start.

The SD5 does not start.

The SD5 SNMP extension is not correctly installed.

The SNMP port is blocked by firewall.

The default timeout value is not long enough for a health check.

Page 20: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

19

Contacting Supermicro

Headquarters

Address: Super Micro Computer, Inc.

980 Rock Ave.

San Jose, CA 95131 U.S.A.

Tel: +1 (408) 503-8000

Fax: +1 (408) 503-8008

Email: [email protected] (General Information)

[email protected] (Technical Support)

Web Site: www.supermicro.com

Europe

Address: Super Micro Computer B.V.

Het Sterrenbeeld 28, 5215 ML

's-Hertogenbosch, The Netherlands

Tel: +31 (0) 73-6400390

Fax: +31 (0) 73-6416525

Email: [email protected] (General Information)

[email protected] (Technical Support)

[email protected] (Customer Support)

Asia-Pacific

Address: Super Micro Computer, Inc.

3F, No. 150, Jian 1st Rd.

Zhonghe Dist., New Taipei City 23511

Taiwan (R.O.C)

Tel: +886-(2) 8226-3990

Fax: +886-(2) 8226-3992

Web Site: www.supermicro.com.tw

Technical Support:

Email: [email protected]

Tel: +886-(2)-8226-3990

Page 21: Supermicro Server Monitoring with SuperDoctor 5 and Nagios ...zoomtecnologia.com.br/SITE_DEMO/arquivos/files/SSM_Manual.pdf · Supermicro Server Monitoring with SuperDoctor 5 and

20

This page is intentionally left blank