![Page 1: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/1.jpg)
Linux Clusters Ins.tute: Monitoring
Kyle Hutson – System Administrator for Kansas State University [email protected]
![Page 2: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/2.jpg)
Why monitoring?
• How should we get no=fied? • What should we monitor? • How oAen should we monitor? • Internal vs external • Informa=onal vs urgent
2 May 20, 2015
![Page 3: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/3.jpg)
How should we get no.fied?
• Urgent: • Email or text • Define this carefully
• Not-‐so urgent: • Web page updates
• Especially helpful for historical data • Email (filtered) • End-‐user support requests
3 May 20, 2015
![Page 4: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/4.jpg)
What should we monitor?
• External: Basic Connec=vity • Internal:
• The urgent • Power status • Scheduler/head node status • Cold-‐aisle temperatures • Storage system
4 May 20, 2015
![Page 5: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/5.jpg)
Lots of li?le things
• Overall cluster health • Queue size • Overall network usage • Number of responding nodes
• Individual node health • Load average • Memory used • Network bandwidth • CPU usage • Temperature
• Storage • Capacity • Degraded status • Connec=vity
5 May 20, 2015
![Page 6: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/6.jpg)
Security
• Securing the cluster • Security status updates • Any failures
• sudo reports • Network login failures (e.g. fail2ban) • crontab failures • Logfile errors (customize to fit)
6 May 20, 2015
![Page 7: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/7.jpg)
How oBen?
• You will quickly get a feel for this • Too much info is o,en worse than too li3le info • The “urgent” – con=nually • The “not-‐so-‐urgent” – anywhere from a few =mes per day to once per week
• There’s nothing wrong with trial and error
7 May 20, 2015
![Page 8: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/8.jpg)
How to make it happen
• Nagios/NRPE (Nagios Remote Plugin Executor) • Generic executable that runs “plugins”
• Plugins can monitor just about anything you can think of monitoring • Even works with Windows • Nagios (hap://www.nagios.org/) is by far the most common monitoring system
8 May 20, 2015
![Page 9: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/9.jpg)
How to make it happen
9 May 20, 2015
![Page 10: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/10.jpg)
How to make it happen
• Icinga (haps://www.icinga.org/) • Can use NRPE • (New) version 2 has its own client • Uses database backend for history • Mul=-‐threaded and mul=homed
10 May 20, 2015
![Page 11: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/11.jpg)
How to make it happen
11 May 20, 2015
![Page 12: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/12.jpg)
How to make it happen
• Ganglia (hap://ganglia.sourceforge.net/) -‐ for historical and resource monitoring
• Ours are public • RRD files give historical data (a.k.a. “lots of preay graphs”)
12 May 20, 2015
![Page 13: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/13.jpg)
How to make it happen
13 May 20, 2015
![Page 14: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%](https://reader034.vdocuments.net/reader034/viewer/2022042121/5e9afbbb117ef351436c5d76/html5/thumbnails/14.jpg)
How to make it happen
• New alterna=ve to Ganglia: Graphite (hap://graphite.wikidot.com/) • Uses “whisper” instead of RRD (smaller files) • Scaling is beaer than Ganglia • Dynamic data points let you see exactly what you want (with some prac=ce)
• S=ll in beta
14 May 20, 2015