hepsysman monitoring workshop introduction to the day and overview of ganglia pete gronbech
TRANSCRIPT
HEPSYSMAN
Monitoring WorkshopIntroduction to the Day
and Overview of Ganglia
Pete Gronbech
31st October 2007 Introduction & Ganglia Slide 2
Agenda
Wednesday 31st October 2007 10:00 Start / Coffee 10:30 - 11:00 Introduction & Ganglia Overview Pete Gronbech11:00 - 12:30 MonAMI Interactive Workshop Paul Millar 12:30 - 13:30 Lunch 13:30 - 14:00 Intro To Nagios A. Elwell14:00 - 14:30 GRID Service Monitoring Group Ian Neilson14:30 - 15:00 Further Nagios Scripts. Chris Brew15:00 - 16:00 Live Install at a site and workshop discussion. 16:00 - 16:30 Other Monitoring Tools Discussion (Pakiti, gridmap,
accounting cpu and storage, SAM, SAM admins page etc.) 16:30 AOB and wrap up.
31st October 2007 Introduction & Ganglia Slide 3
Why Monitoring
• Untrustworthy machines, that are critical. Your systems will fail. When they do fail, two things save you from downtime: Redundancy and Monitoring systems
• Limited Man Power at sites• Ever increasing sizes of clusters• Complex software with many failure modes• Need to meet SLAs – 95% uptime• PR and reporting
31st October 2007 Introduction & Ganglia Slide 4
Many external monitoring sites
• Gstat - http://goc.grid.sinica.edu.tw/gstat/UKI.html• Steve Lloyds Page -
http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/ukgrid.html• SAM - https://lcg-sam.cern.ch:8443/sam/sam.py?
sensors=CE®ions=UKI&vo=ops&order=SiteName&funct=ShowSensorTests
31st October 2007 Introduction & Ganglia Slide 5
Many more external monitoring sites
• Gridmap - http://gridmap.cern.ch/gm/
• Gridview - http://gridview.cern.ch/GRIDVIEW/ • Accounting -
http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.html
31st October 2007 Introduction & Ganglia Slide 6
Local Site Monitoring
• Unix tools • Batch System tools• Really need something that can provide a
quick visual overview of the health and load on your cluster … ganglia
31st October 2007 Introduction & Ganglia Slide 7
Ganglia
31st October 2007 Introduction & Ganglia Slide 8
How does Ganglia work?
• Ganglia works through a small agent, gmond, on each node or machine to be monitored. You can distribute a single gmond instance to lots of machines at once. Gmonds communicate the state of their local node to a machine running a Master gmetad instance.
• The server uses RRDtool to store the data over time
• The Ganglia framework can be extended to monitor many parameters.
31st October 2007 Introduction & Ganglia Slide 9
SetupThe software can be downloaded from
http://ganglia.sourceforge.net/
Computer A
Runs gmond
Computer B
Computer C
Computer D
Runs gmond
Runs gmond
Runs gmetad
Clients just have to run gmond, which is configured
by /etc/gmond.conf
Server to collect the data runs gmetad.
It could also run gmond to monitor itself.
The web interface needs to run on a webserver.
Computer Dgmetad
gmond
httpd
31st October 2007 Introduction & Ganglia Slide 10
Client Setup
Computer A
Runs gmond
Computer B
Computer C
Runs gmond
Runs gmond
yum install ganglia-gmond
edit config file
service gmond start
chkconfig gmond on
/etc/gmond.conf extracts
cluster { name = "LCG Workers" } /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { mcast_join = 239.2.11.95 port = 8649 }
/* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { mcast_join = 239.2.11.95 port = 8649 bind = 239.2.11.95}
/* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8649 }
udp_send_channel { port = 8649 host = pplxconfig}
31st October 2007 Introduction & Ganglia Slide 11
Server Setup
yum install ganglia-gmond ganglia-gmetad ganglia-web
edit /etc/gmond.conf
edit /etc/gmetad.conf
Computer Dgmetad
gmond
httpd
Extracts from /etc/gmetad.conf
data_source "LCG Workers" computerA.physics.ox.ac.uk ComputerB.physics.ox.ac.uk computerC.physics.ox.ac.uk
data_source "LCG Servers" t2se01.physics.ox.ac.uk:8656 t2ce02.physics.ox.ac.uk:8656 gridlogger.physics.ox.ac.uk:8656
31st October 2007 Introduction & Ganglia Slide 12
Aggregating sub clusters
31st October 2007 Introduction & Ganglia Slide 13
Host level detail
31st October 2007 Introduction & Ganglia Slide 14
Customizing
• Adding PBS Batch Queue data
31st October 2007 Introduction & Ganglia Slide 15
PBS Queue Monitoring
• Originally based on RAL Tier 1 work• Actually fairly complicated.• see Chris Brew or me later for details.
31st October 2007 Introduction & Ganglia Slide 16
How is Ganglia different from Nagios
• Ganglia is architecturally designed to perform efficiently in very large monitoring environments: each Ganglia gmond performs its service checks locally, reporting in at a regular interval to the gmetad. Nagios performs its service checks by polling each device across a network connection and waiting for a response (known as "active checks"), which can be more resource and bandwidth intensive.
• Nagios uses the results of its active checks to determine state by comparing the metrics it polls to thresholds. These state changes can in turn be used to generate notifications and customizable corrective actions. Ganglia, by contrast, has no built-in thresholds, and so does not generate events or notifications.
• The general rule of thumb has been: if you need to monitor a limited number of aspects of a large number of identical devices, use Ganglia; if you want to monitor lots of aspects of a smaller number of different devices, use Nagios. But those distinctions are blurring as Ganglia supports more and more devices, and as Nagios' scalability improves.
31st October 2007 Introduction & Ganglia Slide 17
How is Ganglia different from Nagios
• The problem with ganglia and all the other external web pages we have been looking at is that you have to look at them!
• If all is well with your system you don’t want to have to look.
• This is where Nagios comes in. It can be setup to alert you when something goes wrong, or a value passes a threshold.