pax_global_header 0000666 0000000 0000000 00000000064 12304335250 0014507 g ustar 00root root 0000000 0000000 52 comment=d8b6bca3bffe8cbf431f356f3427aca0e829d4d0
ganglia-nagios-bridge-1.1.0/ 0000775 0000000 0000000 00000000000 12304335250 0015620 5 ustar 00root root 0000000 0000000 ganglia-nagios-bridge-1.1.0/COPYING 0000664 0000000 0000000 00000001506 12304335250 0016655 0 ustar 00root root 0000000 0000000 # ganglia-nagios-bridge - transfer Ganglia XML to Nagios checkresults file
#
# Project page: http://danielpocock.com/ganglia-nagios-bridge
#
# Copyright (C) 2010 Daniel Pocock http://danielpocock.com
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see .
ganglia-nagios-bridge-1.1.0/README.txt 0000664 0000000 0000000 00000003566 12304335250 0017330 0 ustar 00root root 0000000 0000000
ganglia-nagios-bridge
Copyright (C) 2010 Daniel Pocock
http://danielpocock.com/ganglia-nagios-bridge
Installation
------------
Copy ganglia-nagios-bridge.py to a suitable location (e.g. /usr/local/bin)
Declare the services to be monitored in /etc/nagios3/*
- see the included nagios.cfg for some examples
Copy nagios-bridge.conf to a suitable location (default /etc/ganglia)
and amend it as required.
Add ganglia-nagios-bridge.py to crontab
- it can be run every 1 or 5 minutes typically
- it must run as the nagios user (to create files in the checkresult
spool directory)
e.g.
cat > /etc/crontab << EOF
# poll metrics from Ganglia, update Nagios
* * * * * nagios /usr/local/bin/ganglia-nagios-bridge.py
EOF
Limitations and troubleshooting
-------------------------------
Nagios must know about all the hosts and services exported by
Ganglia. If it sees things in the checkresults file that it
doesn't know about, it logs warning messages about them
but it correctly processes all the things it does
recognise.
Make sure the hostnames and service names match exactly.
If the Ganglia XML is particularly large, you may want to buffer it
before parsing to avoid any risk of blocking the gmetad
while ganglia-nagios-bridge is working. Of course, this requires
that you have sufficient RAM (or disk space) to buffer the XML.
Nagios is very picky about the checkresult filename. mkstemp
is used to generate the filenames. Nagios expects them to be
exactly 7 characters, starting with a 'c'. Nagios will delete
the files after reading them.
Monitor nagios.log while it is processing the first checkresult
file to make sure it goes smoothly. It will log errors if it
doesn't recognise a host or service name. You may want to automatically
detect such issues in the log so that if new hosts/services appear in future,
you will be immediately alerted to update the Nagios configuration.
ganglia-nagios-bridge-1.1.0/deployment.fig 0000664 0000000 0000000 00000002330 12304335250 0020465 0 ustar 00root root 0000000 0000000 #FIG 3.2 Produced by xfig version 3.2.5b
Landscape
Center
Metric
A4
100.00
Single
-2
1200 2
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
1125 2790 2835 2790 2835 3645 1125 3645 1125 2790
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
4725 2790 7740 2790 7740 3600 4725 3600 4725 2790
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2610 1530 4905 1530 4905 2565 2610 2565 2610 1530
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 3
1 1 1.00 60.00 120.00
2610 2025 1845 2025 1845 2790
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 3
1 1 1.00 60.00 120.00
4905 2025 6120 2025 6120 2790
4 0 0 50 -1 0 12 0.0000 4 180 645 1170 3015 Ganglia\001
4 0 0 50 -1 0 12 0.0000 4 180 1560 1170 3240 gmetad (port 8651)\001
4 0 0 50 -1 0 12 0.0000 4 180 1365 1170 3465 or gmond (8649)\001
4 0 0 50 -1 0 12 0.0000 4 180 570 4815 2970 Nagios\001
4 0 0 50 -1 0 12 0.0000 4 180 1785 4815 3195 Check result spool dir\001
4 0 0 50 -1 0 12 0.0000 4 180 2880 4815 3420 /var/lib/nagios3/spool/checkresults\001
4 0 0 50 -1 0 12 0.0000 4 180 1800 2700 1755 ganglia-nagios-bridge\001
4 0 0 50 -1 0 12 0.0000 4 180 1620 2700 1980 Polls Ganglia XML,\001
4 0 0 50 -1 0 12 0.0000 4 180 1440 2700 2205 writes single bulk\001
4 0 0 50 -1 0 12 0.0000 4 135 1320 2700 2430 checkresults file\001
ganglia-nagios-bridge-1.1.0/ganglia-nagios-bridge.py 0000775 0000000 0000000 00000024046 12304335250 0022315 0 ustar 00root root 0000000 0000000 #!/usr/bin/python
#
# ganglia-nagios-bridge - transfer Ganglia XML to Nagios checkresults file
#
# Project page: http://danielpocock.com/ganglia-nagios-bridge
#
# Copyright (C) 2010 Daniel Pocock http://danielpocock.com
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see .
#
############################################################################
import argparse
import os
import re
import socket
import tempfile
import time
import xml.sax
# wrapper class so that the SAX parser can process data from a network
# socket
class SocketInputSource:
def __init__(self, socket):
self.socket = socket
def getByteStream(self):
return self
def read(self, buf_size):
return self.socket.recv(buf_size)
# interprets metric values to generate Nagios passive notifications
class PassiveGenerator:
def __init__(self, force_dmax, tmax_grace):
self.force_dmax = force_dmax
self.tmax_grace = tmax_grace
# Nagios is quite fussy about the filename, it must be
# a 7 character name starting with 'c'
tmp_file = tempfile.mkstemp(prefix='c',dir=nagios_result_dir)
self.fh = tmp_file[0]
self.cmd_file = tmp_file[1]
os.write(self.fh, "### Active Check Result File ###\n")
os.write(self.fh, "file_time=" + str(int(time.time())) + "\n")
def done(self):
os.close(self.fh)
ok_filename = self.cmd_file + ".ok"
ok_fh = file(ok_filename, 'a')
ok_fh.close()
def process(self, metric_def, service_name, host, metric_name, metric_value, metric_tn, metric_tmax, metric_dmax, last_seen):
effective_dmax = metric_dmax
if(self.force_dmax > 0):
effective_dmax = force_dmax
effective_tmax = metric_tmax + self.tmax_grace
if effective_dmax > 0 and metric_tn > effective_dmax:
service_state = 3
elif metric_tn > effective_tmax:
service_state = 3
elif isinstance(metric_value, str):
service_state = 0
elif 'crit_below' in metric_def and metric_value < metric_def['crit_below']:
service_state = 2
elif 'warn_below' in metric_def and metric_value < metric_def['warn_below']:
service_state = 1
elif 'crit_above' in metric_def and metric_value > metric_def['crit_above']:
service_state = 2
elif 'warn_above' in metric_def and metric_value > metric_def['warn_above']:
service_state = 1
else:
service_state = 0
#cmd = "[" + str(int(time.time())) + "] PROCESS_SERVICE_CHECK_RESULT;" + host + ";" + service_name + ";" + str(service_state) + ";Value = " + str(metric_value)
#os.write(self.fh, cmd + "\n")
os.write(self.fh, "\n### Nagios Service Check Result ###\n")
os.write(self.fh, "# Time: " + time.asctime() + "\n")
os.write(self.fh, "host_name=" + host + "\n")
os.write(self.fh, "service_description=" + service_name + "\n")
os.write(self.fh, "check_type=0\n")
os.write(self.fh, "check_options=0\n")
os.write(self.fh, "scheduled_check=1\n")
os.write(self.fh, "reschedule_check=1\n")
os.write(self.fh, "latency=0.1\n")
os.write(self.fh, "start_time=" + str(last_seen) + ".0\n")
os.write(self.fh, "finish_time=" + str(last_seen) + ".0\n")
os.write(self.fh, "early_timeout=0\n")
os.write(self.fh, "exited_ok=1\n")
os.write(self.fh, "return_code=" + str(service_state) + "\n")
os.write(self.fh, "output=" + service_name + " " + str(metric_value) + "\\n\n")
#os.write(self.fh, "\n")
# SAX event handler for parsing the Ganglia XML stream
class GangliaHandler(xml.sax.ContentHandler):
def __init__(self, clusters_c, value_handler):
self.clusters_c = clusters_c
self.value_handler = value_handler
self.clusters_cache = {}
self.hosts_cache = {}
self.metrics_cache = {}
def startElement(self, name, attrs):
# METRIC is the most common element, it is handled first,
# followed by HOST and CLUSTER
# handle common elements that we ignore
if name == "EXTRA_ELEMENT":
return
if name == "EXTRA_DATA":
return
# handle a METRIC element in the XML
if name == "METRIC" and self.metrics is not None:
metric_name = attrs['NAME']
cache_key = (self.cluster_idx, self.host_idx, metric_name)
if cache_key in self.metrics_cache:
metric_info = self.metrics_cache[cache_key]
self.metric_idx = metric_info[0]
service_name = metric_info[1]
self.metric = self.clusters_c[self.cluster_idx][1][self.host_idx][1][self.metric_idx][1]
self.handle_metric(metric_name, service_name, attrs)
return
for idx, metric_def in enumerate(self.metrics):
match_result = metric_def[0].match(metric_name)
if match_result:
service_name_tmpl = metric_def[1]['service_name']
if len(match_result.groups()) > 0:
service_name = match_result.expand(service_name_tmpl)
else:
service_name = service_name_tmpl
self.metrics_cache[cache_key] = (idx, service_name)
self.metric = metric_def[1]
self.handle_metric(metric_name, service_name, attrs)
return
# handle a HOST element in the XML
if name == "HOST" and self.hosts is not None:
self.metrics = None
self.host_name = attrs['NAME']
self.host_reported = long(attrs['REPORTED'])
if strip_domains:
self.host_name = self.host_name.partition('.')[0]
cache_key = (self.cluster_idx, self.host_name)
if cache_key in self.hosts_cache:
self.host_idx = self.hosts_cache[cache_key]
self.metrics = self.clusters_c[self.cluster_idx][1][self.host_idx][1]
return
for idx, host_def in enumerate(self.hosts):
if host_def[0].match(self.host_name):
self.hosts_cache[cache_key] = idx
self.host_idx = idx
self.metrics = host_def[1]
return
# handle a CLUSTER element in the XML
if name == "CLUSTER":
self.hosts = None
self.cluster_name = attrs['NAME']
self.cluster_localtime = long(attrs['LOCALTIME'])
if self.cluster_name in self.clusters_cache:
self.cluster_idx = self.clusters_cache[self.cluster_name]
self.hosts = self.clusters_c[self.cluster_idx][1]
return
for idx, cluster_def in enumerate(self.clusters_c):
if cluster_def[0].match(self.cluster_name):
self.clusters_cache[self.cluster_name] = idx
self.cluster_idx = idx
self.hosts = cluster_def[1]
return
def handle_metric(self, metric_name, service_name, attrs):
# extract the metric attributes
metric_value_raw = attrs['VAL']
metric_tn = int(attrs['TN'])
metric_tmax = int(attrs['TMAX'])
metric_dmax = int(attrs['DMAX'])
metric_type = attrs['TYPE']
# they metric_value has a dynamic type:
if metric_type == 'string':
metric_value = metric_value_raw
elif metric_type == 'double' or metric_type == 'float':
metric_value = float(metric_value_raw)
else:
metric_value = int(metric_value_raw)
last_seen = self.cluster_localtime - metric_tn
# call the handler to process the value:
self.value_handler.process(self.metric, service_name, self.host_name, metric_name, metric_value, metric_tn, metric_tmax, metric_dmax, last_seen)
# main program code
if __name__ == '__main__':
try:
# parse command line
parser = argparse.ArgumentParser(description='read Ganglia XML and generate Nagios check results file')
parser.add_argument('config_file', nargs='?',
help='configuration file', default='/etc/ganglia/nagios-bridge.conf')
args = parser.parse_args()
# read the configuration file, setting some defaults first
force_dmax = 0
tmax_grace = 60
execfile(args.config_file)
# compile the regular expressions
clusters_c = []
for cluster_def in clusters:
cluster_c = re.compile(cluster_def[0])
hosts = []
for host_def in cluster_def[1]:
host_c = re.compile(host_def[0])
metrics = []
for metric_def in host_def[1]:
metric_c = re.compile(metric_def[0])
metrics.append((metric_c, metric_def[1]))
hosts.append((host_c, metrics))
clusters_c.append((cluster_c, hosts))
# connect to the gmetad or gmond
sock = socket.create_connection((gmetad_host, gmetad_port))
# set up the SAX parser
parser = xml.sax.make_parser()
pg = PassiveGenerator(force_dmax, tmax_grace)
parser.setContentHandler(GangliaHandler(clusters_c, pg))
# run the main program loop
parser.parse(SocketInputSource(sock))
# write out for Nagios
pg.done()
# all done
sock.close()
except socket.error as e:
logging.warn('Failed to connect to gmetad: %s', e.strerror)
ganglia-nagios-bridge-1.1.0/nagios-bridge.conf 0000664 0000000 0000000 00000005341 12304335250 0021204 0 ustar 00root root 0000000 0000000
# These are the details of the gmond or gmetad XML service
# The gmetad daemon listens on two ports, one of which provides
# immediate output like gmond, that is the port that
# must be specified here (default is 8651)
gmetad_host = '127.0.0.1'
gmetad_port = 8651
# This overrides the DMAX attribute from all metrics in all hosts
# If DMAX > 0 and TN > DMAX, then a metric state is considered
# UNKNOWN and Nagios will potentially send an alert
force_dmax = 0
# Every collection group in gmond.conf defines a time_threshold
# This value appears as TMAX in the XML.
# The gmond process should normally send every metric again before
# the value timer TN > TMAX.
# If ganglia-nagios-bridge is polling a gmond collector
# then a very small tmax_grace period (perhaps 5 seconds) is used.
# If ganglia-nagios-bridge is polling a gmetad server then
# tmax_grace should be set higher than the polling interval configured
# in gmetad.
tmax_grace = 30
# Ganglia XML typically contains FQDNs for all hosts, as it obtains
# the hostnames using reverse DNS lookups. Nagios, on the other hand,
# is often configured with just the hostname and no domain. Setting
# strip_domains = True will ensure that the domain part is stripped from
# the hostname before passing it to Nagios.
strip_domains = True
# This is the directory where Nagios expects to read checkresults
# submitted in batch
nagios_result_dir = '/var/lib/nagios3/spool/checkresults'
# This is where we select the metrics that we want to map from
# Ganglia to Nagios service names
# Any metric not matched in the configuration will be ignored and
# not passed to Nagios.
# It is permitted to use regular expressions for the cluster name,
# host name or metric name matching
# In the 'service_name' attribute of a metric, it is permitted to
# use positional placeholders to refer to groups matched in the
# metric name regex
clusters = [
('.*', # cluster name
[ # list of hosts to match in the cluster(s)
('.*', # host name
[ # list of metrics to match for the host(s)
(r'proc_total', # metric name
{ # attributes for this metric:
'service_name': r'Total processes',
#'crit_below': 0,
#'warn_below': 0,
'warn_above': 120,
'crit_above': 150
}
),
(
r'load_(\w+)', # Using wildcard to match all load avgs
{ 'service_name': r'Load avg \1 minute', 'warn_above': 5, 'crit_above': 10 })
]
)
]
),
('SomeCluster',
[
]
)
]
ganglia-nagios-bridge-1.1.0/nagios.cfg 0000664 0000000 0000000 00000004541 12304335250 0017565 0 ustar 00root root 0000000 0000000
# Typically these definitions are split into multiple configuration
# files within the /etc/nagios tree
# For convenience, they are all listed here in a single file
# to support the example configuration
# from /etc/nagios-plugins/config/dummy.cfg on Debian:
# return-unknown definition
define command {
command_name return-unknown
command_line /usr/lib/nagios/plugins/check_dummy 3
}
# from /etc/nagios3/conf.d/hostgroups_nagios2.cfg on Debian:
define hostgroup {
hostgroup_name all
alias All Servers
members *
}
# /etc/nagios3/conf.d/services_nagios2.cfg on Debian:
# this is checked passively by the ganglia-nagios-bridge utility
define service {
hostgroup_name all-servers
service_description Total processes
use generic-service
notification_interval 0
check_command return-unknown
active_checks_enabled 0
passive_checks_enabled 1
}
# this is checked passively by the ganglia-nagios-bridge utility
define service {
hostgroup_name all-servers
service_description Load avg one minute
use generic-service
notification_interval 0
check_command return-unknown
active_checks_enabled 0
passive_checks_enabled 1
}
# this is checked passively by the ganglia-nagios-bridge utility
define service {
hostgroup_name all-servers
service_description Load avg five minute
use generic-service
notification_interval 0
check_command return-unknown
active_checks_enabled 0
passive_checks_enabled 1
}
# this is checked passively by the ganglia-nagios-bridge utility
define service {
hostgroup_name all-servers
service_description Load avg fifteen minute
use generic-service
notification_interval 0
check_command return-unknown
active_checks_enabled 0
passive_checks_enabled 1
}