pax_global_header00006660000000000000000000000064123043352500014507gustar00rootroot0000000000000052 comment=d8b6bca3bffe8cbf431f356f3427aca0e829d4d0 ganglia-nagios-bridge-1.1.0/000077500000000000000000000000001230433525000156205ustar00rootroot00000000000000ganglia-nagios-bridge-1.1.0/COPYING000066400000000000000000000015061230433525000166550ustar00rootroot00000000000000# ganglia-nagios-bridge - transfer Ganglia XML to Nagios checkresults file # # Project page: http://danielpocock.com/ganglia-nagios-bridge # # Copyright (C) 2010 Daniel Pocock http://danielpocock.com # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see . ganglia-nagios-bridge-1.1.0/README.txt000066400000000000000000000035661230433525000173300ustar00rootroot00000000000000 ganglia-nagios-bridge Copyright (C) 2010 Daniel Pocock http://danielpocock.com/ganglia-nagios-bridge Installation ------------ Copy ganglia-nagios-bridge.py to a suitable location (e.g. /usr/local/bin) Declare the services to be monitored in /etc/nagios3/* - see the included nagios.cfg for some examples Copy nagios-bridge.conf to a suitable location (default /etc/ganglia) and amend it as required. Add ganglia-nagios-bridge.py to crontab - it can be run every 1 or 5 minutes typically - it must run as the nagios user (to create files in the checkresult spool directory) e.g. cat > /etc/crontab << EOF # poll metrics from Ganglia, update Nagios * * * * * nagios /usr/local/bin/ganglia-nagios-bridge.py EOF Limitations and troubleshooting ------------------------------- Nagios must know about all the hosts and services exported by Ganglia. If it sees things in the checkresults file that it doesn't know about, it logs warning messages about them but it correctly processes all the things it does recognise. Make sure the hostnames and service names match exactly. If the Ganglia XML is particularly large, you may want to buffer it before parsing to avoid any risk of blocking the gmetad while ganglia-nagios-bridge is working. Of course, this requires that you have sufficient RAM (or disk space) to buffer the XML. Nagios is very picky about the checkresult filename. mkstemp is used to generate the filenames. Nagios expects them to be exactly 7 characters, starting with a 'c'. Nagios will delete the files after reading them. Monitor nagios.log while it is processing the first checkresult file to make sure it goes smoothly. It will log errors if it doesn't recognise a host or service name. You may want to automatically detect such issues in the log so that if new hosts/services appear in future, you will be immediately alerted to update the Nagios configuration. ganglia-nagios-bridge-1.1.0/deployment.fig000066400000000000000000000023301230433525000204650ustar00rootroot00000000000000#FIG 3.2 Produced by xfig version 3.2.5b Landscape Center Metric A4 100.00 Single -2 1200 2 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5 1125 2790 2835 2790 2835 3645 1125 3645 1125 2790 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5 4725 2790 7740 2790 7740 3600 4725 3600 4725 2790 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5 2610 1530 4905 1530 4905 2565 2610 2565 2610 1530 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 3 1 1 1.00 60.00 120.00 2610 2025 1845 2025 1845 2790 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 3 1 1 1.00 60.00 120.00 4905 2025 6120 2025 6120 2790 4 0 0 50 -1 0 12 0.0000 4 180 645 1170 3015 Ganglia\001 4 0 0 50 -1 0 12 0.0000 4 180 1560 1170 3240 gmetad (port 8651)\001 4 0 0 50 -1 0 12 0.0000 4 180 1365 1170 3465 or gmond (8649)\001 4 0 0 50 -1 0 12 0.0000 4 180 570 4815 2970 Nagios\001 4 0 0 50 -1 0 12 0.0000 4 180 1785 4815 3195 Check result spool dir\001 4 0 0 50 -1 0 12 0.0000 4 180 2880 4815 3420 /var/lib/nagios3/spool/checkresults\001 4 0 0 50 -1 0 12 0.0000 4 180 1800 2700 1755 ganglia-nagios-bridge\001 4 0 0 50 -1 0 12 0.0000 4 180 1620 2700 1980 Polls Ganglia XML,\001 4 0 0 50 -1 0 12 0.0000 4 180 1440 2700 2205 writes single bulk\001 4 0 0 50 -1 0 12 0.0000 4 135 1320 2700 2430 checkresults file\001 ganglia-nagios-bridge-1.1.0/ganglia-nagios-bridge.py000077500000000000000000000240461230433525000223150ustar00rootroot00000000000000#!/usr/bin/python # # ganglia-nagios-bridge - transfer Ganglia XML to Nagios checkresults file # # Project page: http://danielpocock.com/ganglia-nagios-bridge # # Copyright (C) 2010 Daniel Pocock http://danielpocock.com # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see . # ############################################################################ import argparse import os import re import socket import tempfile import time import xml.sax # wrapper class so that the SAX parser can process data from a network # socket class SocketInputSource: def __init__(self, socket): self.socket = socket def getByteStream(self): return self def read(self, buf_size): return self.socket.recv(buf_size) # interprets metric values to generate Nagios passive notifications class PassiveGenerator: def __init__(self, force_dmax, tmax_grace): self.force_dmax = force_dmax self.tmax_grace = tmax_grace # Nagios is quite fussy about the filename, it must be # a 7 character name starting with 'c' tmp_file = tempfile.mkstemp(prefix='c',dir=nagios_result_dir) self.fh = tmp_file[0] self.cmd_file = tmp_file[1] os.write(self.fh, "### Active Check Result File ###\n") os.write(self.fh, "file_time=" + str(int(time.time())) + "\n") def done(self): os.close(self.fh) ok_filename = self.cmd_file + ".ok" ok_fh = file(ok_filename, 'a') ok_fh.close() def process(self, metric_def, service_name, host, metric_name, metric_value, metric_tn, metric_tmax, metric_dmax, last_seen): effective_dmax = metric_dmax if(self.force_dmax > 0): effective_dmax = force_dmax effective_tmax = metric_tmax + self.tmax_grace if effective_dmax > 0 and metric_tn > effective_dmax: service_state = 3 elif metric_tn > effective_tmax: service_state = 3 elif isinstance(metric_value, str): service_state = 0 elif 'crit_below' in metric_def and metric_value < metric_def['crit_below']: service_state = 2 elif 'warn_below' in metric_def and metric_value < metric_def['warn_below']: service_state = 1 elif 'crit_above' in metric_def and metric_value > metric_def['crit_above']: service_state = 2 elif 'warn_above' in metric_def and metric_value > metric_def['warn_above']: service_state = 1 else: service_state = 0 #cmd = "[" + str(int(time.time())) + "] PROCESS_SERVICE_CHECK_RESULT;" + host + ";" + service_name + ";" + str(service_state) + ";Value = " + str(metric_value) #os.write(self.fh, cmd + "\n") os.write(self.fh, "\n### Nagios Service Check Result ###\n") os.write(self.fh, "# Time: " + time.asctime() + "\n") os.write(self.fh, "host_name=" + host + "\n") os.write(self.fh, "service_description=" + service_name + "\n") os.write(self.fh, "check_type=0\n") os.write(self.fh, "check_options=0\n") os.write(self.fh, "scheduled_check=1\n") os.write(self.fh, "reschedule_check=1\n") os.write(self.fh, "latency=0.1\n") os.write(self.fh, "start_time=" + str(last_seen) + ".0\n") os.write(self.fh, "finish_time=" + str(last_seen) + ".0\n") os.write(self.fh, "early_timeout=0\n") os.write(self.fh, "exited_ok=1\n") os.write(self.fh, "return_code=" + str(service_state) + "\n") os.write(self.fh, "output=" + service_name + " " + str(metric_value) + "\\n\n") #os.write(self.fh, "\n") # SAX event handler for parsing the Ganglia XML stream class GangliaHandler(xml.sax.ContentHandler): def __init__(self, clusters_c, value_handler): self.clusters_c = clusters_c self.value_handler = value_handler self.clusters_cache = {} self.hosts_cache = {} self.metrics_cache = {} def startElement(self, name, attrs): # METRIC is the most common element, it is handled first, # followed by HOST and CLUSTER # handle common elements that we ignore if name == "EXTRA_ELEMENT": return if name == "EXTRA_DATA": return # handle a METRIC element in the XML if name == "METRIC" and self.metrics is not None: metric_name = attrs['NAME'] cache_key = (self.cluster_idx, self.host_idx, metric_name) if cache_key in self.metrics_cache: metric_info = self.metrics_cache[cache_key] self.metric_idx = metric_info[0] service_name = metric_info[1] self.metric = self.clusters_c[self.cluster_idx][1][self.host_idx][1][self.metric_idx][1] self.handle_metric(metric_name, service_name, attrs) return for idx, metric_def in enumerate(self.metrics): match_result = metric_def[0].match(metric_name) if match_result: service_name_tmpl = metric_def[1]['service_name'] if len(match_result.groups()) > 0: service_name = match_result.expand(service_name_tmpl) else: service_name = service_name_tmpl self.metrics_cache[cache_key] = (idx, service_name) self.metric = metric_def[1] self.handle_metric(metric_name, service_name, attrs) return # handle a HOST element in the XML if name == "HOST" and self.hosts is not None: self.metrics = None self.host_name = attrs['NAME'] self.host_reported = long(attrs['REPORTED']) if strip_domains: self.host_name = self.host_name.partition('.')[0] cache_key = (self.cluster_idx, self.host_name) if cache_key in self.hosts_cache: self.host_idx = self.hosts_cache[cache_key] self.metrics = self.clusters_c[self.cluster_idx][1][self.host_idx][1] return for idx, host_def in enumerate(self.hosts): if host_def[0].match(self.host_name): self.hosts_cache[cache_key] = idx self.host_idx = idx self.metrics = host_def[1] return # handle a CLUSTER element in the XML if name == "CLUSTER": self.hosts = None self.cluster_name = attrs['NAME'] self.cluster_localtime = long(attrs['LOCALTIME']) if self.cluster_name in self.clusters_cache: self.cluster_idx = self.clusters_cache[self.cluster_name] self.hosts = self.clusters_c[self.cluster_idx][1] return for idx, cluster_def in enumerate(self.clusters_c): if cluster_def[0].match(self.cluster_name): self.clusters_cache[self.cluster_name] = idx self.cluster_idx = idx self.hosts = cluster_def[1] return def handle_metric(self, metric_name, service_name, attrs): # extract the metric attributes metric_value_raw = attrs['VAL'] metric_tn = int(attrs['TN']) metric_tmax = int(attrs['TMAX']) metric_dmax = int(attrs['DMAX']) metric_type = attrs['TYPE'] # they metric_value has a dynamic type: if metric_type == 'string': metric_value = metric_value_raw elif metric_type == 'double' or metric_type == 'float': metric_value = float(metric_value_raw) else: metric_value = int(metric_value_raw) last_seen = self.cluster_localtime - metric_tn # call the handler to process the value: self.value_handler.process(self.metric, service_name, self.host_name, metric_name, metric_value, metric_tn, metric_tmax, metric_dmax, last_seen) # main program code if __name__ == '__main__': try: # parse command line parser = argparse.ArgumentParser(description='read Ganglia XML and generate Nagios check results file') parser.add_argument('config_file', nargs='?', help='configuration file', default='/etc/ganglia/nagios-bridge.conf') args = parser.parse_args() # read the configuration file, setting some defaults first force_dmax = 0 tmax_grace = 60 execfile(args.config_file) # compile the regular expressions clusters_c = [] for cluster_def in clusters: cluster_c = re.compile(cluster_def[0]) hosts = [] for host_def in cluster_def[1]: host_c = re.compile(host_def[0]) metrics = [] for metric_def in host_def[1]: metric_c = re.compile(metric_def[0]) metrics.append((metric_c, metric_def[1])) hosts.append((host_c, metrics)) clusters_c.append((cluster_c, hosts)) # connect to the gmetad or gmond sock = socket.create_connection((gmetad_host, gmetad_port)) # set up the SAX parser parser = xml.sax.make_parser() pg = PassiveGenerator(force_dmax, tmax_grace) parser.setContentHandler(GangliaHandler(clusters_c, pg)) # run the main program loop parser.parse(SocketInputSource(sock)) # write out for Nagios pg.done() # all done sock.close() except socket.error as e: logging.warn('Failed to connect to gmetad: %s', e.strerror) ganglia-nagios-bridge-1.1.0/nagios-bridge.conf000066400000000000000000000053411230433525000212040ustar00rootroot00000000000000 # These are the details of the gmond or gmetad XML service # The gmetad daemon listens on two ports, one of which provides # immediate output like gmond, that is the port that # must be specified here (default is 8651) gmetad_host = '127.0.0.1' gmetad_port = 8651 # This overrides the DMAX attribute from all metrics in all hosts # If DMAX > 0 and TN > DMAX, then a metric state is considered # UNKNOWN and Nagios will potentially send an alert force_dmax = 0 # Every collection group in gmond.conf defines a time_threshold # This value appears as TMAX in the XML. # The gmond process should normally send every metric again before # the value timer TN > TMAX. # If ganglia-nagios-bridge is polling a gmond collector # then a very small tmax_grace period (perhaps 5 seconds) is used. # If ganglia-nagios-bridge is polling a gmetad server then # tmax_grace should be set higher than the polling interval configured # in gmetad. tmax_grace = 30 # Ganglia XML typically contains FQDNs for all hosts, as it obtains # the hostnames using reverse DNS lookups. Nagios, on the other hand, # is often configured with just the hostname and no domain. Setting # strip_domains = True will ensure that the domain part is stripped from # the hostname before passing it to Nagios. strip_domains = True # This is the directory where Nagios expects to read checkresults # submitted in batch nagios_result_dir = '/var/lib/nagios3/spool/checkresults' # This is where we select the metrics that we want to map from # Ganglia to Nagios service names # Any metric not matched in the configuration will be ignored and # not passed to Nagios. # It is permitted to use regular expressions for the cluster name, # host name or metric name matching # In the 'service_name' attribute of a metric, it is permitted to # use positional placeholders to refer to groups matched in the # metric name regex clusters = [ ('.*', # cluster name [ # list of hosts to match in the cluster(s) ('.*', # host name [ # list of metrics to match for the host(s) (r'proc_total', # metric name { # attributes for this metric: 'service_name': r'Total processes', #'crit_below': 0, #'warn_below': 0, 'warn_above': 120, 'crit_above': 150 } ), ( r'load_(\w+)', # Using wildcard to match all load avgs { 'service_name': r'Load avg \1 minute', 'warn_above': 5, 'crit_above': 10 }) ] ) ] ), ('SomeCluster', [ ] ) ] ganglia-nagios-bridge-1.1.0/nagios.cfg000066400000000000000000000045411230433525000175650ustar00rootroot00000000000000 # Typically these definitions are split into multiple configuration # files within the /etc/nagios tree # For convenience, they are all listed here in a single file # to support the example configuration # from /etc/nagios-plugins/config/dummy.cfg on Debian: # return-unknown definition define command { command_name return-unknown command_line /usr/lib/nagios/plugins/check_dummy 3 } # from /etc/nagios3/conf.d/hostgroups_nagios2.cfg on Debian: define hostgroup { hostgroup_name all alias All Servers members * } # /etc/nagios3/conf.d/services_nagios2.cfg on Debian: # this is checked passively by the ganglia-nagios-bridge utility define service { hostgroup_name all-servers service_description Total processes use generic-service notification_interval 0 check_command return-unknown active_checks_enabled 0 passive_checks_enabled 1 } # this is checked passively by the ganglia-nagios-bridge utility define service { hostgroup_name all-servers service_description Load avg one minute use generic-service notification_interval 0 check_command return-unknown active_checks_enabled 0 passive_checks_enabled 1 } # this is checked passively by the ganglia-nagios-bridge utility define service { hostgroup_name all-servers service_description Load avg five minute use generic-service notification_interval 0 check_command return-unknown active_checks_enabled 0 passive_checks_enabled 1 } # this is checked passively by the ganglia-nagios-bridge utility define service { hostgroup_name all-servers service_description Load avg fifteen minute use generic-service notification_interval 0 check_command return-unknown active_checks_enabled 0 passive_checks_enabled 1 }