pax_global_header00006660000000000000000000000064144730711140014514gustar00rootroot0000000000000052 comment=68b230ccb2d45737367fd351ef25e4d351a5c799 check_patroni-1.0.0/000077500000000000000000000000001447307111400143235ustar00rootroot00000000000000check_patroni-1.0.0/.flake8000066400000000000000000000003051447307111400154740ustar00rootroot00000000000000[flake8] doctests = True ignore = # line too long E501, # line break before binary operator (added by black) W503, exclude = .git, .mypy_cache, .tox, .venv, mypy_config = mypy.ini check_patroni-1.0.0/.github/000077500000000000000000000000001447307111400156635ustar00rootroot00000000000000check_patroni-1.0.0/.github/workflows/000077500000000000000000000000001447307111400177205ustar00rootroot00000000000000check_patroni-1.0.0/.github/workflows/lint.yml000066400000000000000000000005021447307111400214060ustar00rootroot00000000000000name: Lint on: [push, pull_request] jobs: lint: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - uses: actions/setup-python@v2 - name: Install tox run: pip install tox - name: Lint (black & flake8) run: tox -e lint - name: Mypy run: tox -e mypy check_patroni-1.0.0/.github/workflows/publish.yml000066400000000000000000000011561447307111400221140ustar00rootroot00000000000000name: Publish on: push: tags: - 'v*' jobs: publish: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - uses: actions/setup-python@v2 with: python-version: '3.10' - name: Install run: python -m pip install setuptools wheel twine - name: Build run: | python setup.py check python setup.py sdist bdist_wheel python -m twine check dist/* - name: Publish run: python -m twine upload dist/* env: TWINE_USERNAME: __token__ TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }} check_patroni-1.0.0/.github/workflows/tests.yml000066400000000000000000000006771447307111400216170ustar00rootroot00000000000000name: Tests on: [push, pull_request] jobs: tests: runs-on: ubuntu-latest strategy: matrix: include: - python: "3.7" - python: "3.11" steps: - uses: actions/checkout@v2 - name: Setup Python uses: actions/setup-python@v2 with: python-version: ${{ matrix.python }} - name: Install tox run: pip install tox - name: Test run: tox -e py check_patroni-1.0.0/.gitignore000066400000000000000000000002171447307111400163130ustar00rootroot00000000000000__pycache__/ check_patroni.egg-info tests/*.state_file tests/config.ini vagrant/.vagrant vagrant/*.state_file .*.swp .venv/ .tox/ dist/ build/ check_patroni-1.0.0/CHANGELOG.md000066400000000000000000000037201447307111400161360ustar00rootroot00000000000000# Change log ## Unreleased ### Added ### Fixed ### Misc ## check_patroni 1.0.0 - 2023-08-28 Check patroni is now tagged as Production/Stable. ### Added * Add `sync_standby` as a valid replica type for `cluster_has_replica`. (contributed by @mattpoel) * Add info and options (`--sync-warning` and `--sync-critical`) about sync replica to `cluster_has_replica`. * Add a new service `cluster_has_scheduled_action` to warn of any scheduled switchover or restart. * Add options to `node_is_replica` to check specifically for a synchronous (`--is-sync`) or asynchronous node (`--is-async`). * Add `standby-leader` as a valid leader type for `cluster_has_leader`. * Add a new service `node_is_leader` to check if a node is a leader (which includes standby leader nodes) ### Fixed * Fix the `node_is_alive` check. (#31) * Fix the `cluster_has_replica` and `cluster_node_count` checks to account for the new replica state `streaming` introduced in v3.0.4 (#28, reported by @log1-c) ### Misc * Create CHANGELOG.md * Add tests for the output of the scripts in addition to the return code * Documentation in CONTRIBUTING.md ## check_patroni 0.2.0 - 2023-03-20 ### Added * Add a `--save` option when state files are used * Modify `-e/--endpoints` to allow a comma separated list of endpoints (#21, reported by @lihnjo) * Use requests instead of urllib3 (with extensive help from @dlax) * Change the way logging is handled (with extensive help from @dlax) ### Fix * Reverse the test for `node_is_pending` * SSL handling ### Misc * Several doc Fix and Updates * Use spellcheck and isort * Remove tests for python 3.6 * Add python tests for python 3.11 ## check_patroni 0.1.1 - 2022-07-15 The initial release covers the following checks : * check a cluster for + configuration change + presence of a leader + presence of a replica + maintenance status * check a node for + liveness + pending restart status + primary status + replica status + tl change + patroni version check_patroni-1.0.0/CONTRIBUTING.md000066400000000000000000000063111447307111400165550ustar00rootroot00000000000000# Contributing to check_patroni Thanks for your interest in contributing to check_patroni. ## Clone Git Repository Installation from the git repository: ``` $ git clone https://github.com/dalibo/check_patroni.git $ cd check_patroni ``` Change the branch if necessary. ## Create Python Virtual Environment You need a dedicated environment, install dependencies and then check_patroni from the repo: ``` $ python3 -m venv .venv $ . .venv/bin/activate (.venv) $ pip3 install .[test] (.venv) $ pip3 install -r requirements-dev.txt (.venv) $ check_patroni ``` To quit this env and destroy it: ``` $ deactivate $ rm -r .venv ``` ## Development Environment A vagrant file is available to create a icinga / opm / grafana stack and install check_patroni. You can then add a server to the supervision and watch the graphs in grafana. It's in the `vagrant` directory. A vagrant file can be found in [this repository](https://github.com/ioguix/vagrant-patroni) to generate a patroni/etcd setup. The `README.md` can be geneated with `./docs/make_readme.sh`. ## Executing Tests Crafting repeatable tests using a live Patroni cluster can be intricate. To simplify the development process, interactions with Patroni's API are substituted with a mock function that yields an HTTP return code and a JSON object outlining the cluster's status. The JSON files containing this information are housed in the `./tests/json` directory. An important consideration is that there is a potential drawback: if the JSON data is incorrect or if modifications have been made to Patroni without corresponding updates to the tests documented here, the tests might still pass erroneously. The tests are executed automatically for each PR using the ci (see `.github/workflow/lint.yml` and `.github/workflow/tests.yml`). Running the tests manually: * Using patroni's nominal replica state of `streaming` (since v3.0.4): ```bash pytest ./tests ``` * Using patroni's nominal replica state of `running` (before v3.0.4): ```bash pytest --use-old-replica-state ./tests ``` * Using tox: ```bash tox -e lint # mypy + flake8 + black + isort ° codespell tox # pytests and "lint" tests for all supported version of python tox -e py # pytests and "lint" tests for the default version of python ``` Please note that when dealing with any service that checks the state of a node in patroni's `cluster` endpoint, the corresponding JSON test file must be added in `./tests/tools.py`. A bash script, `check_patroni.sh`, is provided to facilitate testing all services on a Patroni endpoint (`./vagrant/check_patroni.sh`). It requires one parameter: the endpoint URL that will be used as the argument for the `-e/--endpoints` option of `check_patroni`. This script essentially compiles a list of service calls and executes them sequentially in a bash script. It creates a state file in the directory from which you run the script. Here's an example usage: ```bash ./vagrant/check_patroni.sh http://10.20.30.51:8008 ``` ## Release Update the Changelog. The package is generated and uploaded to pypi when a `v*` tag is created (see `.github/workflow/publish.yml`). Alternatively, the release can be done manually with: ``` tox -e build tox -e upload ``` check_patroni-1.0.0/LICENSE000066400000000000000000000016261447307111400153350ustar00rootroot00000000000000PostgreSQL Licence Copyright (c) 2022, DALIBO Permission to use, copy, modify, and distribute this software and its documentation for any purpose, without fee, and without a written agreement is hereby granted, provided that the above copyright notice and this paragraph and the following two paragraphs appear in all copies. IN NO EVENT SHALL DALIBO BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF DALIBO HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. DALIBO SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND DALIBO HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. check_patroni-1.0.0/MANIFEST.in000066400000000000000000000003001447307111400160520ustar00rootroot00000000000000include *.md include mypy.ini include pytest.ini include tox.ini include .flake8 include pyproject.toml recursive-include docs *.sh recursive-include tests *.json recursive-include tests *.py check_patroni-1.0.0/README.md000066400000000000000000000320601447307111400156030ustar00rootroot00000000000000# check_patroni A nagios plugin for patroni. ## Features - Check presence of leader, replicas, node counts. - Check each node for replication status. ``` Usage: check_patroni [OPTIONS] COMMAND [ARGS]... Nagios plugin that uses Patroni's REST API to monitor a Patroni cluster. Options: --config FILE Read option defaults from the specified INI file [default: config.ini] -e, --endpoints TEXT Patroni API endpoint. Can be specified multiple times or as a list of comma separated addresses. The node services checks the status of one node, therefore if several addresses are specified they should point to different interfaces on the same node. The cluster services check the status of the cluster, therefore it's better to give a list of all Patroni node addresses. [default: http://127.0.0.1:8008] --cert_file PATH File with the client certificate. --key_file PATH File with the client key. --ca_file PATH The CA certificate. -v, --verbose Increase verbosity -v (info)/-vv (warning)/-vvv (debug) --version --timeout INTEGER Timeout in seconds for the API queries (0 to disable) [default: 2] --help Show this message and exit. Commands: cluster_config_has_changed Check if the hash of the configuration... cluster_has_leader Check if the cluster has a leader. cluster_has_replica Check if the cluster has healthy replicas... cluster_has_scheduled_action Check if the cluster has a scheduled... cluster_is_in_maintenance Check if the cluster is in maintenance... cluster_node_count Count the number of nodes in the cluster. node_is_alive Check if the node is alive ie patroni is... node_is_leader Check if the node is a leader node. node_is_pending_restart Check if the node is in pending restart... node_is_primary Check if the node is the primary with the... node_is_replica Check if the node is a running replica... node_patroni_version Check if the version is equal to the input node_tl_has_changed Check if the timeline has changed. ``` ## Install check_patroni is licensed under PostgreSQL license. ``` $ pip install git+https://github.com/dalibo/check_patroni.git ``` check_patroni works on python 3.6, we keep it that way because patroni also supports it and there are still lots of RH 7 variants around. That being said python 3.6 has been EOL for age and there is no support for it in the github CI. ## Support If you hit a bug or need help, open a [GitHub issue](https://github.com/dalibo/check_patroni/issues/new). Dalibo has no commitment on response time for public free support. Thanks for you contribution ! ## Config file All global and service specific parameters can be specified via a config file has follows: ``` [options] endpoints = https://10.20.199.3:8008, https://10.20.199.4:8008,https://10.20.199.5:8008 cert_file = ./ssl/my-cert.pem key_file = ./ssl/my-key.pem ca_file = ./ssl/CA-cert.pem timeout = 0 [options.node_is_replica] lag=100 ``` ## Thresholds The format for the threshold parameters is `[@][start:][end]`. * `start:` may be omitted if `start == 0` * `~:` means that start is negative infinity * If `end` is omitted, infinity is assumed * To invert the match condition, prefix the range expression with `@`. A match is found when: `start <= VALUE <= end`. For example, the following command will raise: * a warning if there is less than 1 nodes, wich can be translated to outside of range [2;+INF[ * a critical if there are no nodes, wich can be translated to outside of range [1;+INF[ ``` check_patroni -e https://10.20.199.3:8008 cluster_has_replica --warning 2: --critical 1: ``` ## SSL Several options are available: * the server's CA certificate is not available or trusted by the client system: * `--ca_cert`: your certification chain `cat CA-certificate server-certificate > cabundle` * you have a client certificate for authenticating with Patroni's REST API: * `--cert_file`: your certificate or the concatenation of your certificate and private key * `--key_file`: your private key (optional) ## Cluster services ### cluster_config_has_changed ``` Usage: check_patroni cluster_config_has_changed [OPTIONS] Check if the hash of the configuration has changed. Note: either a hash or a state file must be provided for this service to work. Check: * `OK`: The hash didn't change * `CRITICAL`: The hash of the configuration has changed compared to the input (`--hash`) or last time (`--state_file`) Perfdata: * `is_configuration_changed` is 1 if the configuration has changed Options: --hash TEXT A hash to compare with. -s, --state-file TEXT A state file to store the hash of the configuration. --save Set the current configuration hash as the reference for future calls. --help Show this message and exit. ``` ### cluster_has_leader ``` Usage: check_patroni cluster_has_leader [OPTIONS] Check if the cluster has a leader. This check applies to any kind of leaders including standby leaders. Check: * `OK`: if there is a leader node. * `CRITICAL`: otherwise Perfdata: `has_leader` is 1 if there is a leader node, 0 otherwise Options: --help Show this message and exit. ``` ### cluster_has_replica ``` Usage: check_patroni cluster_has_replica [OPTIONS] Check if the cluster has healthy replicas and/or if some are sync standbies A healthy replica: * is in running or streaming state (V3.0.4) * has a replica or sync_standby role * has a lag lower or equal to max_lag Check: * `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold. and if the sync_replica count is compatible with the sync replica count threshold. * `WARNING` / `CRITICAL`: otherwise Perfdata: * healthy_replica & unhealthy_replica count * the number of sync_replica, they are included in the previous count * the lag of each replica labelled with "member name"_lag * a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync Options: -w, --warning TEXT Warning threshold for the number of healthy replica nodes. -c, --critical TEXT Critical threshold for the number of healthy replica nodes. --sync-warning TEXT Warning threshold for the number of sync replica. --sync-critical TEXT Critical threshold for the number of sync replica. --max-lag TEXT maximum allowed lag --help Show this message and exit. ``` ### cluster_has_scheduled_action ``` Usage: check_patroni cluster_has_scheduled_action [OPTIONS] Check if the cluster has a scheduled action (switchover or restart) Check: * `OK`: If the cluster has no scheduled action * `CRITICAL`: otherwise. Perfdata: * `scheduled_actions` is 1 if the cluster has scheduled actions. * `scheduled_switchover` is 1 if the cluster has a scheduled switchover. * `scheduled_restart` counts the number of scheduled restart in the cluster. Options: --help Show this message and exit. ``` ### cluster_is_in_maintenance ``` Usage: check_patroni cluster_is_in_maintenance [OPTIONS] Check if the cluster is in maintenance mode or paused. Check: * `OK`: If the cluster is in maintenance mode. * `CRITICAL`: otherwise. Perfdata: * `is_in_maintenance` is 1 the cluster is in maintenance mode, 0 otherwise Options: --help Show this message and exit. ``` ### cluster_node_count ``` Usage: check_patroni cluster_node_count [OPTIONS] Count the number of nodes in the cluster. The state refers to the state of PostgreSQL. Possible values are: * initializing new cluster, initdb failed * running custom bootstrap script, custom bootstrap failed * starting, start failed * restarting, restart failed * running, streaming (for a replica V3.0.4) * stopping, stopped, stop failed * creating replica * crashed The role refers to the role of the server in the cluster. Possible values are: * master or leader (V3.0.0+) * replica * demoted * promoted * uninitialized Check: * Compares the number of nodes against the normal and healthy (running + streaming) nodes warning and critical thresholds. * `OK`: If they are not provided. Perfdata: * `members`: the member count. * `healthy_members`: the running and streaming member count. * all the roles of the nodes in the cluster with their count (start with "role_"). * all the statuses of the nodes in the cluster with their count (start with "state_"). Options: -w, --warning TEXT Warning threshold for the number of nodes. -c, --critical TEXT Critical threshold for the number of nodes. --healthy-warning TEXT Warning threshold for the number of healthy nodes (running + streaming). --healthy-critical TEXT Critical threshold for the number of healthy nodes (running + streaming). --help Show this message and exit. ``` ## Node services ### node_is_alive ``` Usage: check_patroni node_is_alive [OPTIONS] Check if the node is alive ie patroni is running. This is a liveness check as defined in Patroni's documentation. Check: * `OK`: If patroni is running. * `CRITICAL`: otherwise. Perfdata: * `is_running` is 1 if patroni is running, 0 otherwise Options: --help Show this message and exit. ``` ### node_is_pending_restart ``` Usage: check_patroni node_is_pending_restart [OPTIONS] Check if the node is in pending restart state. This situation can arise if the configuration has been modified but requiers a restart of PostgreSQL to take effect. Check: * `OK`: if the node has no pending restart tag. * `CRITICAL`: otherwise Perfdata: `is_pending_restart` is 1 if the node has pending restart tag, 0 otherwise. Options: --help Show this message and exit. ``` ### node_is_leader ``` Usage: check_patroni node_is_leader [OPTIONS] Check if the node is a leader node. This check applies to any kind of leaders including standby leaders. To check explicitly for a standby leader use the `--is-standby-leader` option. Check: * `OK`: if the node is a leader. * `CRITICAL:` otherwise Perfdata: `is_leader` is 1 if the node is a leader node, 0 otherwise. Options: --is-standby-leader Check for a standby leader --help Show this message and exit. ``` ### node_is_primary ``` Usage: check_patroni node_is_primary [OPTIONS] Check if the node is the primary with the leader lock. This service is not valid for a standby leader, because this kind of node is not a primary. Check: * `OK`: if the node is a primary with the leader lock. * `CRITICAL:` otherwise Perfdata: `is_primary` is 1 if the node is a primary with the leader lock, 0 otherwise. Options: --help Show this message and exit. ``` ### node_is_replica ``` Usage: check_patroni node_is_replica [OPTIONS] Check if the node is a running replica with no noloadbalance tag. It is possible to check if the node is synchronous or asynchronous. If nothing is specified any kind of replica is accepted. When checking for a synchronous replica, it's not possible to specify a lag. Check: * `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold. * `CRITICAL`: otherwise Perfdata: `is_replica` is 1 if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold, 0 otherwise. Options: --max-lag TEXT maximum allowed lag --is-sync check if the replica is synchronous --is-async check if the replica is asynchronous --help Show this message and exit. ``` ### node_patroni_version ``` Usage: check_patroni node_patroni_version [OPTIONS] Check if the version is equal to the input Check: * `OK`: The version is the same as the input `--patroni-version` * `CRITICAL`: otherwise. Perfdata: * `is_version_ok` is 1 if version is ok, 0 otherwise Options: --patroni-version TEXT Patroni version to compare to [required] --help Show this message and exit. ``` ### node_tl_has_changed ``` Usage: check_patroni node_tl_has_changed [OPTIONS] Check if the timeline has changed. Note: either a timeline or a state file must be provided for this service to work. Check: * `OK`: The timeline is the same as last time (`--state_file`) or the inputted timeline (`--timeline`) * `CRITICAL`: The tl is not the same. Perfdata: * `is_timeline_changed` is 1 if the tl has changed, 0 otherwise * the timeline Options: --timeline TEXT A timeline number to compare with. -s, --state-file TEXT A state file to store the last tl number into. --save Set the current timeline number as the reference for future calls. --help Show this message and exit. ``` check_patroni-1.0.0/check_patroni/000077500000000000000000000000001447307111400171345ustar00rootroot00000000000000check_patroni-1.0.0/check_patroni/__init__.py000066400000000000000000000001321447307111400212410ustar00rootroot00000000000000import logging __version__ = "1.0.0" _log: logging.Logger = logging.getLogger(__name__) check_patroni-1.0.0/check_patroni/__main__.py000066400000000000000000000000751447307111400212300ustar00rootroot00000000000000from .cli import main if __name__ == "__main__": main() check_patroni-1.0.0/check_patroni/cli.py000066400000000000000000000535341447307111400202670ustar00rootroot00000000000000import logging import re from configparser import ConfigParser from typing import List import click import nagiosplugin from . import __version__, _log from .cluster import ( ClusterConfigHasChanged, ClusterConfigHasChangedSummary, ClusterHasLeader, ClusterHasLeaderSummary, ClusterHasReplica, ClusterHasScheduledAction, ClusterIsInMaintenance, ClusterNodeCount, ) from .convert import size_to_byte from .node import ( NodeIsAlive, NodeIsAliveSummary, NodeIsLeader, NodeIsLeaderSummary, NodeIsPendingRestart, NodeIsPendingRestartSummary, NodeIsPrimary, NodeIsPrimarySummary, NodeIsReplica, NodeIsReplicaSummary, NodePatroniVersion, NodePatroniVersionSummary, NodeTLHasChanged, NodeTLHasChangedSummary, ) from .types import ConnectionInfo, Parameters DEFAULT_CFG = "config.ini" handler = logging.StreamHandler() handler.setFormatter(logging.Formatter("%(levelname)s - %(message)s")) _log.addHandler(handler) def print_version(ctx: click.Context, param: str, value: str) -> None: if not value or ctx.resilient_parsing: return click.echo(f"Version {__version__}") ctx.exit() def configure(ctx: click.Context, param: str, filename: str) -> None: """Use a config file for the parameters stolen from https://jwodder.github.io/kbits/posts/click-config/ """ # FIXME should use click-configfile / click-config-file ? cfg = ConfigParser() cfg.read(filename) ctx.default_map = {} for sect in cfg.sections(): command_path = sect.split(".") if command_path[0] != "options": continue defaults = ctx.default_map for cmdname in command_path[1:]: defaults = defaults.setdefault(cmdname, {}) defaults.update(cfg[sect]) try: # endpoints is an array of addresses separated by , if isinstance(defaults["endpoints"], str): defaults["endpoints"] = re.split(r"\s*,\s*", defaults["endpoints"]) except KeyError: pass @click.group() @click.option( "--config", type=click.Path(dir_okay=False), default=DEFAULT_CFG, callback=configure, is_eager=True, expose_value=False, help="Read option defaults from the specified INI file", show_default=True, ) @click.option( "-e", "--endpoints", "endpoints", type=str, multiple=True, default=["http://127.0.0.1:8008"], help=( "Patroni API endpoint. Can be specified multiple times or as a list " "of comma separated addresses. " "The node services checks the status of one node, therefore if " "several addresses are specified they should point to different " "interfaces on the same node. The cluster services check the " "status of the cluster, therefore it's better to give a list of " "all Patroni node addresses." ), show_default=True, ) @click.option( "--cert_file", "cert_file", type=click.Path(exists=True), default=None, help="File with the client certificate.", ) @click.option( "--key_file", "key_file", type=click.Path(exists=True), default=None, help="File with the client key.", ) @click.option( "--ca_file", "ca_file", type=click.Path(exists=True), default=None, help="The CA certificate.", ) @click.option( "-v", "--verbose", "verbose", count=True, default=0, help="Increase verbosity -v (info)/-vv (warning)/-vvv (debug)", show_default=False, ) @click.option( "--version", is_flag=True, callback=print_version, expose_value=False, is_eager=True ) @click.option( "--timeout", "timeout", default=2, type=int, help="Timeout in seconds for the API queries (0 to disable)", show_default=True, ) @click.pass_context @nagiosplugin.guarded def main( ctx: click.Context, endpoints: List[str], cert_file: str, key_file: str, ca_file: str, verbose: int, timeout: int, ) -> None: """Nagios plugin that uses Patroni's REST API to monitor a Patroni cluster.""" # FIXME Not all "is/has" services have the same return code for ok. Check if it's ok # We use this to pass parameters instead of ctx.parent.params because the # latter is typed as Optional[Context] and mypy complains with the following # error unless we test if ctx.parent is none which looked ugly. # # error: Item "None" of "Optional[Context]" has an attribute "params" [union-attr] # The config file allows endpoints to be specified as a comma separated list of endpoints # To avoid confusion, We allow the same in command line parameters tendpoints: List[str] = [] for e in endpoints: tendpoints += re.split(r"\s*,\s*", e) endpoints = tendpoints if verbose == 3: logging.getLogger("urllib3").addHandler(handler) logging.getLogger("urllib3").setLevel(logging.DEBUG) _log.setLevel(logging.DEBUG) connection_info: ConnectionInfo if cert_file is None and key_file is None: connection_info = ConnectionInfo(endpoints, None, ca_file) else: connection_info = ConnectionInfo(endpoints, (cert_file, key_file), ca_file) ctx.obj = Parameters( connection_info, timeout, verbose, ) @main.command(name="cluster_node_count") # required otherwise _ are converted to - @click.option( "-w", "--warning", "warning", type=str, help="Warning threshold for the number of nodes.", ) @click.option( "-c", "--critical", "critical", type=str, help="Critical threshold for the number of nodes.", ) @click.option( "--healthy-warning", "healthy_warning", type=str, help="Warning threshold for the number of healthy nodes (running + streaming).", ) @click.option( "--healthy-critical", "healthy_critical", type=str, help="Critical threshold for the number of healthy nodes (running + streaming).", ) @click.pass_context @nagiosplugin.guarded def cluster_node_count( ctx: click.Context, warning: str, critical: str, healthy_warning: str, healthy_critical: str, ) -> None: """Count the number of nodes in the cluster. \b The state refers to the state of PostgreSQL. Possible values are: * initializing new cluster, initdb failed * running custom bootstrap script, custom bootstrap failed * starting, start failed * restarting, restart failed * running, streaming (for a replica V3.0.4) * stopping, stopped, stop failed * creating replica * crashed \b The role refers to the role of the server in the cluster. Possible values are: * master or leader (V3.0.0+) * replica * demoted * promoted * uninitialized \b Check: * Compares the number of nodes against the normal and healthy (running + streaming) nodes warning and critical thresholds. * `OK`: If they are not provided. \b Perfdata: * `members`: the member count. * `healthy_members`: the running and streaming member count. * all the roles of the nodes in the cluster with their count (start with "role_"). * all the statuses of the nodes in the cluster with their count (start with "state_"). """ check = nagiosplugin.Check() check.add( ClusterNodeCount(ctx.obj.connection_info), nagiosplugin.ScalarContext( "members", warning, critical, ), nagiosplugin.ScalarContext( "healthy_members", healthy_warning, healthy_critical, ), nagiosplugin.ScalarContext("member_roles"), nagiosplugin.ScalarContext("member_statuses"), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="cluster_has_leader") @click.pass_context @nagiosplugin.guarded def cluster_has_leader(ctx: click.Context) -> None: """Check if the cluster has a leader. This check applies to any kind of leaders including standby leaders. \b Check: * `OK`: if there is a leader node. * `CRITICAL`: otherwise Perfdata: `has_leader` is 1 if there is a leader node, 0 otherwise """ check = nagiosplugin.Check() check.add( ClusterHasLeader(ctx.obj.connection_info), nagiosplugin.ScalarContext("has_leader", None, "@0:0"), ClusterHasLeaderSummary(), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="cluster_has_replica") @click.option( "-w", "--warning", "warning", type=str, help="Warning threshold for the number of healthy replica nodes.", ) @click.option( "-c", "--critical", "critical", type=str, help="Critical threshold for the number of healthy replica nodes.", ) @click.option( "--sync-warning", "sync_warning", type=str, help="Warning threshold for the number of sync replica.", ) @click.option( "--sync-critical", "sync_critical", type=str, help="Critical threshold for the number of sync replica.", ) @click.option("--max-lag", "max_lag", type=str, help="maximum allowed lag") @click.pass_context @nagiosplugin.guarded def cluster_has_replica( ctx: click.Context, warning: str, critical: str, sync_warning: str, sync_critical: str, max_lag: str, ) -> None: """Check if the cluster has healthy replicas and/or if some are sync standbies \b A healthy replica: * is in running or streaming state (V3.0.4) * has a replica or sync_standby role * has a lag lower or equal to max_lag \b Check: * `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold. and if the sync_replica count is compatible with the sync replica count threshold. * `WARNING` / `CRITICAL`: otherwise \b Perfdata: * healthy_replica & unhealthy_replica count * the number of sync_replica, they are included in the previous count * the lag of each replica labelled with "member name"_lag * a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync """ tmax_lag = size_to_byte(max_lag) if max_lag is not None else None check = nagiosplugin.Check() check.add( ClusterHasReplica(ctx.obj.connection_info, tmax_lag), nagiosplugin.ScalarContext( "healthy_replica", warning, critical, ), nagiosplugin.ScalarContext( "sync_replica", sync_warning, sync_critical, ), nagiosplugin.ScalarContext("unhealthy_replica"), nagiosplugin.ScalarContext("replica_lag"), nagiosplugin.ScalarContext("replica_sync"), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="cluster_config_has_changed") @click.option("--hash", "config_hash", type=str, help="A hash to compare with.") @click.option( "-s", "--state-file", "state_file", type=str, help="A state file to store the hash of the configuration.", ) @click.option( "--save", "save_config", is_flag=True, default=False, help="Set the current configuration hash as the reference for future calls.", ) @click.pass_context @nagiosplugin.guarded def cluster_config_has_changed( ctx: click.Context, config_hash: str, state_file: str, save_config: bool ) -> None: """Check if the hash of the configuration has changed. Note: either a hash or a state file must be provided for this service to work. \b Check: * `OK`: The hash didn't change * `CRITICAL`: The hash of the configuration has changed compared to the input (`--hash`) or last time (`--state_file`) \b Perfdata: * `is_configuration_changed` is 1 if the configuration has changed """ # Note: hash cannot be in the perf data = not a number if (config_hash is None and state_file is None) or ( config_hash is not None and state_file is not None ): raise click.UsageError( "Either --hash or --state-file should be provided for this service", ctx ) old_config_hash = config_hash if state_file is not None: cookie = nagiosplugin.Cookie(state_file) cookie.open() old_config_hash = cookie.get("hash") cookie.close() check = nagiosplugin.Check() check.add( ClusterConfigHasChanged( ctx.obj.connection_info, old_config_hash, state_file, save_config ), nagiosplugin.ScalarContext("is_configuration_changed", None, "@1:1"), ClusterConfigHasChangedSummary(old_config_hash), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="cluster_is_in_maintenance") @click.pass_context @nagiosplugin.guarded def cluster_is_in_maintenance(ctx: click.Context) -> None: """Check if the cluster is in maintenance mode or paused. \b Check: * `OK`: If the cluster is in maintenance mode. * `CRITICAL`: otherwise. \b Perfdata: * `is_in_maintenance` is 1 the cluster is in maintenance mode, 0 otherwise """ check = nagiosplugin.Check() check.add( ClusterIsInMaintenance(ctx.obj.connection_info), nagiosplugin.ScalarContext("is_in_maintenance", None, "0:0"), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="cluster_has_scheduled_action") @click.pass_context @nagiosplugin.guarded def cluster_has_scheduled_action(ctx: click.Context) -> None: """Check if the cluster has a scheduled action (switchover or restart) \b Check: * `OK`: If the cluster has no scheduled action * `CRITICAL`: otherwise. \b Perfdata: * `scheduled_actions` is 1 if the cluster has scheduled actions. * `scheduled_switchover` is 1 if the cluster has a scheduled switchover. * `scheduled_restart` counts the number of scheduled restart in the cluster. """ check = nagiosplugin.Check() check.add( ClusterHasScheduledAction(ctx.obj.connection_info), nagiosplugin.ScalarContext("has_scheduled_actions", None, "0:0"), nagiosplugin.ScalarContext("scheduled_switchover"), nagiosplugin.ScalarContext("scheduled_restart"), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_is_primary") @click.pass_context @nagiosplugin.guarded def node_is_primary(ctx: click.Context) -> None: """Check if the node is the primary with the leader lock. This service is not valid for a standby leader, because this kind of node is not a primary. \b Check: * `OK`: if the node is a primary with the leader lock. * `CRITICAL:` otherwise Perfdata: `is_primary` is 1 if the node is a primary with the leader lock, 0 otherwise. """ check = nagiosplugin.Check() check.add( NodeIsPrimary(ctx.obj.connection_info), nagiosplugin.ScalarContext("is_primary", None, "@0:0"), NodeIsPrimarySummary(), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_is_leader") @click.option( "--is-standby-leader", "check_standby_leader", is_flag=True, default=False, help="Check for a standby leader", ) @click.pass_context @nagiosplugin.guarded def node_is_leader(ctx: click.Context, check_standby_leader: bool) -> None: """Check if the node is a leader node. This check applies to any kind of leaders including standby leaders. To check explicitly for a standby leader use the `--is-standby-leader` option. \b Check: * `OK`: if the node is a leader. * `CRITICAL:` otherwise Perfdata: `is_leader` is 1 if the node is a leader node, 0 otherwise. """ check = nagiosplugin.Check() check.add( NodeIsLeader(ctx.obj.connection_info, check_standby_leader), nagiosplugin.ScalarContext("is_leader", None, "@0:0"), NodeIsLeaderSummary(check_standby_leader), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_is_replica") @click.option("--max-lag", "max_lag", type=str, help="maximum allowed lag") @click.option( "--is-sync", "check_is_sync", is_flag=True, default=False, help="check if the replica is synchronous", ) @click.option( "--is-async", "check_is_async", is_flag=True, default=False, help="check if the replica is asynchronous", ) @click.pass_context @nagiosplugin.guarded def node_is_replica( ctx: click.Context, max_lag: str, check_is_sync: bool, check_is_async: bool ) -> None: """Check if the node is a running replica with no noloadbalance tag. It is possible to check if the node is synchronous or asynchronous. If nothing is specified any kind of replica is accepted. When checking for a synchronous replica, it's not possible to specify a lag. \b Check: * `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold. * `CRITICAL`: otherwise Perfdata: `is_replica` is 1 if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold, 0 otherwise. """ if check_is_sync and max_lag is not None: raise click.UsageError( "--is-sync and --max-lag cannot be provided at the same time for this service", ctx, ) if check_is_sync and check_is_async: raise click.UsageError( "--is-sync and --is-async cannot be provided at the same time for this service", ctx, ) check = nagiosplugin.Check() check.add( NodeIsReplica(ctx.obj.connection_info, max_lag, check_is_sync, check_is_async), nagiosplugin.ScalarContext("is_replica", None, "@0:0"), NodeIsReplicaSummary(max_lag, check_is_sync, check_is_async), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_is_pending_restart") @click.pass_context @nagiosplugin.guarded def node_is_pending_restart(ctx: click.Context) -> None: """Check if the node is in pending restart state. This situation can arise if the configuration has been modified but requiers a restart of PostgreSQL to take effect. \b Check: * `OK`: if the node has no pending restart tag. * `CRITICAL`: otherwise Perfdata: `is_pending_restart` is 1 if the node has pending restart tag, 0 otherwise. """ check = nagiosplugin.Check() check.add( NodeIsPendingRestart(ctx.obj.connection_info), nagiosplugin.ScalarContext("is_pending_restart", None, "0:0"), NodeIsPendingRestartSummary(), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_tl_has_changed") @click.option( "--timeline", "timeline", type=str, help="A timeline number to compare with." ) @click.option( "-s", "--state-file", "state_file", type=str, help="A state file to store the last tl number into.", ) @click.option( "--save", "save_tl", is_flag=True, default=False, help="Set the current timeline number as the reference for future calls.", ) @click.pass_context @nagiosplugin.guarded def node_tl_has_changed( ctx: click.Context, timeline: str, state_file: str, save_tl: bool ) -> None: """Check if the timeline has changed. Note: either a timeline or a state file must be provided for this service to work. \b Check: * `OK`: The timeline is the same as last time (`--state_file`) or the inputted timeline (`--timeline`) * `CRITICAL`: The tl is not the same. \b Perfdata: * `is_timeline_changed` is 1 if the tl has changed, 0 otherwise * the timeline """ if (timeline is None and state_file is None) or ( timeline is not None and state_file is not None ): raise click.UsageError( "Either --timeline or --state-file should be provided for this service", ctx ) old_timeline = timeline if state_file is not None: cookie = nagiosplugin.Cookie(state_file) cookie.open() old_timeline = cookie.get("timeline") cookie.close() check = nagiosplugin.Check() check.add( NodeTLHasChanged(ctx.obj.connection_info, old_timeline, state_file, save_tl), nagiosplugin.ScalarContext("is_timeline_changed", None, "@1:1"), nagiosplugin.ScalarContext("timeline"), NodeTLHasChangedSummary(old_timeline), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_patroni_version") @click.option( "--patroni-version", "patroni_version", type=str, help="Patroni version to compare to", required=True, ) @click.pass_context @nagiosplugin.guarded def node_patroni_version(ctx: click.Context, patroni_version: str) -> None: """Check if the version is equal to the input \b Check: * `OK`: The version is the same as the input `--patroni-version` * `CRITICAL`: otherwise. \b Perfdata: * `is_version_ok` is 1 if version is ok, 0 otherwise """ # TODO the version cannot be written in perfdata find something else ? check = nagiosplugin.Check() check.add( NodePatroniVersion(ctx.obj.connection_info, patroni_version), nagiosplugin.ScalarContext("is_version_ok", None, "@0:0"), nagiosplugin.ScalarContext("patroni_version"), NodePatroniVersionSummary(patroni_version), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_is_alive") @click.pass_context @nagiosplugin.guarded def node_is_alive(ctx: click.Context) -> None: """Check if the node is alive ie patroni is running. This is a liveness check as defined in Patroni's documentation. \b Check: * `OK`: If patroni is running. * `CRITICAL`: otherwise. \b Perfdata: * `is_running` is 1 if patroni is running, 0 otherwise """ check = nagiosplugin.Check() check.add( NodeIsAlive(ctx.obj.connection_info), nagiosplugin.ScalarContext("is_alive", None, "@0:0"), NodeIsAliveSummary(), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) check_patroni-1.0.0/check_patroni/cluster.py000066400000000000000000000201351447307111400211700ustar00rootroot00000000000000import hashlib import json from collections import Counter from typing import Iterable, Union import nagiosplugin from . import _log from .types import ConnectionInfo, PatroniResource, handle_unknown def replace_chars(text: str) -> str: return text.replace("'", "").replace(" ", "_") class ClusterNodeCount(PatroniResource): def probe(self: "ClusterNodeCount") -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("cluster") role_counters: Counter[str] = Counter() roles = [] status_counters: Counter[str] = Counter() statuses = [] for member in item_dict["members"]: roles.append(replace_chars(member["role"])) statuses.append(replace_chars(member["state"])) role_counters.update(roles) status_counters.update(statuses) # The actual check: members, healthy_members yield nagiosplugin.Metric("members", len(item_dict["members"])) yield nagiosplugin.Metric( "healthy_members", status_counters["running"] + status_counters.get("streaming", 0), ) # The performance data : role for role in role_counters: yield nagiosplugin.Metric( f"role_{role}", role_counters[role], context="member_roles" ) # The performance data : statuses (except running) for state in status_counters: yield nagiosplugin.Metric( f"state_{state}", status_counters[state], context="member_statuses" ) class ClusterHasLeader(PatroniResource): def probe(self: "ClusterHasLeader") -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("cluster") is_leader_found = False for member in item_dict["members"]: if ( member["role"] in ("leader", "standby_leader") and member["state"] == "running" ): is_leader_found = True break return [ nagiosplugin.Metric( "has_leader", 1 if is_leader_found else 0, ) ] class ClusterHasLeaderSummary(nagiosplugin.Summary): def ok(self: "ClusterHasLeaderSummary", results: nagiosplugin.Result) -> str: return "The cluster has a running leader." @handle_unknown def problem(self: "ClusterHasLeaderSummary", results: nagiosplugin.Result) -> str: return "The cluster has no running leader." class ClusterHasReplica(PatroniResource): def __init__( self: "ClusterHasReplica", connection_info: ConnectionInfo, max_lag: Union[int, None], ): super().__init__(connection_info) self.max_lag = max_lag def probe(self: "ClusterHasReplica") -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("cluster") replicas = [] healthy_replica = 0 unhealthy_replica = 0 sync_replica = 0 for member in item_dict["members"]: # FIXME are there other acceptable states if member["role"] in ["replica", "sync_standby"]: # patroni 3.0.4 changed the standby state from running to streaming if ( member["state"] in ["running", "streaming"] and member["lag"] != "unknown" ): replicas.append( { "name": member["name"], "lag": member["lag"], "sync": 1 if member["role"] == "sync_standby" else 0, } ) if member["role"] == "sync_standby": sync_replica += 1 if self.max_lag is None or self.max_lag >= int(member["lag"]): healthy_replica += 1 continue unhealthy_replica += 1 # The actual check yield nagiosplugin.Metric("healthy_replica", healthy_replica) yield nagiosplugin.Metric("sync_replica", sync_replica) # The performance data : unhealthy replica count, replicas lag yield nagiosplugin.Metric("unhealthy_replica", unhealthy_replica) for replica in replicas: yield nagiosplugin.Metric( f"{replica['name']}_lag", replica["lag"], context="replica_lag" ) yield nagiosplugin.Metric( f"{replica['name']}_sync", replica["sync"], context="replica_sync" ) # FIXME is this needed ?? # class ClusterHasReplicaSummary(nagiosplugin.Summary): # def ok(self, results): # def problem(self, results): class ClusterConfigHasChanged(PatroniResource): def __init__( self: "ClusterConfigHasChanged", connection_info: ConnectionInfo, config_hash: str, # Always contains the old hash state_file: str, # Only used to update the hash in the state_file (when needed) save: bool = False, # Save the configuration ): super().__init__(connection_info) self.state_file = state_file self.config_hash = config_hash self.save = save def probe(self: "ClusterConfigHasChanged") -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("config") new_hash = hashlib.md5(json.dumps(item_dict).encode()).hexdigest() _log.debug("save result: %(issave)s", {"issave": self.save}) old_hash = self.config_hash if self.state_file is not None and self.save: _log.debug( "saving new hash to state file / cookie %(state_file)s", {"state_file": self.state_file}, ) cookie = nagiosplugin.Cookie(self.state_file) cookie.open() cookie["hash"] = new_hash cookie.commit() cookie.close() _log.debug( "hash info: old hash %(old_hash)s, new hash %(new_hash)s", {"old_hash": old_hash, "new_hash": new_hash}, ) return [ nagiosplugin.Metric( "is_configuration_changed", 1 if new_hash != old_hash else 0, ) ] class ClusterConfigHasChangedSummary(nagiosplugin.Summary): def __init__(self: "ClusterConfigHasChangedSummary", config_hash: str) -> None: self.old_config_hash = config_hash # Note: It would be helpful to display the old / new hash here. Unfortunately, it's not a metric. # So we only have the old / expected one. def ok(self: "ClusterConfigHasChangedSummary", results: nagiosplugin.Result) -> str: return f"The hash of patroni's dynamic configuration has not changed ({self.old_config_hash})." @handle_unknown def problem( self: "ClusterConfigHasChangedSummary", results: nagiosplugin.Result ) -> str: return f"The hash of patroni's dynamic configuration has changed. The old hash was {self.old_config_hash}." class ClusterIsInMaintenance(PatroniResource): def probe(self: "ClusterIsInMaintenance") -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("cluster") # The actual check return [ nagiosplugin.Metric( "is_in_maintenance", 1 if "pause" in item_dict and item_dict["pause"] else 0, ) ] class ClusterHasScheduledAction(PatroniResource): def probe(self: "ClusterIsInMaintenance") -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("cluster") scheduled_switchover = 0 scheduled_restart = 0 if "scheduled_switchover" in item_dict: scheduled_switchover = 1 for member in item_dict["members"]: if "scheduled_restart" in member: scheduled_restart += 1 # The actual check yield nagiosplugin.Metric( "has_scheduled_actions", 1 if (scheduled_switchover + scheduled_restart) > 0 else 0, ) # The performance data : scheduled_switchover, scheduled action count yield nagiosplugin.Metric("scheduled_switchover", scheduled_switchover) yield nagiosplugin.Metric("scheduled_restart", scheduled_restart) check_patroni-1.0.0/check_patroni/convert.py000066400000000000000000000027231447307111400211720ustar00rootroot00000000000000import re from typing import Tuple, Union import click def size_to_byte(value: str) -> int: """Convert any size to Byte >>> size_to_byte('1TB') 1099511627776 >>> size_to_byte('5kB') 5120 >>> size_to_byte('.5kB') 512 >>> size_to_byte('.5 yoyo') Traceback (most recent call last): ... click.exceptions.BadParameter: Invalid unit for size f{value} """ convert = { "B": 1, "kB": 1024, "MB": 1024 * 1024, "GB": 1024 * 1024 * 1024, "TB": 1024 * 1024 * 1024 * 1024, } val, unit = strtod(value) if val is None: val = 1 if unit is None: # No unit, all good # we can round half bytes dont really make sense return round(val) else: try: multiplicateur = convert[unit] except KeyError: raise click.BadParameter("Invalid unit for size f{value}") # we can round half bytes dont really make sense return round(val * multiplicateur) DBL_RE = re.compile(r"^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?") def strtod(value: str) -> Tuple[Union[float, None], Union[str, None]]: """As most as possible close equivalent of strtod(3) function used by postgres to parse parameter values. >>> strtod(' A ') == (None, 'A') True """ value = str(value).strip() match = DBL_RE.match(value) if match: end = match.end() return float(value[:end]), value[end:] return None, value check_patroni-1.0.0/check_patroni/node.py000066400000000000000000000220361447307111400204360ustar00rootroot00000000000000from typing import Iterable import nagiosplugin from . import _log from .types import APIError, ConnectionInfo, PatroniResource, handle_unknown class NodeIsPrimary(PatroniResource): def probe(self: "NodeIsPrimary") -> Iterable[nagiosplugin.Metric]: try: self.rest_api("primary") except APIError: return [nagiosplugin.Metric("is_primary", 0)] return [nagiosplugin.Metric("is_primary", 1)] class NodeIsPrimarySummary(nagiosplugin.Summary): def ok(self: "NodeIsPrimarySummary", results: nagiosplugin.Result) -> str: return "This node is the primary with the leader lock." @handle_unknown def problem(self: "NodeIsPrimarySummary", results: nagiosplugin.Result) -> str: return "This node is not the primary with the leader lock." class NodeIsLeader(PatroniResource): def __init__( self: "NodeIsLeader", connection_info: ConnectionInfo, check_is_standby_leader: bool, ) -> None: super().__init__(connection_info) self.check_is_standby_leader = check_is_standby_leader def probe(self: "NodeIsLeader") -> Iterable[nagiosplugin.Metric]: apiname = "leader" if self.check_is_standby_leader: apiname = "standby-leader" try: self.rest_api(apiname) except APIError: return [nagiosplugin.Metric("is_leader", 0)] return [nagiosplugin.Metric("is_leader", 1)] class NodeIsLeaderSummary(nagiosplugin.Summary): def __init__( self: "NodeIsLeaderSummary", check_is_standby_leader: bool, ) -> None: if check_is_standby_leader: self.leader_kind = "standby leader" else: self.leader_kind = "leader" def ok(self: "NodeIsLeaderSummary", results: nagiosplugin.Result) -> str: return f"This node is a {self.leader_kind} node." @handle_unknown def problem(self: "NodeIsLeaderSummary", results: nagiosplugin.Result) -> str: return f"This node is not a {self.leader_kind} node." class NodeIsReplica(PatroniResource): def __init__( self: "NodeIsReplica", connection_info: ConnectionInfo, max_lag: str, check_is_sync: bool, check_is_async: bool, ) -> None: super().__init__(connection_info) self.max_lag = max_lag self.check_is_sync = check_is_sync self.check_is_async = check_is_async def probe(self: "NodeIsReplica") -> Iterable[nagiosplugin.Metric]: try: if self.check_is_sync: api_name = "synchronous" elif self.check_is_async: api_name = "asynchronous" else: api_name = "replica" if self.max_lag is None: self.rest_api(api_name) else: self.rest_api(f"{api_name}?lag={self.max_lag}") except APIError: return [nagiosplugin.Metric("is_replica", 0)] return [nagiosplugin.Metric("is_replica", 1)] class NodeIsReplicaSummary(nagiosplugin.Summary): def __init__( self: "NodeIsReplicaSummary", lag: str, check_is_sync: bool, check_is_async: bool, ) -> None: self.lag = lag if check_is_sync: self.replica_kind = "synchronous replica" elif check_is_async: self.replica_kind = "asynchronous replica" else: self.replica_kind = "replica" def ok(self: "NodeIsReplicaSummary", results: nagiosplugin.Result) -> str: if self.lag is None: return ( f"This node is a running {self.replica_kind} with no noloadbalance tag." ) return f"This node is a running {self.replica_kind} with no noloadbalance tag and the lag is under {self.lag}." @handle_unknown def problem(self: "NodeIsReplicaSummary", results: nagiosplugin.Result) -> str: if self.lag is None: return f"This node is not a running {self.replica_kind} with no noloadbalance tag." return f"This node is not a running {self.replica_kind} with no noloadbalance tag and a lag under {self.lag}." class NodeIsPendingRestart(PatroniResource): def probe(self: "NodeIsPendingRestart") -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("patroni") is_pending_restart = item_dict.get("pending_restart", False) return [ nagiosplugin.Metric( "is_pending_restart", 1 if is_pending_restart else 0, ) ] class NodeIsPendingRestartSummary(nagiosplugin.Summary): def ok(self: "NodeIsPendingRestartSummary", results: nagiosplugin.Result) -> str: return "This node doesn't have the pending restart flag." @handle_unknown def problem( self: "NodeIsPendingRestartSummary", results: nagiosplugin.Result ) -> str: return "This node has the pending restart flag." class NodeTLHasChanged(PatroniResource): def __init__( self: "NodeTLHasChanged", connection_info: ConnectionInfo, timeline: str, # Always contains the old timeline state_file: str, # Only used to update the timeline in the state_file (when needed) save: bool, # save timeline in state file ) -> None: super().__init__(connection_info) self.state_file = state_file self.timeline = timeline self.save = save def probe(self: "NodeTLHasChanged") -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("patroni") new_tl = item_dict["timeline"] _log.debug("save result: %(issave)s", {"issave": self.save}) old_tl = self.timeline if self.state_file is not None and self.save: _log.debug( "saving new timeline to state file / cookie %(state_file)s", {"state_file": self.state_file}, ) cookie = nagiosplugin.Cookie(self.state_file) cookie.open() cookie["timeline"] = new_tl cookie.commit() cookie.close() _log.debug( "Tl data: old tl %(old_tl)s, new tl %(new_tl)s", {"old_tl": old_tl, "new_tl": new_tl}, ) # The actual check yield nagiosplugin.Metric( "is_timeline_changed", 1 if str(new_tl) != str(old_tl) else 0, ) # The performance data : the timeline number yield nagiosplugin.Metric("timeline", new_tl) class NodeTLHasChangedSummary(nagiosplugin.Summary): def __init__(self: "NodeTLHasChangedSummary", timeline: str) -> None: self.timeline = timeline def ok(self: "NodeTLHasChangedSummary", results: nagiosplugin.Result) -> str: return f"The timeline is still {self.timeline}." @handle_unknown def problem(self: "NodeTLHasChangedSummary", results: nagiosplugin.Result) -> str: return f"The expected timeline was {self.timeline} got {results['timeline'].metric}." class NodePatroniVersion(PatroniResource): def __init__( self: "NodePatroniVersion", connection_info: ConnectionInfo, patroni_version: str, ) -> None: super().__init__(connection_info) self.patroni_version = patroni_version def probe(self: "NodePatroniVersion") -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("patroni") version = item_dict["patroni"]["version"] _log.debug( "Version data: patroni version %(version)s input version %(patroni_version)s", {"version": version, "patroni_version": self.patroni_version}, ) # The actual check return [ nagiosplugin.Metric( "is_version_ok", 1 if version == self.patroni_version else 0, ) ] class NodePatroniVersionSummary(nagiosplugin.Summary): def __init__(self: "NodePatroniVersionSummary", patroni_version: str) -> None: self.patroni_version = patroni_version def ok(self: "NodePatroniVersionSummary", results: nagiosplugin.Result) -> str: return f"Patroni's version is {self.patroni_version}." @handle_unknown def problem(self: "NodePatroniVersionSummary", results: nagiosplugin.Result) -> str: # FIXME find a way to make the following work, check is perf data can be strings # return f"The expected patroni version was {self.patroni_version} got {results['patroni_version'].metric}." return f"Patroni's version is not {self.patroni_version}." class NodeIsAlive(PatroniResource): def probe(self: "NodeIsAlive") -> Iterable[nagiosplugin.Metric]: try: self.rest_api("liveness") except APIError: return [nagiosplugin.Metric("is_alive", 0)] return [nagiosplugin.Metric("is_alive", 1)] class NodeIsAliveSummary(nagiosplugin.Summary): def ok(self: "NodeIsAliveSummary", results: nagiosplugin.Result) -> str: return "This node is alive (patroni is running)." @handle_unknown def problem(self: "NodeIsAliveSummary", results: nagiosplugin.Result) -> str: return "This node is not alive (patroni is not running)." check_patroni-1.0.0/check_patroni/types.py000066400000000000000000000060731447307111400206600ustar00rootroot00000000000000from typing import Any, Callable, List, Optional, Tuple, Union from urllib.parse import urlparse import attr import nagiosplugin import requests from . import _log class APIError(requests.exceptions.RequestException): """This exception is raised when the rest api couldn't be reached and we got a http status code different from 200. """ @attr.s(auto_attribs=True, frozen=True, slots=True) class ConnectionInfo: endpoints: List[str] = ["http://127.0.0.1:8008"] cert: Optional[Union[str, Tuple[str, str]]] = None ca_cert: Optional[str] = None @attr.s(auto_attribs=True, frozen=True, slots=True) class Parameters: connection_info: ConnectionInfo timeout: int verbose: int @attr.s(auto_attribs=True, slots=True) class PatroniResource(nagiosplugin.Resource): conn_info: ConnectionInfo def rest_api(self: "PatroniResource", service: str) -> Any: """Try to connect to all the provided endpoints for the requested service""" for endpoint in self.conn_info.endpoints: cert: Optional[Union[Tuple[str, str], str]] = None verify: Optional[Union[str, bool]] = None if urlparse(endpoint).scheme == "https": if self.conn_info.cert is not None: # we can have: a key + a cert or a single file with key and cert. cert = self.conn_info.cert if self.conn_info.ca_cert is not None: verify = self.conn_info.ca_cert _log.debug( "Trying to connect to %(endpoint)s/%(service)s with cert: %(cert)s verify: %(verify)s", { "endpoint": endpoint, "service": service, "cert": cert, "verify": verify, }, ) try: r = requests.get(f"{endpoint}/{service}", verify=verify, cert=cert) except Exception as e: _log.debug(e) continue # The status code is already displayed by urllib3 _log.debug( "api call data: %(data)s", {"data": r.text if r.text else ""} ) if r.status_code != 200: raise APIError( f"Failed to connect to {endpoint}/{service} status code {r.status_code}" ) try: return r.json() except requests.exceptions.JSONDecodeError: return None raise nagiosplugin.CheckError("Connection failed for all provided endpoints") HandleUnknown = Callable[[nagiosplugin.Summary, nagiosplugin.Results], Any] def handle_unknown(func: HandleUnknown) -> HandleUnknown: """decorator to handle the unknown state in Summary.problem""" def wrapper(summary: nagiosplugin.Summary, results: nagiosplugin.Results) -> Any: if results.most_significant[0].state.code == 3: """get the appropriate message for all unknown error""" return results.most_significant[0].hint return func(summary, results) return wrapper check_patroni-1.0.0/docs/000077500000000000000000000000001447307111400152535ustar00rootroot00000000000000check_patroni-1.0.0/docs/make_readme.sh000077500000000000000000000066671447307111400200630ustar00rootroot00000000000000#!/bin/bash if ! command -v check_patroni &>/dev/null; then echo "check_partroni must be installed to generate the documentation" exit 1 fi top_srcdir="$(readlink -m "$0/../..")" README="${top_srcdir}/README.md" function readme(){ echo "$1" >> $README } function helpme(){ readme readme '```' check_patroni $1 --help >> $README readme '```' readme } cat << '_EOF_' > $README # check_patroni A nagios plugin for patroni. ## Features - Check presence of leader, replicas, node counts. - Check each node for replication status. _EOF_ helpme cat << '_EOF_' >> $README ## Install check_patroni is licensed under PostgreSQL license. ``` $ pip install git+https://github.com/dalibo/check_patroni.git ``` check_patroni works on python 3.6, we keep it that way because patroni also supports it and there are still lots of RH 7 variants around. That being said python 3.6 has been EOL for age and there is no support for it in the github CI. ## Support If you hit a bug or need help, open a [GitHub issue](https://github.com/dalibo/check_patroni/issues/new). Dalibo has no commitment on response time for public free support. Thanks for you contribution ! ## Config file All global and service specific parameters can be specified via a config file has follows: ``` [options] endpoints = https://10.20.199.3:8008, https://10.20.199.4:8008,https://10.20.199.5:8008 cert_file = ./ssl/my-cert.pem key_file = ./ssl/my-key.pem ca_file = ./ssl/CA-cert.pem timeout = 0 [options.node_is_replica] lag=100 ``` ## Thresholds The format for the threshold parameters is `[@][start:][end]`. * `start:` may be omitted if `start == 0` * `~:` means that start is negative infinity * If `end` is omitted, infinity is assumed * To invert the match condition, prefix the range expression with `@`. A match is found when: `start <= VALUE <= end`. For example, the following command will raise: * a warning if there is less than 1 nodes, wich can be translated to outside of range [2;+INF[ * a critical if there are no nodes, wich can be translated to outside of range [1;+INF[ ``` check_patroni -e https://10.20.199.3:8008 cluster_has_replica --warning 2: --critical 1: ``` ## SSL Several options are available: * the server's CA certificate is not available or trusted by the client system: * `--ca_cert`: your certification chain `cat CA-certificate server-certificate > cabundle` * you have a client certificate for authenticating with Patroni's REST API: * `--cert_file`: your certificate or the concatenation of your certificate and private key * `--key_file`: your private key (optional) _EOF_ readme readme "## Cluster services" readme readme "### cluster_config_has_changed" helpme cluster_config_has_changed readme "### cluster_has_leader" helpme cluster_has_leader readme "### cluster_has_replica" helpme cluster_has_replica readme "### cluster_has_scheduled_action" helpme cluster_has_scheduled_action readme "### cluster_is_in_maintenance" helpme cluster_is_in_maintenance readme "### cluster_node_count" helpme cluster_node_count readme "## Node services" readme readme "### node_is_alive" helpme node_is_alive readme "### node_is_pending_restart" helpme node_is_pending_restart readme "### node_is_leader" helpme node_is_leader readme "### node_is_primary" helpme node_is_primary readme "### node_is_replica" helpme node_is_replica readme "### node_patroni_version" helpme node_patroni_version readme "### node_tl_has_changed" helpme node_tl_has_changed cat << _EOF_ >> $README _EOF_ check_patroni-1.0.0/mypy.ini000066400000000000000000000014121447307111400160200ustar00rootroot00000000000000[mypy] show_error_codes = true strict = true exclude = build/ [mypy-setup] ignore_errors = True [mypy-nagiosplugin.*] ignore_missing_imports = true [mypy-check_patroni.types] # no stubs for nagioplugin => ignore: Class cannot subclass "Resource" (has type "Any") [misc] disallow_subclassing_any = false [mypy-check_patroni.node] # no subs for nagiosplugin => ignore: Class cannot subclass "Summary" (has type "Any") [misc] disallow_subclassing_any = false [mypy-check_patroni.cluster] # no subs for nagiosplugin => ignore: Class cannot subclass "Summary" (has type "Any") [misc] disallow_subclassing_any = false [mypy-check_patroni.cli] # no stubs for nagiosplugin => ignore: Untyped decorator makes function "main" untyped [misc] disallow_untyped_decorators = false check_patroni-1.0.0/pyproject.toml000066400000000000000000000002041447307111400172330ustar00rootroot00000000000000[build-system] requires = ["setuptools", "setuptools-scm"] build-backend = "setuptools.build_meta" [tool.isort] profile = "black" check_patroni-1.0.0/pytest.ini000066400000000000000000000000451447307111400163530ustar00rootroot00000000000000[pytest] addopts = --doctest-modules check_patroni-1.0.0/requirements-dev.txt000066400000000000000000000001461447307111400203640ustar00rootroot00000000000000black codespell isort flake8 mypy==0.961 pytest pytest-mock types-requests setuptools tox twine wheel check_patroni-1.0.0/setup.py000066400000000000000000000030501447307111400160330ustar00rootroot00000000000000import pathlib from setuptools import find_packages, setup HERE = pathlib.Path(__file__).parent long_description = (HERE / "README.md").read_text() def get_version() -> str: fpath = HERE / "check_patroni" / "__init__.py" with fpath.open() as f: for line in f: if line.startswith("__version__"): return line.split('"')[1] raise Exception(f"version information not found in {fpath}") setup( name="check_patroni", version=get_version(), author="Dalibo", author_email="contact@dalibo.com", packages=find_packages(include=["check_patroni*"]), include_package_data=True, url="https://github.com/dalibo/check_patroni", license="PostgreSQL", description="Nagios plugin to check on patroni", long_description=long_description, long_description_content_type="text/markdown", classifiers=[ "Development Status :: 5 - Production/Stable", "Environment :: Console", "License :: OSI Approved :: PostgreSQL License", "Programming Language :: Python :: 3", "Topic :: System :: Monitoring", ], keywords="patroni nagios check", python_requires=">=3.6", install_requires=[ "attrs >= 17, !=21.1", "requests", "nagiosplugin >= 1.3.2", "click >= 8.0.1", ], extras_require={ "test": [ "pytest", "pytest-mock", ], }, entry_points={ "console_scripts": [ "check_patroni=check_patroni.cli:main", ], }, zip_safe=False, ) check_patroni-1.0.0/tests/000077500000000000000000000000001447307111400154655ustar00rootroot00000000000000check_patroni-1.0.0/tests/__init__.py000066400000000000000000000000001447307111400175640ustar00rootroot00000000000000check_patroni-1.0.0/tests/conftest.py000066400000000000000000000006371447307111400176720ustar00rootroot00000000000000def pytest_addoption(parser): """ Add CLI options to `pytest` to pass those options to the test cases. These options are used in `pytest_generate_tests`. """ parser.addoption("--use-old-replica-state", action="store_true", default=False) def pytest_generate_tests(metafunc): metafunc.parametrize( "use_old_replica_state", [metafunc.config.getoption("use_old_replica_state")] ) check_patroni-1.0.0/tests/json/000077500000000000000000000000001447307111400164365ustar00rootroot00000000000000check_patroni-1.0.0/tests/json/cluster_config_has_changed.json000066400000000000000000000006051447307111400246440ustar00rootroot00000000000000{ "loop_wait": 10, "master_start_timeout": 300, "postgresql": { "parameters": { "archive_command": "pgbackrest --stanza=main archive-push %p", "archive_mode": "on", "max_connections": 300, "restore_command": "pgbackrest --stanza=main archive-get %f \"%p\"" }, "use_pg_rewind": false, "use_slot": true }, "retry_timeout": 10, "ttl": 30 } check_patroni-1.0.0/tests/json/cluster_has_leader_ko.json000066400000000000000000000012511447307111400236510ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "replica", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "running", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "running", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-1.0.0/tests/json/cluster_has_leader_ok.json000066400000000000000000000012541447307111400236540ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-1.0.0/tests/json/cluster_has_leader_ok_standby_leader.json000066400000000000000000000012641447307111400267150ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "standby_leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-1.0.0/tests/json/cluster_has_replica_ko.json000066400000000000000000000012621447307111400240360ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "stopped", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": "unknown" }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-1.0.0/tests/json/cluster_has_replica_ko_lag.json000066400000000000000000000012721447307111400246620ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 10241024 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 20000000 } ] } check_patroni-1.0.0/tests/json/cluster_has_replica_ok.json000066400000000000000000000012611447307111400240350ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "sync_standby", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-1.0.0/tests/json/cluster_has_replica_ok_lag.json000066400000000000000000000012601447307111400246570ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 1024 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-1.0.0/tests/json/cluster_has_scheduled_action_ko_restart.json000066400000000000000000000011411447307111400274540ustar00rootroot00000000000000{ "members": [ { "name": "p1", "role": "sync_standby", "state": "streaming", "api_url": "http://10.20.30.51:8008/patroni", "host": "10.20.30.51", "port": 5432, "timeline": 3, "scheduled_restart": { "schedule": "2023-10-08T11:30:00+00:00", "postmaster_start_time": "2023-08-21 08:08:33.415237+00:00" }, "lag": 0 }, { "name": "p2", "role": "leader", "state": "running", "api_url": "http://10.20.30.52:8008/patroni", "host": "10.20.30.52", "port": 5432, "timeline": 3 } ] } check_patroni-1.0.0/tests/json/cluster_has_scheduled_action_ko_switchover.json000066400000000000000000000010571447307111400301730ustar00rootroot00000000000000{ "members": [ { "name": "p1", "role": "sync_standby", "state": "streaming", "api_url": "http://10.20.30.51:8008/patroni", "host": "10.20.30.51", "port": 5432, "timeline": 3, "lag": 0 }, { "name": "p2", "role": "leader", "state": "running", "api_url": "http://10.20.30.52:8008/patroni", "host": "10.20.30.52", "port": 5432, "timeline": 3 } ], "scheduled_switchover": { "at": "2023-10-08T11:30:00+00:00", "from": "p1", "to": "p2" } } check_patroni-1.0.0/tests/json/cluster_has_scheduled_action_ok.json000066400000000000000000000012611447307111400257130ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "sync_standby", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-1.0.0/tests/json/cluster_is_in_maintenance_ko.json000066400000000000000000000012751447307111400252330ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ], "pause": true } check_patroni-1.0.0/tests/json/cluster_is_in_maintenance_ko_pause_false.json000066400000000000000000000012761447307111400276030ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ], "pause": false } check_patroni-1.0.0/tests/json/cluster_is_in_maintenance_ok.json000066400000000000000000000012541447307111400252300ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-1.0.0/tests/json/cluster_is_in_maintenance_ok_pause_false.json000066400000000000000000000012761447307111400276030ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ], "pause": false } check_patroni-1.0.0/tests/json/cluster_node_count_critical.json000066400000000000000000000003461447307111400251040ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 } ] } check_patroni-1.0.0/tests/json/cluster_node_count_healthy_critical.json000066400000000000000000000012261447307111400266200ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "start failed", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "lag": "unknown" }, { "name": "srv3", "role": "replica", "state": "start failed", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "lag": "unknown" } ] } check_patroni-1.0.0/tests/json/cluster_node_count_healthy_warning.json000066400000000000000000000007111447307111400264710ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-1.0.0/tests/json/cluster_node_count_ok.json000066400000000000000000000012541447307111400237220ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-1.0.0/tests/json/cluster_node_count_warning.json000066400000000000000000000007111447307111400247530ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-1.0.0/tests/json/node_is_leader_ko.json000066400000000000000000000010701447307111400227540ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "master", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-1.0.0/tests/json/node_is_leader_ko_standby_leader.json000066400000000000000000000007171447307111400260230ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2023-08-23 14:30:50.201691+00:00", "role": "standby_leader", "server_version": 140009, "xlog": { "received_location": 889192448, "replayed_location": 889192448, "replayed_timestamp": null, "paused": false }, "timeline": 1, "dcs_last_seen": 1692805971, "database_system_identifier": "7270495803765492571", "patroni": { "version": "3.1.0", "scope": "patroni-demo-sb" } } check_patroni-1.0.0/tests/json/node_is_leader_ok.json000066400000000000000000000010701447307111400227540ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "master", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-1.0.0/tests/json/node_is_leader_ok_standby_leader.json000066400000000000000000000007171447307111400260230ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2023-08-23 14:30:50.201691+00:00", "role": "standby_leader", "server_version": 140009, "xlog": { "received_location": 889192448, "replayed_location": 889192448, "replayed_timestamp": null, "paused": false }, "timeline": 1, "dcs_last_seen": 1692805971, "database_system_identifier": "7270495803765492571", "patroni": { "version": "3.1.0", "scope": "patroni-demo-sb" } } check_patroni-1.0.0/tests/json/node_is_pending_restart_ko.json000066400000000000000000000011231447307111400247070ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "master", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "pending_restart": true, "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-1.0.0/tests/json/node_is_pending_restart_ok.json000066400000000000000000000010701447307111400247100ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "master", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-1.0.0/tests/json/node_is_primary_ko.json000066400000000000000000000007011447307111400232030ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:57:51.693 UTC", "role": "replica", "server_version": 110012, "cluster_unlocked": false, "xlog": { "received_location": 1174407088, "replayed_location": 1174407088, "replayed_timestamp": null, "paused": false }, "timeline": 58, "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-1.0.0/tests/json/node_is_primary_ok.json000066400000000000000000000010701447307111400232030ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "master", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-1.0.0/tests/json/node_is_replica_ko.json000066400000000000000000000010701447307111400231370ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "master", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-1.0.0/tests/json/node_is_replica_ok.json000066400000000000000000000007011447307111400231370ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:57:51.693 UTC", "role": "replica", "server_version": 110012, "cluster_unlocked": false, "xlog": { "received_location": 1174407088, "replayed_location": 1174407088, "replayed_timestamp": null, "paused": false }, "timeline": 58, "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-1.0.0/tests/json/node_patroni_version.json000066400000000000000000000010701447307111400235550ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "master", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-1.0.0/tests/json/node_tl_has_changed.json000066400000000000000000000010701447307111400232570ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "master", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-1.0.0/tests/test_api.py000066400000000000000000000013751447307111400176550ustar00rootroot00000000000000from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import my_mock def test_api_status_code_200( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "node_is_pending_restart_ok", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "node_is_pending_restart"] ) assert result.exit_code == 0 def test_api_status_code_404( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "Fake test", 404) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "node_is_pending_restart"] ) assert result.exit_code == 3 check_patroni-1.0.0/tests/test_cluster_config_has_changed.py000066400000000000000000000126541447307111400244200ustar00rootroot00000000000000import nagiosplugin from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import here, my_mock def test_cluster_config_has_changed_ok_with_hash( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_config_has_changed", 200) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_config_has_changed", "--hash", "96b12d82571473d13e890b893734e731", ], ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERCONFIGHASCHANGED OK - The hash of patroni's dynamic configuration has not changed (96b12d82571473d13e890b893734e731). | is_configuration_changed=0;;@1:1\n" ) def test_cluster_config_has_changed_ok_with_state_file( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() with open(here / "cluster_config_has_changed.state_file", "w") as f: f.write('{"hash": "96b12d82571473d13e890b893734e731"}') my_mock(mocker, "cluster_config_has_changed", 200) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_config_has_changed", "--state-file", str(here / "cluster_config_has_changed.state_file"), ], ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERCONFIGHASCHANGED OK - The hash of patroni's dynamic configuration has not changed (96b12d82571473d13e890b893734e731). | is_configuration_changed=0;;@1:1\n" ) def test_cluster_config_has_changed_ko_with_hash( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_config_has_changed", 200) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_config_has_changed", "--hash", "96b12d82571473d13e890b8937ffffff", ], ) assert result.exit_code == 2 assert ( result.stdout == "CLUSTERCONFIGHASCHANGED CRITICAL - The hash of patroni's dynamic configuration has changed. The old hash was 96b12d82571473d13e890b8937ffffff. | is_configuration_changed=1;;@1:1\n" ) def test_cluster_config_has_changed_ko_with_state_file_and_save( mocker: MockerFixture, use_old_replica_state: bool, ) -> None: runner = CliRunner() with open(here / "cluster_config_has_changed.state_file", "w") as f: f.write('{"hash": "96b12d82571473d13e890b8937ffffff"}') my_mock(mocker, "cluster_config_has_changed", 200) # test without saving the new hash result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_config_has_changed", "--state-file", str(here / "cluster_config_has_changed.state_file"), ], ) assert result.exit_code == 2 assert ( result.stdout == "CLUSTERCONFIGHASCHANGED CRITICAL - The hash of patroni's dynamic configuration has changed. The old hash was 96b12d82571473d13e890b8937ffffff. | is_configuration_changed=1;;@1:1\n" ) cookie = nagiosplugin.Cookie(here / "cluster_config_has_changed.state_file") cookie.open() new_config_hash = cookie.get("hash") cookie.close() assert new_config_hash == "96b12d82571473d13e890b8937ffffff" # test when we save the hash result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_config_has_changed", "--state-file", str(here / "cluster_config_has_changed.state_file"), "--save", ], ) assert result.exit_code == 2 assert ( result.stdout == "CLUSTERCONFIGHASCHANGED CRITICAL - The hash of patroni's dynamic configuration has changed. The old hash was 96b12d82571473d13e890b8937ffffff. | is_configuration_changed=1;;@1:1\n" ) cookie = nagiosplugin.Cookie(here / "cluster_config_has_changed.state_file") cookie.open() new_config_hash = cookie.get("hash") cookie.close() assert new_config_hash == "96b12d82571473d13e890b893734e731" def test_cluster_config_has_changed_params( mocker: MockerFixture, use_old_replica_state: bool ) -> None: # This one is placed last because it seems like the exceptions are not flushed from stderr for the next tests. runner = CliRunner() my_mock(mocker, "cluster_config_has_changed", 200) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_config_has_changed", "--hash", "640df9f0211c791723f18fc3ed9dbb95", "--state-file", str(here / "fake_file_name.state_file"), ], ) assert result.exit_code == 3 assert ( result.stdout == "CLUSTERCONFIGHASCHANGED UNKNOWN: click.exceptions.UsageError: Either --hash or --state-file should be provided for this service\n" ) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "cluster_config_has_changed"] ) assert result.exit_code == 3 assert ( result.stdout == "CLUSTERCONFIGHASCHANGED UNKNOWN: click.exceptions.UsageError: Either --hash or --state-file should be provided for this service\n" ) check_patroni-1.0.0/tests/test_cluster_has_leader.py000066400000000000000000000027461447307111400227370ustar00rootroot00000000000000from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import my_mock def test_cluster_has_leader_ok( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_has_leader_ok", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "cluster_has_leader"] ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERHASLEADER OK - The cluster has a running leader. | has_leader=1;;@0\n" ) def test_cluster_has_leader_ok_standby_leader( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_has_leader_ok_standby_leader", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "cluster_has_leader"] ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERHASLEADER OK - The cluster has a running leader. | has_leader=1;;@0\n" ) def test_cluster_has_leader_ko( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_has_leader_ko", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "cluster_has_leader"] ) assert result.exit_code == 2 assert ( result.stdout == "CLUSTERHASLEADER CRITICAL - The cluster has no running leader. | has_leader=0;;@0\n" ) check_patroni-1.0.0/tests/test_cluster_has_replica.py000066400000000000000000000117771447307111400231260ustar00rootroot00000000000000from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import my_mock # TODO Lag threshold tests def test_cluster_has_relica_ok( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_has_replica_ok", 200, use_old_replica_state) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "cluster_has_replica"] ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2 srv2_lag=0 srv2_sync=0 srv3_lag=0 srv3_sync=1 sync_replica=1 unhealthy_replica=0\n" ) def test_cluster_has_replica_ok_with_count_thresholds( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_has_replica_ok", 200, use_old_replica_state) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_has_replica", "--warning", "@1", "--critical", "@0", ], ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2;@1;@0 srv2_lag=0 srv2_sync=0 srv3_lag=0 srv3_sync=1 sync_replica=1 unhealthy_replica=0\n" ) def test_cluster_has_replica_ok_with_sync_count_thresholds( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_has_replica_ok", 200, use_old_replica_state) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_has_replica", "--sync-warning", "1:", ], ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2 srv2_lag=0 srv2_sync=0 srv3_lag=0 srv3_sync=1 sync_replica=1;1: unhealthy_replica=0\n" ) def test_cluster_has_replica_ok_with_count_thresholds_lag( mocker: MockerFixture, use_old_replica_state: bool, ) -> None: runner = CliRunner() my_mock(mocker, "cluster_has_replica_ok_lag", 200, use_old_replica_state) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_has_replica", "--warning", "@1", "--critical", "@0", "--max-lag", "1MB", ], ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2;@1;@0 srv2_lag=1024 srv2_sync=0 srv3_lag=0 srv3_sync=0 sync_replica=0 unhealthy_replica=0\n" ) def test_cluster_has_replica_ko_with_count_thresholds( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_has_replica_ko", 200, use_old_replica_state) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_has_replica", "--warning", "@1", "--critical", "@0", ], ) assert result.exit_code == 1 assert ( result.stdout == "CLUSTERHASREPLICA WARNING - healthy_replica is 1 (outside range @0:1) | healthy_replica=1;@1;@0 srv3_lag=0 srv3_sync=0 sync_replica=0 unhealthy_replica=1\n" ) def test_cluster_has_replica_ko_with_sync_count_thresholds( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_has_replica_ko", 200, use_old_replica_state) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_has_replica", "--sync-warning", "2:", "--sync-critical", "1:", ], ) assert result.exit_code == 2 assert ( result.stdout == "CLUSTERHASREPLICA CRITICAL - sync_replica is 0 (outside range 1:) | healthy_replica=1 srv3_lag=0 srv3_sync=0 sync_replica=0;2:;1: unhealthy_replica=1\n" ) def test_cluster_has_replica_ko_with_count_thresholds_and_lag( mocker: MockerFixture, use_old_replica_state: bool, ) -> None: runner = CliRunner() my_mock(mocker, "cluster_has_replica_ko_lag", 200, use_old_replica_state) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_has_replica", "--warning", "@1", "--critical", "@0", "--max-lag", "1MB", ], ) assert result.exit_code == 2 assert ( result.stdout == "CLUSTERHASREPLICA CRITICAL - healthy_replica is 0 (outside range @0:0) | healthy_replica=0;@1;@0 srv2_lag=10241024 srv2_sync=0 srv3_lag=20000000 srv3_sync=0 sync_replica=0 unhealthy_replica=2\n" ) check_patroni-1.0.0/tests/test_cluster_has_scheduled_action.py000066400000000000000000000034321447307111400247710ustar00rootroot00000000000000from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import my_mock def test_cluster_has_scheduled_action_ok( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_has_scheduled_action_ok", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "cluster_has_scheduled_action"] ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERHASSCHEDULEDACTION OK - has_scheduled_actions is 0 | has_scheduled_actions=0;;0 scheduled_restart=0 scheduled_switchover=0\n" ) def test_cluster_has_scheduled_action_ko_switchover( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_has_scheduled_action_ko_switchover", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "cluster_has_scheduled_action"] ) assert result.exit_code == 2 assert ( result.stdout == "CLUSTERHASSCHEDULEDACTION CRITICAL - has_scheduled_actions is 1 (outside range 0:0) | has_scheduled_actions=1;;0 scheduled_restart=0 scheduled_switchover=1\n" ) def test_cluster_has_scheduled_action_ko_restart( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_has_scheduled_action_ko_restart", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "cluster_has_scheduled_action"] ) assert result.exit_code == 2 assert ( result.stdout == "CLUSTERHASSCHEDULEDACTION CRITICAL - has_scheduled_actions is 1 (outside range 0:0) | has_scheduled_actions=1;;0 scheduled_restart=1 scheduled_switchover=0\n" ) check_patroni-1.0.0/tests/test_cluster_is_in_maintenance.py000066400000000000000000000030651447307111400243060ustar00rootroot00000000000000from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import my_mock def test_cluster_is_in_maintenance_ok( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_is_in_maintenance_ok", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "cluster_is_in_maintenance"] ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERISINMAINTENANCE OK - is_in_maintenance is 0 | is_in_maintenance=0;;0\n" ) def test_cluster_is_in_maintenance_ko( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_is_in_maintenance_ko", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "cluster_is_in_maintenance"] ) assert result.exit_code == 2 assert ( result.stdout == "CLUSTERISINMAINTENANCE CRITICAL - is_in_maintenance is 1 (outside range 0:0) | is_in_maintenance=1;;0\n" ) def test_cluster_is_in_maintenance_ok_pause_false( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_is_in_maintenance_ok_pause_false", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "cluster_is_in_maintenance"] ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERISINMAINTENANCE OK - is_in_maintenance is 0 | is_in_maintenance=0;;0\n" ) check_patroni-1.0.0/tests/test_cluster_node_count.py000066400000000000000000000124061447307111400227770ustar00rootroot00000000000000from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import my_mock def test_cluster_node_count_ok( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_node_count_ok", 200, use_old_replica_state) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "cluster_node_count"] ) assert result.exit_code == 0 if use_old_replica_state: assert ( result.output == "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3 members=3 role_leader=1 role_replica=2 state_running=3\n" ) else: assert ( result.output == "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3 members=3 role_leader=1 role_replica=2 state_running=1 state_streaming=2\n" ) def test_cluster_node_count_ok_with_thresholds( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_node_count_ok", 200, use_old_replica_state) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_node_count", "--warning", "@0:1", "--critical", "@2", "--healthy-warning", "@2", "--healthy-critical", "@0:1", ], ) assert result.exit_code == 0 if use_old_replica_state: assert ( result.output == "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3;@2;@1 members=3;@1;@2 role_leader=1 role_replica=2 state_running=3\n" ) else: assert ( result.output == "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3;@2;@1 members=3;@1;@2 role_leader=1 role_replica=2 state_running=1 state_streaming=2\n" ) def test_cluster_node_count_healthy_warning( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_node_count_healthy_warning", 200, use_old_replica_state) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_node_count", "--healthy-warning", "@2", "--healthy-critical", "@0:1", ], ) assert result.exit_code == 1 if use_old_replica_state: assert ( result.output == "CLUSTERNODECOUNT WARNING - healthy_members is 2 (outside range @0:2) | healthy_members=2;@2;@1 members=2 role_leader=1 role_replica=1 state_running=2\n" ) else: assert ( result.output == "CLUSTERNODECOUNT WARNING - healthy_members is 2 (outside range @0:2) | healthy_members=2;@2;@1 members=2 role_leader=1 role_replica=1 state_running=1 state_streaming=1\n" ) def test_cluster_node_count_healthy_critical( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_node_count_healthy_critical", 200, use_old_replica_state) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_node_count", "--healthy-warning", "@2", "--healthy-critical", "@0:1", ], ) assert result.exit_code == 2 assert ( result.output == "CLUSTERNODECOUNT CRITICAL - healthy_members is 1 (outside range @0:1) | healthy_members=1;@2;@1 members=3 role_leader=1 role_replica=2 state_running=1 state_start_failed=2\n" ) def test_cluster_node_count_warning( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_node_count_warning", 200, use_old_replica_state) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_node_count", "--warning", "@2", "--critical", "@0:1", ], ) assert result.exit_code == 1 if use_old_replica_state: assert ( result.stdout == "CLUSTERNODECOUNT WARNING - members is 2 (outside range @0:2) | healthy_members=2 members=2;@2;@1 role_leader=1 role_replica=1 state_running=2\n" ) else: assert ( result.stdout == "CLUSTERNODECOUNT WARNING - members is 2 (outside range @0:2) | healthy_members=2 members=2;@2;@1 role_leader=1 role_replica=1 state_running=1 state_streaming=1\n" ) def test_cluster_node_count_critical( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "cluster_node_count_critical", 200, use_old_replica_state) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "cluster_node_count", "--warning", "@2", "--critical", "@0:1", ], ) assert result.exit_code == 2 assert ( result.stdout == "CLUSTERNODECOUNT CRITICAL - members is 1 (outside range @0:1) | healthy_members=1 members=1;@2;@1 role_leader=1 state_running=1\n" ) check_patroni-1.0.0/tests/test_node_is_alive.py000066400000000000000000000016601447307111400217010ustar00rootroot00000000000000from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import my_mock def test_node_is_alive_ok(mocker: MockerFixture, use_old_replica_state: bool) -> None: runner = CliRunner() my_mock(mocker, None, 200) result = runner.invoke(main, ["-e", "https://10.20.199.3:8008", "node_is_alive"]) assert result.exit_code == 0 assert ( result.stdout == "NODEISALIVE OK - This node is alive (patroni is running). | is_alive=1;;@0\n" ) def test_node_is_alive_ko(mocker: MockerFixture, use_old_replica_state: bool) -> None: runner = CliRunner() my_mock(mocker, None, 404) result = runner.invoke(main, ["-e", "https://10.20.199.3:8008", "node_is_alive"]) assert result.exit_code == 2 assert ( result.stdout == "NODEISALIVE CRITICAL - This node is not alive (patroni is not running). | is_alive=0;;@0\n" ) check_patroni-1.0.0/tests/test_node_is_leader.py000066400000000000000000000032431447307111400220340ustar00rootroot00000000000000from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import my_mock def test_node_is_leader_ok(mocker: MockerFixture, use_old_replica_state: bool) -> None: runner = CliRunner() my_mock(mocker, "node_is_leader_ok", 200) result = runner.invoke(main, ["-e", "https://10.20.199.3:8008", "node_is_leader"]) assert result.exit_code == 0 assert ( result.stdout == "NODEISLEADER OK - This node is a leader node. | is_leader=1;;@0\n" ) my_mock(mocker, "node_is_leader_ok_standby_leader", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "node_is_leader", "--is-standby-leader"], ) print(result.stdout) assert result.exit_code == 0 assert ( result.stdout == "NODEISLEADER OK - This node is a standby leader node. | is_leader=1;;@0\n" ) def test_node_is_leader_ko(mocker: MockerFixture, use_old_replica_state: bool) -> None: runner = CliRunner() my_mock(mocker, "node_is_leader_ko", 503) result = runner.invoke(main, ["-e", "https://10.20.199.3:8008", "node_is_leader"]) assert result.exit_code == 2 assert ( result.stdout == "NODEISLEADER CRITICAL - This node is not a leader node. | is_leader=0;;@0\n" ) my_mock(mocker, "node_is_leader_ko_standby_leader", 503) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "node_is_leader", "--is-standby-leader"], ) assert result.exit_code == 2 assert ( result.stdout == "NODEISLEADER CRITICAL - This node is not a standby leader node. | is_leader=0;;@0\n" ) check_patroni-1.0.0/tests/test_node_is_pending_restart.py000066400000000000000000000021231447307111400237640ustar00rootroot00000000000000from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import my_mock def test_node_is_pending_restart_ok( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "node_is_pending_restart_ok", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "node_is_pending_restart"] ) assert result.exit_code == 0 assert ( result.stdout == "NODEISPENDINGRESTART OK - This node doesn't have the pending restart flag. | is_pending_restart=0;;0\n" ) def test_node_is_pending_restart_ko( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "node_is_pending_restart_ko", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "node_is_pending_restart"] ) assert result.exit_code == 2 assert ( result.stdout == "NODEISPENDINGRESTART CRITICAL - This node has the pending restart flag. | is_pending_restart=1;;0\n" ) check_patroni-1.0.0/tests/test_node_is_primary.py000066400000000000000000000017501447307111400222640ustar00rootroot00000000000000from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import my_mock def test_node_is_primary_ok(mocker: MockerFixture, use_old_replica_state: bool) -> None: runner = CliRunner() my_mock(mocker, "node_is_primary_ok", 200) result = runner.invoke(main, ["-e", "https://10.20.199.3:8008", "node_is_primary"]) assert result.exit_code == 0 assert ( result.stdout == "NODEISPRIMARY OK - This node is the primary with the leader lock. | is_primary=1;;@0\n" ) def test_node_is_primary_ko(mocker: MockerFixture, use_old_replica_state: bool) -> None: runner = CliRunner() my_mock(mocker, "node_is_primary_ko", 503) result = runner.invoke(main, ["-e", "https://10.20.199.3:8008", "node_is_primary"]) assert result.exit_code == 2 assert ( result.stdout == "NODEISPRIMARY CRITICAL - This node is not the primary with the leader lock. | is_primary=0;;@0\n" ) check_patroni-1.0.0/tests/test_node_is_replica.py000066400000000000000000000130671447307111400222240ustar00rootroot00000000000000from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import my_mock def test_node_is_replica_ok(mocker: MockerFixture, use_old_replica_state: bool) -> None: runner = CliRunner() my_mock(mocker, "node_is_replica_ok", 200) result = runner.invoke(main, ["-e", "https://10.20.199.3:8008", "node_is_replica"]) assert result.exit_code == 0 assert ( result.stdout == "NODEISREPLICA OK - This node is a running replica with no noloadbalance tag. | is_replica=1;;@0\n" ) def test_node_is_replica_ko(mocker: MockerFixture, use_old_replica_state: bool) -> None: runner = CliRunner() my_mock(mocker, "node_is_replica_ko", 503) result = runner.invoke(main, ["-e", "https://10.20.199.3:8008", "node_is_replica"]) assert result.exit_code == 2 assert ( result.stdout == "NODEISREPLICA CRITICAL - This node is not a running replica with no noloadbalance tag. | is_replica=0;;@0\n" ) def test_node_is_replica_ko_lag( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() # We don't do the check ourselves, patroni does it and changes the return code my_mock(mocker, "node_is_replica_ok", 503) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "node_is_replica", "--max-lag", "100"] ) assert result.exit_code == 2 assert ( result.stdout == "NODEISREPLICA CRITICAL - This node is not a running replica with no noloadbalance tag and a lag under 100. | is_replica=0;;@0\n" ) my_mock(mocker, "node_is_replica_ok", 503) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "node_is_replica", "--is-async", "--max-lag", "100", ], ) assert result.exit_code == 2 assert ( result.stdout == "NODEISREPLICA CRITICAL - This node is not a running asynchronous replica with no noloadbalance tag and a lag under 100. | is_replica=0;;@0\n" ) def test_node_is_replica_sync_ok( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() # We don't do the check ourselves, patroni does it and changes the return code my_mock(mocker, "node_is_replica_ok", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "node_is_replica", "--is-sync"] ) assert result.exit_code == 0 assert ( result.stdout == "NODEISREPLICA OK - This node is a running synchronous replica with no noloadbalance tag. | is_replica=1;;@0\n" ) def test_node_is_replica_sync_ko( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() # We don't do the check ourselves, patroni does it and changes the return code my_mock(mocker, "node_is_replica_ok", 503) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "node_is_replica", "--is-sync"] ) assert result.exit_code == 2 assert ( result.stdout == "NODEISREPLICA CRITICAL - This node is not a running synchronous replica with no noloadbalance tag. | is_replica=0;;@0\n" ) def test_node_is_replica_async_ok( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() # We don't do the check ourselves, patroni does it and changes the return code my_mock(mocker, "node_is_replica_ok", 200) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "node_is_replica", "--is-async"] ) assert result.exit_code == 0 assert ( result.stdout == "NODEISREPLICA OK - This node is a running asynchronous replica with no noloadbalance tag. | is_replica=1;;@0\n" ) def test_node_is_replica_async_ko( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() # We don't do the check ourselves, patroni does it and changes the return code my_mock(mocker, "node_is_replica_ok", 503) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "node_is_replica", "--is-async"] ) assert result.exit_code == 2 assert ( result.stdout == "NODEISREPLICA CRITICAL - This node is not a running asynchronous replica with no noloadbalance tag. | is_replica=0;;@0\n" ) def test_node_is_replica_params( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() # We don't do the check ourselves, patroni does it and changes the return code my_mock(mocker, "node_is_replica_ok", 200) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "node_is_replica", "--is-async", "--is-sync", ], ) assert result.exit_code == 3 assert ( result.stdout == "NODEISREPLICA UNKNOWN: click.exceptions.UsageError: --is-sync and --is-async cannot be provided at the same time for this service\n" ) # We don't do the check ourselves, patroni does it and changes the return code my_mock(mocker, "node_is_replica_ok", 200) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "node_is_replica", "--is-sync", "--max-lag", "1MB", ], ) assert result.exit_code == 3 assert ( result.stdout == "NODEISREPLICA UNKNOWN: click.exceptions.UsageError: --is-sync and --max-lag cannot be provided at the same time for this service\n" ) check_patroni-1.0.0/tests/test_node_patroni_version.py000066400000000000000000000023561447307111400233320ustar00rootroot00000000000000from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import my_mock def test_node_patroni_version_ok( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "node_patroni_version", 200) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "node_patroni_version", "--patroni-version", "2.0.2", ], ) assert result.exit_code == 0 assert ( result.stdout == "NODEPATRONIVERSION OK - Patroni's version is 2.0.2. | is_version_ok=1;;@0\n" ) def test_node_patroni_version_ko( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "node_patroni_version", 200) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "node_patroni_version", "--patroni-version", "1.0.0", ], ) assert result.exit_code == 2 assert ( result.stdout == "NODEPATRONIVERSION CRITICAL - Patroni's version is not 1.0.0. | is_version_ok=0;;@0\n" ) check_patroni-1.0.0/tests/test_node_tl_has_changed.py000066400000000000000000000113021447307111400230230ustar00rootroot00000000000000import nagiosplugin from click.testing import CliRunner from pytest_mock import MockerFixture from check_patroni.cli import main from .tools import here, my_mock def test_node_tl_has_changed_ok_with_timeline( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "node_tl_has_changed", 200) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "node_tl_has_changed", "--timeline", "58", ], ) assert result.exit_code == 0 assert ( result.stdout == "NODETLHASCHANGED OK - The timeline is still 58. | is_timeline_changed=0;;@1:1 timeline=58\n" ) def test_node_tl_has_changed_ok_with_state_file( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() with open(here / "node_tl_has_changed.state_file", "w") as f: f.write('{"timeline": 58}') my_mock(mocker, "node_tl_has_changed", 200) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "node_tl_has_changed", "--state-file", str(here / "node_tl_has_changed.state_file"), ], ) assert result.exit_code == 0 assert ( result.stdout == "NODETLHASCHANGED OK - The timeline is still 58. | is_timeline_changed=0;;@1:1 timeline=58\n" ) def test_node_tl_has_changed_ko_with_timeline( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() my_mock(mocker, "node_tl_has_changed", 200) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "node_tl_has_changed", "--timeline", "700", ], ) assert result.exit_code == 2 assert ( result.stdout == "NODETLHASCHANGED CRITICAL - The expected timeline was 700 got 58. | is_timeline_changed=1;;@1:1 timeline=58\n" ) def test_node_tl_has_changed_ko_with_state_file_and_save( mocker: MockerFixture, use_old_replica_state: bool ) -> None: runner = CliRunner() with open(here / "node_tl_has_changed.state_file", "w") as f: f.write('{"timeline": 700}') my_mock(mocker, "node_tl_has_changed", 200) # test without saving the new tl result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "node_tl_has_changed", "--state-file", str(here / "node_tl_has_changed.state_file"), ], ) assert result.exit_code == 2 assert ( result.stdout == "NODETLHASCHANGED CRITICAL - The expected timeline was 700 got 58. | is_timeline_changed=1;;@1:1 timeline=58\n" ) cookie = nagiosplugin.Cookie(here / "node_tl_has_changed.state_file") cookie.open() new_tl = cookie.get("timeline") cookie.close() assert new_tl == 700 # test when we save the hash result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "node_tl_has_changed", "--state-file", str(here / "node_tl_has_changed.state_file"), "--save", ], ) assert result.exit_code == 2 assert ( result.stdout == "NODETLHASCHANGED CRITICAL - The expected timeline was 700 got 58. | is_timeline_changed=1;;@1:1 timeline=58\n" ) cookie = nagiosplugin.Cookie(here / "node_tl_has_changed.state_file") cookie.open() new_tl = cookie.get("timeline") cookie.close() assert new_tl == 58 def test_node_tl_has_changed_params( mocker: MockerFixture, use_old_replica_state: bool ) -> None: # This one is placed last because it seems like the exceptions are not flushed from stderr for the next tests. runner = CliRunner() my_mock(mocker, "node_tl_has_changed", 200) result = runner.invoke( main, [ "-e", "https://10.20.199.3:8008", "node_tl_has_changed", "--timeline", "58", "--state-file", str(here / "fake_file_name.state_file"), ], ) assert result.exit_code == 3 assert ( result.stdout == "NODETLHASCHANGED UNKNOWN: click.exceptions.UsageError: Either --timeline or --state-file should be provided for this service\n" ) result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "node_tl_has_changed"] ) assert result.exit_code == 3 assert ( result.stdout == "NODETLHASCHANGED UNKNOWN: click.exceptions.UsageError: Either --timeline or --state-file should be provided for this service\n" ) check_patroni-1.0.0/tests/tools.py000066400000000000000000000026531447307111400172050ustar00rootroot00000000000000import json import pathlib from typing import Any from pytest_mock import MockerFixture from check_patroni.types import APIError, PatroniResource here = pathlib.Path(__file__).parent def getjson(name: str) -> Any: path = here / "json" / f"{name}.json" if not path.exists(): raise Exception(f"path does not exist : {path}") with path.open() as f: return json.load(f) def my_mock( mocker: MockerFixture, json_file: str, status: int, use_old_replica_state: bool = False, ) -> None: def mock_rest_api(self: PatroniResource, service: str) -> Any: if status != 200: raise APIError("Test en erreur pour status code 200") if json_file: if use_old_replica_state and ( json_file.startswith("cluster_has_replica") or json_file.startswith("cluster_node_count") ): return cluster_api_set_replica_running(getjson(json_file)) return getjson(json_file) return None mocker.resetall() mocker.patch("check_patroni.types.PatroniResource.rest_api", mock_rest_api) def cluster_api_set_replica_running(js: Any) -> Any: # starting from 3.0.4 the state of replicas is streaming instead of running for node in js["members"]: if node["role"] in ["replica", "sync_standby"]: if node["state"] == "streaming": node["state"] = "running" return js check_patroni-1.0.0/tox.ini000066400000000000000000000022141447307111400156350ustar00rootroot00000000000000[tox] # the versions specified here are overridden by github workflow envlist = lint, mypy, py{37,38,39,310,311} skip_missing_interpreters = True [testenv] deps = pytest pytest-mock commands = pytest {toxinidir}/check_patroni {toxinidir}/tests {posargs:-vv} [testenv:lint] skip_install = True deps = codespell black flake8 isort commands = codespell {toxinidir}/check_patroni {toxinidir}/tests black --check --diff {toxinidir}/check_patroni {toxinidir}/tests flake8 {toxinidir}/check_patroni {toxinidir}/tests isort --check --diff {toxinidir}/check_patroni {toxinidir}/tests [testenv:mypy] deps = mypy == 0.961 commands = # we need to install types-requests mypy --install-types --non-interactive {toxinidir}/check_patroni [testenv:build] deps = wheel setuptools twine allowlist_externals = rm commands = rm --verbose --recursive --force {toxinidir}/dist/ python -m build python -m twine check dist/* [testenv:upload] # requires a check_patroni section in ~/.pypirc skip_install = True deps = twine commands = python -m twine upload --repository check_patroni dist/* check_patroni-1.0.0/vagrant/000077500000000000000000000000001447307111400157655ustar00rootroot00000000000000check_patroni-1.0.0/vagrant/LICENSE000066400000000000000000000030001447307111400167630ustar00rootroot00000000000000BSD 3-Clause License Copyright (c) 2019, Jehan-Guillaume (ioguix) de Rorthais All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. check_patroni-1.0.0/vagrant/Makefile000066400000000000000000000006731447307111400174330ustar00rootroot00000000000000export VAGRANT_BOX_UPDATE_CHECK_DISABLE=1 export VAGRANT_CHECKPOINT_DISABLE=1 .PHONY: all prov validate all: prov prov: vagrant up --provision clean: vagrant destroy -f validate: @vagrant validate @if which shellcheck >/dev/null ;\ then shellcheck provision/* ;\ else echo "WARNING: shellcheck is not in PATH, not checking bash syntax" ;\ fi check_patroni-1.0.0/vagrant/README.md000066400000000000000000000031641447307111400172500ustar00rootroot00000000000000# Icinga ## Install Create the VM: ``` make ``` ## IcingaWeb Configure Icingaweb : ``` http://$IP/icingaweb2/setup ``` * Screen 1: Welcome Use the icinga token given a the end of the `icinga2-setup` provision, or: ``` sudo icingacli setup token show ``` Next * Screen 2: Modules Activate Monitor (already set) Next * Screen 3: Icinga Web 2 Next * Screen 4: Authentication Next * Screen 5: Database Resource Database Name: icingaweb_db Username: supervisor Password: th3Pass Charset: UTF8 Validate Next * Screen 6: Authentication Backend Next * Screen 7: Administration Fill the blanks Next * Screen 8: Application Configuration Next * Screen 9: Summary Next * Screen 10: Welcome ... again Next * Screen 11: Monitoring IDO Resource Database Name: icinga2 Username: supervisor Password: th3Pass Charset: UTF8 Validate Next * Screen 12: Command Transport Transaport name: icinga2 Transport Type: API Host: 127.0.0.1 Port: 5665 User: icinga_api Password: th3Pass Next * Screen 13: Monitoring Security Next * Screen 14: Summary Finish * Screen 15: Hopefuly success Login ## Add servers to icinga ``` # Connect to the vm vagrant ssh s1 # Create /etc/icinga2/conf.d/check_patroni.conf sudo /vagrant/provision/director.bash init cluster1 p1=10.20.89.54 p2=10.20.89.55 # Check and load conf sudo icinga2 daemon -C sudo systemctl restart icinga2.service ``` # Grafana Connect to: http://10.20.89.52:3000/login User / pass: admin/admin Import the dashboards for the grafana directory. They are created for cluster1, and servers p1, p2. check_patroni-1.0.0/vagrant/Vagrantfile000066400000000000000000000040521447307111400201530ustar00rootroot00000000000000require 'ipaddr' #require 'yaml' ENV["LC_ALL"] = 'en_US.utf8' myBox = 'debian/buster64' myProvider = 'libvirt' pgver = 11 start_ip = '10.20.89.51' etcd_nodes = [] patroni_nodes = [] sup_nodes = ['s1'] # install check_patroni from the local repo (test) or from pip (official) cp_origin = 'test' # [test, official] Vagrant.configure(2) do |config| config.vm.provider myProvider next_ip = IPAddr.new(start_ip).succ host_ip = (IPAddr.new(start_ip) & "255.255.255.0").succ.to_s nodes_ips = {} ( patroni_nodes + etcd_nodes + sup_nodes ).each do |node| nodes_ips[node] = next_ip.to_s next_ip = next_ip.succ end # don't mind about insecure ssh key config.ssh.insert_key = false # https://vagrantcloud.com/search. config.vm.box = myBox # hardware and host settings config.vm.provider 'libvirt' do |lv| lv.cpus = 1 lv.memory = 512 lv.watchdog model: 'i6300esb' lv.default_prefix = 'patroni_' lv.qemu_use_session = false end # disable default share (NFS is not working directly in DEBIAN 10) config.vm.synced_folder ".", "/vagrant", type: "rsync" config.vm.synced_folder "/home/benoit/git/dalibo/check_patroni", "/check_patroni", type: "rsync" ## allow root@vm to ssh to ssh_login@network_1 #config.vm.synced_folder 'ssh', '/root/.ssh', type: 'rsync', # owner: 'root', group: 'root', # rsync__args: [ "--verbose", "--archive", "--delete", "--copy-links", "--no-perms" ] # system setup for sup nodes (sup_nodes).each do |node| config.vm.define node do |conf| conf.vm.network 'private_network', ip: nodes_ips[node] conf.vm.provision 'icinga2-setup', type: 'shell', path: 'provision/icinga2.bash', args: [ node ], preserve_order: true conf.vm.provision 'check_patroni', type: 'shell', path: 'provision/check_patroni.bash', args: [ cp_origin ], preserve_order: true end end end check_patroni-1.0.0/vagrant/check_patroni.sh000077500000000000000000000016141447307111400211370ustar00rootroot00000000000000#!/bin/bash if [[ -z "$1" ]]; then echo "usage: $0 PATRONI_END_POINT" exit 1 fi echo "-- Running patroni checks using endpoint $1" echo "-- Cluster checks" check_patroni -e "$1" cluster_config_has_changed --state-file cluster.sate_file --save check_patroni -e "$1" cluster_has_leader check_patroni -e "$1" cluster_has_replica check_patroni -e "$1" cluster_is_in_maintenance check_patroni -e "$1" cluster_has_scheduled_action check_patroni -e "$1" cluster_node_count echo "-- Node checks" check_patroni -e "$1" node_is_alive check_patroni -e "$1" node_is_pending_restart check_patroni -e "$1" node_is_primary check_patroni -e "$1" node_is_leader --is-standby-leader check_patroni -e "$1" node_is_replica check_patroni -e "$1" node_is_replica --is-sync check_patroni -e "$1" node_patroni_version --patroni-version 3.1.0 check_patroni -e "$1" node_tl_has_changed --state-file cluster.sate_file --save check_patroni-1.0.0/vagrant/grafana/000077500000000000000000000000001447307111400173645ustar00rootroot00000000000000check_patroni-1.0.0/vagrant/grafana/cluster_status_cluster1.json000066400000000000000000000471551447307111400252010ustar00rootroot00000000000000{ "__inputs": [ { "name": "DS_OPM", "label": "opm", "description": "", "type": "datasource", "pluginId": "postgres", "pluginName": "PostgreSQL" }, { "name": "VAR_CLUSTER_NAME", "type": "constant", "label": "cluster_name", "value": "cluster1", "description": "" } ], "__elements": [], "__requires": [ { "type": "grafana", "id": "grafana", "name": "Grafana", "version": "8.3.3" }, { "type": "datasource", "id": "postgres", "name": "PostgreSQL", "version": "1.0.0" }, { "type": "panel", "id": "stat", "name": "Stat", "version": "" }, { "type": "panel", "id": "timeseries", "name": "Time series", "version": "" } ], "annotations": { "list": [ { "builtIn": 1, "datasource": "-- Grafana --", "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "target": { "limit": 100, "matchAny": false, "tags": [], "type": "dashboard" }, "type": "dashboard" } ] }, "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 0, "id": null, "iteration": 1640960519458, "links": [], "liveNow": false, "panels": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": true, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 6, "w": 20, "x": 0, "y": 0 }, "id": 14, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom" }, "tooltip": { "mode": "single" } }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$cluster_name' AND s.service = 'check_patroni_cluster_has_replica'\n ) \n AND m.label ilike '%lag%' \nGROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Cluster replica lag", "type": "timeseries" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null } ] } }, "overrides": [] }, "gridPos": { "h": 2, "w": 4, "x": 20, "y": 0 }, "id": 4, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$cluster_name' AND s.service = 'check_patroni_cluster_has_leader'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Cluster has primary", "type": "stat" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 2, "w": 4, "x": 20, "y": 2 }, "id": 10, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$cluster_name' AND s.service = 'check_patroni_cluster_config_has_changed'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Cluster config has changed", "type": "stat" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 2, "w": 4, "x": 20, "y": 4 }, "id": 8, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$cluster_name' AND s.service = 'check_patroni_cluster_is_in_maintenance'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Cluster is in maintenance", "type": "stat" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [ { "matcher": { "id": "byName", "options": "role_leader" }, "properties": [ { "id": "displayName", "value": "leader" } ] }, { "matcher": { "id": "byName", "options": "role_replica" }, "properties": [ { "id": "displayName", "value": "replicas" } ] }, { "matcher": { "id": "byName", "options": "state_running" }, "properties": [ { "id": "displayName", "value": "running" } ] } ] }, "gridPos": { "h": 5, "w": 12, "x": 0, "y": 6 }, "id": 2, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "center", "orientation": "vertical", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM public.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id \n FROM public.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$cluster_name' AND s.service = 'check_patroni_cluster_node_count'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Cluster node count", "type": "stat" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": true, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [ { "matcher": { "id": "byName", "options": "healthy_replica" }, "properties": [ { "id": "displayName", "value": "healthy" } ] }, { "matcher": { "id": "byName", "options": "unhealthy_replica" }, "properties": [ { "id": "displayName", "value": "unhealthy" } ] } ] }, "gridPos": { "h": 5, "w": 12, "x": 12, "y": 6 }, "id": 6, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom" }, "tooltip": { "mode": "single" } }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM public.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id \n FROM public.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$cluster_name' AND s.service = 'check_patroni_cluster_has_replica'\n )\n AND m.label IN('healthy_replica','unhealthy_replica') \n GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Cluster has replica", "type": "timeseries" } ], "refresh": "", "schemaVersion": 34, "style": "dark", "tags": [], "templating": { "list": [ { "hide": 2, "name": "cluster_name", "query": "${VAR_CLUSTER_NAME}", "skipUrlSync": false, "type": "constant", "current": { "value": "${VAR_CLUSTER_NAME}", "text": "${VAR_CLUSTER_NAME}", "selected": false }, "options": [ { "value": "${VAR_CLUSTER_NAME}", "text": "${VAR_CLUSTER_NAME}", "selected": false } ] }, { "auto": false, "auto_count": 30, "auto_min": "10s", "current": { "selected": true, "text": "1m", "value": "1m" }, "hide": 0, "name": "interval", "options": [ { "selected": true, "text": "1m", "value": "1m" }, { "selected": false, "text": "10m", "value": "10m" }, { "selected": false, "text": "30m", "value": "30m" }, { "selected": false, "text": "1h", "value": "1h" }, { "selected": false, "text": "6h", "value": "6h" }, { "selected": false, "text": "12h", "value": "12h" }, { "selected": false, "text": "1d", "value": "1d" }, { "selected": false, "text": "7d", "value": "7d" }, { "selected": false, "text": "14d", "value": "14d" }, { "selected": false, "text": "30d", "value": "30d" } ], "query": "1m,10m,30m,1h,6h,12h,1d,7d,14d,30d", "queryValue": "", "refresh": 2, "skipUrlSync": false, "type": "interval" } ] }, "time": { "from": "now-6h", "to": "now" }, "timepicker": {}, "timezone": "", "title": "Cluster status: cluster1", "uid": "4BullO0nk", "version": 10, "weekStart": "" }check_patroni-1.0.0/vagrant/grafana/node_status_p1.json000066400000000000000000000311571447307111400232160ustar00rootroot00000000000000{ "__inputs": [ { "name": "DS_OPM", "label": "opm", "description": "", "type": "datasource", "pluginId": "postgres", "pluginName": "PostgreSQL" }, { "name": "VAR_NODE_NAME", "type": "constant", "label": "node_name", "value": "p1", "description": "" } ], "__elements": [], "__requires": [ { "type": "grafana", "id": "grafana", "name": "Grafana", "version": "8.3.3" }, { "type": "datasource", "id": "postgres", "name": "PostgreSQL", "version": "1.0.0" }, { "type": "panel", "id": "stat", "name": "Stat", "version": "" } ], "annotations": { "list": [ { "builtIn": 1, "datasource": "-- Grafana --", "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "target": { "limit": 100, "matchAny": false, "tags": [], "type": "dashboard" }, "type": "dashboard" } ] }, "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 0, "id": null, "iteration": 1640961009033, "links": [], "liveNow": false, "panels": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [ { "matcher": { "id": "byName", "options": "is_primary" }, "properties": [ { "id": "displayName", "value": "Primaire" } ] }, { "matcher": { "id": "byName", "options": "is_replica" }, "properties": [ { "id": "displayName", "value": "Secondaire" } ] } ] }, "gridPos": { "h": 9, "w": 12, "x": 0, "y": 0 }, "id": 2, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_primary'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "hide": false, "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_replica'\n ) GROUP BY time, m.label ORDER BY time", "refId": "B", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Node type", "type": "stat" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [ { "matcher": { "id": "byName", "options": "is_alive" }, "properties": [ { "id": "displayName", "value": "Node is alive" } ] }, { "matcher": { "id": "byName", "options": "is_pending_restart" }, "properties": [ { "id": "displayName", "value": "Node is pending restart" } ] }, { "matcher": { "id": "byName", "options": "timeline" }, "properties": [ { "id": "displayName", "value": "Current timeline" } ] } ] }, "gridPos": { "h": 9, "w": 12, "x": 12, "y": 0 }, "id": 4, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_alive'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "hide": false, "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_tl_has_changed'\n )\nAND m.label = 'timeline'\nGROUP BY time, m.label ORDER BY time", "refId": "B", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "hide": false, "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_pending_restart'\n ) GROUP BY time, m.label ORDER BY time", "refId": "D", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Health stats", "type": "stat" } ], "schemaVersion": 34, "style": "dark", "tags": [], "templating": { "list": [ { "hide": 2, "name": "node_name", "query": "${VAR_NODE_NAME}", "skipUrlSync": false, "type": "constant", "current": { "value": "${VAR_NODE_NAME}", "text": "${VAR_NODE_NAME}", "selected": false }, "options": [ { "value": "${VAR_NODE_NAME}", "text": "${VAR_NODE_NAME}", "selected": false } ] }, { "auto": false, "auto_count": 30, "auto_min": "10s", "current": { "selected": false, "text": "1m", "value": "1m" }, "hide": 0, "name": "interval", "options": [ { "selected": true, "text": "1m", "value": "1m" }, { "selected": false, "text": "10m", "value": "10m" }, { "selected": false, "text": "30m", "value": "30m" }, { "selected": false, "text": "1h", "value": "1h" }, { "selected": false, "text": "6h", "value": "6h" }, { "selected": false, "text": "12h", "value": "12h" }, { "selected": false, "text": "1d", "value": "1d" }, { "selected": false, "text": "7d", "value": "7d" }, { "selected": false, "text": "14d", "value": "14d" }, { "selected": false, "text": "30d", "value": "30d" } ], "query": "1m,10m,30m,1h,6h,12h,1d,7d,14d,30d", "queryValue": "", "refresh": 2, "skipUrlSync": false, "type": "interval" } ] }, "time": { "from": "now-6h", "to": "now" }, "timepicker": {}, "timezone": "", "title": "Node status: p1", "uid": "2LfUnFAnk", "version": 1, "weekStart": "" }check_patroni-1.0.0/vagrant/grafana/node_status_p2.json000066400000000000000000000311601447307111400232110ustar00rootroot00000000000000{ "__inputs": [ { "name": "DS_OPM", "label": "opm", "description": "", "type": "datasource", "pluginId": "postgres", "pluginName": "PostgreSQL" }, { "name": "VAR_NODE_NAME", "type": "constant", "label": "node_name", "value": "p2", "description": "" } ], "__elements": [], "__requires": [ { "type": "grafana", "id": "grafana", "name": "Grafana", "version": "8.3.3" }, { "type": "datasource", "id": "postgres", "name": "PostgreSQL", "version": "1.0.0" }, { "type": "panel", "id": "stat", "name": "Stat", "version": "" } ], "annotations": { "list": [ { "builtIn": 1, "datasource": "-- Grafana --", "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "target": { "limit": 100, "matchAny": false, "tags": [], "type": "dashboard" }, "type": "dashboard" } ] }, "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 0, "id": null, "iteration": 1640960994907, "links": [], "liveNow": false, "panels": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [ { "matcher": { "id": "byName", "options": "is_primary" }, "properties": [ { "id": "displayName", "value": "Primaire" } ] }, { "matcher": { "id": "byName", "options": "is_replica" }, "properties": [ { "id": "displayName", "value": "Secondaire" } ] } ] }, "gridPos": { "h": 9, "w": 12, "x": 0, "y": 0 }, "id": 2, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_primary'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "hide": false, "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_replica'\n ) GROUP BY time, m.label ORDER BY time", "refId": "B", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Node type", "type": "stat" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [ { "matcher": { "id": "byName", "options": "is_alive" }, "properties": [ { "id": "displayName", "value": "Node is alive" } ] }, { "matcher": { "id": "byName", "options": "is_pending_restart" }, "properties": [ { "id": "displayName", "value": "Node is pending restart" } ] }, { "matcher": { "id": "byName", "options": "timeline" }, "properties": [ { "id": "displayName", "value": "Current timeline" } ] } ] }, "gridPos": { "h": 9, "w": 12, "x": 12, "y": 0 }, "id": 4, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_alive'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "hide": false, "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_tl_has_changed'\n )\nAND m.label = 'timeline'\nGROUP BY time, m.label ORDER BY time", "refId": "B", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "hide": false, "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_pending_restart'\n ) GROUP BY time, m.label ORDER BY time", "refId": "D", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Health stats", "type": "stat" } ], "schemaVersion": 34, "style": "dark", "tags": [], "templating": { "list": [ { "hide": 2, "name": "node_name", "query": "${VAR_NODE_NAME}", "skipUrlSync": false, "type": "constant", "current": { "value": "${VAR_NODE_NAME}", "text": "${VAR_NODE_NAME}", "selected": false }, "options": [ { "value": "${VAR_NODE_NAME}", "text": "${VAR_NODE_NAME}", "selected": false } ] }, { "auto": false, "auto_count": 30, "auto_min": "10s", "current": { "selected": false, "text": "1m", "value": "1m" }, "hide": 0, "name": "interval", "options": [ { "selected": true, "text": "1m", "value": "1m" }, { "selected": false, "text": "10m", "value": "10m" }, { "selected": false, "text": "30m", "value": "30m" }, { "selected": false, "text": "1h", "value": "1h" }, { "selected": false, "text": "6h", "value": "6h" }, { "selected": false, "text": "12h", "value": "12h" }, { "selected": false, "text": "1d", "value": "1d" }, { "selected": false, "text": "7d", "value": "7d" }, { "selected": false, "text": "14d", "value": "14d" }, { "selected": false, "text": "30d", "value": "30d" } ], "query": "1m,10m,30m,1h,6h,12h,1d,7d,14d,30d", "queryValue": "", "refresh": 2, "skipUrlSync": false, "type": "interval" } ] }, "time": { "from": "now-6h", "to": "now" }, "timepicker": {}, "timezone": "", "title": "Node status: p2", "uid": "2LfUnFAnkr", "version": 1, "weekStart": "" }check_patroni-1.0.0/vagrant/provision/000077500000000000000000000000001447307111400200155ustar00rootroot00000000000000check_patroni-1.0.0/vagrant/provision/check_patroni.bash000077500000000000000000000012371447307111400234730ustar00rootroot00000000000000#!/usr/bin/env bash info (){ echo "$1" } ORIGIN=$1 set -o errexit set -o nounset set -o pipefail info "#=============================================================================" info "# check_patroni" info "#=============================================================================" DEBIAN_FRONTEND=noninteractive apt install -q -y git python3-pip pip3 install --upgrade pip case "$ORIGIN" in "test") cd /check_patroni pip3 install . ln -s /usr/local/bin/check_patroni /usr/lib/nagios/plugins/check_patroni ;; "official") pip3 install check_patroni ;; *) echo "Origin : [$ORIGIN] is not supported" exit 1 esac check_patroni --version check_patroni-1.0.0/vagrant/provision/director.bash000077500000000000000000000174451447307111400225050ustar00rootroot00000000000000#!/usr/bin/env bash info(){ echo "$1" } usage(){ echo "$0 ACTION CLUSTER_NAME [NODE..]" echo "" echo " ACTION: init | add" echo " CLUSTER: cluster name" echo " NODE: HOST=IP" echo " HOST: any name for icinga" echo " IP: the IP" } if [ "$#" -le "3" ]; then usage exit 1 fi ACTION="$1" shift CLUSTER="$1" shift NODES=( "$@" ) TARGET="/etc/icinga2/conf.d/check_patroni.conf" #set -o errexit set -o nounset set -o pipefail init(){ cat << '__EOF__' > "$TARGET" # =================================================================== # Check Commands # =================================================================== template CheckCommand "check_patroni" { command = [ PluginDir + "/check_patroni" ] arguments = { "--endpoints" = { value = "$endpoints$" order = -2 repeat_key = true } "--timeout" = { value = "$timeout$" order = -1 } } } object CheckCommand "check_patroni_node_is_alive" { import "check_patroni" arguments += { "node_is_alive" = { order = 1 } } } object CheckCommand "check_patroni_node_is_primary" { import "check_patroni" arguments += { "node_is_primary" = { order = 1 } } } object CheckCommand "check_patroni_node_is_replica" { import "check_patroni" arguments += { "node_is_replica" = { order = 1 } } } object CheckCommand "check_patroni_node_is_pending_restart" { import "check_patroni" arguments += { "node_is_pending_restart" = { order = 1 } } } object CheckCommand "check_patroni_node_patroni_version" { import "check_patroni" arguments += { "node_patroni_version" = { order = 1 } "--patroni-version" = { value = "$patroni_version$" order = 2 } } } object CheckCommand "check_patroni_node_tl_has_changed" { import "check_patroni" arguments += { "node_tl_has_changed" = { order = 1 } "--state-file" = { value = "/tmp/$state_file$" # a quick and dirty way for this poc order = 2 } } } # ------------------------------------------------------------------- object CheckCommand "check_patroni_cluster_has_leader" { import "check_patroni" arguments += { "cluster_has_leader" = { order = 1 } } } object CheckCommand "check_patroni_cluster_has_replica" { import "check_patroni" arguments += { "cluster_has_replica" = { order = 1 } "--warning" = { value = "$has_replica_warning$" order = 2 } "--critical" = { value = "$has_replica_critical$" order = 3 } } } object CheckCommand "check_patroni_cluster_config_has_changed" { import "check_patroni" arguments += { "cluster_config_has_changed" = { order = 1 } "--state-file" = { value = "/tmp/$state_file$" # a quick and dirty way for this poc order = 2 } } } object CheckCommand "check_patroni_cluster_is_in_maintenance" { import "check_patroni" arguments += { "cluster_is_in_maintenance" = { order = 1 } } } object CheckCommand "check_patroni_cluster_node_count" { import "check_patroni" arguments += { "cluster_node_count" = { order = 1 } "--warning" = { value = "$node_count_warning$" order = 2 } "--critical" = { value = "$node_count_critical$" order = 3 } "--running-warning" = { value = "$node_count_running_warning$" order = 4 } "--running-critical" = { value = "$node_count_running_critical$" order = 5 } } } # =================================================================== # Services # =================================================================== template Service "check_patroni" { max_check_attempts = 3 check_interval = 1m # we spam a little for the sake of testing retry_interval = 15 # we spam a little for the sake of testing enable_perfdata = true vars.timeout = 10 } apply Service "check_patroni_node_is_alive" { import "check_patroni" check_command = "check_patroni_node_is_alive" assign where "patroni_servers" in host.groups } apply Service "check_patroni_node_is_primary" { import "check_patroni" check_command = "check_patroni_node_is_primary" assign where "patroni_servers" in host.groups } apply Service "check_patroni_node_is_replica" { import "check_patroni" check_command = "check_patroni_node_is_replica" assign where "patroni_servers" in host.groups } apply Service "check_patroni_node_is_pending_restart" { import "check_patroni" check_command = "check_patroni_node_is_pending_restart" assign where "patroni_servers" in host.groups } apply Service "check_patroni_node_patroni_version" { import "check_patroni" check_command = "check_patroni_node_patroni_version" assign where "patroni_servers" in host.groups } apply Service "check_patroni_node_tl_has_changed" { import "check_patroni" vars.state_file = host.name + ".state" check_command = "check_patroni_node_tl_has_changed" assign where "patroni_servers" in host.groups } # ------------------------------------------------------------------- apply Service "check_patroni_cluster_has_leader" { import "check_patroni" check_command = "check_patroni_cluster_has_leader" assign where "patroni_clusters" in host.groups } apply Service "check_patroni_cluster_has_replica" { import "check_patroni" check_command = "check_patroni_cluster_has_replica" assign where "patroni_clusters" in host.groups } apply Service "check_patroni_cluster_config_has_changed" { import "check_patroni" vars.state_file = host.name + ".state" check_command = "check_patroni_cluster_config_has_changed" assign where "patroni_clusters" in host.groups } apply Service "check_patroni_cluster_is_in_maintenance" { import "check_patroni" check_command = "check_patroni_cluster_is_in_maintenance" assign where "patroni_clusters" in host.groups } apply Service "check_patroni_cluster_node_count" { import "check_patroni" check_command = "check_patroni_cluster_node_count" assign where "patroni_clusters" in host.groups } # =================================================================== # Hosts meta # =================================================================== object HostGroup "patroni_servers" { display_name = "patroni servers" } template Host "patroni_servers" { groups = [ "patroni_servers" ] check_command = "hostalive" vars.patroni_version = "2.1.2" } # ------------------------------------------------------------------- object HostGroup "patroni_clusters" { display_name = "patroni clusters" } template Host "patroni_clusters" { groups = [ "patroni_clusters" ] check_command = "dummy" } # =================================================================== # Hosts meta # =================================================================== __EOF__ } add_hosts(){ NODES=$@ for N in "${NODES[@]}"; do IP="${N##*=}" HOST="${N%=*}" cat << __EOF__ >> "$TARGET" object Host "$HOST" { import "patroni_servers" display_name = "Server patroni $HOST" address = "$IP" vars.endpoints = [ "http://" + address + ":8008" ] } __EOF__ done } add_cluster(){ CLUSTER=$1 NODES=$2 NAME="" IPS=" " for N in "${NODES[@]}"; do IP="${N##*=}" HOST="${N%=*}" NAME="$NAME $HOST" IPS="$IPS\"http://${IP}:8008\", " done cat << __EOF__ >> "$TARGET" object Host "$CLUSTER" { import "patroni_clusters" display_name = "Cluster: $CLUSTER ($NAME )" vars.endpoints = [$IPS ] vars.has_replica_warning = "1:" vars.has_replica_critical = "1:" vars.node_count_warning = "2:" vars.node_count_critical = "1:" vars.node_count_running_warning = "2:" vars.node_count_running_critical = "1:" } __EOF__ } case "$ACTION" in "init") init add_hosts "$NODES" add_cluster "$CLUSTER" "$NODES" ;; "add") add_hosts "$NODES" add_cluster "$CLUSTER" "$NODES" ;; *) usage echo "error: invalid action" exit 1 esac check_patroni-1.0.0/vagrant/provision/icinga2.bash000077500000000000000000000270231447307111400221770ustar00rootroot00000000000000#!/usr/bin/env bash info (){ echo "$1" } #set -o errexit set -o nounset set -o pipefail NODENAME="$1" shift PG_ICINGA_USER_NAME="supervisor" PG_ICINGA_USER_PWD="th3Pass" PG_ICINGAWEB_USER_NAME="supervisor" PG_ICINGAWEB_USER_PWD="th3Pass" PG_DIRECTOR_USER_NAME="supervisor" PG_DIRECTOR_USER_PWD="th3Pass" PG_OPM_USER_NAME="opm" PG_OPM_USER_PWD="th3Pass" PG_GRAFANA_USER_NAME="supervisor" PG_GRAFANA_USER_PWD="th3Pass" set_hostname(){ info "#=============================================================================" info "# hostname and /etc/hosts setup" info "#=============================================================================" hostnamectl set-hostname "${NODENAME}" sed --in-place -e "s/\(127\.0\.0\.1\s*localhost$\)/\1 ${NODENAME}/" /etc/hosts } packages(){ info "#=============================================================================" info "# install required repos and packages" info "#=============================================================================" apt-get update || true apt-get -y install apt-transport-https wget gnupg software-properties-common DIST=$(awk -F"[)(]+" '/VERSION=/ {print $2}' /etc/os-release) echo "deb https://packages.icinga.com/debian icinga-${DIST} main" > "/etc/apt/sources.list.d/${DIST}-icinga.list" echo "deb-src https://packages.icinga.com/debian icinga-${DIST} main" >> "/etc/apt/sources.list.d/${DIST}-icinga.list" echo "deb https://packages.grafana.com/oss/deb stable main" > /etc/apt/sources.list.d/grafana.list echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list wget -q -O - https://packages.icinga.com/icinga.key | apt-key add - wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add - wget -q -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add - apt-get update || true PACKAGES=( grafana icinga2 icinga2-ido-pgsql icingaweb2 icingaweb2-module-monitoring icingacli postgresql-client postgresql-14 php7.3-pgsql php7.3-imagick php7.3-intl nagios-plugins ) DEBIAN_FRONTEND=noninteractive apt install -q -y "${PACKAGES[@]}" systemctl --quiet --now enable postgresql@14 } icinga_setup(){ info "#=============================================================================" info "# Icinga setup" info "#=============================================================================" ## this part is already done by the standart icinga install with the user icinga2 ## and a random password, here we dont really care cat << __EOF__ | sudo -u postgres psql DROP ROLE IF EXISTS supervisor; DROP DATABASE IF EXISTS icinga2; CREATE ROLE ${PG_ICINGA_USER_NAME} WITH LOGIN SUPERUSER PASSWORD '${PG_ICINGA_USER_PWD}'; CREATE DATABASE icinga2; __EOF__ echo "*:*:*:${PG_ICINGA_USER_NAME}:${PG_ICINGA_USER_PWD}" > ~postgres/.pgpass chown postgres:postgres ~postgres/.pgpass chmod 600 ~postgres/.pgpass PGPASSFILE=~postgres/.pgpass psql -U $PG_ICINGA_USER_NAME -h 127.0.0.1 -d icinga2 -f /usr/share/icinga2-ido-pgsql/schema/pgsql.sql icingacli setup config directory --group icingaweb2 icingacli setup token create ## this part is already done by the standart icinga install with the user icinga2 cat << __EOF__ > /etc/icinga2/features-available/ido-pgsql.conf /** * The db_ido_pgsql library implements IDO functionality * for PostgreSQL. */ library "db_ido_pgsql" object IdoPgsqlConnection "ido-pgsql" { user = "${PG_ICINGA_USER_NAME}", password = "${PG_ICINGA_USER_PWD}", host = "localhost", database = "icinga2" } __EOF__ icinga2 feature enable ido-pgsql icinga2 feature enable command icinga2 feature enable perfdata #icinga2 node wizard icinga2 node setup --master --cn s1 --zone master systemctl restart icinga2.service } icinga_API(){ info "#=============================================================================" info "# Icinga API" info "#=============================================================================" icinga2 api setup cat <<__EOF__ >> /etc/icinga2/conf.d/api-users.conf object ApiUser "icingaapi" { password = "th3Pass" permissions = [ "*" ] } __EOF__ systemctl restart icinga2.service } icinga_web(){ info "#=============================================================================" info "# Icinga2 Web" info "#=============================================================================" if [ "$PG_ICINGA_USER_NAME" != "$PG_ICINGAWEB_USER_NAME" ]; then sudo -u postgres psql -c "CREATE ROLE ${PG_ICINGAWEB_USER_NAME} WITH LOGIN PASSWORD '${PG_ICINGAWEB_USER_PWD}';" fi sudo -u postgres psql -c "CREATE DATABASE icingaweb_db OWNER ${PG_ICINGAWEB_USER_NAME};" sed --in-place -e "s/;date\.timezone =/date.timezone = europe\/paris/" /etc/php/7.3/apache2/php.ini a2enconf icingaweb2 a2enmod rewrite a2dismod mpm_event a2enmod php7.3 systemctl restart apache2 } director(){ info "#=============================================================================" info "# Icinga director" info "#=============================================================================" # Create the database if [ "$PG_ICINGA_USER_NAME" != "$PG_DIRECTOR_USER_NAME" ]; then sudo -u postgres psql -c "CREATE ROLE ${PG_DIRECTOR_USER_NAME} WITH LOGIN PASSWORD '${PG_DIRECTOR_USER_PWD}';" fi sudo -u postgres psql -c "CREATE DATABASE director_db OWNER ${PG_DIRECTOR_USER_NAME};" sudo -iu postgres psql -d director_db -c "CREATE EXTENSION pgcrypto;" ## Prereq MODULE_NAME=incubator MODULE_VERSION=v0.11.0 MODULES_PATH="/usr/share/icingaweb2/modules" MODULE_PATH="${MODULES_PATH}/${MODULE_NAME}" RELEASES="https://github.com/Icinga/icingaweb2-module-${MODULE_NAME}/archive" mkdir "$MODULE_PATH" \ && wget -q $RELEASES/${MODULE_VERSION}.tar.gz -O - \ | tar xfz - -C "$MODULE_PATH" --strip-components 1 icingacli module enable "${MODULE_NAME}" ## Director MODULE_VERSION="1.8.1" ICINGAWEB_MODULEPATH="/usr/share/icingaweb2/modules" REPO_URL="https://github.com/icinga/icingaweb2-module-director" TARGET_DIR="${ICINGAWEB_MODULEPATH}/director" URL="${REPO_URL}/archive/v${MODULE_VERSION}.tar.gz" useradd -r -g icingaweb2 -d /var/lib/icingadirector -s /bin/false icingadirector install -d -o icingadirector -g icingaweb2 -m 0750 /var/lib/icingadirector install -d -m 0755 "${TARGET_DIR}" wget -q -O - "$URL" | tar xfz - -C "${TARGET_DIR}" --strip-components 1 cp "${TARGET_DIR}/contrib/systemd/icinga-director.service" /etc/systemd/system/ icingacli module enable director systemctl daemon-reload systemctl enable icinga-director.service systemctl start icinga-director.service # The permission have to be like this to let icingaweb activate modules chown -R www-data:icingaweb2 /etc/icingaweb2 } grafana(){ info "#=============================================================================" info "# Grafana" info "#=============================================================================" if [ "$PG_ICINGA_USER_NAME" != "$PG_GRAFANA_USER_NAME" ]; then sudo -u postgres psql -c "CREATE ROLE ${PG_GRAFANA_USER_NAME} WITH LOGIN PASSWORD '${PG_GRAFANA_USER_PWD}';" fi sudo -u postgres psql -c "CREATE DATABASE grafana OWNER ${PG_GRAFANA_USER_NAME};" cat << __EOF__ > /etc/grafana/grafana.ini [database] # You can configure the database connection by specifying type, host, name, user and password # as seperate properties or as on string using the url propertie. # Either "mysql", "postgres" or "sqlite3", it's your choice type = postgres host = 127.0.0.1:5432 name = grafana user = $PG_GRAFANA_USER_NAME password = $PG_GRAFANA_USER_PWD __EOF__ systemctl --quiet --now enable grafana-server.service } opm(){ info "#=============================================================================" info "# OPM" info "#=============================================================================" ## OPM Install DEBIAN_FRONTEND=noninteractive apt install -q -y postgresql-server-dev-10 libdbd-pg-perl git build-essential cd /usr/local/src || exit 1 git clone https://github.com/OPMDG/opm-core.git git clone https://github.com/OPMDG/opm-wh_nagios.git cd /usr/local/src/opm-wh_nagios/pg/ || exit 1 make install cd /usr/local/src/opm-core/pg/ || exit 1 make install ## OPM db setup cat << __EOF__ | sudo -iu postgres psql CREATE ROLE ${PG_OPM_USER_NAME} WITH LOGIN PASSWORD '${PG_OPM_USER_PWD}'; CREATE DATABASE opm OWNER ${PG_OPM_USER_NAME}; __EOF__ cat << __EOF__ | sudo -iu postgres psql -d opm CREATE EXTENSION opm_core; CREATE EXTENSION wh_nagios CASCADE; SELECT * FROM grant_dispatcher('wh_nagios', 'opm'); __EOF__ ## OPM dispatcher cat < /etc/opm_dispatcher.conf daemon=0 directory=/var/spool/icinga2/perfdata frequency=5 db_connection_string=dbi:Pg:dbname=opm host=localhost db_user=${PG_OPM_USER_NAME} db_password=${PG_OPM_USER_PWD} debug=0 syslog=1 hostname_filter = /^$/ # Empty hostname. Never happens service_filter = /^$/ # Empty service label_filter = /^$/ # Empty label EOF cat <<'EOF' > /etc/systemd/system/opm_dispatcher.service [Unit] Description=dispatcher nagios, import perf files from icinga to opm [Service] User=nagios Group=nagios ExecStart=/usr/local/src/opm-wh_nagios/bin/nagios_dispatcher.pl -c /etc/opm_dispatcher.conf # start right after boot Type=simple # restart on crash Restart=always # after 10s RestartSec=10 [Install] WantedBy=multi-user.target EOF ## OPM planned task cat <<'EOF' > /etc/systemd/system/opm_dispatch_record.service [Unit] Description=Run wh_nagios.dispatch_record() on OPM database [Service] Type=oneshot User=postgres Group=postgres SyslogIdentifier=opm_dispatch_record ExecStart=/usr/bin/psql -U postgres -d opm -c "SELECT * FROM wh_nagios.dispatch_record()" EOF cat <<'EOF' > /etc/systemd/system/opm_dispatch_record.timer [Unit] Description=Timer to run wh_nagios.dispatch_record() on OPM [Timer] OnBootSec=60s OnUnitInactiveSec=1min [Install] WantedBy=timers.target EOF systemctl daemon-reload systemctl enable opm_dispatcher systemctl start opm_dispatcher systemctl enable opm_dispatch_record.timer systemctl start opm_dispatch_record.timer ## To check once everything is setup (icingaweb is setup) # sudo journalctl -fu opm_dispatcher # sudo ournalctl -ft opm_dispatch_record ## Grants for graphana sudo -iu postgres psql -c "CREATE ROLE grafana WITH LOGIN PASSWORD 'th3Pass'" cat <