pax_global_header00006660000000000000000000000064137733721550014527gustar00rootroot0000000000000052 comment=953b41aa9f00f57e3e815081e332d7833112f616 nohang-0.2.0/000077500000000000000000000000001377337215500130005ustar00rootroot00000000000000nohang-0.2.0/.gitignore000066400000000000000000000023441377337215500147730ustar00rootroot00000000000000# Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover .hypothesis/ .pytest_cache/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder target/ # Jupyter Notebook .ipynb_checkpoints # pyenv .python-version # celery beat schedule file celerybeat-schedule # SageMath parsed files *.sage.py # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ # Kate .kate-swp # deb /deb/package/ /deb/package.deb nohang-0.2.0/.travis.yml000066400000000000000000000015211377337215500151100ustar00rootroot00000000000000dist: bionic language: python sudo: required script: - sudo make install - sudo systemctl enable --now nohang.service - sudo systemctl stop nohang.service - sudo systemctl enable --now nohang-desktop.service - sudo systemctl stop nohang-desktop.service - oom-sort -h - oom-sort - nohang -h - nohang -v - nohang --check --config /usr/local/etc/nohang/nohang.conf - nohang --check --config /usr/local/etc/nohang/nohang-desktop.conf - nohang --check --config conf/nohang/test.conf - sudo nohang --config /usr/local/etc/nohang/nohang.conf --tasks - sudo nohang --config /usr/local/etc/nohang/nohang-desktop.conf --tasks - /bin/sleep 60 & - sudo bash -c "nohang --monitor --config conf/nohang/test.conf & tail /dev/zero & sleep 30 && pkill python3" - sudo cat /var/log/nohang/nohang.log - sudo make uninstall nohang-0.2.0/CHANGELOG.md000066400000000000000000000073571377337215500146250ustar00rootroot00000000000000# Changelog This changelog is outdated. It will be updated later. ## [Unreleased] - Added new CLI options: - -v, --version - -m, --memload - --monitor - --tasks - --check-config - Possible process crashes are fixed: - Fixed crash at startup due to `UnicodeDecodeError` on some systems - Handled `UnicodeDecodeError` if victim name consists of many unicode characters ([rfjakob/earlyoom#110](https://github.com/rfjakob/earlyoom/issues/110)) - Fixed process crash before performing corrective actions if Python 3.4 or lower are used to interpret nohang - Improve output: - Display `oom_score`, `oom_score_adj`, `Ancestry`, `EUID`, `State`, `VmSize`, `RssAnon`, `RssFile`, `RssShmem`, `CGroup_v1`, `CGroup_v2`, `Realpath`, `Cmdline` and `Lifetime` of the victim in corrective action reports - Added memory report interval - Added delta memory info (the rate of change of available memory) - Print statistics on corrective actions after each corrective action - Added ability to print a process table before each corrective action - Added the ability to log into a separate file - Improved GUI warnings: - Reduced the idle time of the daemon in the process of launching a notification - All notify-send calls are made using the `nohang_notify_helper` script, in which all timeouts are handled (not anymore: nohang_notify_helper has been removed) - Native python implementation of `env` search without running `ps` to notify all users if nohang started with UID=0. - Improved modifing badness via matching with regular expressions: - Added the ability to set many different `badness_adj` for processes depending on the matching `Name`, `CGroup_v1`, `CGroup_v2`, `cmdline`, `realpath`, `environ` and `EUID` with the specified regular expressions ([issue #11](https://github.com/hakavlad/nohang/issues/11)) - Fix: replace `re.fullmatch()` by `re.search()` - Reduced memory usage: - Reduced memory usage and startup time (using `sys.argv` instead of `argparse`) - Reduced memory usage with `mlockall()` using `MCL_ONFAULT` ([rfjakob/earlyoom#112](https://github.com/rfjakob/earlyoom/issues/112)) - Lock all memory by default using mlockall() - Added new tools: - `oom-sort` - `psi-top` - `psi2log` - Improve poll rate algorithm - Fixed Makefile for installation on CentOS 7 (remove gzip `-k` option). - Added `max_post_sigterm_victim_lifetime` option: send SIGKILL to the victim if it doesn't respond to SIGTERM for a certain time - Added `post_kill_exe` option (the ability to run any command after killing a victim) - Added `warning_exe` option (the ability to run any command instead of GUI low memory warnings) - Added `victim_cache_time` option - Improved victim search algorithm (do it ~30% faster) ([rfjakob/earlyoom#114](https://github.com/rfjakob/earlyoom/issues/114)) - Improved limiting `oom_score_adj`: now it can works with UID != 0 - Fixed conf parsing: use of `line.partition('=')` instead of `line.split('=')` - Removed self-defense options from the config, use systemd unit scheduling instead - Added the ability to send any signal instead of SIGTERM for processes with certain names - Added support for `PSI` - Recheck memory levels after finding a victim to prevent killing innocent victims in some cases ([issue #20](https://github.com/hakavlad/nohang/issues/20)) - Now one corrective action to one victim can be applied only once. - Ignoring zram by default, checking for this has become optional. - Improved user input validation - Improved documentation - Handle signals (SIGTERM, SIGINT, SIGQUIT, SIGHUP), print total stat by corrective actions at exit. ## [0.1] - 2018-11-23 [unreleased]: https://github.com/hakavlad/nohang/compare/v0.1...HEAD [0.1]: https://github.com/hakavlad/nohang/releases/tag/v0.1 nohang-0.2.0/LICENSE000066400000000000000000000020571377337215500140110ustar00rootroot00000000000000MIT License Copyright (c) 2018 Alexey Avramov Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. nohang-0.2.0/Makefile000066400000000000000000000117261377337215500144470ustar00rootroot00000000000000DESTDIR ?= PREFIX ?= /usr/local SYSCONFDIR ?= /usr/local/etc SYSTEMDUNITDIR ?= /usr/local/lib/systemd/system BINDIR ?= $(PREFIX)/bin SBINDIR ?= $(PREFIX)/sbin DATADIR ?= $(PREFIX)/share DOCDIR ?= $(DATADIR)/doc/nohang MANDIR ?= $(DATADIR)/man PANDOC := $(shell command -v pandoc 2> /dev/null) all: @ echo "Use: make install, make install-openrc, make uninstall" update-manpages: ifdef PANDOC pandoc docs/nohang.manpage.md -s -t man > man/nohang.8 pandoc docs/oom-sort.manpage.md -s -t man > man/oom-sort.1 pandoc docs/psi2log.manpage.md -s -t man > man/psi2log.1 pandoc docs/psi-top.manpage.md -s -t man > man/psi-top.1 else @echo "pandoc is not installed, skipping manpages generation" endif base: install -p -d $(DESTDIR)$(SBINDIR) install -p -m0755 src/nohang $(DESTDIR)$(SBINDIR)/nohang install -p -d $(DESTDIR)$(BINDIR) install -p -m0755 src/oom-sort $(DESTDIR)$(BINDIR)/oom-sort install -p -m0755 src/psi-top $(DESTDIR)$(BINDIR)/psi-top install -p -m0755 src/psi2log $(DESTDIR)$(BINDIR)/psi2log install -p -d $(DESTDIR)$(SYSCONFDIR)/nohang sed "s|:TARGET_DATADIR:|$(DATADIR)|" \ conf/nohang/nohang.conf.in > nohang.conf sed "s|:TARGET_DATADIR:|$(DATADIR)|" \ conf/nohang/nohang-desktop.conf.in > nohang-desktop.conf install -p -m0644 nohang.conf $(DESTDIR)$(SYSCONFDIR)/nohang/nohang.conf install -p -m0644 nohang-desktop.conf $(DESTDIR)$(SYSCONFDIR)/nohang/nohang-desktop.conf install -p -d $(DESTDIR)$(DATADIR)/nohang install -p -m0644 nohang.conf $(DESTDIR)$(DATADIR)/nohang/nohang.conf install -p -m0644 nohang-desktop.conf $(DESTDIR)$(DATADIR)/nohang/nohang-desktop.conf -git describe --tags --long --dirty > version install -p -m0644 version $(DESTDIR)$(DATADIR)/nohang/version rm -fv nohang.conf rm -fv nohang-desktop.conf rm -fv version install -p -d $(DESTDIR)/etc/logrotate.d install -p -m0644 conf/logrotate.d/nohang $(DESTDIR)/etc/logrotate.d/nohang install -p -d $(DESTDIR)$(MANDIR)/man1 gzip -9cn man/oom-sort.1 > $(DESTDIR)$(MANDIR)/man1/oom-sort.1.gz gzip -9cn man/psi-top.1 > $(DESTDIR)$(MANDIR)/man1/psi-top.1.gz gzip -9cn man/psi2log.1 > $(DESTDIR)$(MANDIR)/man1/psi2log.1.gz install -p -d $(DESTDIR)$(MANDIR)/man8 sed "s|:SYSCONFDIR:|$(SYSCONFDIR)|g; s|:DATADIR:|$(DATADIR)|g" \ man/nohang.8 > nohang.8 gzip -9cn nohang.8 > $(DESTDIR)$(MANDIR)/man8/nohang.8.gz rm -fv nohang.8 install -p -d $(DESTDIR)$(DOCDIR) install -p -m0644 README.md $(DESTDIR)$(DOCDIR)/README.md install -p -m0644 CHANGELOG.md $(DESTDIR)$(DOCDIR)/CHANGELOG.md units: install -p -d $(DESTDIR)$(SYSTEMDUNITDIR) sed "s|:TARGET_SBINDIR:|$(SBINDIR)|; s|:TARGET_SYSCONFDIR:|$(SYSCONFDIR)|" \ systemd/nohang.service.in > nohang.service sed "s|:TARGET_SBINDIR:|$(SBINDIR)|; s|:TARGET_SYSCONFDIR:|$(SYSCONFDIR)|" \ systemd/nohang-desktop.service.in > nohang-desktop.service install -p -m0644 nohang.service $(DESTDIR)$(SYSTEMDUNITDIR)/nohang.service install -p -m0644 nohang-desktop.service $(DESTDIR)$(SYSTEMDUNITDIR)/nohang-desktop.service rm -fv nohang.service rm -fv nohang-desktop.service chcon: chcon -t systemd_unit_file_t $(DESTDIR)$(SYSTEMDUNITDIR)/nohang.service || : chcon -t systemd_unit_file_t $(DESTDIR)$(SYSTEMDUNITDIR)/nohang-desktop.service || : daemon-reload: systemctl daemon-reload || : build_deb: base units reinstall-deb: set -v deb/build.sh sudo apt install --reinstall ./deb/package.deb install: base units chcon daemon-reload # This is fine. install-openrc: base install -p -d $(DESTDIR)$(SYSCONFDIR)/init.d sed "s|:TARGET_SBINDIR:|$(SBINDIR)|; s|:TARGET_SYSCONFDIR:|$(SYSCONFDIR)|" \ openrc/nohang.in > openrc/nohang sed "s|:TARGET_SBINDIR:|$(SBINDIR)|; s|:TARGET_SYSCONFDIR:|$(SYSCONFDIR)|" \ openrc/nohang-desktop.in > openrc/nohang-desktop install -p -m0775 openrc/nohang $(DESTDIR)$(SYSCONFDIR)/init.d/nohang install -p -m0775 openrc/nohang-desktop $(DESTDIR)$(SYSCONFDIR)/init.d/nohang-desktop rm -fv openrc/nohang rm -fv openrc/nohang-desktop uninstall-base: rm -fv $(DESTDIR)$(SBINDIR)/nohang rm -fv $(DESTDIR)$(BINDIR)/oom-sort rm -fv $(DESTDIR)$(BINDIR)/psi-top rm -fv $(DESTDIR)$(BINDIR)/psi2log rm -fv $(DESTDIR)$(MANDIR)/man1/oom-sort.1.gz rm -fv $(DESTDIR)$(MANDIR)/man1/psi-top.1.gz rm -fv $(DESTDIR)$(MANDIR)/man1/psi2log.1.gz rm -fv $(DESTDIR)$(MANDIR)/man8/nohang.8.gz rm -fvr $(DESTDIR)$/etc/logrotate.d/nohang rm -fvr $(DESTDIR)$(DOCDIR)/ rm -fvr $(DESTDIR)/var/log/nohang/ rm -fvr $(DESTDIR)$(DATADIR)/nohang/ rm -fvr $(DESTDIR)$(SYSCONFDIR)/nohang/ uninstall-units: systemctl stop nohang.service || : systemctl stop nohang-desktop.service || : systemctl disable nohang.service || : systemctl disable nohang-desktop.service || : rm -fv $(DESTDIR)$(SYSTEMDUNITDIR)/nohang.service rm -fv $(DESTDIR)$(SYSTEMDUNITDIR)/nohang-desktop.service uninstall-openrc: rc-service nohang-desktop stop || : rc-service nohang stop || : rm -fv $(DESTDIR)$(SYSCONFDIR)/init.d/nohang rm -fv $(DESTDIR)$(SYSCONFDIR)/init.d/nohang-desktop uninstall: uninstall-base uninstall-units daemon-reload uninstall-openrc nohang-0.2.0/README.md000066400000000000000000001427051377337215500142700ustar00rootroot00000000000000![pic](https://i.imgur.com/scXQ312.png) # nohang [![Build Status](https://travis-ci.org/hakavlad/nohang.svg?branch=master)](https://travis-ci.org/hakavlad/nohang) [![Total alerts](https://img.shields.io/lgtm/alerts/g/hakavlad/nohang.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/hakavlad/nohang/alerts/) [![Packaging status](https://repology.org/badge/tiny-repos/nohang.svg)](https://repology.org/project/nohang/versions) `nohang` package provides a highly configurable daemon for Linux which is able to correctly prevent [out of memory](https://en.wikipedia.org/wiki/Out_of_memory) (OOM) and keep system responsiveness in low memory conditions. The package also includes additional diagnostic tools (`oom-sort`, `psi2log`, `psi-top`). ## What is the problem? OOM conditions may cause [freezes](https://en.wikipedia.org/wiki/Hang_(computing)), [livelocks](https://en.wikipedia.org/wiki/Deadlock#Livelock), drop [caches](https://en.wikipedia.org/wiki/Page_cache) and processes to be killed (via sending [SIGKILL](https://en.wikipedia.org/wiki/Signal_(IPC)#SIGKILL)) instead of trying to terminate them correctly (via sending [SIGTERM](https://en.wikipedia.org/wiki/Signal_(IPC)#SIGTERM) or takes other corrective action). Some applications may crash if it's impossible to allocate memory. Here are the statements of some users: > "How do I prevent Linux from freezing when out of memory? Today I (accidentally) ran some program on my Linux box that quickly used a lot of memory. My system froze, became unresponsive and thus I was unable to kill the offender. How can I prevent this in the future? Can't it at least keep a responsive core or something running?" — [serverfault](https://serverfault.com/questions/390623/how-do-i-prevent-linux-from-freezing-when-out-of-memory) > "With or without swap it still freezes before the OOM killer gets run automatically. This is really a kernel bug that should be fixed (i.e. run OOM killer earlier, before dropping all disk cache). Unfortunately kernel developers and a lot of other folk fail to see the problem. Common suggestions such as disable/enable swap, buy more RAM, run less processes, set limits etc. do not address the underlying problem that the kernel's low memory handling sucks camel's balls." — [serverfault](https://serverfault.com/questions/390623/how-do-i-prevent-linux-from-freezing-when-out-of-memory#comment417508_390625) > "The traditional Linux OOM killer works fine in some cases, but in others it kicks in too late, resulting in the system entering a [livelock](https://en.wikipedia.org/wiki/Deadlock#Livelock) for an indeterminate period." — [engineering.fb.com](https://engineering.fb.com/production-engineering/oomd/) Also look at these discussions: - Why are low memory conditions handled so badly? [[r/linux](https://www.reddit.com/r/linux/comments/56r4xj/why_are_low_memory_conditions_handled_so_badly/)] - Memory management "more effective" on Windows than Linux? (in preventing total system lockup) [[r/linux](https://www.reddit.com/r/linux/comments/aqd9mh/memory_management_more_effective_on_windows_than/)] - Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure [[original LKML post](https://lkml.org/lkml/2019/8/4/15) | [r/linux](https://www.reddit.com/r/linux/comments/cmg48b/lets_talk_about_the_elephant_in_the_room_the/) | [Hacker News](https://news.ycombinator.com/item?id=20620545) | [slashdot](https://linux.slashdot.org/story/19/08/06/1839206/linux-performs-poorly-in-low-ram--memory-pressure-situations-on-the-desktop) | [phoronix](https://www.phoronix.com/forums/forum/phoronix/general-discussion/1118164-yes-linux-does-bad-in-low-ram-memory-pressure-situations-on-the-desktop) | [opennet.ru](https://www.opennet.ru/opennews/art.shtml?num=51231) | [linux.org.ru](https://www.linux.org.ru/forum/talks/15151526)] ## Solution Use one of the userspace OOM killers: - [earlyoom](https://github.com/rfjakob/earlyoom): This is a simple, stable and tiny OOM prevention daemon written in C (the best choice for emedded and old servers). It has a minimum dependencies and can work with oldest kernels. It is enabled by default on Fedora 32 Workstation (and F33 KDE). - [oomd](https://github.com/facebookincubator/oomd): This is a userspace OOM killer for linux systems written in C++ and developed by Facebook. This is the best choice for use in large data centers. It needs Linux 4.20+. - [low-memory-monitor](https://gitlab.freedesktop.org/hadess/low-memory-monitor/): There's a [project announcement](http://www.hadess.net/2019/08/low-memory-monitor-new-project.html). - [psi-monitor](https://github.com/endlessm/eos-boot-helper/tree/master/psi-monitor): It's used by default on [Endless OS](https://endlessos.com/). - `nohang`: nohang is earlyoom on steroids and has many useful features, see below. Maybe this is a good choice for modern desktops and servers if you need fine-tuning. It's used by default on [Garuda Linux](https://garudalinux.org/). Use these tools to improve responsiveness during heavy swapping: - [le9-patch](https://github.com/hakavlad/le9-patch): Protect active file pages to prevent thrashing and improve responsiveness under low-memory conditions. It's kernel-side solution that can fix OOM killer behavior. - [prelockd](https://github.com/hakavlad/prelockd): Lock executables and shared libraries in memory to improve system responsiveness under low-memory conditions. - [memavaild](https://github.com/hakavlad/memavaild): Keep amount of available memory by evicting memory of selected cgroups into swap space. - [uresourced](https://gitlab.freedesktop.org/benzea/uresourced): This daemon will give resource allocations to active graphical users. It's [enabled by default](https://fedoraproject.org/wiki/Changes/Reserve_resources_for_active_user_WS) on Fedora 33 Workstation. Of course, you can also [download more RAM](https://downloadmoreram.com/), tune [virtual memory](https://www.kernel.org/doc/Documentation/sysctl/vm.txt), use [zram](https://www.kernel.org/doc/Documentation/blockdev/zram.txt)/[zswap](https://www.kernel.org/doc/Documentation/vm/zswap.txt) and use [limits](https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html) for cgroups. ## Features - Sending the SIGTERM signal is default corrective action. If the victim does not respond to SIGTERM, with a further drop in the level of memory it gets SIGKILL; - Customizing victim selection: impact on the badness of processes via matching their names, cgroups, exe realpathes, environs, cmdlines and euids with specified regular expressions; - Customizing corrective actions: if the name or control group of the victim matches a certain regex pattern, you can run any command instead of sending the SIGTERM signal (the default corrective action) to the victim. For example: - `systemctl restart foo`; - `kill -INT $PID` (you can override the signal sent to the victim, $PID will be replaced by the victim's PID). - GUI notifications: - Notification of corrective actions taken and displaying the name and PID of the victim; - Low memory warnings. - [zram](https://www.kernel.org/doc/Documentation/blockdev/zram.txt) support (`mem_used_total` as a trigger); - [PSI](https://lwn.net/Articles/759658/) ([pressure stall information](https://facebookmicrosites.github.io/psi/)) support; - Optional checking kernel messages for OOM events; - Easy setup with configuration files ([nohang.conf](https://github.com/hakavlad/nohang/blob/master/conf/nohang/nohang.conf.in), [nohang-desktop.conf](https://github.com/hakavlad/nohang/blob/master/conf/nohang/nohang-desktop.conf.in)). ## Demo `nohang` prevents Out Of Memory with GUI notifications: - [https://youtu.be/ChTNu9m7uMU](https://youtu.be/ChTNu9m7uMU) – just old demo without swap space. - [https://youtu.be/UCwZS5uNLu0](https://youtu.be/UCwZS5uNLu0) – running multiple fast memory hogs at the same time without swap space. - [https://youtu.be/PLVWgNrVNlc](https://youtu.be/PLVWgNrVNlc) – opening multiple chromium tabs with 2.3 GiB memory and 1.8 GiB swap space on zram. ## Requirements For basic usage: - `Linux` (>= 3.14, since `MemAvailable` appeared in `/proc/meminfo`) - `Python` (>= 3.3) To respond to `PSI` metrics (optional): - `Linux` (>= 4.20) with `CONFIG_PSI=y` To show GUI notifications (optional): - [notification server](https://wiki.archlinux.org/index.php/Desktop_notifications#Notification_servers) (most of desktop environments use their own implementations) - `libnotify` (Arch Linux, Fedora, openSUSE) or `libnotify-bin` (Debian GNU/Linux, Ubuntu) - `sudo` if nohang started with UID=0. ## Memory and CPU usage - VmRSS is about 10–14 MiB instead of the settings, about 10–11 MiB by default (with Python <= 3.8), about 16–17 MiB with Python 3.9. - CPU usage depends on the level of available memory and monitoring intensity. ## Warnings - the daemon runs with super-user privileges and has full access to all private memory of all processes and sensitive user data; - the daemon does not forbid you to shoot yourself in the foot: with some settings, unwanted killings of processes can occur; - the daemon is not a panacea: there are no universal settings that reliably protect against all types of threats. ## Known problems - The documentation is terrible. - The ZFS ARC cache is memory-reclaimable, like the Linux buffer cache. However, in contrast to the buffer cache, it currently does not count to MemAvailable (see [openzfs/zfs#10255](https://github.com/openzfs/zfs/issues/10255)). See also https://github.com/rfjakob/earlyoom/pull/191 and https://github.com/hakavlad/nohang/issues/89. - Linux kernels without `CONFIG_CGROUP_CPUACCT=y` ([linux-ck](https://wiki.archlinux.org/index.php/Linux-ck), for example) provide incorrect PSI metrics, see [issue](https://github.com/hakavlad/nohang/issues/25#issuecomment-643716504). ## nohang vs nohang-desktop `nohang` comes with two configs: `nohang.conf` and `nohang-desktop.conf`. `nohang` comes with two systemd service unit files: `nohang.service` and `nohang-desktop.service`. Choose one. - `nohang.conf` provides vanilla default settings without PSI checking enabled, without any badness correction and without GUI notifications enabled. - `nohang-desktop.conf` provides default settings optimized for desktop usage. ## How to install #### To install on Fedora: ```bash $ sudo dnf install nohang-desktop $ sudo systemctl enable --now nohang-desktop.service ``` #### To install on RHEL 7 and RHEL 8: nohang is avaliable in [EPEL repos](https://fedoraproject.org/wiki/EPEL). ```bash $ sudo yum install nohang $ sudo systemctl enable nohang.service $ sudo systemctl start nohang.service ``` To enable PSI on RHEL 8 pass `psi=1` to kernel boot cmdline. #### For Arch Linux there's an [AUR package](https://aur.archlinux.org/packages/nohang-git/) Use your favorite [AUR helper](https://wiki.archlinux.org/index.php/AUR_helpers). For example, ```bash $ yay -S nohang-git $ sudo systemctl enable --now nohang-desktop.service ``` #### To install on Ubuntu 20.04/20.10 To install from [PPA](https://launchpad.net/~oibaf/+archive/ubuntu/test/): ```bash $ sudo add-apt-repository ppa:oibaf/test $ sudo apt update $ sudo apt install nohang $ sudo systemctl enable --now nohang-desktop.service ``` #### To install on Debian and Ubuntu-based systems: Outdated and buggy nohang v0.1 release was packaged for [Debian 11](https://packages.debian.org/bullseye/source/nohang) and [Ubuntu 20.10](https://packages.ubuntu.com/source/groovy/nohang). It's easy to build a deb package with the latest git snapshot. Install build dependencies: ```bash $ sudo apt install make fakeroot ``` Clone the latest git snapshot and run the build script to build the package: ```bash $ git clone https://github.com/hakavlad/nohang.git && cd nohang $ deb/build.sh ``` Install the package: ```bash $ sudo apt install --reinstall ./deb/package.deb ``` Start and enable `nohang.service` or `nohang-desktop.service` after installing the package: ```bash $ sudo systemctl enable --now nohang-desktop.service ``` #### To install on Gentoo and derivatives (e.g. Funtoo): Add the [eph kit](https://git.sr.ht/~happy_shredder/eph_kit) overlay, for example using layman or as a local repository. Then update your repos: ```bash $ sudo layman -S # if added via layman $ sudo emerge --sync # local repo on Gentoo $ sudo ego sync # local repo on Funtoo ``` Install: ```bash $ sudo emerge -a nohang ``` Start the service: ```bash $ sudo rc-service nohang-desktop start ``` Optionally add to startup: ```bash $ sudo rc-update add nohang-desktop default ``` #### To install the latest version on any distro: ```bash $ git clone https://github.com/hakavlad/nohang.git && cd nohang $ sudo make install ``` Config files will be located in `/usr/local/etc/nohang/`. To enable and start unit without GUI notifications: ```bash $ sudo systemctl enable --now nohang.service ``` To enable and start unit with GUI notifications: ```bash $ sudo systemctl enable --now nohang-desktop.service ``` On systems with OpenRC: ```bash $ sudo make install-openrc ``` To uninstall: ```bash $ sudo make uninstall ``` ## Command line options ``` ./nohang -h usage: nohang [-h|--help] [-v|--version] [-m|--memload] [-c|--config CONFIG] [--check] [--monitor] [--tasks] optional arguments: -h, --help show this help message and exit -v, --version show version of installed package and exit -m, --memload consume memory until 40 MiB (MemAvailable + SwapFree) remain free, and terminate the process -c CONFIG, --config CONFIG path to the config file. This should only be used with one of the following options: --monitor, --tasks, --check --check check and show the configuration and exit. This should only be used with -c/--config CONFIG option --monitor start monitoring. This should only be used with -c/--config CONFIG option --tasks show tasks state and exit. This should only be used with -c/--config CONFIG option ``` ## How to configure The program can be configured by editing the config file. The configuration includes the following sections: 0. Checking kernel messages for OOM events; 1. Common zram settings; 2. Common PSI settings; 3. Poll rate; 4. Warnings and notifications; 5. Soft threshold; 6. Hard threshold; 7. Customize victim selection; 8. Customize soft corrective actions; 9. Misc settings; 10. Verbosity, debug, logging. Just read the description of the parameters and edit the values. Please restart the daemon to apply the changes. ## How to test nohang - The safest way is to run `nohang --memload`. This causes memory consumption, and the process will exits before OOM occurs. - Another way is to run `tail /dev/zero`. This causes fast memory comsumption and causes OOM at the end. If testing occurs while `nohang` is running, these processes should be terminated before OOM occurs. ## Tasks state Run `sudo nohang -c/--config CONFIG --tasks` to see the table of prosesses with their badness values, oom_scores, names, UIDs etc.
Output example ``` Config: /etc/nohang/nohang.conf ################################################################################################################### # PID PPID badness oom_score oom_score_adj eUID S VmSize VmRSS VmSwap Name CGroup #------- ------- ------- --------- ------------- ---------- - ------ ----- ------ --------------- -------- # 336 1 1 1 0 0 S 85 25 0 systemd-journal /system.slice/systemd-journald.service # 383 1 0 0 -1000 0 S 46 5 0 systemd-udevd /system.slice/systemd-udevd.service # 526 2238 7 7 0 1000 S 840 96 0 kate /user.slice/user-1000.slice/session-7.scope # 650 1 3 3 0 1000 S 760 50 0 kate /user.slice/user-1000.slice/session-7.scope # 731 1 0 0 0 100 S 126 4 0 systemd-timesyn /system.slice/systemd-timesyncd.service # 756 1 0 0 0 105 S 181 3 0 rtkit-daemon /system.slice/rtkit-daemon.service # 759 1 0 0 0 0 S 277 7 0 accounts-daemon /system.slice/accounts-daemon.service # 761 1 0 0 0 0 S 244 3 0 rsyslogd /system.slice/rsyslog.service # 764 1 0 0 -900 108 S 45 5 0 dbus-daemon /system.slice/dbus.service # 805 1 0 0 0 0 S 46 5 0 systemd-logind /system.slice/systemd-logind.service # 806 1 0 0 0 0 S 35 3 0 irqbalance /system.slice/irqbalance.service # 813 1 0 0 0 0 S 29 3 0 cron /system.slice/cron.service # 814 1 11 11 0 0 S 176 160 0 memlockd /system.slice/memlockd.service # 815 1 0 0 -10 0 S 32 9 0 python3 /fork.slice/fork-bomb.slice/fork-bomb-killer.slice/fork-bomb-killer.service # 823 1 0 0 0 0 S 25 4 0 smartd /system.slice/smartd.service # 826 1 0 0 0 113 S 46 3 0 avahi-daemon /system.slice/avahi-daemon.service # 850 826 0 0 0 113 S 46 0 0 avahi-daemon /system.slice/avahi-daemon.service # 868 1 0 0 0 0 S 281 8 0 polkitd /system.slice/polkit.service # 903 1 1 1 0 0 S 4094 16 0 stunnel4 /system.slice/stunnel4.service # 940 1 0 0 -600 0 S 39 10 0 python3 /nohang.slice/nohang.service # 1014 1 0 0 0 13 S 22 2 0 obfs-local /system.slice/obfs-local.service # 1015 1 0 0 0 1000 S 36 4 0 ss-local /system.slice/ss-local.service # 1023 1 0 0 0 116 S 33 2 0 dnscrypt-proxy /system.slice/dnscrypt-proxy.service # 1029 1 1 1 0 119 S 4236 16 0 privoxy /system.slice/privoxy.service # 1035 1 0 0 0 0 S 355 6 0 lightdm /system.slice/lightdm.service # 1066 1 0 0 0 0 S 45 7 0 wpa_supplicant /system.slice/wpa_supplicant.service # 1178 1 0 0 0 0 S 14 2 0 agetty /system.slice/system-getty.slice/getty@tty1.service # 1294 1 0 0 -1000 0 S 4 1 0 watchdog /system.slice/watchdog.service # 1632 1 1 1 0 1000 S 1391 22 0 pulseaudio /user.slice/user-1000.slice/session-2.scope # 1689 1632 0 0 0 1000 S 125 5 0 gconf-helper /user.slice/user-1000.slice/session-2.scope # 1711 1 0 0 0 0 S 367 8 0 udisksd /system.slice/udisks2.service # 1819 1 0 0 0 0 S 304 8 0 upowerd /system.slice/upower.service # 1879 1 0 0 0 1000 S 64 7 0 systemd /user.slice/user-1000.slice/user@1000.service/init.scope # 1880 1879 0 0 0 1000 S 229 2 0 (sd-pam) /user.slice/user-1000.slice/user@1000.service/init.scope # 1888 1 0 0 0 0 S 14 2 0 agetty /system.slice/system-getty.slice/getty@tty2.service # 1889 1 0 0 0 0 S 14 2 0 agetty /system.slice/system-getty.slice/getty@tty3.service # 1890 1 0 0 0 0 S 14 2 0 agetty /system.slice/system-getty.slice/getty@tty4.service # 1891 1 0 0 0 0 S 14 2 0 agetty /system.slice/system-getty.slice/getty@tty5.service # 1892 1 0 0 0 0 S 14 2 0 agetty /system.slice/system-getty.slice/getty@tty6.service # 1893 1035 14 14 0 0 R 623 208 0 Xorg /system.slice/lightdm.service # 1904 1 0 0 0 111 S 64 7 0 systemd /user.slice/user-111.slice/user@111.service/init.scope # 1905 1904 0 0 0 111 S 229 2 0 (sd-pam) /user.slice/user-111.slice/user@111.service/init.scope # 1916 1904 0 0 0 111 S 44 3 0 dbus-daemon /user.slice/user-111.slice/user@111.service/dbus.service # 1920 1 0 0 0 111 S 215 5 0 at-spi2-registr /user.slice/user-111.slice/session-c2.scope # 1922 1904 0 0 0 111 S 278 6 0 gvfsd /user.slice/user-111.slice/user@111.service/gvfs-daemon.service # 1935 1035 0 0 0 0 S 238 6 0 lightdm /user.slice/user-1000.slice/session-7.scope # 1942 1 0 0 0 1000 S 210 9 0 gnome-keyring-d /user.slice/user-1000.slice/session-7.scope # 1944 1935 1 1 0 1000 S 411 21 0 mate-session /user.slice/user-1000.slice/session-7.scope # 1952 1879 0 0 0 1000 S 45 5 0 dbus-daemon /user.slice/user-1000.slice/user@1000.service/dbus.service # 1981 1944 0 0 0 1000 S 11 0 0 ssh-agent /user.slice/user-1000.slice/session-7.scope # 1984 1879 0 0 0 1000 S 278 6 0 gvfsd /user.slice/user-1000.slice/user@1000.service/gvfs-daemon.service # 1990 1879 0 0 0 1000 S 341 5 0 at-spi-bus-laun /user.slice/user-1000.slice/user@1000.service/at-spi-dbus-bus.service # 1995 1990 0 0 0 1000 S 44 4 0 dbus-daemon /user.slice/user-1000.slice/user@1000.service/at-spi-dbus-bus.service # 1997 1879 0 0 0 1000 S 215 5 0 at-spi2-registr /user.slice/user-1000.slice/user@1000.service/at-spi-dbus-bus.service # 2000 1879 0 0 0 1000 S 184 5 0 dconf-service /user.slice/user-1000.slice/user@1000.service/dbus.service # 2009 1944 2 2 0 1000 S 1308 35 0 mate-settings-d /user.slice/user-1000.slice/session-7.scope # 2013 1944 2 2 0 1000 S 436 32 0 marco /user.slice/user-1000.slice/session-7.scope # 2024 1944 4 4 0 1000 S 1258 55 0 caja /user.slice/user-1000.slice/session-7.scope # 2032 1 1 1 0 1000 S 333 18 0 msd-locate-poin /user.slice/user-1000.slice/session-7.scope # 2033 1879 0 0 0 1000 S 348 11 0 gvfs-udisks2-vo /user.slice/user-1000.slice/user@1000.service/gvfs-udisks2-volume-monitor.service # 2036 1944 1 1 0 1000 S 331 17 0 polkit-mate-aut /user.slice/user-1000.slice/session-7.scope # 2038 1944 5 5 0 1000 S 682 78 0 mate-panel /user.slice/user-1000.slice/session-7.scope # 2041 1944 2 2 0 1000 S 514 31 0 nm-applet /user.slice/user-1000.slice/session-7.scope # 2046 1944 1 1 0 1000 S 495 25 0 mate-power-mana /user.slice/user-1000.slice/session-7.scope # 2047 1944 2 2 0 1000 S 692 32 0 mate-volume-con /user.slice/user-1000.slice/session-7.scope # 2049 1944 3 3 0 1000 S 548 44 0 mate-screensave /user.slice/user-1000.slice/session-7.scope # 2059 1879 0 0 0 1000 S 263 5 0 gvfs-goa-volume /user.slice/user-1000.slice/user@1000.service/gvfs-goa-volume-monitor.service # 2076 1879 0 0 0 1000 S 352 7 0 gvfsd-trash /user.slice/user-1000.slice/user@1000.service/gvfs-daemon.service # 2077 1879 0 0 0 1000 S 362 7 0 gvfs-afc-volume /user.slice/user-1000.slice/user@1000.service/gvfs-afc-volume-monitor.service # 2087 1879 0 0 0 1000 S 263 5 0 gvfs-mtp-volume /user.slice/user-1000.slice/user@1000.service/gvfs-mtp-volume-monitor.service # 2093 1879 0 0 0 1000 S 275 6 0 gvfs-gphoto2-vo /user.slice/user-1000.slice/user@1000.service/gvfs-gphoto2-volume-monitor.service # 2106 1879 3 3 0 1000 S 544 42 0 wnck-applet /user.slice/user-1000.slice/user@1000.service/dbus.service # 2108 1879 1 1 0 1000 S 396 21 0 notification-ar /user.slice/user-1000.slice/user@1000.service/dbus.service # 2112 1879 1 1 0 1000 S 499 25 0 mate-sensors-ap /user.slice/user-1000.slice/user@1000.service/dbus.service # 2113 1879 1 1 0 1000 S 390 21 0 mate-brightness /user.slice/user-1000.slice/user@1000.service/dbus.service # 2114 1879 1 1 0 1000 S 534 22 0 mate-multiload- /user.slice/user-1000.slice/user@1000.service/dbus.service # 2118 1879 2 2 0 1000 S 547 29 0 clock-applet /user.slice/user-1000.slice/user@1000.service/dbus.service # 2152 1879 1 1 0 1000 S 218 22 0 gvfsd-metadata /user.slice/user-1000.slice/user@1000.service/gvfs-metadata.service # 2206 1 3 3 0 110 S 106 48 0 tor /system.slice/system-tor.slice/tor@default.service # 2229 1 3 3 0 1000 S 999 42 0 kactivitymanage /user.slice/user-1000.slice/session-7.scope # 2238 1 0 0 0 1000 S 150 9 0 kdeinit5 /user.slice/user-1000.slice/session-7.scope # 2239 2238 3 3 0 1000 S 648 41 0 klauncher /user.slice/user-1000.slice/session-7.scope # 3959 1 1 1 0 0 S 615 18 0 NetworkManager /system.slice/NetworkManager.service # 3977 3959 0 0 0 0 S 20 4 0 dhclient /system.slice/NetworkManager.service # 5626 1879 0 0 0 1000 S 355 7 0 gvfsd-network /user.slice/user-1000.slice/user@1000.service/gvfs-daemon.service # 5637 1879 1 1 0 1000 S 623 14 0 gvfsd-smb-brows /user.slice/user-1000.slice/user@1000.service/gvfs-daemon.service # 6296 1879 0 0 0 1000 S 435 7 0 gvfsd-dnssd /user.slice/user-1000.slice/user@1000.service/gvfs-daemon.service # 11129 1879 3 3 0 1000 S 597 42 0 kded5 /user.slice/user-1000.slice/user@1000.service/dbus.service # 11136 1879 2 2 0 1000 S 639 39 0 kuiserver5 /user.slice/user-1000.slice/user@1000.service/dbus.service # 11703 1879 3 3 0 1000 S 500 45 0 mate-system-mon /user.slice/user-1000.slice/user@1000.service/dbus.service # 16798 1879 0 0 0 1000 S 346 10 0 gvfsd-http /user.slice/user-1000.slice/user@1000.service/gvfs-daemon.service # 18133 1 3 3 0 1000 S 760 49 0 kate /user.slice/user-1000.slice/session-7.scope # 18144 2038 1 1 0 1000 S 301 23 0 lxterminal /user.slice/user-1000.slice/session-7.scope # 18147 18144 0 0 0 1000 S 14 2 0 gnome-pty-helpe /user.slice/user-1000.slice/session-7.scope # 18148 18144 1 1 0 1000 S 42 26 0 bash /user.slice/user-1000.slice/session-7.scope # 18242 2238 1 1 0 1000 S 194 14 0 file.so /user.slice/user-1000.slice/session-7.scope # 18246 18148 0 0 0 0 S 54 4 0 sudo /user.slice/user-1000.slice/session-7.scope # 19003 1 0 0 0 0 S 310 12 0 packagekitd /system.slice/packagekit.service # 26993 2038 91 91 0 1000 S 3935 1256 0 firefox-esr /user.slice/user-1000.slice/session-7.scope # 27275 26993 121 121 0 1000 S 3957 1684 0 Web Content /user.slice/user-1000.slice/session-7.scope # 30374 1 1 1 0 1000 S 167 14 0 VBoxXPCOMIPCD /user.slice/user-1000.slice/session-7.scope # 30380 1 2 2 0 1000 S 958 27 0 VBoxSVC /user.slice/user-1000.slice/session-7.scope # 30549 30380 86 86 0 1000 S 5332 1192 0 VirtualBox /user.slice/user-1000.slice/session-7.scope # 30875 1 1 1 0 1000 S 345 26 0 leafpad /user.slice/user-1000.slice/session-7.scope # 32689 1 7 7 0 1000 S 896 99 0 dolphin /user.slice/user-1000.slice/session-7.scope ################################################################################################################### Process with highest badness (found in 55 ms): PID: 27275, Name: Web Content, badness: 121 ```
## Logging To view the latest entries in the log (for systemd users): ```bash $ sudo journalctl -eu nohang.service #### or $ sudo journalctl -eu nohang-desktop.service ``` You can also enable `separate_log` in the config to logging in `/var/log/nohang/nohang.log`. ## oom-sort `oom-sort` is an additional diagnostic tool that will be installed with `nohang` package. It sorts the processes in descending order of their `oom_score` and also displays `oom_score_adj`, `Uid`, `Pid`, `Name`, `VmRSS`, `VmSwap` and optionally `cmdline`. Run `oom-sort --help` for more info. Man page: [oom-sort.manpage.md](docs/oom-sort.manpage.md). Usage: ```bash $ oom-sort ```
Output example ``` oom_score oom_score_adj UID PID Name VmRSS VmSwap cmdline --------- ------------- ---- ----- --------------- ------- -------- ------- 23 0 0 964 Xorg 58 M 22 M /usr/libexec/Xorg -background none :0 vt01 -nolisten tcp -novtswitch -auth /var/run/lxdm/lxdm-:0.auth 13 0 1000 1365 pcmanfm 38 M 10 M pcmanfm --desktop --profile LXDE 10 0 1000 1408 dnfdragora-upda 9 M 27 M /usr/bin/python3 /bin/dnfdragora-updater 5 0 0 822 firewalld 0 M 19 M /usr/bin/python3 /usr/sbin/firewalld --nofork --nopid 5 0 1000 1364 lxpanel 18 M 2 M lxpanel --profile LXDE 5 0 1000 1685 nm-applet 6 M 12 M nm-applet 5 0 1000 1862 lxterminal 16 M 2 M lxterminal 4 0 996 890 polkitd 8 M 6 M /usr/lib/polkit-1/polkitd --no-debug 4 0 1000 1703 pnmixer 6 M 11 M pnmixer 3 0 0 649 systemd-journal 10 M 1 M /usr/lib/systemd/systemd-journald 3 0 1000 1360 openbox 9 M 2 M openbox --config-file /home/user/.config/openbox/lxde-rc.xml 3 0 1000 1363 notification-da 3 M 10 M /usr/libexec/notification-daemon 2 0 1000 1744 clipit 5 M 3 M clipit 2 0 1000 2619 python3 9 M 0 M python3 /bin/oom-sort 1 0 0 809 rsyslogd 3 M 3 M /usr/sbin/rsyslogd -n 1 0 0 825 udisksd 2 M 2 M /usr/libexec/udisks2/udisksd 1 0 0 873 sssd_nss 4 M 1 M /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files 1 0 0 876 systemd-logind 2 M 2 M /usr/lib/systemd/systemd-logind 1 0 0 907 abrt-dump-journ 2 M 1 M /usr/bin/abrt-dump-journal-oops -fxtD 1 0 0 920 NetworkManager 3 M 2 M /usr/sbin/NetworkManager --no-daemon 1 0 1000 1115 systemd 4 M 1 M /usr/lib/systemd/systemd --user 1 0 1000 1118 (sd-pam) 0 M 5 M (sd-pam) 1 0 1000 1366 xscreensaver 5 M 0 M xscreensaver -no-splash 1 0 1000 1851 gvfsd-trash 3 M 1 M /usr/libexec/gvfsd-trash --spawner :1.6 /org/gtk/gvfs/exec_spaw/0 1 0 1000 1969 gvfsd-metadata 6 M 0 M /usr/libexec/gvfsd-metadata 1 0 1000 2262 bash 5 M 0 M bash 0 -1000 0 675 systemd-udevd 0 M 4 M /usr/lib/systemd/systemd-udevd 0 -1000 0 787 auditd 0 M 1 M /sbin/auditd 0 0 0 807 ModemManager 0 M 1 M /usr/sbin/ModemManager 0 0 0 808 smartd 0 M 1 M /usr/sbin/smartd -n -q never 0 0 0 810 alsactl 0 M 0 M /usr/sbin/alsactl -s -n 19 -c -E ALSA_CONFIG_PATH=/etc/alsa/alsactl.conf --initfile=/lib/alsa/init/00main rdaemon 0 0 0 811 mcelog 0 M 0 M /usr/sbin/mcelog --ignorenodev --daemon --foreground 0 0 172 813 rtkit-daemon 0 M 0 M /usr/libexec/rtkit-daemon 0 0 0 814 VBoxService 0 M 1 M /usr/sbin/VBoxService -f 0 0 0 817 rngd 0 M 1 M /sbin/rngd -f 0 -900 81 818 dbus-daemon 3 M 0 M /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only 0 0 0 823 irqbalance 0 M 0 M /usr/sbin/irqbalance --foreground 0 0 70 824 avahi-daemon 0 M 0 M avahi-daemon: running [linux.local] 0 0 0 826 sssd 0 M 2 M /usr/sbin/sssd -i --logger=files 0 0 995 838 chronyd 1 M 0 M /usr/sbin/chronyd 0 0 0 849 gssproxy 0 M 1 M /usr/sbin/gssproxy -D 0 0 0 866 abrtd 0 M 2 M /usr/sbin/abrtd -d -s 0 0 70 870 avahi-daemon 0 M 0 M avahi-daemon: chroot helper 0 0 0 871 sssd_be 0 M 2 M /usr/libexec/sssd/sssd_be --domain implicit_files --uid 0 --gid 0 --logger=files 0 0 0 875 accounts-daemon 0 M 1 M /usr/libexec/accounts-daemon 0 0 0 906 abrt-dump-journ 1 M 2 M /usr/bin/abrt-dump-journal-core -D -T -f -e 0 0 0 908 abrt-dump-journ 1 M 2 M /usr/bin/abrt-dump-journal-xorg -fxtD 0 0 0 950 crond 2 M 1 M /usr/sbin/crond -n 0 0 0 951 atd 0 M 0 M /usr/sbin/atd -f 0 0 0 953 lxdm-binary 0 M 0 M /usr/sbin/lxdm-binary 0 0 0 1060 dhclient 0 M 2 M /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-enp0s3.pid -lf /var/lib/NetworkManager/dhclient-939eab05-4796-3792-af24-9f76cf53ca7f-enp0s3.lease -cf /var/lib/NetworkManager/dhclient-enp0s3.conf enp0s3 0 0 0 1105 lxdm-session 0 M 1 M /usr/libexec/lxdm-session 0 0 1000 1123 pulseaudio 0 M 3 M /usr/bin/pulseaudio --daemonize=no 0 0 1000 1124 lxsession 1 M 2 M /usr/bin/lxsession -s LXDE -e LXDE 0 0 1000 1134 dbus-daemon 2 M 0 M /usr/bin/dbus-daemon --session --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only 0 0 1000 1215 imsettings-daem 0 M 1 M /usr/libexec/imsettings-daemon 0 0 1000 1218 gvfsd 3 M 1 M /usr/libexec/gvfsd 0 0 1000 1223 gvfsd-fuse 0 M 1 M /usr/libexec/gvfsd-fuse /run/user/1000/gvfs -f -o big_writes 0 0 1000 1309 VBoxClient 0 M 0 M /usr/bin/VBoxClient --display 0 0 1000 1310 VBoxClient 0 M 0 M /usr/bin/VBoxClient --clipboard 0 0 1000 1311 VBoxClient 0 M 0 M /usr/bin/VBoxClient --draganddrop 0 0 1000 1312 VBoxClient 0 M 0 M /usr/bin/VBoxClient --display 0 0 1000 1313 VBoxClient 1 M 0 M /usr/bin/VBoxClient --clipboard 0 0 1000 1316 VBoxClient 0 M 0 M /usr/bin/VBoxClient --seamless 0 0 1000 1318 VBoxClient 0 M 0 M /usr/bin/VBoxClient --seamless 0 0 1000 1320 VBoxClient 0 M 0 M /usr/bin/VBoxClient --draganddrop 0 0 1000 1334 ssh-agent 0 M 0 M /usr/bin/ssh-agent /bin/sh -c exec -l bash -c "/usr/bin/startlxde" 0 0 1000 1362 lxpolkit 0 M 1 M lxpolkit 0 0 1000 1370 lxclipboard 0 M 1 M lxclipboard 0 0 1000 1373 ssh-agent 0 M 1 M /usr/bin/ssh-agent -s 0 0 1000 1485 agent 0 M 1 M /usr/libexec/geoclue-2.0/demos/agent 0 0 1000 1751 menu-cached 0 M 1 M /usr/libexec/menu-cache/menu-cached /run/user/1000/menu-cached-:0 0 0 1000 1780 at-spi-bus-laun 0 M 1 M /usr/libexec/at-spi-bus-launcher 0 0 1000 1786 dbus-daemon 1 M 0 M /usr/bin/dbus-daemon --config-file=/usr/share/defaults/at-spi2/accessibility.conf --nofork --print-address 3 0 0 1000 1792 at-spi2-registr 1 M 1 M /usr/libexec/at-spi2-registryd --use-gnome-session 0 0 1000 1840 gvfs-udisks2-vo 0 M 2 M /usr/libexec/gvfs-udisks2-volume-monitor 0 0 1000 1863 gnome-pty-helpe 1 M 0 M gnome-pty-helper 0 0 1000 1864 bash 0 M 1 M bash 0 0 0 1899 sudo 0 M 1 M sudo -i 0 0 0 1901 bash 0 M 1 M -bash 0 0 0 1953 oomd_bin 0 M 0 M oomd_bin -f /sys/fs/cgroup/unified 0 -600 0 2562 python3 10 M 0 M python3 /usr/sbin/nohang --config /etc/nohang/nohang.conf ```
Kthreads, zombies and Pid 1 will not be displayed. ## psi-top psi-top is script that prints the PSI metrics values for every cgroup. It requires `Linux` >= 4.20 with `CONFIG_PSI=y`. Man page: [psi-top.manpage.md](docs/psi-top.manpage.md).
Output example ``` $ psi-top cgroup2 mountpoint: /sys/fs/cgroup avg10 avg60 avg300 avg10 avg60 avg300 cgroup2 ----- ----- ------ ----- ----- ------ --------- some 0.00 0.21 1.56 | full 0.00 0.16 1.14 [SYSTEM_WIDE] some 0.00 0.21 1.56 | full 0.00 0.16 1.14 some 0.00 0.15 1.11 | full 0.00 0.12 0.89 /user.slice some 45.92 28.77 20.19 | full 45.05 28.17 19.56 /user.slice/user-1000.slice some 1.44 4.67 9.24 | full 1.44 4.65 9.20 /user.slice/user-1000.slice/user@1000.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /user.slice/user-1000.slice/user@1000.service/pulseaudio.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /user.slice/user-1000.slice/user@1000.service/gvfs-daemon.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /user.slice/user-1000.slice/user@1000.service/dbus.socket some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /user.slice/user-1000.slice/user@1000.service/gvfs-udisks2-volume-monitor.service some 0.25 1.97 4.05 | full 0.25 1.96 4.03 /user.slice/user-1000.slice/user@1000.service/xfce4-notifyd.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /user.slice/user-1000.slice/user@1000.service/init.scope some 0.00 0.66 1.99 | full 0.00 0.66 1.97 /user.slice/user-1000.slice/user@1000.service/gpg-agent.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /user.slice/user-1000.slice/user@1000.service/gvfs-gphoto2-volume-monitor.service some 0.93 0.75 0.20 | full 0.93 0.75 0.20 /user.slice/user-1000.slice/user@1000.service/at-spi-dbus-bus.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /user.slice/user-1000.slice/user@1000.service/gvfs-metadata.service some 0.00 2.44 6.78 | full 0.00 2.43 6.74 /user.slice/user-1000.slice/user@1000.service/dbus.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /user.slice/user-1000.slice/user@1000.service/gvfs-mtp-volume-monitor.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /user.slice/user-1000.slice/user@1000.service/gvfs-afc-volume-monitor.service some 44.99 28.30 19.41 | full 44.10 27.70 18.79 /user.slice/user-1000.slice/session-2.scope some 0.00 0.31 0.53 | full 0.00 0.31 0.53 /init.scope some 7.25 11.40 13.34 | full 7.23 11.32 13.24 /system.slice some 0.00 0.01 0.02 | full 0.00 0.01 0.02 /system.slice/systemd-udevd.service some 0.00 0.58 1.55 | full 0.00 0.58 1.55 /system.slice/cronie.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /system.slice/sys-kernel-config.mount some 0.00 0.22 0.35 | full 0.00 0.22 0.35 /system.slice/polkit.service some 0.00 0.06 0.20 | full 0.00 0.06 0.20 /system.slice/rtkit-daemon.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /system.slice/sys-kernel-debug.mount some 0.00 0.14 0.62 | full 0.00 0.14 0.62 /system.slice/accounts-daemon.service some 7.86 11.48 12.56 | full 7.84 11.42 12.51 /system.slice/lightdm.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /system.slice/ModemManager.service some 0.00 1.82 5.47 | full 0.00 1.81 5.43 /system.slice/systemd-journald.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /system.slice/dev-mqueue.mount some 0.00 1.64 4.07 | full 0.00 1.64 4.07 /system.slice/NetworkManager.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /system.slice/tmp.mount some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /system.slice/lvm2-lvmetad.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /system.slice/dev-disk-by\x2duuid-5d7355c0\x2dc131\x2d40c5\x2d8541\x2d1e04ad7c8b8d.swap some 0.00 0.09 0.11 | full 0.00 0.09 0.11 /system.slice/upower.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /system.slice/udisks2.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /system.slice/dev-hugepages.mount some 0.00 0.27 0.49 | full 0.00 0.27 0.48 /system.slice/dbus.service some 0.00 0.00 0.00 | full 0.00 0.00 0.00 /system.slice/system-getty.slice some 0.00 0.12 0.20 | full 0.00 0.12 0.20 /system.slice/avahi-daemon.service some 0.00 0.18 0.30 | full 0.00 0.18 0.30 /system.slice/systemd-logind.service ```
## psi2log psi2log is a CLI tool that can check and log PSI metrics from specified target. It requires `Linux` >= 4.20 with `CONFIG_PSI=y`. Man page: [psi2log.manpage.md](docs/psi2log.manpage.md).
Output example ``` $ psi2log Starting psi2log target: SYSTEM_WIDE period: 2 ------------------------------------------------------------------------------------------------------------------ some cpu pressure || some memory pressure | full memory pressure || some io pressure | full io pressure ---------------------||----------------------|----------------------||----------------------|--------------------- avg10 avg60 avg300 || avg10 avg60 avg300 | avg10 avg60 avg300 || avg10 avg60 avg300 | avg10 avg60 avg300 ------ ------ ------ || ------ ------ ------ | ------ ------ ------ || ------ ------ ------ | ------ ------ ------ 0.13 0.26 0.08 || 3.36 10.31 3.47 | 2.68 7.69 2.56 || 20.24 26.90 8.60 | 18.80 23.16 7.33 0.11 0.25 0.08 || 2.75 9.97 3.45 | 2.20 7.44 2.54 || 18.38 26.34 8.61 | 17.21 22.73 7.35 0.09 0.25 0.07 || 2.25 9.65 3.43 | 1.80 7.20 2.52 || 15.05 25.48 8.55 | 14.09 21.99 7.30 0.07 0.24 0.07 || 1.84 9.33 3.40 | 1.47 6.96 2.51 || 13.05 24.78 8.52 | 12.26 21.40 7.28 ^C Peak values: avg10 avg60 avg300 ----------- ------ ------ ------ some cpu 0.13 0.26 0.08 ----------- ------ ------ ------ some memory 3.36 10.31 3.47 full memory 2.68 7.69 2.56 ----------- ------ ------ ------ some io 20.24 26.90 8.61 full io 18.80 23.16 7.35 $ psi2log -t /user.slice -l pm.log Starting psi2log target: /user.slice period: 2 log file: pm.log cgroup2 mountpoint: /sys/fs/cgroup ------------------------------------------------------------------------------------------------------------------ some cpu pressure || some memory pressure | full memory pressure || some io pressure | full io pressure ---------------------||----------------------|----------------------||----------------------|--------------------- avg10 avg60 avg300 || avg10 avg60 avg300 | avg10 avg60 avg300 || avg10 avg60 avg300 | avg10 avg60 avg300 ------ ------ ------ || ------ ------ ------ | ------ ------ ------ || ------ ------ ------ | ------ ------ ------ 28.32 11.97 3.03 || 0.00 1.05 1.65 | 0.00 0.85 1.33 || 0.55 7.79 7.21 | 0.54 7.52 6.80 29.53 12.72 3.25 || 0.00 1.01 1.64 | 0.00 0.82 1.32 || 0.81 7.60 7.17 | 0.44 7.27 6.76 29.80 13.32 3.44 || 0.00 0.98 1.63 | 0.00 0.79 1.31 || 0.66 7.35 7.12 | 0.36 7.03 6.71 29.83 13.86 3.62 || 0.00 0.95 1.62 | 0.00 0.77 1.30 || 0.54 7.11 7.08 | 0.30 6.80 6.66 29.86 14.39 3.80 || 0.00 0.91 1.60 | 0.00 0.74 1.29 || 0.44 6.88 7.03 | 0.24 6.58 6.62 30.07 14.94 3.99 || 0.00 0.88 1.59 | 0.00 0.72 1.28 || 0.36 6.65 6.98 | 0.20 6.36 6.57 ^C Peak values: avg10 avg60 avg300 ----------- ------ ------ ------ some cpu 30.07 14.94 3.99 ----------- ------ ------ ------ some memory 0.00 1.05 1.65 full memory 0.00 0.85 1.33 ----------- ------ ------ ------ some io 0.81 7.79 7.21 full io 0.54 7.52 6.80 ```
## Contribution - Use cases, feature requests and any questions are [welcome](https://github.com/hakavlad/nohang/issues). - Pull requests in `dev` branch are welcome. ## Documentation - [nohang.manpage.md](docs/nohang.manpage.md) - [oom-sort.manpage.md](docs/oom-sort.manpage.md) - [psi2log.manpage.md](docs/psi2log.manpage.md) - [psi-top.manpage.md](docs/psi-top.manpage.md) - [FAQ.ru.md](docs/FAQ.ru.md) - [CHANGELOG.md](CHANGELOG.md) ## License This project is licensed under the terms of the [MIT license](LICENSE). nohang-0.2.0/conf/000077500000000000000000000000001377337215500137255ustar00rootroot00000000000000nohang-0.2.0/conf/logrotate.d/000077500000000000000000000000001377337215500161475ustar00rootroot00000000000000nohang-0.2.0/conf/logrotate.d/nohang000066400000000000000000000001621377337215500173430ustar00rootroot00000000000000/var/log/nohang/*.log { missingok copytruncate notifempty size 1M rotate 5 compress delaycompress } nohang-0.2.0/conf/nohang/000077500000000000000000000000001377337215500151775ustar00rootroot00000000000000nohang-0.2.0/conf/nohang/nohang-desktop.conf.in000066400000000000000000000377161377337215500214120ustar00rootroot00000000000000## This is the configuration file of the nohang daemon. ## The configuration includes the following sections: ## 0. Check kernel messages for OOM events ## 1. Common zram settings ## 2. Common PSI settings ## 3. Poll rate ## 4. Warnings and notifications ## 5. Soft (SIGTERM) threshold ## 6. Hard (SIGKILL) threshold ## 7. Customize victim selection: adjusting badness of processes ## 8. Customize soft corrective actions ## 9. Misc settings ## 10. Verbosity, debug, logging ## WARNING! ## - Lines starting with #, tabs and whitespace characters are comments. ## - Lines starting with @ contain optional parameters that may be repeated. ## - All values are case sensitive. ## - nohang doesn't forbid you to shoot yourself in the foot. Be careful! ## - Restart the daemon after editing the file to apply the new settings. ## - You can find the file with default values here: :TARGET_DATADIR:/nohang/nohang.conf ## To find config keys descriptions see man(8) nohang ############################################################################### ## 0. Check kernel messages for OOM events # @check_kmsg ## Type: boolean ## Comment/uncomment to disable/enable checking kmsg for OOM events # @debug_kmsg ## Type: boolean ## Comment/uncomment to disable/enable debug checking kmsg ############################################################################### 1. Common zram settings Key: zram_checking_enabled Description: Type: boolean Valid values: True | False Default value: False zram_checking_enabled = False ############################################################################### 2. Common PSI settings Key: psi_checking_enabled Description: Type: boolean Valid values: True | False Default value: True psi_checking_enabled = True Key: psi_path Description: Type: string Valid values: any string Default value: /proc/pressure/memory psi_path = /proc/pressure/memory Key: psi_metrics Description: Type: string Valid values: some_avg10, some_avg60, some_avg300, full_avg10, full_avg60, full_avg300 Default value: full_avg10 psi_metrics = full_avg10 Key: psi_excess_duration Description: Type: float Valid values: >= 0 Default value: 30 psi_excess_duration = 30 Key: psi_post_action_delay Description: Type: float Valid values: >= 10 Default value: 15 psi_post_action_delay = 15 ############################################################################### 3. Poll rate Key: fill_rate_mem Description: Type: float Valid values: >= 100 Default value: 6000 fill_rate_mem = 6000 Key: fill_rate_swap Description: Type: float Valid values: >= 100 Default value: 2000 fill_rate_swap = 2000 Key: fill_rate_zram Description: Type: float Valid values: >= 100 Default value: 4000 fill_rate_zram = 4000 Key: max_sleep Description: Type: float Valid values: >= 0.01 and >= min_sleep Default value: 3 max_sleep = 3 Key: min_sleep Description: Type: float Valid values: >= 0.01 and <= max_sleep Default value: 0.1 min_sleep = 0.1 ############################################################################### 4. Warnings and notifications 4.1. GUI notifications after corrective actions Key: post_action_gui_notifications Description: Type: boolean Valid values: True | False Default value: True post_action_gui_notifications = True Key: hide_corrective_action_type Description: Type: boolean Valid values: True | False Default value: False hide_corrective_action_type = False 4.2. Low memory warnings Key: low_memory_warnings_enabled Description: Type: boolean Valid values: True | False Default value: True low_memory_warnings_enabled = True Key: warning_exe Description: Type: string Valid values: any string Default value: (empty string) warning_exe = Key: warning_threshold_min_mem Description: Type: float (with % or M) Valid values: from the range [0; 100] % Default value: 20 % warning_threshold_min_mem = 20 % Key: warning_threshold_min_swap Description: Type: float (with % or M) Valid values: [0; 100] % or >= 0 M Default value: 20 % warning_threshold_min_swap = 25 % Key: warning_threshold_max_zram Description: Type: float (with % or M) Valid values: from the range [0; 100] % Default value: 45 % warning_threshold_max_zram = 45 % Key: warning_threshold_max_psi Description: Type: float Valid values: from the range [0; 100] Default value: 10 warning_threshold_max_psi = 10 Key: min_post_warning_delay Description: Type: float Valid values: >= 1 Default value: 60 min_post_warning_delay = 60 Key: env_cache_time Description: Type: float Valid values: >= 0 Default value: 300 env_cache_time = 300 ############################################################################### 5. Soft threshold (thresholds for sending the SIGTERM signal or implementing other soft corrective action) Key: soft_threshold_min_mem Description: Type: float (with % or M) Valid values: from the range [0; 50] % Default value: 5 % soft_threshold_min_mem = 5 % Key: soft_threshold_min_swap Description: Type: float (with % or M) Valid values: [0; 100] % or >= 0 M Default value: 10 % soft_threshold_min_swap = 10 % Key: soft_threshold_max_zram Description: Type: float (with % or M) Valid values: from the range [10; 90] % Default value: 55 % soft_threshold_max_zram = 55 % Key: soft_threshold_max_psi Description: Type: float Valid values: from the range [5; 100] Default value: 40 soft_threshold_max_psi = 40 ############################################################################### 6. Hard threshold (thresholds for sending the SIGKILL signal) Key: hard_threshold_min_mem Description: Type: float (with % or M) Valid values: from the range [0; 50] % Default value: 2 % hard_threshold_min_mem = 2 % Key: hard_threshold_min_swap Description: Type: float (with % or M) Valid values: [0; 100] % or >= 0 M Default value: 4 % hard_threshold_min_swap = 4 % Key: hard_threshold_max_zram Description: Type: float (with % or M) Valid values: from the range [10; 90] % Default value: 60 % hard_threshold_max_zram = 60 % Key: hard_threshold_max_psi Description: Type: float Valid values: from the range [5; 100] Default value: 90 hard_threshold_max_psi = 90 ############################################################################### 7. Customize victim selection: adjusting badness of processes 7.1. Ignore positive oom_score_adj Key: ignore_positive_oom_score_adj Description: Type: boolean Valid values: True | False Default value: False ignore_positive_oom_score_adj = False 7.2.1. Matching process names with RE patterns change their badness Syntax: @BADNESS_ADJ_RE_NAME badness_adj /// RE_pattern New badness value will be += badness_adj It is possible to compare multiple patterns with different badness_adj values. Example: @BADNESS_ADJ_RE_NAME -500 /// ^sshd$ Prefer terminating Firefox tabs instead of terminating the entire browser. (In Chromium and Electron-based apps child processes get oom_score_adj=300 by default.) @BADNESS_ADJ_RE_NAME 200 /// ^(Web Content|Privileged Cont|file:// Content)$ @BADNESS_ADJ_RE_NAME -200 /// ^(dnf|yum|packagekitd)$ 7.2.2. Matching CGroup_v1-line with RE patterns @BADNESS_ADJ_RE_CGROUP_V1 -50 /// ^/system\.slice/ @BADNESS_ADJ_RE_CGROUP_V1 50 /// /foo\.service$ @BADNESS_ADJ_RE_CGROUP_V1 -50 /// ^/user\.slice/ 7.2.3. Matching CGroup_v2-line with RE patterns @BADNESS_ADJ_RE_CGROUP_V2 100 /// ^/workload 7.2.4. Matching eUIDs with RE patterns @BADNESS_ADJ_RE_UID -100 /// ^0$ 7.2.5. Matching /proc/[pid]/exe realpath with RE patterns Example: @BADNESS_ADJ_RE_REALPATH 20 /// ^/usr/bin/foo$ Protect X. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/usr/libexec/Xorg|/usr/lib/xorg/Xorg|/usr/lib/Xorg|/usr/bin/X|/usr/bin/Xorg|/usr/bin/Xwayland|/usr/bin/weston|/usr/bin/sway)$ Protect GNOME. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/usr/bin/gnome-shell|/usr/bin/metacity|/usr/bin/mutter|/usr/lib/gnome-session/gnome-session-binary|/usr/libexec/gnome-session-binary|/usr/libexec/gnome-session-ctl)$ Protect KDE Plasma. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/usr/bin/plasma-desktop|/usr/bin/plasmashell|/usr/bin/plasma_session|/usr/bin/kwin|/usr/bin/kwin_x11|/usr/bin/kwin_wayland)$ @BADNESS_ADJ_RE_REALPATH -200 /// ^(/usr/bin/startplasma-wayland|/usr/lib/x86_64-linux-gnu/libexec/startplasma-waylandsession|/usr/bin/ksmserver)$ Protect Cinnamon. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/usr/bin/cinnamon|/usr/bin/muffin|/usr/bin/cinnamon-session|/usr/bin/cinnamon-launcher)$ Protect Xfce. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/usr/bin/xfwm4|/usr/bin/xfce4-session|/usr/bin/xfce4-panel|/usr/bin/xfdesktop)$ Protect Mate. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/usr/bin/marco|/usr/bin/mate-session|/usr/bin/caja|/usr/bin/mate-panel)$ Protect LXQt. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/usr/bin/lxqt-panel|/usr/bin/pcmanfm-qt|/usr/bin/lxqt-session)$ Protect Budgie Desktop. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/usr/bin/budgie-wm|/usr/bin/budgie-panel)$ Protect other. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/usr/bin/compiz|/usr/bin/openbox|/usr/bin/fluxbox|/usr/bin/awesome|/usr/bin/icewm|/usr/bin/enlightenment|/usr/bin/gala|/usr/bin/wingpanel|/usr/bin/i3)$ Protect display managers. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/usr/sbin/gdm|/usr/sbin/gdm3|/usr/sbin/sddm|/usr/bin/sddm|/usr/lib/x86_64-linux-gnu/sddm/sddm-helper|/usr/bin/slim|/usr/sbin/lightdm|/usr/libexec/gdm-session-worker|/usr/libexec/gdm-wayland-session|/usr/lib/gdm3/gdm-wayland-session|/usr/lib/gdm3/gdm-session-worker)$ @BADNESS_ADJ_RE_REALPATH -200 /// ^/usr/lib/gdm3/ Protect systemd-logind. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/lib/systemd/systemd-logind|/usr/lib/systemd/systemd-logind)$ Protect `systemd --user`. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/lib/systemd/systemd|/usr/lib/systemd/systemd)$ Protect dbus. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/usr/bin/dbus-daemon|/usr/bin/dbus-run-session|/usr/bin/dbus-broker-launcher|/usr/bin/dbus-broker)$ Protect package managers and distro installers. @BADNESS_ADJ_RE_REALPATH -200 /// ^(/usr/bin/calamares|/usr/bin/dpkg|/usr/bin/pacman|/usr/bin/yay|/usr/bin/pamac|/usr/bin/pamac-daemon|/usr/bin/pamac-manager)$ Prefer stress. @BADNESS_ADJ_RE_REALPATH 900 /// ^(/usr/bin/stress|/usr/bin/stress-ng)$ 7.2.6. Matching /proc/[pid]/cwd realpath with RE patterns @BADNESS_ADJ_RE_CWD 200 /// ^/home/ 7.2.7. Matching cmdlines with RE patterns WARNING: using this option can greatly slow down the search for a victim in conditions of heavily swapping. Prefer Chromium tabs and Electron-based apps @BADNESS_ADJ_RE_CMDLINE 200 /// --type=renderer Prefer Firefox tabs (Web Content and WebExtensions) @BADNESS_ADJ_RE_CMDLINE 100 /// -appomni @BADNESS_ADJ_RE_CMDLINE -200 /// ^/usr/lib/virtualbox 7.2.8. Matching environ with RE patterns WARNING: using this option can greatly slow down the search for a victim in conditions of heavily swapping. @BADNESS_ADJ_RE_ENVIRON 100 /// USER=user Note that you can control badness also via systemd units via OOMScoreAdjust, see www.freedesktop.org/software/systemd/man/systemd.exec.html#OOMScoreAdjust= ############################################################################### 8. Customize soft corrective actions Run the command instead of sending a signal with at soft corrective action if the victim's name or cgroup matches the regular expression. Syntax: KEY REGEXP SEPARATOR COMMAND @SOFT_ACTION_RE_NAME ^foo$ /// kill -USR1 $PID @SOFT_ACTION_RE_CGROUP_V1 ^/system\.slice/ /// systemctl restart $SERVICE @SOFT_ACTION_RE_CGROUP_V2 /foo\.service$ /// systemctl restart $SERVICE $PID will be replaced by process PID. $NAME will be replaced by process name. $SERVICE will be replaced by .service if it exists (overwise it will be relpaced by empty line) ############################################################################### 9. Misc settings Key: max_soft_exit_time Description: Type: float Valid values: >= 0.1 Default value: 10 max_soft_exit_time = 10 Key: post_kill_exe Description: Type: string Valid values: any string Default value: (empty string) post_kill_exe = Key: min_badness Description: Type: integer Valid values: >= 1 Default value: 1 min_badness = 1 Key: post_soft_action_delay Description: Type: float Valid values: >= 0.1 Default value: 3 post_soft_action_delay = 3 Key: post_zombie_delay Description: Type: float Valid values: >= 0 Default value: 0.1 post_zombie_delay = 0.1 Key: victim_cache_time Description: Type: float Valid values: >= 0 Default value: 10 victim_cache_time = 10 Key: exe_timeout Description: Type: float Valid values: >= 0.1 Default value: 20 exe_timeout = 20 ############################################################################### 10. Verbosity, debug, logging Key: print_config_at_startup Description: Type: boolean Valid values: True | False Default value: False print_config_at_startup = False Key: print_mem_check_results Description: Type: boolean Valid values: True | False Default value: False print_mem_check_results = False Key: min_mem_report_interval Description: Type: float Valid values: >= 0 Default value: 60 min_mem_report_interval = 60 Key: print_proc_table Description: Type: boolean Valid values: True | False Default value: False print_proc_table = False Key: extra_table_info Description: WARNING: using "cmdline" or "environ" keys can greatly slow down the search for a victim in conditions of heavy swapping. Type: string Valid values: None, cgroup_v1, cgroup_v2, realpath, cwd, cmdline, environ Default value: None extra_table_info = None Key: print_victim_status Description: Type: boolean Valid values: True | False Default value: True print_victim_status = True Key: print_victim_cmdline Description: Type: boolean Valid values: True | False Default value: False print_victim_cmdline = False Key: max_victim_ancestry_depth Description: Type: integer Valid values: >= 1 Default value: 3 max_victim_ancestry_depth = 3 Key: print_statistics Description: Type: boolean Valid values: True | False Default value: True print_statistics = True Key: debug_psi Description: Type: boolean Valid values: True | False Default value: False debug_psi = False Key: debug_gui_notifications Description: Type: boolean Valid values: True | False Default value: False debug_gui_notifications = False Key: debug_sleep Description: Type: boolean Valid values: True | False Default value: False debug_sleep = False Key: debug_threading Description: Type: boolean Valid values: True | False Default value: False debug_threading = False Key: separate_log Description: Type: boolean Valid values: True | False Default value: False separate_log = False ############################################################################### Use cases, feature requests and any questions are welcome: https://github.com/hakavlad/nohang/issues nohang-0.2.0/conf/nohang/nohang.conf.in000066400000000000000000000254651377337215500177410ustar00rootroot00000000000000## This is the configuration file of the nohang daemon. ## The configuration includes the following sections: ## 0. Check kernel messages for OOM events ## 1. Common zram settings ## 2. Common PSI settings ## 3. Poll rate ## 4. Warnings and notifications ## 5. Soft (SIGTERM) threshold ## 6. Hard (SIGKILL) threshold ## 7. Customize victim selection: adjusting badness of processes ## 8. Customize soft corrective actions ## 9. Misc settings ## 10. Verbosity, debug, logging ## WARNING! ## - Lines starting with #, tabs and whitespace characters are comments. ## - Lines starting with @ contain optional parameters that may be repeated. ## - All values are case sensitive. ## - nohang doesn't forbid you to shoot yourself in the foot. Be careful! ## - Restart the daemon after editing the file to apply the new settings. ## - You can find the file with default values here: :TARGET_DATADIR:/nohang/nohang.conf ## To find config keys descriptions see man(8) nohang ############################################################################### ## 0. Check kernel messages for OOM events # @check_kmsg ## Type: boolean ## Comment/uncomment to disable/enable checking kmsg for OOM events # @debug_kmsg ## Type: boolean ## Comment/uncomment to disable/enable debug checking kmsg ############################################################################### ## 1. Common zram settings zram_checking_enabled = False ## Type: boolean, valid values: True | False ## Default value: False ############################################################################### ## 2. Common PSI settings psi_checking_enabled = False ## Type: boolean, valid values: True | False ## Default value: False psi_path = /proc/pressure/memory ## Type: string; valid values: any string ## Default value: /proc/pressure/memory psi_metrics = full_avg10 ## Type: string; valid values: some_avg10, some_avg60, some_avg300, ## full_avg10, full_avg60, full_avg300 ## Default value: full_avg10 psi_excess_duration = 30 ## Type: float; valid values: >= 0 ## Default value: 30 psi_post_action_delay = 15 ## Type: float; valid values: >= 10 ## Default value: 15 ############################################################################### ## 3. Poll rate fill_rate_mem = 6000 ## Type: float; valid values: >= 100 ## Default value: 6000 fill_rate_swap = 2000 ## Type: float; valid values: >= 100 ## Default value: 2000 fill_rate_zram = 4000 ## Type: float; valid values: >= 100 ## Default value: 4000 max_sleep = 3 ## Type: float; valid values: >= 0.01 and >= min_sleep ## Default value: 3 min_sleep = 0.1 ## Type: float; valid values: >= 0.01 and <= max_sleep ## Default value: 0.1 ############################################################################### ## 4. Warnings and notifications ## 4.1. GUI notifications after corrective actions post_action_gui_notifications = False ## Type: boolean; valid values: True | False ## Default value: False hide_corrective_action_type = False ## Type: boolean; valid values: True | False ## Default value: False ## 4.2. Low memory warnings low_memory_warnings_enabled = False ## Type: boolean; valid values: True | False ## Default value: False warning_exe = ## Type: string; valid values: any string ## Default value: (empty string) warning_threshold_min_mem = 20 % ## Type: float (with % or M); valid values: from the range [0; 100] % ## Default value: 20 % warning_threshold_min_swap = 25 % ## Type: float (with % or M); valid values: [0; 100] % or >= 0 M ## Default value: 20 % warning_threshold_max_zram = 45 % ## Type: float (with % or M); valid values: from the range [0; 100] % ## Default value: 45 % warning_threshold_max_psi = 10 ## Type: float; valid values: from the range [0; 100] ## Default value: 10 min_post_warning_delay = 60 ## Type: float; valid values: >= 1 ## Default value: 60 env_cache_time = 300 ## Type: float; valid values: >= 0 ## Default value: 300 ############################################################################### ## 5. Soft threshold (thresholds for sending the SIGTERM signal or ## implementing other soft corrective action) soft_threshold_min_mem = 5 % ## Type: float (with % or M); valid values: from the range [0; 50] % ## Default value: 5 % soft_threshold_min_swap = 10 % ## Type: float (with % or M); valid values: [0; 100] % or >= 0 M ## Default value: 10 % soft_threshold_max_zram = 55 % ## Type: float (with % or M); valid values: from the range [10; 90] % ## Default value: 55 % soft_threshold_max_psi = 40 ## Type: float; valid values: from the range [5; 100] ## Default value: 40 ############################################################################### ## 6. Hard threshold (thresholds for sending the SIGKILL signal) hard_threshold_min_mem = 2 % ## Type: float (with % or M); valid values: from the range [0; 50] % ## Default value: 2 % hard_threshold_min_swap = 4 % ## Type: float (with % or M); valid values: [0; 100] % or >= 0 M ## Default value: 4 % hard_threshold_max_zram = 60 % ## Type: float (with % or M); valid values: from the range [10; 90] % ## Default value: 60 % hard_threshold_max_psi = 90 ## Type: float; valid values: from the range [5; 100] ## Default value: 90 ############################################################################### ## 7. Customize victim selection: adjusting badness of processes ## 7.1. Ignore positive oom_score_adj ignore_positive_oom_score_adj = False ## Type: boolean; valid values: True | False ## Default value: False ## 7.2. Matching process properties with regular expressions to change their ## badness. ## Syntax: ## @BADNESS_ADJ_RE_PROPERTY badness_adj /// RE_pattern ## New badness value will be added to process's badness_adj ## It is possible to compare multiple patterns ## with different badness_adj values. ## 7.2.1. Matching process names with RE patterns to change their badness ## Example: # @BADNESS_ADJ_RE_NAME 200 /// ^Web Content$ ## 7.2.2. Matching CGroup_v1-line with RE patterns # @BADNESS_ADJ_RE_CGROUP_V1 50 /// /foo\.service$ # @BADNESS_ADJ_RE_CGROUP_V1 -50 /// ^/user\.slice/ ## 7.2.3. Matching CGroup_v2-line with RE patterns # @BADNESS_ADJ_RE_CGROUP_V2 100 /// ^/workload ## 7.2.4. Matching eUIDs with RE patterns # @BADNESS_ADJ_RE_UID -100 /// ^0$ ## 7.2.5. Matching /proc/[pid]/exe realpath with RE patterns ## Example: # @BADNESS_ADJ_RE_REALPATH 900 /// ^(/usr/bin/stress|/usr/bin/stress-ng)$ ## 7.2.6. Matching /proc/[pid]/cwd realpath with RE patterns # @BADNESS_ADJ_RE_CWD 200 /// ^/home/ ## 7.2.7. Matching cmdlines with RE patterns ## WARNING: using this option can greatly slow down the search for a victim ## in conditions of intense swapping. ## Prefer Chromium tabs and Electron-based apps # @BADNESS_ADJ_RE_CMDLINE 200 /// --type=renderer ## Prefer Firefox tabs (Web Content and WebExtensions) # @BADNESS_ADJ_RE_CMDLINE 100 /// -appomni ## Avoid Virtualbox processes # @BADNESS_ADJ_RE_CMDLINE -200 /// ^/usr/lib/virtualbox ## 7.2.8. Matching environ with RE patterns ## WARNING: using this option can greatly slow down the search for a victim ## in conditions of heavy swapping. # @BADNESS_ADJ_RE_ENVIRON 100 /// USER=user # Note that you can control badness also via systemd units via # OOMScoreAdjust, see # www.freedesktop.org/software/systemd/man/systemd.exec.html#OOMScoreAdjust= ############################################################################### ## 8. Customize soft corrective actions ## Run the command instead of sending a signal with at soft corrective action ## if the victim's name or cgroup matches the regular expression. ## Syntax: ## KEY REGEXP SEPARATOR COMMAND # @SOFT_ACTION_RE_NAME ^foo$ /// kill -USR1 $PID # @SOFT_ACTION_RE_CGROUP_V1 ^/system\.slice/ /// systemctl restart $SERVICE # @SOFT_ACTION_RE_CGROUP_V2 /foo\.service$ /// systemctl restart $SERVICE ## $PID will be replaced by process PID. ## $NAME will be replaced by process name. ## $SERVICE will be replaced by .service if it exists (overwise it will be ## relpaced by empty line) ############################################################################### ## 9. Misc settings max_soft_exit_time = 10 ## Type: float; valid values: >= 0.1 ## Default value: 10 post_kill_exe = ## Type: string; valid values: any string ## Default value: (empty string) min_badness = 1 ## Type: integer; valid values: >= 1 ## Default value: 1 ## nohang will do nothing if the badness of all processes is below min_badness ## (actualy it will spam to stdout/log) post_soft_action_delay = 3 ## Type: float; valid values: >= 0.1 ## Default value: 3 post_zombie_delay = 0.1 ## Type: float; valid values: >= 0 ## Default value: 0.1 victim_cache_time = 10 ## Type: float; valid values: >= 0 ## Default value: 10 exe_timeout = 20 ## Type: float; valid values: >= 0.1 ## Default value: 20 ############################################################################### ## 10. Verbosity, debug, logging print_config_at_startup = False ## Type: boolean; valid values: True | False ## Default value: False print_mem_check_results = False ## Type: boolean; valid values: True | False ## Default value: False min_mem_report_interval = 60 ## Type: float; valid values: >= 0 ## Default value: 60 print_proc_table = False ## Type: boolean; valid values: True | False ## Default value: False extra_table_info = None ## Type: string; valid values: None, cgroup_v1, cgroup_v2, realpath, cwd, ## cmdline, environ ## Default value: None ## WARNING: using "cmdline" or "environ" keys can greatly slow down the search ## for a victim in conditions of heavy swapping. print_victim_status = True ## Type: boolean; valid values: True | False ## Default value: True print_victim_cmdline = False ## Type: boolean; valid values: True | False ## Default value: False max_victim_ancestry_depth = 3 ## Type: integer; valid values: >= 1 ## Default value: 3 print_statistics = True ## Type: boolean; valid values: True | False ## Default value: True debug_psi = False ## Type: boolean; valid values: True | False ## Default value: False debug_gui_notifications = False ## Type: boolean; valid values: True | False ## Default value: False debug_sleep = False ## Type: boolean; valid values: True | False ## Default value: False debug_threading = False ## Type: boolean; valid values: True | False ## Default value: False separate_log = False ## Type: boolean; valid values: True | False ## Default value: False ############################################################################### ## Use cases, feature requests and any questions are welcome: ## https://github.com/hakavlad/nohang/issues ## nohang-0.2.0/conf/nohang/test.conf000066400000000000000000000225371377337215500170360ustar00rootroot00000000000000## This is the configuration file of the nohang daemon. ## The configuration includes the following sections: ## 0. Check kernel messages for OOM events ## 1. Common zram settings ## 2. Common PSI settings ## 3. Poll rate ## 4. Warnings and notifications ## 5. Soft (SIGTERM) threshold ## 6. Hard (SIGKILL) threshold ## 7. Customize victim selection: adjusting badness of processes ## 8. Customize soft corrective actions ## 9. Misc settings ## 10. Verbosity, debug, logging ## WARNING! ## - Lines starting with #, tabs and whitespace characters are comments. ## - Lines starting with @ contain optional parameters that may be repeated. ## - All values are case sensitive. ## - nohang doesn't forbid you to shoot yourself in the foot. Be careful! ## - Restart the daemon after editing the file to apply the new settings. ## - You can find the file with default values here: :TARGET_DATADIR:/nohang/nohang.conf ## To find config keys descriptions see man(8) nohang ############################################################################### ## 0. Check kernel messages for OOM events # @check_kmsg ## Type: boolean ## Comment/uncomment to disable/enable checking kmsg for OOM events # @debug_kmsg ## Type: boolean ## Comment/uncomment to disable/enable debug checking kmsg ############################################################################### 1. Common zram settings Key: zram_checking_enabled Description: Type: boolean Valid values: True and False Default value: False zram_checking_enabled = True ############################################################################### 2. Common PSI settings Description: Type: boolean Valid values: True and False psi_checking_enabled = True Description: Type: string Valid values: psi_path = /proc/pressure/memory Description: Type: string Valid values: psi_metrics = full_avg10 Description: Type: float Valid values: psi_excess_duration = 60 Description: Type: float Valid values: psi_post_action_delay = 60 ############################################################################### 3. Poll rate Description: Type: float Valid values: fill_rate_mem = 4000 Description: Type: float Valid values: fill_rate_swap = 1500 Description: Type: float Valid values: fill_rate_zram = 6000 Description: Type: float Valid values: max_sleep = 3 Description: Type: float Valid values: min_sleep = 0.1 ############################################################################### 4. Warnings and notifications 4.1. GUI notifications after corrective actions Description: Type: boolean Valid values: True and False post_action_gui_notifications = True Description: Type: boolean Valid values: True and False hide_corrective_action_type = False 4.2. Low memory warnings Description: Type: boolean Valid values: True and False low_memory_warnings_enabled = True Description: Type: string Valid values: warning_exe = Description: Type: float (+ % or M) Valid values: warning_threshold_min_mem = 20 % Description: Type: float (+ % or M) Valid values: warning_threshold_min_swap = 20 % Description: Type: float (+ % or M) Valid values: warning_threshold_max_zram = 50 % Description: Type: float Valid values: warning_threshold_max_psi = 100 Description: Type: float Valid values: min_post_warning_delay = 30 Description: Type: float Valid values: env_cache_time = 300 ############################################################################### 5. Soft threshold Description: Type: float (+ % or M) Valid values: soft_threshold_min_mem = 20 % Description: Type: float (+ % or M) Valid values: soft_threshold_min_swap = 20 % Description: Type: float (+ % or M) Valid values: soft_threshold_max_zram = 60 % Description: Type: float Valid values: soft_threshold_max_psi = 60 ############################################################################### 6. Hard threshold hard_threshold_min_mem = 2 % Description: Type: float (+ % or M) Valid values: hard_threshold_min_swap = 2 % Description: Type: float (+ % or M) Valid values: hard_threshold_max_zram = 65 % Description: Type: float Valid values: hard_threshold_max_psi = 90 ############################################################################### 7. Customize victim selection: adjusting badness of processes 7.1. Ignore positive oom_score_adj Description: Type: boolean Valid values: True and False ignore_positive_oom_score_adj = True 7.3.1. Matching process names with RE patterns change their badness Syntax: @BADNESS_ADJ_RE_NAME badness_adj /// RE_pattern New badness value will be += badness_adj It is possible to compare multiple patterns with different badness_adj values. Example: @BADNESS_ADJ_RE_NAME -500 /// ^sshd$ 7.3.2. Matching CGroup_v1-line with RE patterns @BADNESS_ADJ_RE_CGROUP_V1 -50 /// ^/system\.slice/ @BADNESS_ADJ_RE_CGROUP_V1 50 /// /foo\.service$ @BADNESS_ADJ_RE_CGROUP_V1 -50 /// ^/user\.slice/ 7.3.3. Matching CGroup_v2-line with RE patterns @BADNESS_ADJ_RE_CGROUP_V2 100 /// ^/workload 7.3.4. Matching eUIDs with RE patterns @BADNESS_ADJ_RE_UID -100 /// ^0$ 7.3.5. Matching realpath with RE patterns @BADNESS_ADJ_RE_REALPATH 20 /// ^/usr/bin/foo 7.3.5.1. Matching cwd with RE patterns @BADNESS_ADJ_RE_CWD 20 /// ^/home/ 7.3.6. Matching cmdlines with RE patterns @BADNESS_ADJ_RE_CMDLINE 2000 /// ^/bin/sleep Prefer chromium tabs and electron-based apps @BADNESS_ADJ_RE_CMDLINE 200 /// --type=renderer Prefer firefox tabs (Web Content and WebExtensions) @BADNESS_ADJ_RE_CMDLINE 100 /// -appomni @BADNESS_ADJ_RE_CMDLINE -200 /// ^/usr/lib/virtualbox 7.3.7. Matching environ with RE patterns @BADNESS_ADJ_RE_ENVIRON 100 /// USER=user Note that you can control badness also via systemd units via OOMScoreAdjust, see www.freedesktop.org/software/systemd/man/systemd.exec.html#OOMScoreAdjust= ############################################################################### 8. Customize soft corrective actions TODO: docs Syntax: KEY REGEXP SEPARATOR COMMAND @SOFT_ACTION_RE_NAME ^tail$ /// kill -SEGV $PID @SOFT_ACTION_RE_NAME ^foo$ /// kill -SEGV $PID @SOFT_ACTION_RE_NAME ^bash$ /// kill -9 $PID @SOFT_ACTION_RE_CGROUP_V1 ^/system\.slice/ /// systemctl restart $SERVICE @SOFT_ACTION_RE_CGROUP_V1 /foo\.service$ /// systemctl restart $SERVICE @SOFT_ACTION_RE_NAME ^tail$ /// kill -TERM $PID $PID will be replaced by process PID. $NAME will be replaced by process name. $SERVICE will be replaced by .service if it exists (overwise it will be relpaced by empty line) ############################################################################### 9. Misc settings Description: Type: float Valid values: max_soft_exit_time = 10 Description: Type: string Valid values: post_kill_exe = Description: Type: integer Valid values: min_badness = 10 Description: Type: float Valid values: post_soft_action_delay = 3 Description: Type: float Valid values: post_zombie_delay = 0.1 Description: Type: float Valid values: victim_cache_time = 10 Description: Type: float Valid values: exe_timeout = 20 ############################################################################### 10. Verbosity, debug, logging Description: Type: boolean Valid values: True and False print_config_at_startup = True Description: Type: boolean Valid values: True and False print_mem_check_results = True Description: Type: float Valid values: min_mem_report_interval = 0 Description: Type: boolean Valid values: True and False print_proc_table = True Description: Type: string Valid values: None cgroup_v1 cgroup_v2 realpath cwd cmdline environ extra_table_info = None Description: Type: boolean Valid values: True and False print_victim_status = True Description: Type: boolean Valid values: True and False print_victim_cmdline = True Description: Type: integer Valid values: max_victim_ancestry_depth = 99 Description: Type: boolean Valid values: True and False print_statistics = True Description: Type: boolean Valid values: True and False debug_psi = True Description: Type: boolean Valid values: True and False debug_gui_notifications = True Description: Type: boolean Valid values: True and False debug_sleep = True Description: Type: boolean Valid values: True and False debug_threading = True Description: Type: boolean Valid values: True and False separate_log = True ############################################################################### Use cases, feature requests and any questions are welcome: https://github.com/hakavlad/nohang/issues nohang-0.2.0/deb/000077500000000000000000000000001377337215500135325ustar00rootroot00000000000000nohang-0.2.0/deb/DEBIAN/000077500000000000000000000000001377337215500144545ustar00rootroot00000000000000nohang-0.2.0/deb/DEBIAN/conffiles000066400000000000000000000001201377337215500163400ustar00rootroot00000000000000/etc/nohang/nohang.conf /etc/nohang/nohang-desktop.conf /etc/logrotate.d/nohang nohang-0.2.0/deb/DEBIAN/control000066400000000000000000000010201377337215500160500ustar00rootroot00000000000000Package: nohang Version: 0.2.0 Section: admin Architecture: all Depends: python3 Suggests: libnotify-bin, sudo, logrotate Maintainer: Alexey Avramov Priority: optional Homepage: https://github.com/hakavlad/nohang Description: Sophisticated low memory handler nohang is a highly configurable daemon for Linux which is able to correctly prevent out of memory (OOM) and keep system responsiveness in low memory conditions. The package also includes additional diagnostic tools: oom-sort, psi2log, psi-top. nohang-0.2.0/deb/DEBIAN/postinst000077500000000000000000000000301377337215500162560ustar00rootroot00000000000000systemctl daemon-reload nohang-0.2.0/deb/build.sh000077500000000000000000000002751377337215500151740ustar00rootroot00000000000000#!/bin/sh -v make \ DESTDIR=deb/package \ PREFIX=/usr \ SYSCONFDIR=/etc \ SYSTEMDUNITDIR=/lib/systemd/system \ build_deb cd deb cp -r DEBIAN package/ fakeroot dpkg-deb --build package nohang-0.2.0/docs/000077500000000000000000000000001377337215500137305ustar00rootroot00000000000000nohang-0.2.0/docs/FAQ.ru.md000066400000000000000000000145261377337215500153160ustar00rootroot00000000000000 # FAQ для русскоязычных ### Каковы основные особенности демона? - Явная и гибкая конфигурация через конфигурационный файл. Все, что может быть настраиваемо, по возможности вынесено в конфиг. Таким образом, запуск демона без конфига невозможен. Также пользователь может видеть все значения ключей конфига. Минимум скрытых параметров. - Возможность поэтапного реагирования на нехватку памяти. Можно настроить три порога реакции: 1. Для отправки GUI уведомдений о нехватке памяти (либо выполнение произвольной команды, например отправки e-mail) 2. Порог отправки сигнала SIGTERM (в большинстве случаев коррекция происходи здесь). Это главное корректирующее действие, после которого большинство процессов завершаются, по возможности корректно. 3. Если жертва не реагирует на SIGTERM, то получит сигнал SIGKILL при дальнейшем уменьшении объема доступной памяти, или по прошествии определенного времени (ключ конфига max_soft_exit_time). - Возможность реагирования на разные виды раздражителей: 1. При наличии пространства подкачки демон реагирует на объем доступного пространства подкачки (SwapFree) при условии, что порог доступной памяти также ниже заданного уровня. При отсутствии пространства подкачки демон реагирует на объем доступной памяти (MemAvailable). 2. При наличии пространства подкачки демон может реагировать на превышение метрик PSI, если это задано в конфиге. Корректирующее действие происходит если в течение заданного времени (psi_excess_duration) порог доступной памяти и порог заданной метрики PSI превышены одновременно, но не ранее чем через psi_post_action_delay после предыдущего корректирующего действия. 3. Возможность реакции на размер mem_used_total, если смонтированы устойства zram. - Возможность влияния на выбор жертвы при корректирующем действии путем сопоставления различных характеристик процесса (name, exe realpath, euid, cgroup etc) с заданными регулярными выражениями. Это похоже на маханизм, используемый в ядре, однако вместо oom_score_adj для отдельных PID можно задать badness_adj для всех процессов, подходящих под определенные критерии. - Возможность GUI уведомлений о совершенных корректирующих действиях. - Возможность кастомизации корректирующего действия. Эта возможность еще сырая. Включает в себя: 1. На мягком (SIGTERM) пороге для процессов с заданными свойствами, если они становятся жертвами, выполнять заданную команду. 2. На жестком (SIGKILL) пороге можно с помощью ключа post_kill_exe задать произвольную команду, которая будет выполняться после любого жесткого корректирующего действия. ### Почему не триггерим ядерный OOM killer? ### Что такое PSI и как это помогает в обработке нехватки памяти? ### Как проверить поддержку PSI ядром? ### Зачем нужен ключ zram_checking_enabled? ### Как демон предотвращает убийство невиновных? ### Не показываются GUI уведомления. В чем возможная причина? ### В KDE Plasma история GUI уведомлений не сохраняется. Как исправить? ### Как пользоваться oom-sort? ### Как пользоваться psi-top? ### Как пользоваться psi2log? ### nohang vs earlyoom ### nohang vs oomd ### Как запустить и протестировать nohang без установки? ### Что не так с ZFS? ### В каких ситуациях демон не поможет? ### Почему в настройках реакции на метрики PSI по умолчанию предлагается реагирование на some avg10, а не full avg10? ### Система зависает, демон не помогает. В чем проблема и что делать? ### Как протестировать работу демона? Как создать нагрузку на память? ### В каких случаях лучше не включать проверку PSI? ### nohang vs nohang-desktop: в чем разница? ### Как это вообще работает? ### Как получить список доступных для мониторинга файлов PSI? ### Поддерживается ли убийство групп процессов? Нет, но поддержка этого может быть добавлена в будущих релизах. ### Как смотреть логи? nohang-0.2.0/docs/nohang.manpage.md000066400000000000000000000113121377337215500171310ustar00rootroot00000000000000% nohang(8) | Linux System Administrator's Manual # NAME nohang - A sophisticated low memory handler # SYNOPSIS **nohang** [**OPTION**]... # DESCRIPTION nohang is a highly configurable daemon for Linux which is able to correctly prevent out of memory (OOM) and keep system responsiveness in low memory conditions. # REQUIREMENTS #### For basic usage: - Linux (>= 3.14, since MemAvailable appeared in /proc/meminfo) - Python (>= 3.3) #### To respond to PSI metrics (optional): - Linux (>= 4.20) with CONFIG_PSI=y #### To show GUI notifications (optional): - notification server (most of desktop environments use their own implementations) - libnotify (Arch Linux, Fedora, openSUSE) or libnotify-bin (Debian GNU/Linux, Ubuntu) - sudo if nohang started with UID=0. # COMMAND-LINE OPTIONS #### -h, --help show this help message and exit #### -v, --version show version of installed package and exit #### -m, --memload consume memory until 40 MiB (MemAvailable + SwapFree) remain free, and terminate the process #### -c CONFIG, --config CONFIG path to the config file. This should only be used with one of the following options: --monitor, --tasks, --check #### --check check and show the configuration and exit. This should only be used with -c/--config CONFIG option #### --monitor start monitoring. This should only be used with -c/--config CONFIG option #### --tasks show tasks state and exit. This should only be used with -c/--config CONFIG option # FILES #### :SYSCONFDIR:/nohang/nohang.conf path to vanilla nohang configuration file #### :SYSCONFDIR:/nohang/nohang-desktop.conf path to configuration file with settings optimized for desktop usage #### :DATADIR:/nohang/nohang.conf path to file with *default* nohang.conf values #### :DATADIR:/nohang/nohang-desktop.conf path to file with *default* nohang-desktop.conf values #### /var/log/nohang/nohang.log optional log file that stores entries if separate_log=True in the config #### /etc/logrotate.d/nohang logrotate config file that controls rotation in /var/log/nohang/ # nohang.conf vs nohang-desktop.conf - nohang.conf provides vanilla default settings without PSI checking enabled, without any badness correction and without GUI notifications enabled. - nohang-desktop.conf provides default settings optimized for desktop usage. # PROBLEMS The next problems can occur with out-of-tree kernels and modules: - The ZFS ARC cache is memory-reclaimable, like the Linux buffer cache. However, in contrast to the buffer cache, it currently does not count to MemAvailable [1]. See also [2] and [3]. - Linux kernels without CONFIG_CGROUP_CPUACCT=y (linux-ck, for example) provide incorrect PSI metrics, see this thread [4]. # HOW TO CONFIGURE The program can be configured by editing the config file. The configuration includes the following sections: - Memory levels to respond to as an OOM threat - Response on PSI memory metrics - The frequency of checking the level of available memory (and CPU usage) - The prevention of killing innocent victims - Impact on the badness of processes via matching their names, cmdlines and UIDs with regular expressions - The execution of a specific command or sending any signal instead of sending the SIGTERM signal - GUI notifications: - notifications of corrective actions taken - low memory warnings - Verbosity - Misc Just read the description of the parameters and edit the values. Restart the daemon to apply the changes. # CHECK CONFIG Check the config for errors: $ nohang --check --config /path/to/config # HOW TO TEST The safest way is to run **nohang --memload**. This causes memory consumption, and the process will exits before OOM occurs. Another way is to run **tail /dev/zero**. This causes fast memory comsumption and causes OOM at the end. If testing occurs while nohang is running, these processes should be terminated before OOM occurs. # LOGGING To view the latest entries in the log (for systemd users): $ **sudo journalctl -eu nohang.service** or $ **sudo journalctl -eu nohang-desktop.service** You can also enable **separate_log** in the config to logging in **/var/log/nohang/nohang.log**. # SIGNALS Sending SIGTERM, SIGINT, SIGQUIT or SIGHUP signals to the nohang process causes it displays corrective action stats and exits. # REPORTING BUGS Please ask any questions and report bugs at . # AUTHOR Written by Alexey Avramov . # HOMEPAGE Homepage is . # SEE ALSO oom-sort(1), psi-top(1), psi2log(1) # NOTES 1. https://github.com/openzfs/zfs/issues/10255 2. https://github.com/rfjakob/earlyoom/pull/191#issuecomment-622314296 3. https://github.com/hakavlad/nohang/issues/89 4. https://github.com/hakavlad/nohang/issues/25#issuecomment-521390412 nohang-0.2.0/docs/oom-sort.manpage.md000066400000000000000000000015231377337215500174410ustar00rootroot00000000000000% oom-sort(1) | General Commands Manual # NAME oom-sort - sort processes by oom_score # SYNOPSIS **oom-sort** [**OPTION**]... # DESCRIPTION oom-sort is script that sorts tasks by oom_score by default. oom-sort is part of nohang package. # OPTIONS #### -h, --help show this help message and exit #### --num NUM, -n NUM max number of lines; default: 99999 #### --len LEN, -l LEN max cmdline length; default: 99999 #### --sort SORT, -s SORT sort by unit; available units: oom_score, oom_score_adj, UID, PID, Name, VmRSS, VmSwap, cmdline (optional); default unit: oom_score # REPORTING BUGS Please ask any questions and report bugs at . # AUTHOR Written by Alexey Avramov . # HOMEPAGE Homepage is . # SEE ALSO psi-top(1), psi2log(1), nohang(8) nohang-0.2.0/docs/psi-top.manpage.md000066400000000000000000000013271377337215500172570ustar00rootroot00000000000000% psi-top(1) | General Commands Manual # NAME psi-top - print the PSI metrics values for every cgroup. # SYNOPSIS **psi-top** [**OPTION**]... # DESCRIPTION psi-top is script that prints the PSI metrics values for every cgroup. psi-top is part of nohang package. # OPTIONS #### -h, --help show this help message and exit #### -m METRICS, --metrics METRICS metrics (memory, io or cpu) # EXAMPLES $ psi-top $ psi-top --metrics io $ psi-top -m cpu # REPORTING BUGS Please ask any questions and report bugs at . # AUTHOR Written by Alexey Avramov . # HOMEPAGE Homepage is . # SEE ALSO oom-sort(1), psi2log(1), nohang(8) nohang-0.2.0/docs/psi2log.manpage.md000066400000000000000000000021211377337215500172340ustar00rootroot00000000000000% psi2log(1) | General Commands Manual # NAME psi2log \- PSI metrics monitor and logger # SYNOPSIS **psi2log** [**OPTION**]... # DESCRIPTION psi2log is a CLI tool that can check and log PSI metrics from specified target. psi2log is part of nohang package. # OPTIONS #### -h, --help show this help message and exit #### -t TARGET, --target TARGET target (cgroup_v2 or SYTSTEM_WIDE) #### -i INTERVAL, --interval INTERVAL interval in sec #### -l LOG, --log LOG path to log file #### -m MODE, --mode MODE mode (1 or 2) #### -s SUPPRESS_OUTPUT, --suppress-output SUPPRESS_OUTPUT suppress output # EXAMPLES $ psi2log $ psi2log --mode 2 $ psi2log --target /user.slice --interval 1.5 --log psi.log # SIGNALS Sending SIGTERM, SIGINT, SIGQUIT or SIGHUP signals to the psi2log process causes it displays peak values and exits.. # REPORTING BUGS Please ask any questions and report bugs at . # AUTHOR Written by Alexey Avramov . # HOMEPAGE Homepage is . # SEE ALSO oom-sort(1), psi-top(1), nohang(8) nohang-0.2.0/man/000077500000000000000000000000001377337215500135535ustar00rootroot00000000000000nohang-0.2.0/man/nohang.8000066400000000000000000000123331377337215500151200ustar00rootroot00000000000000.\" Automatically generated by Pandoc 1.17.2 .\" .TH "nohang" "8" "" "" "Linux System Administrator\[aq]s Manual" .hy .SH NAME .PP nohang \- A sophisticated low memory handler .SH SYNOPSIS .PP \f[B]nohang\f[] [\f[B]OPTION\f[]]... .SH DESCRIPTION .PP nohang is a highly configurable daemon for Linux which is able to correctly prevent out of memory (OOM) and keep system responsiveness in low memory conditions. .SH REQUIREMENTS .SS For basic usage: .IP \[bu] 2 Linux (>= 3.14, since MemAvailable appeared in /proc/meminfo) .IP \[bu] 2 Python (>= 3.3) .SS To respond to PSI metrics (optional): .IP \[bu] 2 Linux (>= 4.20) with CONFIG_PSI=y .SS To show GUI notifications (optional): .IP \[bu] 2 notification server (most of desktop environments use their own implementations) .IP \[bu] 2 libnotify (Arch Linux, Fedora, openSUSE) or libnotify\-bin (Debian GNU/Linux, Ubuntu) .IP \[bu] 2 sudo if nohang started with UID=0. .SH COMMAND\-LINE OPTIONS .SS \-h, \-\-help .PP show this help message and exit .SS \-v, \-\-version .PP show version of installed package and exit .SS \-m, \-\-memload .PP consume memory until 40 MiB (MemAvailable + SwapFree) remain free, and terminate the process .SS \-c CONFIG, \-\-config CONFIG .PP path to the config file. This should only be used with one of the following options: \-\-monitor, \-\-tasks, \-\-check .SS \-\-check .PP check and show the configuration and exit. This should only be used with \-c/\-\-config CONFIG option .SS \-\-monitor .PP start monitoring. This should only be used with \-c/\-\-config CONFIG option .SS \-\-tasks .PP show tasks state and exit. This should only be used with \-c/\-\-config CONFIG option .SH FILES .SS :SYSCONFDIR:/nohang/nohang.conf .PP path to vanilla nohang configuration file .SS :SYSCONFDIR:/nohang/nohang\-desktop.conf .PP path to configuration file with settings optimized for desktop usage .SS :DATADIR:/nohang/nohang.conf .PP path to file with \f[I]default\f[] nohang.conf values .SS :DATADIR:/nohang/nohang\-desktop.conf .PP path to file with \f[I]default\f[] nohang\-desktop.conf values .SS /var/log/nohang/nohang.log .PP optional log file that stores entries if separate_log=True in the config .SS /etc/logrotate.d/nohang .PP logrotate config file that controls rotation in /var/log/nohang/ .SH nohang.conf vs nohang\-desktop.conf .IP \[bu] 2 nohang.conf provides vanilla default settings without PSI checking enabled, without any badness correction and without GUI notifications enabled. .IP \[bu] 2 nohang\-desktop.conf provides default settings optimized for desktop usage. .SH PROBLEMS .PP The next problems can occur with out\-of\-tree kernels and modules: .IP \[bu] 2 The ZFS ARC cache is memory\-reclaimable, like the Linux buffer cache. However, in contrast to the buffer cache, it currently does not count to MemAvailable [1]. See also [2] and [3]. .IP \[bu] 2 Linux kernels without CONFIG_CGROUP_CPUACCT=y (linux\-ck, for example) provide incorrect PSI metrics, see this thread [4]. .SH HOW TO CONFIGURE .PP The program can be configured by editing the config file. The configuration includes the following sections: .IP \[bu] 2 Memory levels to respond to as an OOM threat .IP \[bu] 2 Response on PSI memory metrics .IP \[bu] 2 The frequency of checking the level of available memory (and CPU usage) .IP \[bu] 2 The prevention of killing innocent victims .IP \[bu] 2 Impact on the badness of processes via matching their names, cmdlines and UIDs with regular expressions .IP \[bu] 2 The execution of a specific command or sending any signal instead of sending the SIGTERM signal .IP \[bu] 2 GUI notifications: .RS 2 .IP \[bu] 2 notifications of corrective actions taken .IP \[bu] 2 low memory warnings .RE .IP \[bu] 2 Verbosity .IP \[bu] 2 Misc .PP Just read the description of the parameters and edit the values. Restart the daemon to apply the changes. .SH CHECK CONFIG .PP Check the config for errors: .PP $ nohang \-\-check \-\-config /path/to/config .SH HOW TO TEST .PP The safest way is to run \f[B]nohang \-\-memload\f[]. This causes memory consumption, and the process will exits before OOM occurs. Another way is to run \f[B]tail /dev/zero\f[]. This causes fast memory comsumption and causes OOM at the end. If testing occurs while nohang is running, these processes should be terminated before OOM occurs. .SH LOGGING .PP To view the latest entries in the log (for systemd users): .PP $ \f[B]sudo journalctl \-eu nohang.service\f[] .PP or .PP $ \f[B]sudo journalctl \-eu nohang\-desktop.service\f[] .PP You can also enable \f[B]separate_log\f[] in the config to logging in \f[B]/var/log/nohang/nohang.log\f[]. .SH SIGNALS .PP Sending SIGTERM, SIGINT, SIGQUIT or SIGHUP signals to the nohang process causes it displays corrective action stats and exits. .SH REPORTING BUGS .PP Please ask any questions and report bugs at . .SH AUTHOR .PP Written by Alexey Avramov . .SH HOMEPAGE .PP Homepage is . .SH SEE ALSO .PP oom\-sort(1), psi\-top(1), psi2log(1) .SH NOTES .IP "1." 3 https://github.com/openzfs/zfs/issues/10255 .IP "2." 3 https://github.com/rfjakob/earlyoom/pull/191#issuecomment\-622314296 .IP "3." 3 https://github.com/hakavlad/nohang/issues/89 .IP "4." 3 https://github.com/hakavlad/nohang/issues/25#issuecomment\-521390412 nohang-0.2.0/man/oom-sort.1000066400000000000000000000017341377337215500154210ustar00rootroot00000000000000.\" Automatically generated by Pandoc 1.17.2 .\" .TH "oom\-sort" "1" "" "" "General Commands Manual" .hy .SH NAME .PP oom\-sort \- sort processes by oom_score .SH SYNOPSIS .PP \f[B]oom\-sort\f[] [\f[B]OPTION\f[]]... .SH DESCRIPTION .PP oom\-sort is script that sorts tasks by oom_score by default. oom\-sort is part of nohang package. .SH OPTIONS .SS \-h, \-\-help .PP show this help message and exit .SS \-\-num NUM, \-n NUM .PP max number of lines; default: 99999 .SS \-\-len LEN, \-l LEN .PP max cmdline length; default: 99999 .SS \-\-sort SORT, \-s SORT .PP sort by unit; available units: oom_score, oom_score_adj, UID, PID, Name, VmRSS, VmSwap, cmdline (optional); default unit: oom_score .SH REPORTING BUGS .PP Please ask any questions and report bugs at . .SH AUTHOR .PP Written by Alexey Avramov . .SH HOMEPAGE .PP Homepage is . .SH SEE ALSO .PP psi\-top(1), psi2log(1), nohang(8) nohang-0.2.0/man/psi-top.1000066400000000000000000000015471377337215500152370ustar00rootroot00000000000000.\" Automatically generated by Pandoc 1.17.2 .\" .TH "psi\-top" "1" "" "" "General Commands Manual" .hy .SH NAME .PP psi\-top \- print the PSI metrics values for every cgroup. .SH SYNOPSIS .PP \f[B]psi\-top\f[] [\f[B]OPTION\f[]]... .SH DESCRIPTION .PP psi\-top is script that prints the PSI metrics values for every cgroup. psi\-top is part of nohang package. .SH OPTIONS .SS \-h, \-\-help .PP show this help message and exit .SS \-m METRICS, \-\-metrics METRICS .PP metrics (memory, io or cpu) .SH EXAMPLES .PP $ psi\-top .PP $ psi\-top \-\-metrics io .PP $ psi\-top \-m cpu .SH REPORTING BUGS .PP Please ask any questions and report bugs at . .SH AUTHOR .PP Written by Alexey Avramov . .SH HOMEPAGE .PP Homepage is . .SH SEE ALSO .PP oom\-sort(1), psi2log(1), nohang(8) nohang-0.2.0/man/psi2log.1000066400000000000000000000023701377337215500152160ustar00rootroot00000000000000.\" Automatically generated by Pandoc 1.17.2 .\" .TH "psi2log" "1" "" "" "General Commands Manual" .hy .SH NAME .PP psi2log \- PSI metrics monitor and logger .SH SYNOPSIS .PP \f[B]psi2log\f[] [\f[B]OPTION\f[]]... .SH DESCRIPTION .PP psi2log is a CLI tool that can check and log PSI metrics from specified target. psi2log is part of nohang package. .SH OPTIONS .SS \-h, \-\-help .PP show this help message and exit .SS \-t TARGET, \-\-target TARGET .PP target (cgroup_v2 or SYTSTEM_WIDE) .SS \-i INTERVAL, \-\-interval INTERVAL .PP interval in sec .SS \-l LOG, \-\-log LOG .PP path to log file .SS \-m MODE, \-\-mode MODE .PP mode (1 or 2) .SS \-s SUPPRESS_OUTPUT, \-\-suppress\-output SUPPRESS_OUTPUT .PP suppress output .SH EXAMPLES .PP $ psi2log .PP $ psi2log \-\-mode 2 .PP $ psi2log \-\-target /user.slice \-\-interval 1.5 \-\-log psi.log .SH SIGNALS .PP Sending SIGTERM, SIGINT, SIGQUIT or SIGHUP signals to the psi2log process causes it displays peak values and exits.. .SH REPORTING BUGS .PP Please ask any questions and report bugs at . .SH AUTHOR .PP Written by Alexey Avramov . .SH HOMEPAGE .PP Homepage is . .SH SEE ALSO .PP oom\-sort(1), psi\-top(1), nohang(8) nohang-0.2.0/openrc/000077500000000000000000000000001377337215500142665ustar00rootroot00000000000000nohang-0.2.0/openrc/nohang-desktop.in000077500000000000000000000004501377337215500175410ustar00rootroot00000000000000#!/sbin/openrc-run name="nohang-desktop daemon" description="Sophisticated low memory handler" command=:TARGET_SBINDIR:/nohang command_args="--monitor --config :TARGET_SYSCONFDIR:/nohang/nohang-desktop.conf" pidfile="/var/run/nohang-desktop" start_stop_daemon_args="--background --make-pidfile" nohang-0.2.0/openrc/nohang.in000077500000000000000000000004201377337215500160670ustar00rootroot00000000000000#!/sbin/openrc-run name="nohang daemon" description="Sophisticated low memory handler" command=:TARGET_SBINDIR:/nohang command_args="--monitor --config :TARGET_SYSCONFDIR:/nohang/nohang.conf" pidfile="/var/run/nohang" start_stop_daemon_args="--background --make-pidfile" nohang-0.2.0/src/000077500000000000000000000000001377337215500135675ustar00rootroot00000000000000nohang-0.2.0/src/nohang000077500000000000000000003614001377337215500147730ustar00rootroot00000000000000#!/usr/bin/env python3 """A sophisticated low memory handler.""" import os from ctypes import CDLL from time import sleep, monotonic from operator import itemgetter from sys import stdout, stderr, argv, exit from re import search from sre_constants import error as invalid_re from signal import signal, SIGKILL, SIGTERM, SIGINT, SIGQUIT, SIGHUP, SIGUSR1 def read_path(path): """ """ try: fd[path].seek(0) except ValueError: try: fd[path] = open(path, 'rb', buffering=0) except FileNotFoundError: return None except KeyError: try: fd[path] = open(path, 'rb', buffering=0) except FileNotFoundError: return None try: return fd[path].read(99999).decode() except OSError: fd[path].close() return None def missing_config_key(key): """ """ errprint('ERROR: invalid config: missing key "{}"'.format(key)) exit(1) def invalid_config_key_value(key): """ """ errprint('ERROR: invalid config: invalid "{}" value'.format(key)) exit(1) def check_permissions(): """ """ try: os.path.realpath('/proc/1/exe') except Exception as e: log('WARNING: missing CAP_SYS_PTRACE: {}'.format(e)) try: os.kill(1, 0) except Exception as e: log('WARNING: cannot send a signal: {}'.format(e)) try: rline1('/proc/1/oom_score') except Exception as e: errprint('ERROR: {}'.format(e)) exit(1) def memload(): """ """ from random import random r = str(random())[2:8] hi = 'Enter the numbers {} to confirm that you are not a robot: '.format(r) try: t0 = monotonic() inp = input(hi) except KeyboardInterrupt: errprint('KeyboardInterrupt\nExit') exit(1) try: os.setreuid(1, 1) except Exception: pass if inp != r: errprint('Captcha is not passed ("{}" != "{}")'.format(inp, r)) errprint('memload() is not for robots\nExit') exit(1) if monotonic() - t0 > 30: errprint('Captcha is not passed (timeout expired)') errprint('memload() is not for robots\nExit') exit(1) else: print('-' * 68) with open('/proc/meminfo') as f: mem_list = f.readlines() mem_list_names = [] for s in mem_list: mem_list_names.append(s.split(':')[0]) try: mem_available_index = mem_list_names.index('MemAvailable') except ValueError: errprint('Your Linux kernel is too old, Linux 3.14+ required\nExit') swap_free_index = mem_list_names.index('SwapFree') def check_mem_and_swap(): """find mem_available, swap_total, swap_free""" with open('/proc/meminfo') as f: for n, line in enumerate(f): if n == mem_available_index: mem_available = int(line.split(':')[1][:-4]) continue if n == swap_free_index: swap_free = int(line.split(':')[1][:-4]) break return mem_available, swap_free def print_mem(mem_available, swap_free): print('\033MMemAvailable: {} MiB, SwapFree: {} MiB ' ' '.format( round(mem_available / 1024), round(swap_free / 1024))) hi = 'Warning! The process will consume memory until 40 MiB of mem' \ 'ory\n(MemAvailable + SwapFree) remain free, and it will be t' \ 'erminated via SIGUSR1\nat the end. This may cause the system' \ ' to freeze and processes to terminate.\nDo you want to conti' \ 'nue? [No/Yes] ' try: inp = input(hi) except KeyboardInterrupt: errprint('KeyboardInterrupt\nExit') exit(1) if inp != 'Yes': print('Exit') exit() else: print('Memory consumption has started!\n') ex = [] z = monotonic() self_pid = os.getpid() while True: try: mem_available, swap_free = check_mem_and_swap() x = mem_available + swap_free if x <= 1024 * 40: # 40 MiB print_mem(mem_available, swap_free) print('Self terminating by SIGUSR1') os.kill(self_pid, SIGUSR1) else: ex.append(bytearray(1024 * 40)) # step size is 40 KiB u = monotonic() - z if u <= 0.01: continue z = monotonic() print_mem(mem_available, swap_free) except KeyboardInterrupt: errprint('KeyboardInterrupt') errprint('Self terminating by the SIGUSR1 signal') os.kill(self_pid, SIGUSR1) except MemoryError: errprint('MemoryError') errprint('Self terminating by the SIGUSR1 signal') os.kill(self_pid, SIGUSR1) def arcstats(): """ """ with open(arcstats_path, 'rb') as f: a_list = f.read().decode().split('\n') for n, line in enumerate(a_list): if n == c_min_index: c_min = int(line.rpartition(' ')[2]) / 1024 elif n == size_index: size = int(line.rpartition(' ')[2]) / 1024 elif n == arc_meta_used_index: arc_meta_used = int(line.rpartition(' ')[2]) / 1024 elif n == arc_meta_min_index: arc_meta_min = int(line.rpartition(' ')[2]) / 1024 else: continue c_rec = size - c_min if c_rec < 0: c_rec = 0 meta_rec = arc_meta_used - arc_meta_min if meta_rec < 0: meta_rec = 0 zfs_available = c_rec + meta_rec # return c_min, size, arc_meta_used, arc_meta_min, zfs_available return zfs_available def exe(cmd): """ execute cmd in subprocess.Popen() """ cmd_list = shlex.split(cmd) cmd_num_dict['cmd_num'] += 1 cmd_num = cmd_num_dict['cmd_num'] th_name = threading.current_thread().getName() log('Executing Command-{} {} with timeout {}s in {}'.format( cmd_num, cmd_list, exe_timeout, th_name, )) t3 = monotonic() try: with Popen(cmd_list) as proc: try: proc.wait(timeout=exe_timeout) exit_status = proc.poll() t4 = monotonic() log('Command-{} execution completed in {}s; exit status' ': {}'.format(cmd_num, round(t4 - t3, 3), exit_status)) except TimeoutExpired: proc.kill() log('Timeout expired for Command-{}'.format(cmd_num)) except Exception as e: log('Exception in {}: {}'.format(th_name, e)) def start_thread(func, *a, **k): """ run function in a new thread """ th = threading.Thread(target=func, args=a, kwargs=k, daemon=True) th_name = th.getName() if debug_threading: log('Starting {} from {}'.format( th_name, threading.current_thread().getName() )) try: t1 = monotonic() th.start() t2 = monotonic() if debug_threading: log('{} has started in {} ms, {} threads are ' 'currently alive'.format(th_name, round(( t2 - t1) * 1000, 1), threading.active_count())) except RuntimeError: log('RuntimeError: cannot start {}'.format(th_name)) return 1 def re_pid_environ(pid): """ read environ of 1 process returns tuple with USER, DBUS, DISPLAY like follow: ('user', 'DISPLAY=:0', 'DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus') returns None if these vars is not in /proc/[pid]/environ """ try: with open('/proc/' + pid + '/environ', 'rb') as f: env = f.read().decode('utf-8', 'ignore') except (FileNotFoundError, ProcessLookupError): return None if display_env in env and dbus_env in env and user_env in env: env_list = env.split('\x00') # iterating over a list of process environment variables for i in env_list: # exclude Display Manager's user if i.startswith('HOME=/var'): return None if i.startswith(user_env): user = i if user == 'USER=root': return None continue if i.startswith(display_env): if i[-2] == '.': # DISPLAY=:0.0 -> DISPLAY=:0 display = i[:-2] else: display = i if len(display) > 10: # skip DISPLAY >= :10 return None continue if i.startswith(dbus_env): dbus = i continue try: return user.partition('USER=')[2], display, dbus except UnboundLocalError: return None def root_notify_env(): """return set(user, display, dbus)""" unsorted_envs_list = [] # iterates over processes, find processes with suitable env for pid in os.listdir('/proc'): if is_alive(pid): one_env = re_pid_environ(pid) unsorted_envs_list.append(one_env) env = set(unsorted_envs_list) env.discard(None) # deduplicate dbus new_env = [] end = [] for i in env: key = i[0] + i[1] if key not in end: end.append(key) new_env.append(i) else: continue return new_env def pop(cmd): """ run cmd in subprocess.Popen() """ cmd_num_dict['cmd_num'] += 1 cmd_num = cmd_num_dict['cmd_num'] if swap_total == 0: wait_time = 15 else: wait_time = 30 th_name = threading.current_thread().getName() log('Executing Command-{} {} with timeout {}s in {}'.format( cmd_num, cmd, wait_time, th_name )) t3 = monotonic() try: with Popen(cmd) as proc: try: proc.wait(timeout=wait_time) err = proc.poll() t4 = monotonic() if debug_gui_notifications: log('Command-{} execution completed in {}s; exit status' ': {}'.format(cmd_num, round(t4 - t3, 3), err)) except TimeoutExpired: proc.kill() if debug_gui_notifications: log('Timeout expired for Command-{}'.format(cmd_num)) except Exception as e: log('Exception in {}: {}'.format(th_name, e)) def send_notification(title, body): """ """ if self_uid != 0: cmd = ['notify-send', '--icon=dialog-warning', title, body] pop(cmd) return None t1 = monotonic() if envd['t'] is None: list_with_envs = root_notify_env() envd['list_with_envs'] = list_with_envs envd['t'] = monotonic() cached_env = '' elif monotonic() - envd['t'] > env_cache_time: list_with_envs = root_notify_env() envd['list_with_envs'] = list_with_envs envd['t'] = monotonic() cached_env = '' else: list_with_envs = envd['list_with_envs'] cached_env = ' (cached)' t2 = monotonic() if debug_gui_notifications: log('Found env in {} ms{}'.format(round((t2 - t1) * 1000), cached_env)) log(' Title: {}'.format([title])) log(' Body: {}'.format([body])) log(' Env list: {}'.format(list_with_envs)) list_len = len(list_with_envs) # if somebody logged in with GUI if list_len > 0: # iterating over logged-in users for i in list_with_envs: username, display_env, dbus_env = i[0], i[1], i[2] display_tuple = display_env.partition('=') dbus_tuple = dbus_env.partition('=') display_value = display_tuple[2] dbus_value = dbus_tuple[2] cmd = [ 'sudo', '-u', username, 'env', 'DISPLAY=' + display_value, 'DBUS_SESSION_BUS_ADDRESS=' + dbus_value, 'notify-send', '--icon=dialog-warning', '--app-name=nohang', title, body ] start_thread(pop, cmd) else: if debug_gui_notifications: log('Nobody logged-in with GUI. Nothing to do.') def send_notify_warn(): """ Implement Low memory warnings """ log('Warning threshold exceeded') log_meminfo() if check_warning_exe: start_thread(exe, warning_exe) else: title = 'Low memory' shared = meminfo()['shared'] sh_percent = shared / mem_total if sh_percent > 0.6: body = 'Save your unsaved data!\nClear tmpfs! Shmem: {}%'.format( round(sh_percent * 100)) elif sh_percent > 0.3: body = 'Save your unsaved data!\nClose unused apps!\nClear ' \ 'tmpfs! Shmem: {}%'.format(round(sh_percent * 100)) else: body = 'Save your unsaved data!\nClose unused apps!' """" body = 'MemAvail: {}%\nSwapFree: {}%'.format( round(mem_available / mem_total * 100), round(swap_free / (swap_total + 0.1) * 100) ) """ start_thread(send_notification, title, body) def send_notify(threshold, name, pid): """ Notificate about OOM Preventing. threshold: key for notify_sig_dict name: str process name pid: str process pid """ title = 'System hang prevention' if hide_corrective_action_type: body = 'Corrective action applied' else: body = '{} [{}] {}'.format( notify_sig_dict[threshold], pid, name.replace( # symbol '&' can break notifications in some themes, # therefore it is replaced by '*' '&', '*' )) start_thread(send_notification, title, body) def send_notify_etc(pid, name, command): """ Notificate about OOM Preventing. command: str command that will be executed name: str process name pid: str process pid """ title = 'System hang prevention' if hide_corrective_action_type: body = 'Corrective action applied' else: body = 'Victim is [{}] {}\nExecute the command:\n' \ '{}'.format(pid, name.replace( '&', '*'), command.replace('&', '*')) start_thread(send_notification, title, body) def check_config(): """ """ log('\n0. Check kernel messages for OOM events') log(' @check_kmsg: <{}>'.format(check_kmsg)) log(' @debug_kmsg: <{}>'.format(debug_kmsg)) log('\n1. Common zram settings') log(' zram_checking_enabled: {}'.format(zram_checking_enabled)) log('\n2. Common PSI settings') log(' psi_checking_enabled: {}'.format(psi_checking_enabled)) log(' psi_path: {}'.format(psi_path)) log(' psi_metrics: {}'.format(psi_metrics)) log(' psi_excess_duration: {} sec'.format(psi_excess_duration)) log(' psi_post_action_delay: {} sec'.format(psi_post_action_delay)) log('\n3. Poll rate') log(' fill_rate_mem: {}'.format(fill_rate_mem)) log(' fill_rate_swap: {}'.format(fill_rate_swap)) log(' fill_rate_zram: {}'.format(fill_rate_zram)) log(' max_sleep: {} sec'.format(max_sleep)) log(' min_sleep: {} sec'.format(min_sleep)) log('\n4. Warnings and notifications') log(' post_action_gui_notifications: {}'.format( post_action_gui_notifications)) log(' hide_corrective_action_type: {}'.format( hide_corrective_action_type)) log(' low_memory_warnings_enabled: {}'.format( low_memory_warnings_enabled)) log(' warning_exe: {}'.format(warning_exe)) log(' warning_threshold_min_mem: {} MiB, {} %'.format(round( warning_threshold_min_mem_mb), round( warning_threshold_min_mem_percent, 1))) log(' warning_threshold_min_swap: {}'.format (warning_threshold_min_swap)) log(' warning_threshold_max_zram: {} MiB, {} %'.format(round( warning_threshold_max_zram_mb), round( warning_threshold_max_zram_percent, 1))) log(' warning_threshold_max_psi: {}'.format( warning_threshold_max_psi)) log(' min_post_warning_delay: {} sec'.format( min_post_warning_delay)) log(' env_cache_time: {}'.format(env_cache_time)) log('\n5. Soft threshold') log(' soft_threshold_min_mem: {} MiB, {} %'.format( round(soft_threshold_min_mem_mb), round( soft_threshold_min_mem_percent, 1))) log(' soft_threshold_min_swap: {}'.format(soft_threshold_min_swap)) log(' soft_threshold_max_zram: {} MiB, {} %'.format( round(soft_threshold_max_zram_mb), round( soft_threshold_max_zram_percent, 1))) log(' soft_threshold_max_psi: {}'.format(soft_threshold_max_psi)) log('\n6. Hard threshold') log(' hard_threshold_min_mem: {} MiB, {} %'.format( round(hard_threshold_min_mem_mb), round( hard_threshold_min_mem_percent, 1))) log(' hard_threshold_min_swap: {}'.format(hard_threshold_min_swap)) log(' hard_threshold_max_zram: {} MiB, {} %'.format( round(hard_threshold_max_zram_mb), round( hard_threshold_max_zram_percent, 1))) log(' hard_threshold_max_psi: {}'.format(hard_threshold_max_psi)) log('\n7. Customize victim selection: adjusting badness of processes') log('\n7.1. Ignore positive oom_score_adj') log(' ignore_positive_oom_score_adj: {}'.format( ignore_positive_oom_score_adj)) log('\n7.2. Adjusting badness of processes by matching with ' 'regular expressions') log('7.2.1. Matching process names with RE patterns') if len(badness_adj_re_name_list) > 0: log(' badness_adj: regexp:') for i in badness_adj_re_name_list: log(' {:>12} {}'.format(i[0], i[1])) else: log(' (not set)') log('7.2.2. Matching CGroup_v1-line with RE patterns') if len(badness_adj_re_cgroup_v1_list) > 0: log(' badness_adj: regexp:') for i in badness_adj_re_cgroup_v1_list: log(' {:>12} {}'.format(i[0], i[1])) else: log(' (not set)') log('7.2.3. Matching CGroup_v2-line with RE patterns') if len(badness_adj_re_cgroup_v2_list) > 0: log(' badness_adj: regexp:') for i in badness_adj_re_cgroup_v2_list: log(' {:>12} {}'.format(i[0], i[1])) else: log(' (not set)') log('7.2.4. Matching eUIDs with RE patterns') if len(badness_adj_re_uid_list) > 0: log(' badness_adj: regexp:') for i in badness_adj_re_uid_list: log(' {:>12} {}'.format(i[0], i[1])) else: log(' (not set)') log('7.2.5. Matching realpath with RE patterns') if len(badness_adj_re_realpath_list) > 0: log(' badness_adj: regexp:') for i in badness_adj_re_realpath_list: log(' {:>12} {}'.format(i[0], i[1])) else: log(' (not set)') log('7.2.6. Matching /proc/[pid]/cwd realpath with RE patterns') if len(badness_adj_re_cwd_list) > 0: log(' badness_adj: regexp:') for i in badness_adj_re_cwd_list: log(' {:>12} {}'.format(i[0], i[1])) else: log(' (not set)') log('7.2.7. Matching cmdlines with RE patterns') if len(badness_adj_re_cmdline_list) > 0: log(' badness_adj: regexp:') for i in badness_adj_re_cmdline_list: log(' {:>12} {}'.format(i[0], i[1])) else: log(' (not set)') log('7.2.8. Matching environ with RE patterns') if len(badness_adj_re_environ_list) > 0: log(' badness_adj: regexp:') for i in badness_adj_re_environ_list: log(' {:>12} {}'.format(i[0], i[1])) else: log(' (not set)') log('\n8. Customize soft corrective actions') if len(soft_actions_list) > 0: log(' Match by: regexp: command: ') for i in soft_actions_list: log(' {} {} {}'.format(i[0].ljust(10), i[1].ljust(12), i[2])) else: log(' (not set)') log('\n9. Misc') log(' max_soft_exit_time: {} sec'.format(max_soft_exit_time)) log(' post_kill_exe: {}'.format(post_kill_exe)) log(' min_badness: {}'.format(min_badness)) log(' post_soft_action_delay: {} sec'.format( post_soft_action_delay)) log(' post_zombie_delay: {} sec'.format(post_zombie_delay)) log(' victim_cache_time: {} sec'.format(victim_cache_time)) log(' exe_timeout: {} sec'.format(exe_timeout)) log('\n10. Verbosity') log(' print_config_at_startup: {}'.format(print_config_at_startup)) log(' print_mem_check_results: {}'.format(print_mem_check_results)) log(' min_mem_report_interval: {} sec'.format( min_mem_report_interval)) log(' print_proc_table: {}'.format(print_proc_table)) log(' extra_table_info: {}'.format(extra_table_info)) log(' print_victim_status: {}'.format(print_victim_status)) log(' print_victim_cmdline: {}'.format(print_victim_cmdline)) log(' max_victim_ancestry_depth: {}'.format(max_victim_ancestry_depth)) log(' print_statistics: {}'.format(print_statistics)) log(' debug_gui_notifications: {}'.format(debug_gui_notifications)) log(' debug_psi: {}'.format(debug_psi)) log(' debug_sleep: {}'.format(debug_sleep)) log(' debug_threading: {}'.format(debug_threading)) log(' separate_log: {}'.format(separate_log)) if check_config_flag: log('\nconfig is OK') exit() def get_swap_threshold_tuple(string, key): # re (Num %, True) or (Num KiB, False) """Returns KiB value if abs val was set in config, or tuple with %""" # return tuple with abs and bool: (abs %, True) or (abs MiB, False) if string.endswith('%'): value = string_to_float_convert_test(string[:-1]) if value is None or value < 0 or value > 100: invalid_config_key_value(key) return value, True elif string.endswith('M'): value = string_to_float_convert_test(string[:-1]) if value is None or value < 0: invalid_config_key_value(key) return value, False else: invalid_config_key_value(key) def find_cgroup_indexes(): """ Find cgroup-line positions in /proc/*/cgroup file. """ cgroup_v1_index = cgroup_v2_index = None with open('/proc/self/cgroup') as f: for index, line in enumerate(f): if ':name=' in line: cgroup_v1_index = index if line.startswith('0::'): cgroup_v2_index = index return cgroup_v1_index, cgroup_v2_index def pid_to_rss(pid): """ """ try: rss = int(rline1( '/proc/{}/statm'.format(pid)).split(' ')[1]) * SC_PAGESIZE except (IndexError, FileNotFoundError, ProcessLookupError): rss = None return rss def pid_to_vm_size(pid): """ """ try: vm_size = int(rline1( '/proc/{}/statm'.format(pid)).partition(' ')[0]) * SC_PAGESIZE except (IndexError, FileNotFoundError, ProcessLookupError): vm_size = None return vm_size def signal_handler(signum, frame): """ """ for i in sig_list: signal(i, signal_handler_inner) log('Got the {} signal '.format( sig_dict[signum])) if len(fd) > 0: for f in fd: fd[f].close() print_stat_dict() m = monotonic() - start_time user_time, system_time = os.times()[0:2] p_time = user_time + system_time p_percent = p_time / m * 100 log('Process time: {}s (average: {}%); exit.'.format( round(p_time, 2), round(p_percent, 2))) exit() def signal_handler_inner(signum, frame): """ """ log('Got the {} signal (ignored) '.format( sig_dict[signum])) def write(path, string): """ """ with open(path, 'w') as f: f.write(string) def valid_re(reg_exp): """Validate regular expression. """ try: search(reg_exp, '') except invalid_re: log('Invalid config: invalid regexp: {}'.format(reg_exp)) exit(1) def func_print_proc_table(): """ """ print_proc_table = True find_victim(print_proc_table) exit() def log(*msg): """ """ print(*msg) if separate_log: logging.info(*msg) def print_version(): """ """ if os.path.exists('/usr/local/share/nohang/version'): v = rline1('/usr/local/share/nohang/version') else: try: v = rline1('/usr/share/nohang/version') except FileNotFoundError: v = None if v is None: print('nohang unknown version') else: print('nohang ' + v) exit() def psi_file_mem_to_metrics(psi_path): """ """ with open(psi_path) as f: psi_list = f.readlines() some_list, full_list = psi_list[0].split(' '), psi_list[1].split(' ') some_avg10 = some_list[1].split('=')[1] some_avg60 = some_list[2].split('=')[1] some_avg300 = some_list[3].split('=')[1] full_avg10 = full_list[1].split('=')[1] full_avg60 = full_list[2].split('=')[1] full_avg300 = full_list[3].split('=')[1] return (some_avg10, some_avg60, some_avg300, full_avg10, full_avg60, full_avg300) def pid_to_cgroup_v1(pid): """ """ cgroup_v1 = '' try: with open('/proc/' + pid + '/cgroup') as f: for index, line in enumerate(f): if index == cgroup_v1_index: cgroup_v1 = '/' + line.partition('/')[2][:-1] return cgroup_v1 except (FileNotFoundError, ProcessLookupError): return '' def pid_to_cgroup_v2(pid): """ """ cgroup_v2 = '' try: with open('/proc/' + pid + '/cgroup') as f: for index, line in enumerate(f): if index == cgroup_v2_index: cgroup_v2 = line[3:-1] return cgroup_v2 except (FileNotFoundError, ProcessLookupError): return '' def pid_to_starttime(pid): """ handle FNF error! """ try: starttime = rline1('/proc/' + pid + '/stat').rpartition(')')[ 2].split(' ')[20] except UnicodeDecodeError: with open('/proc/' + pid + '/stat', 'rb') as f: starttime = f.read().decode('utf-8', 'ignore').rpartition( ')')[2].split(' ')[20] return float(starttime) / SC_CLK_TCK def pid_to_nssid(pid): """ handle FNF error! """ try: nssid = rline1('/proc/' + pid + '/stat').rpartition(')')[ 2].split(' ')[4] except UnicodeDecodeError: with open('/proc/' + pid + '/stat', 'rb') as f: nssid = f.read().decode('utf-8', 'ignore').rpartition( ')')[2].split(' ')[4] return nssid def get_victim_id(pid): """victim_id is starttime + pid""" try: return rline1('/proc/' + pid + '/stat').rpartition( ')')[2].split(' ')[20] + '_pid' + pid except (FileNotFoundError, ProcessLookupError): return '' def pid_to_state(pid): """ """ try: with open('/proc/' + pid + '/stat', 'rb') as f: return f.read(40).decode('utf-8', 'ignore').rpartition(')')[2][1] except (FileNotFoundError, ProcessLookupError): return '' except IndexError: with open('/proc/' + pid + '/stat', 'rb') as f: return f.read().decode('utf-8', 'ignore').rpartition(')')[2][1] def pid_to_name(pid): """ """ try: with open('/proc/{}/comm'.format(pid), 'rb', buffering=0) as f: return f.read().decode('utf-8', 'ignore')[:-1] except (FileNotFoundError, ProcessLookupError): return '' def pid_to_ppid(pid): """ """ try: with open('/proc/' + pid + '/status') as f: for n, line in enumerate(f): if n is ppid_index: return line.split('\t')[1].strip() except (FileNotFoundError, ProcessLookupError): return '' except UnicodeDecodeError: with open('/proc/' + pid + '/status', 'rb') as f: f_list = f.read().decode('utf-8', 'ignore').split('\n') for i in range(len(f_list)): if i is ppid_index: return f_list[i].split('\t')[1] def pid_to_ancestry(pid, max_victim_ancestry_depth=1): """ """ if max_victim_ancestry_depth == 1: ppid = pid_to_ppid(pid) pname = pid_to_name(ppid) return '\n PPID: {} ({})'.format(ppid, pname) if max_victim_ancestry_depth == 0: return '' anc_list = [] for i in range(max_victim_ancestry_depth): ppid = pid_to_ppid(pid) pname = pid_to_name(ppid) anc_list.append((ppid, pname)) if ppid == '1': break pid = ppid a = '' for i in anc_list: a = a + ' <= PID {} ({})'.format(i[0], i[1]) return '\n ancestry: ' + a[4:] def pid_to_cmdline(pid): """ Get process cmdline by pid. pid: str pid of required process returns string cmdline """ try: with open('/proc/' + pid + '/cmdline', 'rb') as f: return f.read().decode('utf-8', 'ignore').replace( '\x00', ' ').rstrip() except (FileNotFoundError, ProcessLookupError): return '' def pid_to_environ(pid): """ Get process environ by pid. pid: str pid of required process returns string environ """ try: with open('/proc/' + pid + '/environ', 'rb') as f: return f.read().decode('utf-8', 'ignore').replace( '\x00', ' ').rstrip() except (FileNotFoundError, ProcessLookupError): return '' def pid_to_realpath(pid): """ """ try: return os.path.realpath('/proc/{}/exe'.format(pid)) except (FileNotFoundError, ProcessLookupError, PermissionError): return '' def pid_to_cwd(pid): """ """ try: return os.path.realpath('/proc/{}/cwd'.format(pid)) except (FileNotFoundError, ProcessLookupError, PermissionError): return '' def pid_to_uid(pid): """return euid""" try: with open('/proc/{}/status'.format(pid), 'rb', buffering=0) as f: f_list = f.read().decode('utf-8', 'ignore').split('\n') return f_list[uid_index].split('\t')[2] except (FileNotFoundError, ProcessLookupError): return '' def pid_to_badness(pid, oom_score): """Find and modify badness (if it needs).""" oom_score_adj = None try: if oom_score is None: oom_score = pid_to_oom_score(pid) if oom_score == 0: return oom_score, oom_score badness = oom_score if ignore_positive_oom_score_adj: oom_score_adj = pid_to_oom_score_adj(pid) if oom_score_adj > 0: badness = badness - oom_score_adj if regex_matching: name = pid_to_name(pid) for re_tup in badness_adj_re_name_list: if search(re_tup[1], name) is not None: badness_adj = int(re_tup[0]) if badness_adj <= 0: badness += badness_adj else: if oom_score_adj is None: oom_score_adj = pid_to_oom_score_adj(pid) if oom_score_adj >= 0: badness += badness_adj if re_match_cgroup_v1: cgroup_v1 = pid_to_cgroup_v1(pid) for re_tup in badness_adj_re_cgroup_v1_list: if search(re_tup[1], cgroup_v1) is not None: badness_adj = int(re_tup[0]) if badness_adj <= 0: badness += badness_adj else: if oom_score_adj is None: oom_score_adj = pid_to_oom_score_adj(pid) if oom_score_adj >= 0: badness += badness_adj if re_match_cgroup_v2: cgroup_v2 = pid_to_cgroup_v2(pid) for re_tup in badness_adj_re_cgroup_v2_list: if search(re_tup[1], cgroup_v2) is not None: badness_adj = int(re_tup[0]) if badness_adj <= 0: badness += badness_adj else: if oom_score_adj is None: oom_score_adj = pid_to_oom_score_adj(pid) if oom_score_adj >= 0: badness += badness_adj if re_match_realpath: realpath = pid_to_realpath(pid) for re_tup in badness_adj_re_realpath_list: if search(re_tup[1], realpath) is not None: badness_adj = int(re_tup[0]) if badness_adj <= 0: badness += badness_adj else: if oom_score_adj is None: oom_score_adj = pid_to_oom_score_adj(pid) if oom_score_adj >= 0: badness += badness_adj if re_match_cwd: cwd = pid_to_cwd(pid) for re_tup in badness_adj_re_cwd_list: if search(re_tup[1], cwd) is not None: badness_adj = int(re_tup[0]) if badness_adj <= 0: badness += badness_adj else: if oom_score_adj is None: oom_score_adj = pid_to_oom_score_adj(pid) if oom_score_adj >= 0: badness += badness_adj if re_match_cmdline: cmdline = pid_to_cmdline(pid) for re_tup in badness_adj_re_cmdline_list: if search(re_tup[1], cmdline) is not None: badness_adj = int(re_tup[0]) if badness_adj <= 0: badness += badness_adj else: if oom_score_adj is None: oom_score_adj = pid_to_oom_score_adj(pid) if oom_score_adj >= 0: badness += badness_adj if re_match_environ: environ = pid_to_environ(pid) for re_tup in badness_adj_re_environ_list: if search(re_tup[1], environ) is not None: badness_adj = int(re_tup[0]) if badness_adj <= 0: badness += badness_adj else: if oom_score_adj is None: oom_score_adj = pid_to_oom_score_adj(pid) if oom_score_adj >= 0: badness += badness_adj if re_match_uid: uid = pid_to_uid(pid) for re_tup in badness_adj_re_uid_list: if search(re_tup[1], uid) is not None: badness_adj = int(re_tup[0]) if badness_adj <= 0: badness += badness_adj else: if oom_score_adj is None: oom_score_adj = pid_to_oom_score_adj(pid) if oom_score_adj >= 0: badness += badness_adj if badness < 0: badness = 0 return badness, oom_score except (FileNotFoundError, ProcessLookupError): return None, None def pid_to_status(pid): """ """ try: with open('/proc/{}/status'.format(pid), 'rb', buffering=0) as f: f_list = f.read().decode('utf-8', 'ignore').split('\n') for i in range(len(f_list)): if i == 0: name = f_list[i].split('\t')[1] if i is state_index: state = f_list[i].split('\t')[1][0] if i is ppid_index: ppid = f_list[i].split('\t')[1] if i is uid_index: uid = f_list[i].split('\t')[2] if i is vm_size_index: vm_size = kib_to_mib( int(f_list[i].split('\t')[1][:-3])) if i is vm_rss_index: vm_rss = kib_to_mib(int(f_list[i].split('\t')[1][:-3])) if i is vm_swap_index: vm_swap = kib_to_mib(int(f_list[i].split('\t')[1][:-3])) return name, state, ppid, uid, vm_size, vm_rss, vm_swap except (FileNotFoundError, ProcessLookupError, ValueError): return None def uptime(): """ """ return float(rline1('/proc/uptime').split(' ')[0]) def errprint(*text): """ """ print(*text, file=stderr, flush=True) try: if separate_log: logging.info(*msg) except NameError: pass def mlockall(): """ """ MCL_CURRENT = 1 MCL_FUTURE = 2 MCL_ONFAULT = 4 libc = CDLL(None, use_errno=True) result = libc.mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT) if result != 0: result = libc.mlockall(MCL_CURRENT | MCL_FUTURE) if result != 0: log('WARNING: cannot lock process memory: [Errno {}]'.format( result)) def update_stat_dict(key): """ """ if key is not None: if key not in stat_dict: stat_dict.update({key: 1}) else: new_value = stat_dict[key] + 1 stat_dict.update({key: new_value}) def print_stat_dict(): """ """ if print_statistics: lsd = len(stat_dict) if lsd == 0: log('No corrective actions applied in the last {}'.format( format_time(monotonic() - start_time))) else: stats_msg = 'What happened in the last {}:'.format( format_time(monotonic() - start_time)) for i in stat_dict: stats_msg += '\n {}: {}'.format(i, stat_dict[i]) log(stats_msg) def find_psi_metrics_value(psi_path, psi_metrics): """ """ foooo = read_path(psi_path) if foooo is None: return None try: if psi_metrics == 'some_avg10': return float(foooo.split('\n')[0].split(' ')[1].split('=')[1]) if psi_metrics == 'some_avg60': return float(foooo.split('\n')[0].split(' ')[2].split('=')[1]) if psi_metrics == 'some_avg300': return float(foooo.split('\n')[0].split(' ')[3].split('=')[1]) if psi_metrics == 'full_avg10': return float(foooo.split('\n')[1].split(' ')[1].split('=')[1]) if psi_metrics == 'full_avg60': return float(foooo.split('\n')[1].split(' ')[2].split('=')[1]) if psi_metrics == 'full_avg300': return float(foooo.split('\n')[1].split(' ')[3].split('=')[1]) except Exception as e: if debug_psi: log('Invalid psi_path: {}'.format(e)) return None def check_mem_and_swap(): """ """ fd['mi'].seek(0) m_list = fd['mi'].read().decode().split(' kB\n') ma = int(m_list[mem_available_index].split(':')[1]) st = int(m_list[swap_total_index].split(':')[1]) sf = int(m_list[swap_free_index].split(':')[1]) if ZFS: ma += arcstats() return ma, st, sf def meminfo(): """ """ fd['mi'].seek(0) m_list = fd['mi'].read().decode().split(' kB\n') mem_available = int(m_list[mem_available_index].split(':')[1]) mem_free = int(m_list[mem_free_index].split(':')[1]) swap_total = int(m_list[swap_total_index].split(':')[1]) swap_free = int(m_list[swap_free_index].split(':')[1]) buffers = int(m_list[buffers_index].split(':')[1]) cached = int(m_list[cached_index].split(':')[1]) sreclaimable = int(m_list[sreclaimable_index].split(':')[1]) shmem = int(m_list[shmem_info_index].split(':')[1]) md = dict() md['total'] = mem_total md['used'] = mem_total - mem_free - buffers - cached - sreclaimable md['free'] = mem_free md['available'] = mem_available if ZFS: z = arcstats() mem_available += z md['shared'] = shmem md['buffers'] = buffers md['cache'] = cached + sreclaimable md['swap_total'] = swap_total md['swap_used'] = swap_total - swap_free md['swap_free'] = swap_free return md def check_zram(): """Find MemUsedZram (mem_used_total).""" if os.path.exists('/sys/block/zram0/mem_limit'): summa = 0 if os.path.exists('/sys/block/zram0/mm_stat'): for dev in os.listdir('/sys/block'): try: with open('/sys/block/{}/mm_stat'.format( dev), 'rb', buffering=0) as f: summa += int(f.read().decode().split()[2]) except FileNotFoundError: continue return summa / 1024 else: for dev in os.listdir('/sys/block'): try: with open('/sys/block/{}/mem_used_total'.format( dev), 'rb', buffering=0) as f: summa += int(f.read()) except FileNotFoundError: continue return summa / 1024 else: return 0 def format_time(t): """ """ total_s = int(t) if total_s < 60: return '{}s'.format(round(t, 1)) if total_s < 3600: total_m = total_s // 60 mod_s = total_s % 60 return '{}min {}s'.format(total_m, mod_s) if total_s < 86400: total_m = total_s // 60 mod_s = total_s % 60 total_h = total_m // 60 mod_m = total_m % 60 return '{}h {}min {}s'.format(total_h, mod_m, mod_s) total_m = total_s // 60 mod_s = total_s % 60 total_h = total_m // 60 mod_m = total_m % 60 total_d = total_h // 24 mod_h = total_h % 24 return '{}d {}h {}min {}s'.format(total_d, mod_h, mod_m, mod_s) def string_to_float_convert_test(string): """Try to interprete string values as floats.""" try: return float(string) except ValueError: return None def string_to_int_convert_test(string): """Try to interpret string values as integers.""" try: return int(string) except ValueError: return None def conf_parse_string(param): """ Get string parameters from the config dict. param: config_dict key returns config_dict[param].strip() """ if param in config_dict: return config_dict[param].strip() else: missing_config_key(param) def conf_parse_bool(param): """ Get bool parameters from the config_dict. param: config_dict key returns bool """ if param in config_dict: param_str = config_dict[param] if param_str == 'True': return True elif param_str == 'False': return False else: invalid_config_key_value(param) else: missing_config_key(param) def rline1(path): """Read 1st line from the path.""" try: with open(path) as f: for line in f: return line.rstrip() except UnicodeDecodeError: with open(path, 'rb') as f: return f.read(999).decode( 'utf-8', 'ignore').split('\n')[0] # use partition()! def kib_to_mib(num): """Convert KiB values to MiB values.""" return round(num / 1024.0) def percent(num): """Interprete num as percentage.""" return round(num * 100, 1) def just_percent_mem(num): """Convert num to percent and justify.""" return str(round(num * 100, 1)).rjust(4, ' ') def just_percent_swap(num): """ """ return str(round(num * 100, 1)).rjust(5, ' ') def human(num, lenth): """Convert KiB values to MiB values with right alignment.""" return str(round(num / 1024)).rjust(lenth, ' ') def is_alive(pid): """ """ try: with open('/proc/{}/statm'.format(pid), 'rb', buffering=0) as f: rss = f.read().decode().split(' ')[1] if rss != '0': return True except (FileNotFoundError, ProcessLookupError, NotADirectoryError, PermissionError): return False def alive_pid_list(): """ """ pid_list = [] for pid in os.listdir('/proc'): if not pid[0].isdecimal(): continue if is_alive(pid): pid_list.append(pid) pid_list.remove(self_pid) if '1' in pid_list: pid_list.remove('1') return pid_list def pid_to_oom_score(pid): try: with open('/proc/{}/oom_score'.format(pid), 'rb', buffering=0) as f: return int(f.read()) except (FileNotFoundError, ProcessLookupError, NotADirectoryError): return 0 def pid_to_oom_score_adj(pid): try: with open( '/proc/{}/oom_score_adj'.format(pid), 'rb', buffering=0) as f: return int(f.read()) except (FileNotFoundError, ProcessLookupError, NotADirectoryError): return 0 def badness_pid_list(): """ """ pid_b_list = [] for pid in os.listdir('/proc'): o = pid_to_oom_score(pid) if o >= 1: if not pid[0].isdecimal(): continue if pid == self_pid or pid == '1': continue b = pid_to_badness(pid, o)[0] # log('PID: {}, oom_score: {}, badness: {}, Name: {}'.format( # pid, o, b, pid_to_name(pid))) pid_b_list.append((pid, b)) return pid_b_list def fast_find_victim(): """ """ ft1 = monotonic() pid_badness_list = badness_pid_list() real_proc_num = len(pid_badness_list) if real_proc_num == 0: log('Found {} tasks with non-zero oom_score (except init and self) ' 'in {}ms'.format(real_proc_num, round((monotonic() - ft1) * 1000))) return None log('Found {} tasks with non-zero oom_score (except init and self) ' 'in {}ms'.format(real_proc_num, round((monotonic() - ft1) * 1000))) # Make list of (pid, badness) tuples, sorted by 'badness' values pid_badness_list_sorted = sorted( pid_badness_list, key=itemgetter(1), reverse=True) m0 = monotonic() top_n = 15 if real_proc_num < top_n: top_n = real_proc_num log('TOP-{} tasks by badness:'.format(top_n)) log(' Name PID badness') log(' --------------- ------- -------') for pid_badness in pid_badness_list_sorted[0:top_n]: p = pid_badness[0] b = str(pid_badness[1]) n = pid_to_name(p) log(' {} {} {}'.format(n.ljust(15), p.rjust(7), b.rjust(7))) pid = pid_badness_list_sorted[0][0] victim_id = get_victim_id(pid) # Get maximum 'badness' value victim_badness = pid_badness_list_sorted[0][1] victim_name = pid_to_name(pid) log('TOP printed in {}ms; process with highest badness:\n PID: {}, na' 'me: {}, badness: {}'.format( round((monotonic() - m0) * 1000), pid, victim_name, victim_badness )) return pid, victim_badness, victim_name, victim_id def find_victim(_print_proc_table): """ Find the process with highest badness and its badness adjustment Return pid and badness """ if not _print_proc_table: return fast_find_victim() ft1 = monotonic() pid_list = alive_pid_list() pid_badness_list = [] if _print_proc_table: if extra_table_info == 'None': extra_table_title = '' elif extra_table_info == 'cgroup_v1': extra_table_title = 'CGroup_v1' elif extra_table_info == 'cgroup_v2': extra_table_title = 'CGroup_v2' elif extra_table_info == 'cmdline': extra_table_title = 'cmdline' elif extra_table_info == 'environ': extra_table_title = 'environ' elif extra_table_info == 'realpath': extra_table_title = 'realpath' elif extra_table_info == 'cwd': extra_table_title = 'cwd' else: extra_table_title = '' hr = '#' * 107 log('Tasks state (memory values in mebibytes):') log(hr) log('# PID PPID badness oom_score oom_score_adj e' 'UID S VmSize VmRSS VmSwap Name {}'.format( extra_table_title)) log('#------- ------- ------- --------- ------------- -------' '--- - ------ ----- ------ ---------------') for pid in pid_list: badness = pid_to_badness(pid, None)[0] if badness is None: continue if _print_proc_table: try: oom_score = pid_to_oom_score(pid) oom_score_adj = pid_to_oom_score_adj(pid) except FileNotFoundError: continue if pid_to_status(pid) is None: continue else: (name, state, ppid, uid, vm_size, vm_rss, vm_swap) = pid_to_status(pid) if extra_table_info == 'None': extra_table_line = '' elif extra_table_info == 'cgroup_v1': extra_table_line = pid_to_cgroup_v1(pid) elif extra_table_info == 'cgroup_v2': extra_table_line = pid_to_cgroup_v2(pid) elif extra_table_info == 'cmdline': extra_table_line = pid_to_cmdline(pid) elif extra_table_info == 'environ': extra_table_line = pid_to_environ(pid) elif extra_table_info == 'realpath': extra_table_line = pid_to_realpath(pid) elif extra_table_info == 'cwd': extra_table_line = pid_to_cwd(pid) else: extra_table_line = '' log('#{} {} {} {} {} {} {} {} {} {} {} {}'.format( pid.rjust(7), ppid.rjust(7), str(badness).rjust(7), str(oom_score).rjust(9), str(oom_score_adj).rjust(13), uid.rjust(10), state, str(vm_size).rjust(6), str(vm_rss).rjust(5), str(vm_swap).rjust(6), name.ljust(15), extra_table_line)) pid_badness_list.append((pid, badness)) real_proc_num = len(pid_badness_list) # Make list of (pid, badness) tuples, sorted by 'badness' values # print(pid_badness_list) pid_tuple_list = sorted( pid_badness_list, key=itemgetter(1), reverse=True )[0] pid = pid_tuple_list[0] victim_id = get_victim_id(pid) # Get maximum 'badness' value victim_badness = pid_tuple_list[1] victim_name = pid_to_name(pid) if _print_proc_table: log(hr) log('Found {} tasks with non-zero VmRSS (except init and self)'.format( real_proc_num)) log('Process with highest badness (found in {}ms):\n PID: {}, Na' 'me: {}, badness: {}'.format( round((monotonic() - ft1) * 1000), pid, victim_name, victim_badness)) return pid, victim_badness, victim_name, victim_id def find_victim_info(pid, victim_badness, name): """ """ status0 = monotonic() try: with open('/proc/{}/status'.format(pid), 'rb', buffering=0) as f: f_list = f.read().decode('utf-8', 'ignore').split('\n') for i in range(len(f_list)): if i is state_index: state = f_list[i].split('\t')[1].rstrip() if i is uid_index: uid = f_list[i].split('\t')[2] if i is vm_size_index: vm_size = kib_to_mib( int(f_list[i].split('\t')[1][:-3])) if i is vm_rss_index: vm_rss = kib_to_mib(int(f_list[i].split('\t')[1][:-3])) if detailed_rss: if i is anon_index: anon_rss = kib_to_mib( int(f_list[i].split('\t')[1][:-3])) if i is file_index: file_rss = kib_to_mib( int(f_list[i].split('\t')[1][:-3])) if i is shmem_index: shmem_rss = kib_to_mib( int(f_list[i].split('\t')[1][:-3])) if i is vm_swap_index: vm_swap = kib_to_mib( int(f_list[i].split('\t')[1][:-3])) if print_victim_cmdline: cmdline = pid_to_cmdline(pid) oom_score = pid_to_oom_score(pid) oom_score_adj = pid_to_oom_score_adj(pid) except (IndexError, ValueError): x = 'Selected process died before corrective action' log(x) update_stat_dict(x) print_stat_dict() return None try: realpath = pid_to_realpath(pid) cwd = pid_to_cwd(pid) nssid = pid_to_nssid(pid) victim_lifetime = format_time(uptime() - pid_to_starttime(pid)) victim_cgroup_v1 = pid_to_cgroup_v1(pid) victim_cgroup_v2 = pid_to_cgroup_v2(pid) except FileNotFoundError: x = 'Selected process died before corrective action' log(x) update_stat_dict(x) print_stat_dict() return None ancestry = pid_to_ancestry(pid, max_victim_ancestry_depth) if not print_victim_cmdline: cmdline = '' c1 = '' else: c1 = '\n cmdline: ' if detailed_rss: detailed_rss_info = ' (Anon: {}, File: {}, Shmem: {})'.format( anon_rss, file_rss, shmem_rss) else: detailed_rss_info = '' victim_info = 'Victim status (found in {}ms):' \ '\n PID: {}, name: {}, state: {}, EUID: {}, ' \ 'SID: {} ({}), lifetime: {}' \ '\n badness: {}, oom_score: {}, oom_score_adj: {}' \ '\n Vm, MiB: Size: {}, RSS: {}{}, Swap: {}' \ '\n cgroup_v1: {}' \ '\n cgroup_v2: {}' \ '{}{}{}' \ '\n exe realpath: {}' \ '\n cwd realpath: {}'.format( round((monotonic() - status0) * 1000), pid, name, state, uid, nssid, pid_to_name(nssid), victim_lifetime, victim_badness, oom_score, oom_score_adj, vm_size, vm_rss, detailed_rss_info, vm_swap, victim_cgroup_v1, victim_cgroup_v2, ancestry, c1, cmdline, realpath, cwd) return victim_info def check_mem_swap_ex(): """ Check: is mem and swap threshold exceeded? Return: None, (SIGTERM, meminfo), (SIGKILL, meminfo) """ mem_available, swap_total, swap_free = check_mem_and_swap() # if hard_threshold_min_swap is set in percent if swap_kill_is_percent: hard_threshold_min_swap_kb = swap_total * \ hard_threshold_min_swap_percent / 100.0 else: hard_threshold_min_swap_kb = swap_kb_dict['hard_threshold_min_swap_kb'] if swap_term_is_percent: soft_threshold_min_swap_kb = swap_total * \ soft_threshold_min_swap_percent / 100.0 else: soft_threshold_min_swap_kb = swap_kb_dict['soft_threshold_min_swap_kb'] if swap_warn_is_percent: warning_threshold_min_swap_kb = swap_total * \ warning_threshold_min_swap_percent / 100.0 else: warning_threshold_min_swap_kb = swap_kb_dict[ 'warning_threshold_min_swap_kb'] if swap_total > hard_threshold_min_swap_kb: swap_sigkill_pc = percent( hard_threshold_min_swap_kb / (swap_total + 0.1)) else: swap_sigkill_pc = '-' if swap_total > soft_threshold_min_swap_kb: swap_sigterm_pc = percent( soft_threshold_min_swap_kb / (swap_total + 0.1)) else: swap_sigterm_pc = '-' if (mem_available <= hard_threshold_min_mem_kb and swap_free <= hard_threshold_min_swap_kb): mem_info = 'Memory status that requires corrective actions:\n Mem' \ 'Available [{} MiB, {} %] <= hard_threshold_min_mem [{} MiB' \ ', {} %]\n SwapFree [{} MiB, {} %] <= hard_threshold_m' \ 'in_swap [{} MiB, {} %]'.format( kib_to_mib(mem_available), percent(mem_available / mem_total), kib_to_mib(hard_threshold_min_mem_kb), round(hard_threshold_min_mem_percent, 1), kib_to_mib(swap_free), percent(swap_free / (swap_total + 0.1)), kib_to_mib(hard_threshold_min_swap_kb), swap_sigkill_pc) return (SIGKILL, mem_info, mem_available, hard_threshold_min_swap_kb, soft_threshold_min_swap_kb, swap_free, swap_total) if (mem_available <= soft_threshold_min_mem_kb and swap_free <= soft_threshold_min_swap_kb): mem_info = 'Memory status that requires corrective actions:\n M' \ 'emAvailable [{} MiB, {} %] <= soft_threshold_min_mem [{} MiB,' \ ' {} %]\n SwapFree [{} MiB, {} %] <= soft_threshold_min_swa' \ 'p [{} MiB, {} %]'.format( kib_to_mib(mem_available), percent(mem_available / mem_total), kib_to_mib(soft_threshold_min_mem_kb), round(soft_threshold_min_mem_percent, 1), kib_to_mib(swap_free), percent(swap_free / (swap_total + 0.1)), kib_to_mib(soft_threshold_min_swap_kb), swap_sigterm_pc) return (SIGTERM, mem_info, mem_available, hard_threshold_min_swap_kb, soft_threshold_min_swap_kb, swap_free, swap_total) if low_memory_warnings_enabled: if (mem_available <= warning_threshold_min_mem_kb and swap_free <= warning_threshold_min_swap_kb + 0.1): return ('WARN', None, mem_available, hard_threshold_min_swap_kb, soft_threshold_min_swap_kb, swap_free, swap_total) return (None, None, mem_available, hard_threshold_min_swap_kb, soft_threshold_min_swap_kb, swap_free, swap_total) def check_zram_ex(): """ """ mem_used_zram = check_zram() ma_hard_threshold_exceded = bool( mem_available <= hard_threshold_min_mem_kb) ma_soft_threshold_exceded = bool( mem_available <= soft_threshold_min_mem_kb) ma_warning_threshold_exceded = bool( mem_available <= warning_threshold_min_mem_kb) if (mem_used_zram >= hard_threshold_max_zram_kb and ma_hard_threshold_exceded): mem_info = 'Memory status that requires corrective actions:\n MemAv' \ 'ailable [{} MiB, {} %] <= hard_threshold_min_mem [{} MiB' \ ', {} %]\n MemUsedZram [{} MiB, {} %] >= hard_threshold_' \ 'max_zram [{} MiB, {} %]'.format( kib_to_mib(mem_available), percent(mem_available / mem_total), kib_to_mib(hard_threshold_min_mem_kb), round(hard_threshold_min_mem_percent, 1), kib_to_mib(mem_used_zram), percent(mem_used_zram / mem_total), kib_to_mib(hard_threshold_max_zram_kb), percent(hard_threshold_max_zram_kb / mem_total)) return SIGKILL, mem_info, mem_used_zram if (mem_used_zram >= soft_threshold_max_zram_kb and ma_soft_threshold_exceded): mem_info = 'Memory status that requires corrective actions:\n MemA' \ 'vailable [{} MiB, {} %] <= soft_threshold_min_mem [{} M' \ 'iB, {} %]\n MemUsedZram [{} MiB, {} %] >= soft_thresho' \ 'ld_max_zram [{} M, {} %]'.format( kib_to_mib(mem_available), percent(mem_available / mem_total), kib_to_mib(soft_threshold_min_mem_kb), round(soft_threshold_min_mem_percent, 1), kib_to_mib(mem_used_zram), percent(mem_used_zram / mem_total), kib_to_mib(soft_threshold_max_zram_kb), percent(soft_threshold_max_zram_kb / mem_total)) return SIGTERM, mem_info, mem_used_zram if low_memory_warnings_enabled: if (mem_used_zram >= warning_threshold_max_zram_kb and ma_warning_threshold_exceded): return 'WARN', None, mem_used_zram return None, None, mem_used_zram def check_psi_ex(psi_kill_exceeded_timer, psi_term_exceeded_timer, x0, mem_available): """ """ ma_hard_threshold_exceded = bool( mem_available <= hard_threshold_min_mem_kb) ma_soft_threshold_exceded = bool( mem_available <= soft_threshold_min_mem_kb) ma_warning_threshold_exceded = bool( mem_available <= warning_threshold_min_mem_kb) if not (ma_warning_threshold_exceded or ma_soft_threshold_exceded or ma_hard_threshold_exceded) or swap_total == 0: if debug_psi: log('Do not measure the value of PSI, since none of the thresho' 'lds of available memory is exceeded') return (None, None, psi_kill_exceeded_timer, psi_term_exceeded_timer, x0) delta0 = monotonic() - x0 x0 = monotonic() psi_avg_value = find_psi_metrics_value(psi_path, psi_metrics) if debug_psi: log('-------------------------------------------------------------' '-----------') log('PSI {} value in {}: {}'.format( psi_metrics, psi_path, psi_avg_value)) if psi_avg_value is None: return (None, None, -0.0001, -0.0001, x0) psi_post_action_delay_timer = monotonic() - last_action_dict['t'] psi_post_action_delay_exceeded = bool( psi_post_action_delay_timer >= psi_post_action_delay) if psi_avg_value >= hard_threshold_max_psi: sigkill_psi_exceeded = True if ma_hard_threshold_exceded: if psi_kill_exceeded_timer < 0: psi_kill_exceeded_timer = 0 else: psi_kill_exceeded_timer += delta0 else: psi_kill_exceeded_timer = -0.0001 else: sigkill_psi_exceeded = False psi_kill_exceeded_timer = -0.0001 if debug_psi: log('psi_post_action_delay_timer: {}, psi_post_action_delay_exceed' 'ed: {}'.format( round(psi_post_action_delay_timer, 1), psi_post_action_delay_exceeded)) log('mem_avail_hard_threshold_exceded: {}, hard_threshold_psi_exce' 'eded: {}, hard_psi_excess_duration: {}'.format( ma_hard_threshold_exceded, sigkill_psi_exceeded, round(psi_kill_exceeded_timer, 1))) if (sigkill_psi_exceeded and psi_kill_exceeded_timer >= psi_excess_duration and psi_post_action_delay_exceeded and ma_hard_threshold_exceded): mem_info = 'Memory status that requires corrective actions:\n MemAv' \ 'ailable [{} MiB, {} %] <= hard_threshold_min_mem [{} MiB' \ ', {} %]\n Current PSI metric value ({}) >= hard_thresho' \ 'ld_max_psi ({})\n PSI metric value exceeded psi_excess_' \ 'duration ({}s) for {}s'.format( kib_to_mib(mem_available), percent(mem_available / mem_total), kib_to_mib(hard_threshold_min_mem_kb), round(hard_threshold_min_mem_percent, 1), psi_avg_value, hard_threshold_max_psi, psi_excess_duration, round(psi_kill_exceeded_timer, 1)) return (SIGKILL, mem_info, psi_kill_exceeded_timer, psi_term_exceeded_timer, x0) if psi_avg_value >= soft_threshold_max_psi: sigterm_psi_exceeded = True if ma_soft_threshold_exceded: if psi_term_exceeded_timer < 0: psi_term_exceeded_timer = 0 else: psi_term_exceeded_timer += delta0 else: psi_term_exceeded_timer = -0.0001 else: sigterm_psi_exceeded = False psi_term_exceeded_timer = -0.0001 if debug_psi: log('mem_avail_soft_threshold_exceded: {}, soft_threshold_psi_exce' 'eded: {}, soft_psi_excess_duration: {}'.format( ma_soft_threshold_exceded, sigterm_psi_exceeded, round(psi_term_exceeded_timer, 1))) if (sigterm_psi_exceeded and psi_term_exceeded_timer >= psi_excess_duration and psi_post_action_delay_exceeded and ma_soft_threshold_exceded): mem_info = 'Memory status that requires corrective actions:\n MemA' \ 'vailable [{} MiB, {} %] <= soft_threshold_min_mem [{} M' \ 'iB, {} %]\n Current PSI metric value ({}) >= soft_thre' \ 'shold_max_psi ({})\n PSI metric value exceeded psi_exc' \ 'ess_duration ({}s) for {}s'.format( kib_to_mib(mem_available), percent(mem_available / mem_total), kib_to_mib(soft_threshold_min_mem_kb), round(soft_threshold_min_mem_percent, 1), psi_avg_value, soft_threshold_max_psi, psi_excess_duration, round(psi_term_exceeded_timer, 1)) return (SIGTERM, mem_info, psi_kill_exceeded_timer, psi_term_exceeded_timer, x0) if low_memory_warnings_enabled: if (psi_avg_value >= warning_threshold_max_psi and ma_warning_threshold_exceded): return ('WARN', None, psi_kill_exceeded_timer, psi_term_exceeded_timer, x0) return (None, None, psi_kill_exceeded_timer, psi_term_exceeded_timer, x0) def is_victim_alive(victim_id): """ We do not have a reliable sign of the end of the release of memory: https://github.com/rfjakob/earlyoom/issues/128#issuecomment-507023717 """ _, pid = victim_id.split('_pid') new_victim_id = get_victim_id(pid) if victim_id != new_victim_id: return 0 if is_alive(pid): return 1 state = pid_to_state(pid) if state == 'R': return 2 if state == 'Z': return 3 if state == 'X' or state == '': return 0 return 0 def log_meminfo(): """ """ mid = meminfo() log('Memory info, MiB:') log(' total={}, used={}, free={}, available={}, shared={}, buffers' '={}, cache={},'.format( round(mem_total / 1024), round(mid['used'] / 1024), round(mid['free'] / 1024), round(mid['available'] / 1024), round(mid['shared'] / 1024), round(mid['buffers'] / 1024), round(mid['cache'] / 1024) )) log(' swap_total={}, swap_used={}, swap_free={}'.format( round(mid['swap_total'] / 1024), round(mid['swap_used'] / 1024), round(mid['swap_free'] / 1024) )) if PSI_KERNEL_OK: mp = psi_file_mem_to_metrics('/proc/pressure/memory') log('Memory pressure (system-wide):') log(' some avg10={} avg60={} avg300={}'.format( mp[0], mp[1], mp[2] )) log(' full avg10={} avg60={} avg300={}'.format( mp[3], mp[4], mp[5] )) def is_post_oom_delay_exceeded(): """ """ oom_t = oom_dict['t'] if oom_t is not None: post_oom_t = monotonic() - oom_t if post_oom_t < post_oom_delay: log('Time since OOM: {}s; post OOM delay ({}s) is not exceed' 'ed'.format(round(post_oom_t, 3), post_oom_delay)) if debug_sleep: log('Sleep {}s'.format(over_sleep)) sleep(over_sleep) return False else: return True else: return True def implement_corrective_action( threshold, mem_info_list, psi_kill_exceeded_timer, psi_term_exceeded_timer, x0, psi_threshold, zram_threshold, zram_info, psi_info): log(separator_in) debug_corrective_action = True post_oom_delay_exceeded = is_post_oom_delay_exceeded() if not post_oom_delay_exceeded: log(separator_out) return None time0 = monotonic() nu = [] for victim_id in v_dict: iva = is_victim_alive(victim_id) if iva == 0 or iva == 3: nu.append(victim_id) for i in nu: if debug_corrective_action: log('Remove {} from v_dict'.format(i)) v_dict.pop(i) x = False cache_list = [] for victim_id in v_dict: tx = v_dict[victim_id]['time'] ddt = monotonic() - tx if ddt < victim_cache_time: if debug_corrective_action: log('victim_cache_time is not exceeded for {} ({} <' ' {})'.format(victim_id, round(ddt, 3), victim_cache_time)) x = True cache_list.append((victim_id, ddt)) break if x: e = sorted(cache_list, key=itemgetter(1), reverse=False) cached_victim_id = e[0][0] for i in mem_info_list: log(i) if x: victim_id = cached_victim_id pid = victim_id.partition('_pid')[2] victim_badness = pid_to_badness(pid, None)[0] name = v_dict[victim_id]['name'] log('New victim is cached victim {} ({})'.format(pid, name)) else: s1 = set(os.listdir('/proc')) fff = find_victim(print_proc_table) # sleep(0.1) s2 = set(os.listdir('/proc')) dset = s1 - s2 if len(dset) > 0: log('During the search for the victim, the next processes were' ' died: {}'.format(dset)) sleep(over_sleep) log(separator_out) return None if fff is None: if debug_sleep: log('Sleep {}s'.format(over_sleep)) sleep(over_sleep) log(separator_out) return None pid, victim_badness, name, victim_id = fff post_oom_delay_exceeded = is_post_oom_delay_exceeded() if not post_oom_delay_exceeded: log(separator_out) return None log('Recheck memory levels...') (masf_threshold, masf_info, mem_available, _, _, swap_free, _ ) = check_mem_swap_ex() if zram_checking_enabled: zram_threshold, zram_info, _ = check_zram_ex() if CHECK_PSI: (psi_threshold, psi_info, psi_kill_exceeded_timer, psi_term_exceeded_timer, x0) = check_psi_ex( psi_kill_exceeded_timer, psi_term_exceeded_timer, x0, mem_available) if (masf_threshold is SIGKILL or zram_threshold is SIGKILL or psi_threshold is SIGKILL): new_threshold = SIGKILL mem_info_list = [] if masf_threshold is SIGKILL or masf_threshold is SIGTERM: mem_info_list.append(masf_info) if zram_threshold is SIGKILL or zram_threshold is SIGTERM: mem_info_list.append(zram_info) if psi_threshold is SIGKILL or psi_threshold is SIGTERM: mem_info_list.append(psi_info) elif (masf_threshold is SIGTERM or zram_threshold is SIGTERM or psi_threshold is SIGTERM): new_threshold = SIGTERM mem_info_list = [] if masf_threshold is SIGKILL or masf_threshold is SIGTERM: mem_info_list.append(masf_info) if zram_threshold is SIGKILL or zram_threshold is SIGTERM: mem_info_list.append(zram_info) if psi_threshold is SIGKILL or psi_threshold is SIGTERM: mem_info_list.append(psi_info) else: log('Thresholds is not exceeded now') log(separator_out) return None for i in mem_info_list: log(i) if new_threshold is None or new_threshold == 'WARN': log('Thresholds is not exceeded now') log(separator_out) return None threshold = new_threshold vwd = None # Victim Will Die if threshold is SIGTERM: if victim_id in v_dict: dt = monotonic() - v_dict[victim_id]['time'] if dt > max_soft_exit_time: log('max_soft_exit_time (value={}s) is exceeded for the victim' ', hard corrective action will be applied'.format( max_soft_exit_time)) threshold = SIGKILL else: log('max_soft_exit_time is not exceeded (' '{} < {}) for the victim'.format(round( dt, 1), max_soft_exit_time)) if debug_sleep: log('Sleep {}s'.format(over_sleep)) sleep(over_sleep) log(separator_out) return None if victim_badness >= min_badness: if print_victim_status: victim_info = find_victim_info(pid, victim_badness, name) if victim_info is not None: log(victim_info) else: sleep(over_sleep) log(separator_out) return None log_meminfo() if (threshold is SIGKILL and post_kill_exe != '') or ( soft_actions and threshold is SIGTERM): name = pid_to_name(pid) cgroup_v1 = pid_to_cgroup_v1(pid) cgroup_v2 = pid_to_cgroup_v2(pid) if cgroup_v1 != '': cgroup_v1_tail = cgroup_v1.rpartition('/')[2] if cgroup_v1_tail.endswith('.service'): service = cgroup_v1_tail else: service = '' elif cgroup_v2 != '': cgroup_v2_tail = cgroup_v2.rpartition('/')[2] if cgroup_v2_tail.endswith('.service'): service = cgroup_v2_tail else: service = '' else: service = '' soft_match = False if soft_actions and threshold is SIGTERM: for i in soft_actions_list: unit = i[0] if unit == 'name': u = name elif unit == 'cgroup_v1': u = cgroup_v1 else: u = cgroup_v2 regexp = i[1] command = i[2] if search(regexp, u) is not None: log("Regexp '{}' matches with {} '{}'".format( regexp, unit, u)) soft_match = True break start_action = monotonic() post_oom_delay_exceeded = is_post_oom_delay_exceeded() if not post_oom_delay_exceeded: log(separator_out) return None if soft_match: cmd = command.replace('$PID', pid).replace('$NAME', pid_to_name( pid)).replace('$SERVICE', service) preventing_oom_message = 'Implementing a corrective action:\n ' \ 'Executing the command: {}'.format(cmd) log(preventing_oom_message) err = start_thread(exe, cmd) if err == 1: key = 'Cannot execute the command in the new thread' update_stat_dict(key) log(key) else: update_stat_dict('Executing the command "{}"'.format(command)) response_time = monotonic() - time0 log('Total response time: {}ms'.format(round( response_time * 1000))) print_stat_dict() else: preventing_oom_message = 'Implementing a corrective action:\n ' \ 'Sending {} to the victim'.format( sig_dict[threshold]) log(preventing_oom_message) try: os.kill(int(pid), threshold) update_stat_dict( '[ OK ] Sending {} to {}'.format(sig_dict[threshold], name) ) response_time = monotonic() - time0 send_result = 'OK; total response time: {}ms'.format( round(response_time * 1000)) log(send_result) if threshold is SIGKILL: vwd = True print_stat_dict() except ProcessLookupError: vwd = True key = 'Selected process died before corrective action' update_stat_dict(key) print_stat_dict() log(key) except PermissionError: vwd = False key = 'Cannot send a signal: PermissionError' log(key) update_stat_dict(key) print_stat_dict() log('Sleep {}s'.format(post_soft_action_delay)) sleep(10) if not vwd: if victim_id not in v_dict: v_dict[victim_id] = dict() v_dict[victim_id]['time'] = monotonic() v_dict[victim_id]['name'] = name else: pass last_action_dict['t'] = kill_timestamp = monotonic() kill_timestamp = start_action while True: sleep(0.01) d = monotonic() - kill_timestamp iva = is_victim_alive(victim_id) if iva == 0: log('The victim died in {}s'.format(round(d, 3))) if victim_id in v_dict: v_dict.pop(victim_id) break elif iva == 1: if vwd and d > sensitivity_test_time + 10: log('The victim doesn\'t respond on corrective action' ' in {}s'.format(round(d, 3))) break if not vwd and d > sensitivity_test_time: log('The victim doesn\'t respond on corrective action' ' in {}s'.format(round(d, 3))) break elif iva == 2: pass else: log('The victim became a zombie in {}s'.format(round(d, 3))) if victim_id in v_dict: v_dict.pop(victim_id) sleep(post_zombie_delay) break mem_available, _, swap_free = check_mem_and_swap() ma_mib = int(mem_available) / 1024.0 sf_mib = int(swap_free) / 1024.0 log('Memory status after implementing a corrective act' 'ion:\n MemAvailable' ': {} MiB, SwapFree: {} MiB'.format( round(ma_mib, 1), round(sf_mib, 1))) if threshold is SIGKILL and post_kill_exe != '': log('Executing post_kill_exe: {}'.format(post_kill_exe)) cmd = post_kill_exe.replace('$PID', pid).replace( '$NAME', name).replace('$SERVICE', service) start_thread(exe, cmd) if post_action_gui_notifications: if soft_match: send_notify_etc(pid, name, cmd) else: send_notify(threshold, name, pid) else: response_time = monotonic() - time0 victim_badness_is_too_small = 'victim (PID: {}, Name: {}) badness ' \ '({}) < min_badness ({}); nothing to do; response tim' \ 'e: {}ms'.format( pid, name, victim_badness, min_badness, round(response_time * 1000)) log(victim_badness_is_too_small) # update stat_dict key = 'victim badness < min_badness' update_stat_dict(key) print_stat_dict() if vwd is None: if debug_sleep: log('Sleep {}s'.format(over_sleep)) sleep(over_sleep) log(separator_out) return None def sleep_after_check_mem(): """Specify sleep times depends on rates and avialable memory.""" if stable_sleep: if debug_sleep: log('Sleep {}s'.format(min_sleep)) stdout.flush() sleep(min_sleep) return None if hard_threshold_min_mem_kb < soft_threshold_min_mem_kb: mem_point = mem_available - soft_threshold_min_mem_kb else: mem_point = mem_available - hard_threshold_min_mem_kb if hard_threshold_min_swap_kb < soft_threshold_min_swap_kb: swap_point = swap_free - soft_threshold_min_swap_kb else: swap_point = swap_free - hard_threshold_min_swap_kb if swap_point < 0: swap_point = 0 if mem_point < 0: mem_point = 0 t_mem = mem_point / fill_rate_mem t_swap = swap_point / fill_rate_swap if zram_checking_enabled: t_zram = (mem_total * 0.8 - mem_used_zram) / fill_rate_zram if t_zram < 0: t_zram = 0 t_mem_zram = t_mem + t_zram z = ', t_zram={}'.format(round(t_zram, 2)) else: z = '' t_mem_swap = t_mem + t_swap if zram_checking_enabled: if t_mem_swap <= t_mem_zram: t = t_mem_swap else: t = t_mem_zram else: t = t_mem_swap if t > max_sleep: t = max_sleep elif t < min_sleep: t = min_sleep else: pass if debug_sleep: log('Sleep {}s (t_mem={}, t_swap={}{})'.format(round(t, 2), round( t_mem, 2), round(t_swap, 2), z)) stdout.flush() sleep(t) def calculate_percent(arg_key): """ parse conf dict Calculate mem_min_KEY_percent. arg_key: str key for config_dict returns int mem_min_percent or NoneType if got some error """ if arg_key in config_dict: mem_min = config_dict[arg_key] if mem_min.endswith('%'): # truncate percents, so we have a number mem_min_percent = mem_min[:-1].strip() # then 'float test' mem_min_percent = string_to_float_convert_test(mem_min_percent) if mem_min_percent is None: invalid_config_key_value(arg_key) # soft_threshold_min_mem_percent is clean and valid float # percentage. Can translate into Kb mem_min_kb = mem_min_percent / 100 * mem_total mem_min_mb = round(mem_min_kb / 1024) elif mem_min.endswith('M'): mem_min_mb = string_to_float_convert_test(mem_min[:-1].strip()) if mem_min_mb is None: invalid_config_key_value(arg_key) mem_min_kb = mem_min_mb * 1024 mem_min_percent = mem_min_kb / mem_total * 100 else: invalid_config_key_value(arg_key) else: missing_config_key(arg_key) if (arg_key == 'soft_threshold_min_mem' or arg_key == 'hard_threshold_min_mem'): if mem_min_kb > mem_total * 0.5 or mem_min_kb < 0: invalid_config_key_value(arg_key) if (arg_key == 'soft_threshold_max_zram' or arg_key == 'hard_threshold_max_zram'): if mem_min_kb > mem_total * 0.9 or mem_min_kb < mem_total * 0.1: invalid_config_key_value(arg_key) if (arg_key == 'warning_threshold_min_mem' or arg_key == 'warning_threshold_max_zram'): if mem_min_kb > mem_total or mem_min_kb < 0: invalid_config_key_value(arg_key) return mem_min_kb, mem_min_mb, mem_min_percent def is_kmsg_ok(): """ """ m_test0 = monotonic() test_string = 'nohang: clock calibration: {}\n'.format(m_test0) try: write(kmsg_path, test_string) except Exception as e: log(e) return None try: with open(kmsg_path): pass except Exception as e: log(e) return None return m_test0, test_string def check_kmsg_fn(): """ Checking kernel messages for OOM events. It catches lines like follow: 3,591,248095206,-;Out of memory: Kill process 1061 (tail) score 917 or sacr ifice child 3,1699,7208057400,-;Out of memory: Killed process 9277 (tail) total-vm:7837 24kB, anon-rss:566796kB, file-rss:1508kB, shmem-rss:0kB, UID:1000 3,728,398101038,-;Memory cgroup out of memory: Killed process 2497 (tail) t otal-vm:6468944kB, anon-rss:3013188kB, file-rss:1676kB, shmem-rss:0kB, UID : 1000 pgtables:12300kB oom_score_adj:0 4,2928853,54035939870,-;tail invoked oom-killer: gfp_mask=0x100cca(GFP_HIGH USER_MOVABLE), order=0, oom_score_adj=0 6,2928994,54035940084,-;Tasks in /user.slice/user-1000.slice/session-235.sc ope are going to be killed due to memory.oom.group set 6,2929028,54036119432,-;oom_reaper: reaped process 15512 (tail), now anon-r ss:0kB, file-rss:0kB, shmem-rss:0kB """ broken_pipe = False start_t0 = monotonic() counter = 0 with open(kmsg_path) as f: while True: d = monotonic() - start_t0 if d > kmsg_start_timeout: log('kmsg: cannot start in {}s'.format( kmsg_start_timeout)) return None try: s = f.readline() counter += 1 if broken_pipe: log('kmsg: BrokenPipeError occurred, some messages lost') broken_pipe = False except BrokenPipeError: broken_pipe = True if debug_kmsg: log('debug kmsg: BrokenPipeError, sleep {}s'.format( sleep_at_broken_pipe)) sleep(sleep_at_broken_pipe) continue if test_string in s: if debug_kmsg: log('debug kmsg: {}'.format(s[:-1])) kmsg_mono0 = float(s.split(',')[2]) / 1000000 kmsg_t_delta = kmsg_mono0 - m_test0 if debug_kmsg: log('debug kmsg: kmsg_t_delta (kmsg_mono - test_mono)' ': {}'.format(round(kmsg_t_delta, 3))) log('Checking kmsg for OOM events has started in {}s;' ' {} lines parsed'.format(round(d, 3), counter)) break while True: try: s = f.readline() if broken_pipe: log('kmsg: BrokenPipeError occurred, some messages lost') broken_pipe = False except BrokenPipeError: if debug_kmsg: log('debug kmsg: BrokenPipeError, sleep {}s'.format( sleep_at_broken_pipe)) broken_pipe = True sleep(sleep_at_broken_pipe) continue if (s[:2] == '3,' and 'ut of memory: Kill' in s): if debug_kmsg: log('debug kmsg: {}'.format(s[:-1])) k_mono = float((s.split(',')[2])) / 1000000 krm = k_mono - kmsg_t_delta oom_dict['t'] = krm last_action_dict['t'] = krm kmsg = s[:-1].rpartition(';')[2] log('kmsg: {}'.format(kmsg)) update_stat_dict('kmsg: Out of memory: Kill') continue if s[:2] == '6,': if ';oom_reaper: reaped process ' in s: if debug_kmsg: log('debug kmsg: {}'.format(s[:-1])) k_mono = float((s.split(',')[2])) / 1000000 krm = k_mono - kmsg_t_delta last_action_dict['t'] = krm kmsg = s[:-1].rpartition(';')[2] log('kmsg: {}'.format(kmsg)) update_stat_dict('kmsg: oom_reaper: reaped process') continue if 'killed due to memory.oom.group set' in s: if debug_kmsg: log('debug kmsg: {}'.format(s[:-1])) k_mono = float((s.split(',')[2])) / 1000000 krm = k_mono - kmsg_t_delta oom_dict['t'] = krm last_action_dict['t'] = krm kmsg = s[:-1].rpartition(';')[2] log('kmsg: {}'.format(kmsg)) update_stat_dict('killed due to memory.oom.group set') continue if (s[:2] == '4,' and ' invoked oom-killer: ' in s): if debug_kmsg: log('debug kmsg: {}'.format(s[:-1])) k_mono = float((s.split(',')[2])) / 1000000 krm = k_mono - kmsg_t_delta oom_dict['t'] = krm last_action_dict['t'] = krm kmsg = s[:-1].rpartition(';')[2] log('kmsg: {}'.format(kmsg)) update_stat_dict('kmsg: invoked oom-killer') print_stat_dict() if post_action_gui_notifications: send_notification( 'kmsg: Out of memory!', 'invoked oom-killer') continue ############################################################################### # {victim_id : {'time': timestamp, 'name': name} v_dict = dict() start_time = monotonic() help_mess = """usage: nohang [-h|--help] [-v|--version] [-m|--memload] [-c|--config CONFIG] [--check] [--monitor] [--tasks] optional arguments: -h, --help show this help message and exit -v, --version show version of installed package and exit -m, --memload consume memory until 40 MiB (MemAvailable + SwapFree) remain free, and terminate the process -c CONFIG, --config CONFIG path to the config file. This should only be used with one of the following options: --monitor, --tasks, --check --check check and show the configuration and exit. This should only be used with -c/--config CONFIG option --monitor start monitoring. This should only be used with -c/--config CONFIG option --tasks show tasks state and exit. This should only be used with -c/--config CONFIG option""" SC_CLK_TCK = os.sysconf(os.sysconf_names['SC_CLK_TCK']) SC_PAGESIZE = os.sysconf(os.sysconf_names['SC_PAGESIZE']) sig_list = [SIGTERM, SIGINT, SIGQUIT, SIGHUP] sig_dict = { SIGKILL: 'SIGKILL', SIGINT: 'SIGINT', SIGQUIT: 'SIGQUIT', SIGHUP: 'SIGHUP', SIGTERM: 'SIGTERM' } self_pid = str(os.getpid()) self_uid = os.geteuid() root = bool(self_uid == 0) last_action_dict = dict() last_action_dict['t'] = monotonic() # will store corrective actions stat stat_dict = dict() separate_log = False # will be overwritten after parse config cgroup_v1_index, cgroup_v2_index = find_cgroup_indexes() pid_list = alive_pid_list() print_proc_table_flag = False check_config_flag = False a = argv[1:] la = len(a) if la == 0: errprint('ERROR: invalid input: missing CLI options\n') errprint(help_mess) exit(1) if la == 1: if a[0] == '-h' or a[0] == '--help': print(help_mess) exit() if a[0] == '-v' or a[0] == '--version': print_version() if a[0] == '-m' or a[0] == '--memload': memload() errprint('ERROR: invalid input\n') errprint(help_mess) exit(1) if la == 2: errprint('ERROR: invalid input\n') errprint(help_mess) exit(1) if la == 3: if '-c' in a or '--config' in a: if '--monitor' in a or '--check' in a or '--tasks' in a: try: aaa = a.index('-c') except ValueError: pass try: aaa = a.index('--config') except ValueError: pass try: config = a[aaa + 1] except IndexError: errprint('ERROR: invalid input\n') errprint(help_mess) exit(1) if (config == '--check' or config == '--monitor' or config == '--tasks:'): errprint('ERROR: invalid input\n') errprint(help_mess) exit(1) if '--check' in a: check_config_flag = True if '--tasks' in a: print_proc_table_flag = True else: errprint('ERROR: invalid input\n') errprint(help_mess) exit(1) else: errprint('ERROR: invalid input\n') errprint(help_mess) exit(1) if la > 3: errprint('ERROR: invalid CLI input: too many options\n') errprint(help_mess) exit(1) # find mem_total # find positions of SwapFree and SwapTotal in /proc/meminfo with open('/proc/meminfo') as f: mem_list = f.readlines() mem_list_names = [] for s in mem_list: mem_list_names.append(s.split(':')[0]) try: mem_available_index = mem_list_names.index('MemAvailable') except ValueError: errprint('ERROR: your Linux kernel is too old, Linux 3.14+ required') mem_free_index = mem_list_names.index('MemFree') swap_total_index = mem_list_names.index('SwapTotal') swap_free_index = mem_list_names.index('SwapFree') buffers_index = mem_list_names.index('Buffers') cached_index = mem_list_names.index('Cached') sreclaimable_index = mem_list_names.index('SReclaimable') shmem_info_index = mem_list_names.index('Shmem') mem_total = int(mem_list[0].split(':')[1][:-4]) # Get names from /proc/*/status to be able to get VmRSS and VmSwap values with open('/proc/self/status') as file: status_list = file.readlines() status_names = [] for s in status_list: status_names.append(s.split(':')[0]) ppid_index = status_names.index('PPid') vm_size_index = status_names.index('VmSize') vm_rss_index = status_names.index('VmRSS') vm_swap_index = status_names.index('VmSwap') uid_index = status_names.index('Uid') state_index = status_names.index('State') try: anon_index = status_names.index('RssAnon') file_index = status_names.index('RssFile') shmem_index = status_names.index('RssShmem') detailed_rss = True # print(detailed_rss, 'detailed_rss') except ValueError: detailed_rss = False # print('It is not Linux 4.5+') config = os.path.abspath(config) print('Starting nohang with config {}'.format(config)) separator_in = '>>=== STARTING implement_corrective_action() ====>>' separator_out = '<<=== FINISHING implement_corrective_action() ===<<' ############################################################################### fd = dict() # parsing the config with obtaining the parameters dictionary # conf_parameters_dict # conf_restart_dict # dictionary with config options config_dict = dict() badness_adj_re_name_list = [] badness_adj_re_cmdline_list = [] badness_adj_re_environ_list = [] badness_adj_re_uid_list = [] badness_adj_re_cgroup_v1_list = [] badness_adj_re_cgroup_v2_list = [] badness_adj_re_realpath_list = [] badness_adj_re_cwd_list = [] soft_actions_list = [] # separator for optional parameters (that starts with @) opt_separator = '///' check_kmsg = False debug_kmsg = False # stupid conf parsing, it needs refactoring try: with open(config) as f: for line in f: a = line.startswith('#') b = line.startswith('\n') c = line.startswith('\t') d = line.startswith(' ') etc = line.startswith('@SOFT_ACTION_RE_NAME') etc2 = line.startswith('@SOFT_ACTION_RE_CGROUP_V1') etc2_2 = line.startswith('@SOFT_ACTION_RE_CGROUP_V2') if line.startswith('@check_kmsg'): line = line.rstrip() if line == '@check_kmsg': check_kmsg = True if line.startswith('@debug_kmsg'): line = line.rstrip() if line == '@debug_kmsg': debug_kmsg = True if (not a and not b and not c and not d and not etc and not etc2 and not etc2_2): a = line.partition('=') key = a[0].strip() value = a[2].strip() if key not in config_dict: config_dict[key] = value else: errprint('ERROR: config key duplication: {}'.format(key)) exit(1) if etc: a = line.partition('@SOFT_ACTION_RE_NAME')[ 2].partition(opt_separator) a1 = 'name' a2 = a[0].strip() valid_re(a2) a3 = a[2].strip() zzz = (a1, a2, a3) soft_actions_list.append(zzz) if etc2: a = line.partition('@SOFT_ACTION_RE_CGROUP_V1')[ 2].partition(opt_separator) a1 = 'cgroup_v1' a2 = a[0].strip() valid_re(a2) a3 = a[2].strip() zzz = (a1, a2, a3) soft_actions_list.append(zzz) if etc2_2: a = line.partition('@SOFT_ACTION_RE_CGROUP_V2')[ 2].partition(opt_separator) a1 = 'cgroup_v2' a2 = a[0].strip() valid_re(a2) a3 = a[2].strip() zzz = (a1, a2, a3) soft_actions_list.append(zzz) if line.startswith('@BADNESS_ADJ_RE_NAME'): a = line.partition('@BADNESS_ADJ_RE_NAME')[2].strip( ' \n').partition(opt_separator) badness_adj = a[0].strip(' ') reg_exp = a[2].strip(' ') valid_re(reg_exp) badness_adj_re_name_list.append((badness_adj, reg_exp)) if line.startswith('@BADNESS_ADJ_RE_CMDLINE'): a = line.partition('@BADNESS_ADJ_RE_CMDLINE')[2].strip( ' \n').partition(opt_separator) badness_adj = a[0].strip(' ') reg_exp = a[2].strip(' ') valid_re(reg_exp) badness_adj_re_cmdline_list.append((badness_adj, reg_exp)) if line.startswith('@BADNESS_ADJ_RE_UID'): a = line.partition('@BADNESS_ADJ_RE_UID')[2].strip( ' \n').partition(opt_separator) badness_adj = a[0].strip(' ') reg_exp = a[2].strip(' ') valid_re(reg_exp) badness_adj_re_uid_list.append((badness_adj, reg_exp)) if line.startswith('@BADNESS_ADJ_RE_CGROUP_V1'): a = line.partition('@BADNESS_ADJ_RE_CGROUP_V1')[2].strip( ' \n').partition(opt_separator) badness_adj = a[0].strip(' ') reg_exp = a[2].strip(' ') valid_re(reg_exp) badness_adj_re_cgroup_v1_list.append((badness_adj, reg_exp)) if line.startswith('@BADNESS_ADJ_RE_CGROUP_V2'): a = line.partition('@BADNESS_ADJ_RE_CGROUP_V2')[2].strip( ' \n').partition(opt_separator) badness_adj = a[0].strip(' ') reg_exp = a[2].strip(' ') valid_re(reg_exp) badness_adj_re_cgroup_v2_list.append((badness_adj, reg_exp)) if line.startswith('@BADNESS_ADJ_RE_REALPATH'): a = line.partition('@BADNESS_ADJ_RE_REALPATH')[2].strip( ' \n').partition(opt_separator) badness_adj = a[0].strip(' ') reg_exp = a[2].strip(' ') valid_re(reg_exp) badness_adj_re_realpath_list.append((badness_adj, reg_exp)) if line.startswith('@BADNESS_ADJ_RE_CWD'): a = line.partition('@BADNESS_ADJ_RE_CWD')[2].strip( ' \n').partition(opt_separator) badness_adj = a[0].strip(' ') reg_exp = a[2].strip(' ') valid_re(reg_exp) badness_adj_re_cwd_list.append((badness_adj, reg_exp)) if line.startswith('@BADNESS_ADJ_RE_ENVIRON'): a = line.partition('@BADNESS_ADJ_RE_ENVIRON')[2].strip( ' \n').partition(opt_separator) badness_adj = a[0].strip(' ') reg_exp = a[2].strip(' ') valid_re(reg_exp) badness_adj_re_environ_list.append((badness_adj, reg_exp)) except (PermissionError, UnicodeDecodeError, IsADirectoryError, IndexError, FileNotFoundError) as e: errprint(e) errprint('Invalid config. Exit.') exit(1) if badness_adj_re_name_list == []: regex_matching = False else: regex_matching = True if badness_adj_re_cmdline_list == []: re_match_cmdline = False else: re_match_cmdline = True if badness_adj_re_uid_list == []: re_match_uid = False else: re_match_uid = True if badness_adj_re_environ_list == []: re_match_environ = False else: re_match_environ = True if badness_adj_re_realpath_list == []: re_match_realpath = False else: re_match_realpath = True if badness_adj_re_cwd_list == []: re_match_cwd = False else: re_match_cwd = True if badness_adj_re_cgroup_v1_list == []: re_match_cgroup_v1 = False else: re_match_cgroup_v1 = True if badness_adj_re_cgroup_v2_list == []: re_match_cgroup_v2 = False else: re_match_cgroup_v2 = True if soft_actions_list == []: soft_actions = False else: soft_actions = True ############################################################################### # extracting parameters from the dictionary # check for all necessary parameters # validation of all parameters separate_log = conf_parse_bool('separate_log') if separate_log: import logging log_dir = '/var/log/nohang' logfile = log_dir + '/nohang.log' try: os.mkdir(log_dir) except FileExistsError: pass except PermissionError: errprint('ERROR: cannot create {}'.format(log_dir)) if root: exit(1) try: os.chmod(log_dir, mode=0o750) except FileNotFoundError: errprint('ERROR: file not found: {}'.format(log_dir)) if root: exit(1) except PermissionError: errprint('ERROR: permission denied: {}'.format(log_dir)) if root: exit(1) try: logging.basicConfig( filename=logfile, level=logging.INFO, format="%(asctime)s: %(message)s") except FileNotFoundError: errprint('ERROR: file not found: {}'.format(logfile)) if root: exit(1) except PermissionError: errprint('ERROR: permission denied: {}'.format(logfile)) if root: exit(1) if separate_log: logging.info('Starting nohang with the config: {}'.format(config)) debug_psi = conf_parse_bool('debug_psi') print_statistics = conf_parse_bool('print_statistics') print_proc_table = conf_parse_bool('print_proc_table') print_victim_status = conf_parse_bool('print_victim_status') print_victim_cmdline = conf_parse_bool('print_victim_cmdline') print_config_at_startup = conf_parse_bool('print_config_at_startup') print_mem_check_results = conf_parse_bool('print_mem_check_results') debug_sleep = conf_parse_bool('debug_sleep') hide_corrective_action_type = conf_parse_bool('hide_corrective_action_type') low_memory_warnings_enabled = conf_parse_bool('low_memory_warnings_enabled') post_action_gui_notifications = conf_parse_bool( 'post_action_gui_notifications') debug_threading = conf_parse_bool('debug_threading') psi_checking_enabled = conf_parse_bool('psi_checking_enabled') try: psi_file_mem_to_metrics('/proc/pressure/memory') PSI_KERNEL_OK = True except Exception as e: PSI_KERNEL_OK = False if psi_checking_enabled: log('WARNING: PSI metrics are not provided by the kernel: {}'.format( e)) CHECK_PSI = bool(PSI_KERNEL_OK and psi_checking_enabled) zram_checking_enabled = conf_parse_bool('zram_checking_enabled') debug_gui_notifications = conf_parse_bool('debug_gui_notifications') ignore_positive_oom_score_adj = conf_parse_bool( 'ignore_positive_oom_score_adj') (soft_threshold_min_mem_kb, soft_threshold_min_mem_mb, soft_threshold_min_mem_percent) = calculate_percent('soft_threshold_min_mem') (hard_threshold_min_mem_kb, hard_threshold_min_mem_mb, hard_threshold_min_mem_percent) = calculate_percent('hard_threshold_min_mem') (soft_threshold_max_zram_kb, soft_threshold_max_zram_mb, soft_threshold_max_zram_percent) = calculate_percent( 'soft_threshold_max_zram') (hard_threshold_max_zram_kb, hard_threshold_max_zram_mb, hard_threshold_max_zram_percent) = calculate_percent( 'hard_threshold_max_zram') (warning_threshold_min_mem_kb, warning_threshold_min_mem_mb, warning_threshold_min_mem_percent) = calculate_percent( 'warning_threshold_min_mem') (warning_threshold_max_zram_kb, warning_threshold_max_zram_mb, warning_threshold_max_zram_percent) = calculate_percent( 'warning_threshold_max_zram') if 'post_zombie_delay' in config_dict: post_zombie_delay = string_to_float_convert_test( config_dict['post_zombie_delay']) if post_zombie_delay is None or post_zombie_delay < 0: invalid_config_key_value('post_zombie_delay') else: missing_config_key('post_zombie_delay') if 'victim_cache_time' in config_dict: victim_cache_time = string_to_float_convert_test( config_dict['victim_cache_time']) if victim_cache_time is None or victim_cache_time < 0: invalid_config_key_value('victim_cache_time') else: missing_config_key('victim_cache_time') if 'env_cache_time' in config_dict: env_cache_time = string_to_float_convert_test( config_dict['env_cache_time']) if env_cache_time is None or env_cache_time < 0: invalid_config_key_value('env_cache_time') else: missing_config_key('env_cache_time') if 'exe_timeout' in config_dict: exe_timeout = string_to_float_convert_test(config_dict['exe_timeout']) if exe_timeout is None or exe_timeout < 0.1: invalid_config_key_value('exe_timeout') else: missing_config_key('exe_timeout') if 'fill_rate_mem' in config_dict: fill_rate_mem = string_to_float_convert_test(config_dict['fill_rate_mem']) if fill_rate_mem is None or fill_rate_mem < 100: invalid_config_key_value('fill_rate_mem') else: missing_config_key('fill_rate_mem') if 'fill_rate_swap' in config_dict: fill_rate_swap = string_to_float_convert_test( config_dict['fill_rate_swap']) if fill_rate_swap is None or fill_rate_swap < 100: invalid_config_key_value('fill_rate_swap') else: missing_config_key('fill_rate_swap') if 'fill_rate_zram' in config_dict: fill_rate_zram = string_to_float_convert_test( config_dict['fill_rate_zram']) if fill_rate_zram is None or fill_rate_zram < 100: invalid_config_key_value('fill_rate_zram') else: missing_config_key('fill_rate_zram') if 'soft_threshold_min_swap' in config_dict: soft_threshold_min_swap = config_dict['soft_threshold_min_swap'] else: errprint('soft_threshold_min_swap not in config\nExit') exit(1) if 'hard_threshold_min_swap' in config_dict: hard_threshold_min_swap = config_dict['hard_threshold_min_swap'] else: missing_config_key('hard_threshold_min_swap') if 'post_soft_action_delay' in config_dict: post_soft_action_delay = string_to_float_convert_test( config_dict['post_soft_action_delay']) if post_soft_action_delay is None or post_soft_action_delay < 0.1: invalid_config_key_value('post_soft_action_delay') else: missing_config_key('post_soft_action_delay') if 'psi_post_action_delay' in config_dict: psi_post_action_delay = string_to_float_convert_test( config_dict['psi_post_action_delay']) if psi_post_action_delay is None or psi_post_action_delay < 10: invalid_config_key_value('psi_post_action_delay') else: missing_config_key('psi_post_action_delay') if 'hard_threshold_max_psi' in config_dict: hard_threshold_max_psi = string_to_float_convert_test( config_dict['hard_threshold_max_psi']) if (hard_threshold_max_psi is None or hard_threshold_max_psi < 1 or hard_threshold_max_psi > 100): invalid_config_key_value('hard_threshold_max_psi') else: missing_config_key('hard_threshold_max_psi') if 'soft_threshold_max_psi' in config_dict: soft_threshold_max_psi = string_to_float_convert_test( config_dict['soft_threshold_max_psi']) if (soft_threshold_max_psi is None or soft_threshold_max_psi < 1 or soft_threshold_max_psi > 100): invalid_config_key_value('soft_threshold_max_psi') else: missing_config_key('soft_threshold_max_psi') if 'warning_threshold_max_psi' in config_dict: warning_threshold_max_psi = string_to_float_convert_test( config_dict['warning_threshold_max_psi']) if (warning_threshold_max_psi is None or warning_threshold_max_psi < 1 or warning_threshold_max_psi > 100): invalid_config_key_value('warning_threshold_max_psi') else: missing_config_key('warning_threshold_max_psi') if 'min_badness' in config_dict: min_badness = string_to_int_convert_test(config_dict['min_badness']) if min_badness is None or min_badness < 1: invalid_config_key_value('min_badness') else: missing_config_key('min_badness') if 'min_post_warning_delay' in config_dict: min_post_warning_delay = string_to_float_convert_test( config_dict['min_post_warning_delay']) if min_post_warning_delay is None or min_post_warning_delay < 1: invalid_config_key_value('min_post_warning_delay') else: missing_config_key('min_post_warning_delay') if 'warning_threshold_min_swap' in config_dict: warning_threshold_min_swap = config_dict['warning_threshold_min_swap'] else: missing_config_key('warning_threshold_min_swap') if 'max_victim_ancestry_depth' in config_dict: max_victim_ancestry_depth = string_to_int_convert_test( config_dict['max_victim_ancestry_depth']) if min_badness is None: errprint('Invalid max_victim_ancestry_depth value, not integer\nExit') exit(1) if max_victim_ancestry_depth < 1: errprint('Invalud max_victim_ancestry_depth value\nExit') exit(1) else: missing_config_key('max_victim_ancestry_depth') if 'max_soft_exit_time' in config_dict: max_soft_exit_time = string_to_float_convert_test( config_dict['max_soft_exit_time']) if max_soft_exit_time is None or max_soft_exit_time < 0.1: invalid_config_key_value('max_soft_exit_time') else: missing_config_key('max_soft_exit_time') if 'post_kill_exe' in config_dict: post_kill_exe = config_dict['post_kill_exe'] else: missing_config_key('post_kill_exe') if 'psi_path' in config_dict: psi_path = config_dict['psi_path'] if CHECK_PSI: try: psi_file_mem_to_metrics(psi_path) except Exception as e: errprint('WARNING: invalid psi_path "{}": {}'.format(psi_path, e)) else: missing_config_key('psi_path') if 'psi_metrics' in config_dict: psi_metrics = config_dict['psi_metrics'] valid_metrics = { 'some_avg10', 'some_avg60', 'some_avg300', 'full_avg10', 'full_avg60', 'full_avg300'} if psi_metrics not in valid_metrics: invalid_config_key_value('psi_metrics') else: missing_config_key('psi_metrics') if 'warning_exe' in config_dict: warning_exe = config_dict['warning_exe'] check_warning_exe = bool(warning_exe != '') else: missing_config_key('warning_exe') if 'extra_table_info' in config_dict: extra_table_info = config_dict['extra_table_info'] valid_eti = {'None', 'cwd', 'realpath', 'cgroup_v1', 'cgroup_v2', 'cmdline', 'environ'} if extra_table_info not in valid_eti: invalid_config_key_value('extra_table_info') else: missing_config_key('extra_table_info') if 'min_mem_report_interval' in config_dict: min_mem_report_interval = string_to_float_convert_test( config_dict['min_mem_report_interval']) if min_mem_report_interval is None or min_mem_report_interval < 0: invalid_config_key_value('min_mem_report_interval') else: missing_config_key('min_mem_report_interval') if 'psi_excess_duration' in config_dict: psi_excess_duration = string_to_float_convert_test( config_dict['psi_excess_duration']) if psi_excess_duration is None or psi_excess_duration < 0: invalid_config_key_value('psi_excess_duration') else: missing_config_key('psi_excess_duration') if 'max_sleep' in config_dict: max_sleep = string_to_float_convert_test( config_dict['max_sleep']) if max_sleep is None or max_sleep < 0.01: invalid_config_key_value('max_sleep') else: missing_config_key('max_sleep') if 'min_sleep' in config_dict: min_sleep = string_to_float_convert_test( config_dict['min_sleep']) if min_sleep is None or min_sleep < 0.01 or min_sleep > max_sleep: invalid_config_key_value('min_sleep') else: missing_config_key('min_sleep') over_sleep = min_sleep sensitivity_test_time = over_sleep / 4 stable_sleep = bool(max_sleep == min_sleep) if print_proc_table_flag: check_permissions() func_print_proc_table() if (check_kmsg or low_memory_warnings_enabled or post_action_gui_notifications or check_warning_exe or soft_actions or post_kill_exe != ''): import threading import shlex from subprocess import Popen, TimeoutExpired # Get KiB levels if it's possible. soft_threshold_min_swap_tuple = get_swap_threshold_tuple( soft_threshold_min_swap, 'soft_threshold_min_swap') hard_threshold_min_swap_tuple = get_swap_threshold_tuple( hard_threshold_min_swap, 'hard_threshold_min_swap') warning_threshold_min_swap_tuple = get_swap_threshold_tuple( warning_threshold_min_swap, 'warning_threshold_min_swap') swap_kb_dict = dict() swap_term_is_percent = soft_threshold_min_swap_tuple[1] if swap_term_is_percent: soft_threshold_min_swap_percent = soft_threshold_min_swap_tuple[0] else: soft_threshold_min_swap_kb = soft_threshold_min_swap_tuple[0] swap_kb_dict['soft_threshold_min_swap_kb'] = soft_threshold_min_swap_kb swap_kill_is_percent = hard_threshold_min_swap_tuple[1] if swap_kill_is_percent: hard_threshold_min_swap_percent = hard_threshold_min_swap_tuple[0] else: hard_threshold_min_swap_kb = hard_threshold_min_swap_tuple[0] swap_kb_dict['hard_threshold_min_swap_kb'] = hard_threshold_min_swap_kb swap_warn_is_percent = warning_threshold_min_swap_tuple[1] if swap_warn_is_percent: warning_threshold_min_swap_percent = warning_threshold_min_swap_tuple[0] else: warning_threshold_min_swap_kb = warning_threshold_min_swap_tuple[0] swap_kb_dict[ 'warning_threshold_min_swap_kb'] = warning_threshold_min_swap_kb if print_config_at_startup or check_config_flag: check_config() # for calculating the column width when printing mem and zram mem_len = len(str(round(mem_total / 1024.0))) if post_action_gui_notifications: notify_sig_dict = {SIGKILL: 'Killing', SIGTERM: 'Terminating'} # convert rates from MiB/s to KiB/s fill_rate_mem = fill_rate_mem * 1024 fill_rate_swap = fill_rate_swap * 1024 fill_rate_zram = fill_rate_zram * 1024 warn_time_now = 0 warn_time_delta = 1000 # ? warn_timer = 0 mlockall() check_permissions() psi_avg_string = '' # will be overwritten if PSI monitoring enabled mem_used_zram = 0 if print_mem_check_results: # to find delta mem wt2 = 0 new_mem = 0 # init mem report interval report0 = 0 # handle signals for i in sig_list: signal(i, signal_handler) x0 = monotonic() delta0 = 0 threshold = None mem_info = None psi_kill_exceeded_timer = psi_term_exceeded_timer = -0.0001 psi_threshold = zram_threshold = zram_info = psi_info = None log('Monitoring has started!') stdout.flush() display_env = 'DISPLAY=' dbus_env = 'DBUS_SESSION_BUS_ADDRESS=' user_env = 'USER=' envd = dict() envd['list_with_envs'] = envd['t'] = None cmd_num_dict = dict() cmd_num_dict['cmd_num'] = 0 fd['mi'] = open('/proc/meminfo', 'rb', buffering=0) arcstats_path = '/proc/spl/kstat/zfs/arcstats' # arcstats_path = './arcstats' ZFS = os.path.exists(arcstats_path) kmsg_path = '/dev/kmsg' post_oom_delay = 1.0 kmsg_start_timeout = 10.0 sleep_at_broken_pipe = 0.1 oom_dict = dict() oom_dict['t'] = None if check_kmsg: kmsg_test = is_kmsg_ok() if kmsg_test is not None: m_test0, test_string = kmsg_test start_thread(check_kmsg_fn) if ZFS: log('WARNING: ZFS found. Available memory will not be calculated ' 'correctly (issue#89)') try: # find indexes with open(arcstats_path, 'rb') as f: a_list = f.read().decode().split('\n') for n, line in enumerate(a_list): if line.startswith('c_min '): c_min_index = n elif line.startswith('size '): size_index = n elif line.startswith('arc_meta_used '): arc_meta_used_index = n elif line.startswith('arc_meta_min '): arc_meta_min_index = n else: continue except Exception as e: log(e) ZFS = False while True: (masf_threshold, masf_info, mem_available, hard_threshold_min_swap_kb, soft_threshold_min_swap_kb, swap_free, swap_total) = check_mem_swap_ex() if zram_checking_enabled: zram_threshold, zram_info, mem_used_zram = check_zram_ex() if CHECK_PSI: (psi_threshold, psi_info, psi_kill_exceeded_timer, psi_term_exceeded_timer, x0) = check_psi_ex( psi_kill_exceeded_timer, psi_term_exceeded_timer, x0, mem_available) if print_mem_check_results: if CHECK_PSI: psi_avg_value = find_psi_metrics_value(psi_path, psi_metrics) psi_post_action_delay_exceeded = bool( monotonic() >= psi_post_action_delay) if print_mem_check_results: psi_avg_string = 'PSI: {} | '.format( str(psi_avg_value).rjust(6)) wt1 = monotonic() delta = (mem_available + swap_free) - new_mem t_cycle = wt1 - wt2 report_delta = wt1 - report0 if report_delta >= min_mem_report_interval: mem_report = True new_mem = mem_available + swap_free report0 = wt1 else: mem_report = False wt2 = monotonic() if mem_report: speed = delta / 1024.0 / report_delta speed_info = ' | dMem: {} M/s'.format( str(round(speed)).rjust(5)) # Calculate 'swap-column' width swap_len = len(str(round(swap_total / 1024.0))) # Output available mem sizes if swap_total == 0 and mem_used_zram == 0: log('{}MemAvail: {} M, {} %{}'.format( psi_avg_string, human(mem_available, mem_len), just_percent_mem(mem_available / mem_total), speed_info)) elif swap_total > 0 and mem_used_zram == 0: log('{}MemAvail: {} M, {} % | SwapFree: {} M, {} %{}'.format( psi_avg_string, human(mem_available, mem_len), just_percent_mem(mem_available / mem_total), human(swap_free, swap_len), just_percent_swap(swap_free / (swap_total + 0.1)), speed_info)) else: log('{}MemAvail: {} M, {} % | SwapFree: {} M, {} % | Mem' 'UsedZram: {} M, {} %{}'.format( psi_avg_string, human(mem_available, mem_len), just_percent_mem(mem_available / mem_total), human(swap_free, swap_len), just_percent_swap(swap_free / (swap_total + 0.1)), human(mem_used_zram, mem_len), just_percent_mem(mem_used_zram / mem_total), speed_info)) if (masf_threshold == SIGKILL or zram_threshold == SIGKILL or psi_threshold == SIGKILL): threshold = SIGKILL mem_info_list = [] if masf_info is not None: mem_info_list.append(masf_info) if zram_info is not None: mem_info_list.append(zram_info) if psi_info is not None: mem_info_list.append(psi_info) implement_corrective_action( threshold, mem_info_list, psi_kill_exceeded_timer, psi_term_exceeded_timer, x0, psi_threshold, zram_threshold, zram_info, psi_info) continue if (masf_threshold == SIGTERM or zram_threshold == SIGTERM or psi_threshold == SIGTERM): threshold = SIGTERM mem_info_list = [] if masf_info is not None: mem_info_list.append(masf_info) if zram_info is not None: mem_info_list.append(zram_info) if psi_info is not None: mem_info_list.append(psi_info) implement_corrective_action( threshold, mem_info_list, psi_kill_exceeded_timer, psi_term_exceeded_timer, x0, psi_threshold, zram_threshold, zram_info, psi_info) continue if low_memory_warnings_enabled: if (masf_threshold == 'WARN' or zram_threshold == 'WARN' or psi_threshold == 'WARN'): warn_time_delta = monotonic() - warn_time_now warn_time_now = monotonic() warn_timer += warn_time_delta if warn_timer > min_post_warning_delay: send_notify_warn() warn_timer = 0 sleep_after_check_mem() nohang-0.2.0/src/oom-sort000077500000000000000000000124771377337215500153070ustar00rootroot00000000000000#!/usr/bin/env python3 """ sort processes by oom_score """ from operator import itemgetter from os import listdir from argparse import ArgumentParser def pid_to_oom_score(pid): with open('/proc/{}/oom_score'.format(pid), 'rb', buffering=0) as f: return int(f.read()) def pid_to_oom_score_adj(pid): with open('/proc/{}/oom_score_adj'.format(pid), 'rb', buffering=0) as f: return int(f.read()) def pid_to_cmdline(pid): with open('/proc/{}/cmdline'.format(pid), 'rb', buffering=0) as f: return f.read().decode('utf-8', 'ignore').replace( '\x00', ' ').rstrip() def pid_to_status_units(pid): with open('/proc/{}/status'.format(pid), 'rb', buffering=0) as f: f_list = f.read().decode('utf-8', 'ignore').split('\n') for i in range(len(f_list)): if i == 1: name = f_list[0].split('\t')[1] if i == uid_index: uid = f_list[i].split('\t')[2] if i == vm_rss_index: vm_rss = f_list[i].split('\t')[1][:-3] if i == vm_swap_index: vm_swap = f_list[i].split('\t')[1][:-3] return name, uid, vm_rss, vm_swap def get_max_pid_len(): with open('/proc/sys/kernel/pid_max') as f: for line in f: return len(line.strip()) sort_dict = { 'PID': 0, 'oom_score': 1, 'oom_score_adj': 2, 'cmdline': 3, 'Name': 4, 'UID': 5, 'VmRSS': 6, 'VmSwap': 7 } parser = ArgumentParser() parser.add_argument( '--num', '-n', help="""max number of lines; default: 99999""", default=99999, type=str ) parser.add_argument( '--len', '-l', help="""max cmdline length; default: 99999""", default=99999, type=int ) parser.add_argument( '--sort', '-s', help="""sort by unit; default: oom_score""", default='oom_score', type=str ) args = parser.parse_args() display_cmdline = args.len num_lines = args.num sort_by = args.sort if sort_by not in sort_dict: print('Invalid -s/--sort value. Valid values are:\nPID\noom_scor' 'e [default value]\noom-sore_adj\nUID\nName\ncmdline\nVmR' 'SS\nVmSwap') exit() # find VmRSS, VmSwap and UID positions in /proc/*/status for further # searching positions of UID, VmRSS and VmSwap in each process with open('/proc/self/status') as file: status_list = file.readlines() status_names = [] for s in status_list: status_names.append(s.split(':')[0]) uid_index = status_names.index('Uid') vm_rss_index = status_names.index('VmRSS') vm_swap_index = status_names.index('VmSwap') # get sorted list with pid, oom_score, oom_score_adj, cmdline # get status units: name, uid, rss, swap oom_list = [] for pid in listdir('/proc'): # skip non-numeric entries and PID 1 if pid.isdigit() is False or pid == '1': continue try: oom_score = pid_to_oom_score(pid) oom_score_adj = pid_to_oom_score_adj(pid) cmdline = pid_to_cmdline(pid) if cmdline == '': continue name, uid, vm_rss, vm_swap = pid_to_status_units(pid) except FileNotFoundError: continue except ProcessLookupError: continue except Exception as e: print(e) exit(1) oom_list.append(( int(pid), int(oom_score), int(oom_score_adj), cmdline, name, int(uid), int(vm_rss), int(vm_swap))) # list sorted by oom_score oom_list_sorted = sorted( oom_list, key=itemgetter(int(sort_dict[sort_by])), reverse=True) # find width of columns max_pid_len = get_max_pid_len() max_uid_len = len(str(sorted( oom_list, key=itemgetter(5), reverse=True)[0][5])) max_vm_rss_len = len(str(round( sorted(oom_list, key=itemgetter(6), reverse=True)[0][6] / 1024))) if max_vm_rss_len < 5: max_vm_rss_len = 5 # print output if display_cmdline == 0: print( 'oom_score oom_score_adj{}UID{}PID Name {}VmRSS VmSwap'.format( ' ' * (max_uid_len - 2), ' ' * (max_pid_len - 2), ' ' * max_vm_rss_len ) ) print( '--------- ------------- {} {} --------------- {}-- --------'.format( '-' * max_uid_len, '-' * max_pid_len, '-' * max_vm_rss_len ) ) else: print( 'oom_score oom_score_adj{}UID{}PID Name {}VmRSS VmSwa' 'p cmdline'.format( ' ' * (max_uid_len - 2), ' ' * (max_pid_len - 2), ' ' * max_vm_rss_len ) ) print( '--------- ------------- {} {} --------------- {}-- ------' '-- -------'.format( '-' * max_uid_len, '-' * max_pid_len, '-' * max_vm_rss_len ) ) # print processes stats sorted by sort_dict[sort_by] for i in oom_list_sorted[:int(num_lines)]: pid = i[0] oom_score = i[1] oom_score_adj = i[2] cmdline = i[3] name = i[4] uid = i[5] vm_rss = i[6] vm_swap = i[7] print( '{} {} {} {} {} {} M {} M {}'.format( str(oom_score).rjust(9), str(oom_score_adj).rjust(13), str(uid).rjust(max_uid_len), str(pid).rjust(max_pid_len), name.ljust(15), str(round(vm_rss / 1024.0)).rjust(max_vm_rss_len, ' '), str(round(vm_swap / 1024.0)).rjust(6, ' '), cmdline[:display_cmdline] ) ) nohang-0.2.0/src/psi-top000077500000000000000000000113721377337215500151140ustar00rootroot00000000000000#!/usr/bin/env python3 import os from argparse import ArgumentParser def psi_path_to_metrics(psi_path): """ """ with open(psi_path) as f: psi_list = f.readlines() some_list, full_list = psi_list[0].split(' '), psi_list[1].split(' ') some_avg10 = some_list[1].split('=')[1] some_avg60 = some_list[2].split('=')[1] some_avg300 = some_list[3].split('=')[1] full_avg10 = full_list[1].split('=')[1] full_avg60 = full_list[2].split('=')[1] full_avg300 = full_list[3].split('=')[1] return (some_avg10, some_avg60, some_avg300, full_avg10, full_avg60, full_avg300) def psi_path_to_metrics_cpu(psi_path): """ """ with open(psi_path) as f: psi_list = f.readlines() some_list = psi_list[0].rstrip().split(' ') some_avg10 = some_list[1].split('=')[1] some_avg60 = some_list[2].split('=')[1] some_avg300 = some_list[3].split('=')[1] return (some_avg10, some_avg60, some_avg300) def cgroup2_root(): """ """ with open(mounts) as f: for line in f: if cgroup2_separator in line: return line.partition(cgroup2_separator)[0].partition(' ')[2] def get_psi_mem_files(cgroup2_path, met): """ """ path_list = [] for root, dirs, files in os.walk(cgroup2_path): for file in files: path = os.path.join(root, file) if path.endswith('/{}.pressure'.format(met)): path_list.append(path) return path_list def psi_path_to_cgroup2(path): """ """ if path.endswith('/cpu.pressure'): return path.partition(cgroup2_mountpoint)[ 2].partition('/cpu.pressure')[0] if path.endswith('/io.pressure'): return path.partition(cgroup2_mountpoint)[ 2].partition('/io.pressure')[0] if path.endswith('/memory.pressure'): return path.partition(cgroup2_mountpoint)[ 2].partition('/memory.pressure')[0] parser = ArgumentParser() parser.add_argument( '-m', '--metrics', help="""metrics (memory, io or cpu)""", default='memory', type=str ) args = parser.parse_args() met = args.metrics if not (met == 'memory' or met == 'io' or met == 'cpu'): print('ERROR: invalid metrics:', met) exit(1) psi_path = '/proc/pressure/{}'.format(met) mounts = '/proc/mounts' cgroup2_separator = ' cgroup2 rw,' cgroup2_mountpoint = cgroup2_root() if cgroup2_mountpoint is None: print('ERROR: cgroup_v2 hierarchy is not mounted') exit(1) try: psi_path_to_metrics('/proc/pressure/memory') except Exception as e: print('ERROR: {}'.format(e)) print('PSI metrics are not provided by the kernel. Exit.') exit(1) if cgroup2_mountpoint is not None: y = get_psi_mem_files(cgroup2_mountpoint, met) path_list = get_psi_mem_files(cgroup2_mountpoint, met) head_mem_io = '''PSI metrics: {} cgroup_v2 mountpoint: {} =====================|======================| some | full | -------------------- | -------------------- | avg10 avg60 avg300 | avg10 avg60 avg300 | cgroup_v2 ------ ------ ------ | ------ ------ ------ | -----------'''.format( met, cgroup2_mountpoint) head_cpu = '''PSI metrics: {} cgroup_v2 mountpoint: {} =====================| some | -------------------- | avg10 avg60 avg300 | cgroup_v2 ------ ------ ------ | -----------'''.format( met, cgroup2_mountpoint) if met == 'cpu': print(head_cpu) else: print(head_mem_io) if met == 'cpu': some_avg10, some_avg60, some_avg300 = psi_path_to_metrics_cpu(psi_path) print('{} {} {} | {}'.format( some_avg10.rjust(6), some_avg60.rjust(6), some_avg300.rjust(6), 'SYSTEM_WIDE')) else: (some_avg10, some_avg60, some_avg300, full_avg10, full_avg60, full_avg300 ) = psi_path_to_metrics(psi_path) print('{} {} {} | {} {} {} | {}'.format( some_avg10.rjust(6), some_avg60.rjust(6), some_avg300.rjust(6), full_avg10.rjust(6), full_avg60.rjust(6), full_avg300.rjust(6), 'SYSTEM_WIDE')) for psi_path in path_list: if met == 'cpu': some_avg10, some_avg60, some_avg300 = psi_path_to_metrics_cpu(psi_path) print('{} {} {} | {}'.format( some_avg10.rjust(6), some_avg60.rjust(6), some_avg300.rjust(6), psi_path_to_cgroup2(psi_path))) else: (some_avg10, some_avg60, some_avg300, full_avg10, full_avg60, full_avg300) = psi_path_to_metrics(psi_path) print('{} {} {} | {} {} {} | {}'.format( some_avg10.rjust(6), some_avg60.rjust(6), some_avg300.rjust(6), full_avg10.rjust(6), full_avg60.rjust(6), full_avg300.rjust(6), psi_path_to_cgroup2(psi_path))) nohang-0.2.0/src/psi2log000077500000000000000000000426551377337215500151100ustar00rootroot00000000000000#!/usr/bin/env python3 """psi2log - PSI metrics monitor and logger""" from time import sleep, monotonic from ctypes import CDLL from sys import stdout, exit from argparse import ArgumentParser from signal import signal, SIGTERM, SIGINT, SIGQUIT, SIGHUP def read_path(path): """ """ try: fd[path].seek(0) except ValueError: try: fd[path] = open(path, 'rb', buffering=0) except FileNotFoundError as e: log(e) return None except KeyError: try: fd[path] = open(path, 'rb', buffering=0) except FileNotFoundError as e: log(e) return None try: return fd[path].read(99999).decode() except OSError as e: log(e) fd[path].close() return None def form1(num): """ """ s = str(num).split('.') return '{}.{:0<2}'.format(s[0], s[1]) def form2(num): """ """ s = str(round(num, 1)).split('.') return '{}.{:0<1}'.format(s[0], s[1]) def signal_handler(signum, frame): """ """ def signal_handler_inner(signum, frame): pass for i in sig_list: signal(i, signal_handler_inner) if len(fd) > 0: for f in fd: fd[f].close() if signum == SIGINT: print('') lpd = len(peaks_dict) if lpd == 15: # mode 1 log('=================================') log('Peak values: avg10 avg60 avg300') log('----------- ------ ------ ------') log('some cpu {:>6} {:>6} {:>6}'.format( form1(peaks_dict['c_some_avg10']), form1(peaks_dict['c_some_avg60']), form1(peaks_dict['c_some_avg300']), )) log('----------- ------ ------ ------') log('some io {:>6} {:>6} {:>6}'.format( form1(peaks_dict['i_some_avg10']), form1(peaks_dict['i_some_avg60']), form1(peaks_dict['i_some_avg300']), )) log('full io {:>6} {:>6} {:>6}'.format( form1(peaks_dict['i_full_avg10']), form1(peaks_dict['i_full_avg60']), form1(peaks_dict['i_full_avg300']), )) log('----------- ------ ------ ------') log('some memory {:>6} {:>6} {:>6}'.format( form1(peaks_dict['m_some_avg10']), form1(peaks_dict['m_some_avg60']), form1(peaks_dict['m_some_avg300']), )) log('full memory {:>6} {:>6} {:>6}'.format( form1(peaks_dict['m_full_avg10']), form1(peaks_dict['m_full_avg60']), form1(peaks_dict['m_full_avg300']), )) if lpd == 5: # mode 2 log('----- | ----- ----- | ----- ----- | --------') log('{:>5} | {:>5} {:>5} | {:>5} {:>5} | peaks'.format( form2(peaks_dict['avg_cs']), form2(peaks_dict['avg_is']), form2(peaks_dict['avg_if']), form2(peaks_dict['avg_ms']), form2(peaks_dict['avg_mf']) )) if separate_log: logging.info('') exit() def cgroup2_root(): """ """ with open(mounts) as f: for line in f: if cgroup2_separator in line: return line.partition(cgroup2_separator)[0].partition(' ')[2] def mlockall(): """ """ MCL_CURRENT = 1 MCL_FUTURE = 2 MCL_ONFAULT = 4 libc = CDLL('libc.so.6', use_errno=True) result = libc.mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT) if result != 0: result = libc.mlockall(MCL_CURRENT | MCL_FUTURE) if result != 0: log('WARNING: cannot lock all memory: [Errno {}]'.format(result)) else: log('All memory locked with MCL_CURRENT | MCL_FUTURE') else: log('All memory locked with MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT') def psi_file_mem_to_metrics0(psi_path): """ """ with open(psi_path) as f: psi_list = f.readlines() some_list, full_list = psi_list[0].split(' '), psi_list[1].split(' ') some_avg10 = some_list[1].split('=')[1] some_avg60 = some_list[2].split('=')[1] some_avg300 = some_list[3].split('=')[1] full_avg10 = full_list[1].split('=')[1] full_avg60 = full_list[2].split('=')[1] full_avg300 = full_list[3].split('=')[1] return (some_avg10, some_avg60, some_avg300, full_avg10, full_avg60, full_avg300) def psi_file_mem_to_metrics(psi_path): """ """ foo = read_path(psi_path) if foo is None: return None try: psi_list = foo.split('\n') some_list, full_list = psi_list[0].split(' '), psi_list[1].split(' ') some_avg10 = some_list[1].split('=')[1] some_avg60 = some_list[2].split('=')[1] some_avg300 = some_list[3].split('=')[1] full_avg10 = full_list[1].split('=')[1] full_avg60 = full_list[2].split('=')[1] full_avg300 = full_list[3].split('=')[1] return (some_avg10, some_avg60, some_avg300, full_avg10, full_avg60, full_avg300) except Exception as e: log('{}'.format(e)) return None def psi_file_cpu_to_metrics(psi_path): """ """ foo = read_path(psi_path) if foo is None: return None try: psi_list = foo.split('\n') some_list = psi_list[0].split(' ') some_avg10 = some_list[1].split('=')[1] some_avg60 = some_list[2].split('=')[1] some_avg300 = some_list[3].split('=')[1] return (some_avg10, some_avg60, some_avg300) except Exception as e: log('{}'.format(e)) return None def psi_file_mem_to_total(psi_path): """ """ foo = read_path(psi_path) if foo is None: return None try: psi_list = foo.split('\n') some_list, full_list = psi_list[0].split(' '), psi_list[1].split(' ') some_total = some_list[4].split('=')[1] full_total = full_list[4].split('=')[1] return int(some_total), int(full_total) except Exception as e: log('{}'.format(e)) return None def psi_file_cpu_to_total(psi_path): """ """ foo = read_path(psi_path) if foo is None: return None try: psi_list = foo.split('\n') some_list = psi_list[0].split(' ') some_total = some_list[4].split('=')[1] return int(some_total) except Exception as e: log('{}'.format(e)) return None def print_head_1(): """ """ log('====================================================================' '==============================================') log(' cpu || io ' '|| memory') log('==================== || =========================================== ' '|| ===========================================') log(' some || some | full ' '|| some | full') log('-------------------- || -------------------- | -------------------- ' '|| -------------------- | --------------------') log(' avg10 avg60 avg300 || avg10 avg60 avg300 | avg10 avg60 avg300 ' '|| avg10 avg60 avg300 | avg10 avg60 avg300') log('------ ------ ------ || ------ ------ ------ | ------ ------ ------ ' '|| ------ ------ ------ | ------ ------ ------') def print_head_2(): """ """ log('======|=============|=============|') log(' cpu | io | memory |') log('----- | ----------- | ----------- |') log(' some | some full | some full | interval') log('----- | ----- ----- | ----- ----- | --------') def log(*msg): """ """ if not SUPPRESS_OUTPUT: print(*msg) if separate_log: logging.info(*msg) def log_head(*msg): """ """ print(*msg) if separate_log: logging.info(*msg) ############################################################################## parser = ArgumentParser() parser.add_argument( '-t', '--target', help="""target (cgroup_v2 or SYSTEM_WIDE)""", default='SYSTEM_WIDE', type=str ) parser.add_argument( '-i', '--interval', help="""interval in sec""", default=2, type=float ) parser.add_argument( '-l', '--log', help="""path to log file""", default=None, type=str ) parser.add_argument( '-m', '--mode', help="""mode (1 or 2)""", default='1', type=str ) parser.add_argument( '-s', '--suppress-output', help="""suppress output""", default='False', type=str ) args = parser.parse_args() target = args.target mode = args.mode interval = args.interval log_file = args.log suppress_output = args.suppress_output if target != 'SYSTEM_WIDE': target = '/' + target.strip('/') ############################################################################## if log_file is None: separate_log = False else: separate_log = True import logging sig_list = [SIGTERM, SIGINT, SIGQUIT, SIGHUP] for i in sig_list: signal(i, signal_handler) if separate_log: try: logging.basicConfig( filename=log_file, level=logging.INFO, format="%(asctime)s: %(message)s") except Exception as e: print(e) exit(1) if suppress_output == 'False': SUPPRESS_OUTPUT = False elif suppress_output == 'True': SUPPRESS_OUTPUT = True else: log_head('error: argument -s/--suppress-output: valid values are ' 'False and True') exit(1) if log_file is not None: logstring = 'log file: {}, '.format(log_file) else: logstring = 'log file is not set, ' if interval < 1: log_head('error: argument -i/--interval: the value must be greater than or' ' equal to 1') exit(1) if not (mode == '1' or mode == '2'): log_head('ERROR: invalid mode. Valid values are 1 and 2. Exit.') exit(1) try: psi_file_mem_to_metrics0('/proc/pressure/memory') except Exception as e: log_head('ERROR: {}'.format(e)) log_head('PSI metrics are not provided by the kernel. Exit.') exit(1) log_head('Starting psi2log, target: {}, mode: {}, interval: {} sec, {}suppress' ' output: {}'.format( target, mode, round(interval, 3), logstring, suppress_output)) fd = dict() if target == 'SYSTEM_WIDE': system_wide = True source_dir = '/proc/pressure' cpu_file = '/proc/pressure/cpu' io_file = '/proc/pressure/io' memory_file = '/proc/pressure/memory' log_head('PSI source dir: /proc/pressure/, source files: cpu, io, memory') else: system_wide = False mounts = '/proc/mounts' cgroup2_separator = ' cgroup2 rw,' cgroup2_mountpoint = cgroup2_root() if cgroup2_mountpoint is None: log('ERROR: unified cgroup hierarchy is not mounted, exit') exit(1) source_dir = cgroup2_mountpoint + target cpu_file = source_dir + '/cpu.pressure' io_file = source_dir + '/io.pressure' memory_file = source_dir + '/memory.pressure' log_head('PSI source dir: {}{}/, source files: cpu.pressure, io.pressure,' ' memory.pressure'.format(cgroup2_mountpoint, target)) abnormal_interval = 1.01 * interval peaks_dict = dict() mlockall() if mode == '2': print_head_2() try: total_cs0 = psi_file_cpu_to_total(cpu_file) total_is0, total_if0 = psi_file_mem_to_total(io_file) total_ms0, total_mf0 = psi_file_mem_to_total(memory_file) monotonic0 = monotonic() stdout.flush() sleep(interval) except TypeError: stdout.flush() sleep(interval) while True: try: total_cs1 = psi_file_cpu_to_total(cpu_file) total_is1, total_if1 = psi_file_mem_to_total(io_file) total_ms1, total_mf1 = psi_file_mem_to_total(memory_file) monotonic1 = monotonic() dm = monotonic1 - monotonic0 if dm > abnormal_interval and dm - interval > 0.05: log('WARNING: abnormal interval ({} sec), metrics may be prov' 'ided incorrect'.format(round(dm, 3))) monotonic0 = monotonic1 except TypeError: stdout.flush() sleep(interval) continue dtotal_cs = total_cs1 - total_cs0 avg_cs = dtotal_cs / dm / 10000 if 'avg_cs' not in peaks_dict or peaks_dict['avg_cs'] < avg_cs: peaks_dict['avg_cs'] = avg_cs total_cs0 = total_cs1 dtotal_is = total_is1 - total_is0 avg_is = dtotal_is / dm / 10000 if 'avg_is' not in peaks_dict or peaks_dict['avg_is'] < avg_is: peaks_dict['avg_is'] = avg_is total_is0 = total_is1 dtotal_if = total_if1 - total_if0 avg_if = dtotal_if / dm / 10000 if 'avg_if' not in peaks_dict or peaks_dict['avg_if'] < avg_if: peaks_dict['avg_if'] = avg_if total_if0 = total_if1 dtotal_ms = total_ms1 - total_ms0 avg_ms = dtotal_ms / dm / 10000 if 'avg_ms' not in peaks_dict or peaks_dict['avg_ms'] < avg_ms: peaks_dict['avg_ms'] = avg_ms total_ms0 = total_ms1 dtotal_mf = total_mf1 - total_mf0 avg_mf = dtotal_mf / dm / 10000 if 'avg_mf' not in peaks_dict or peaks_dict['avg_mf'] < avg_mf: peaks_dict['avg_mf'] = avg_mf total_mf0 = total_mf1 log('{:>5} | {:>5} {:>5} | {:>5} {:>5} | {}'.format( round(avg_cs, 1), round(avg_is, 1), round(avg_if, 1), round(avg_ms, 1), round(avg_mf, 1), round(dm, 3) )) stdout.flush() sleep(interval) print_head_1() while True: try: (c_some_avg10, c_some_avg60, c_some_avg300 ) = psi_file_cpu_to_metrics(cpu_file) (i_some_avg10, i_some_avg60, i_some_avg300, i_full_avg10, i_full_avg60, i_full_avg300 ) = psi_file_mem_to_metrics(io_file) (m_some_avg10, m_some_avg60, m_some_avg300, m_full_avg10, m_full_avg60, m_full_avg300 ) = psi_file_mem_to_metrics(memory_file) except TypeError: stdout.flush() sleep(interval) continue log('{:>6} {:>6} {:>6} || {:>6} {:>6} {:>6} | {:>6} {:>6} {:>6} || {:>6}' ' {:>6} {:>6} | {:>6} {:>6} {:>6}'.format( c_some_avg10, c_some_avg60, c_some_avg300, i_some_avg10, i_some_avg60, i_some_avg300, i_full_avg10, i_full_avg60, i_full_avg300, m_some_avg10, m_some_avg60, m_some_avg300, m_full_avg10, m_full_avg60, m_full_avg300 )) c_some_avg10 = float(c_some_avg10) if ('c_some_avg10' not in peaks_dict or peaks_dict['c_some_avg10'] < c_some_avg10): peaks_dict['c_some_avg10'] = c_some_avg10 c_some_avg60 = float(c_some_avg60) if ('c_some_avg60' not in peaks_dict or peaks_dict['c_some_avg60'] < c_some_avg60): peaks_dict['c_some_avg60'] = c_some_avg60 c_some_avg300 = float(c_some_avg300) if ('c_some_avg300' not in peaks_dict or peaks_dict['c_some_avg300'] < c_some_avg300): peaks_dict['c_some_avg300'] = c_some_avg300 ####################################################################### i_some_avg10 = float(i_some_avg10) if ('i_some_avg10' not in peaks_dict or peaks_dict['i_some_avg10'] < i_some_avg10): peaks_dict['i_some_avg10'] = i_some_avg10 i_some_avg60 = float(i_some_avg60) if ('i_some_avg60' not in peaks_dict or peaks_dict['i_some_avg60'] < i_some_avg60): peaks_dict['i_some_avg60'] = i_some_avg60 i_some_avg300 = float(i_some_avg300) if ('i_some_avg300' not in peaks_dict or peaks_dict['i_some_avg300'] < i_some_avg300): peaks_dict['i_some_avg300'] = i_some_avg300 i_full_avg10 = float(i_full_avg10) if ('i_full_avg10' not in peaks_dict or peaks_dict['i_full_avg10'] < i_full_avg10): peaks_dict['i_full_avg10'] = i_full_avg10 i_full_avg60 = float(i_full_avg60) if ('i_full_avg60' not in peaks_dict or peaks_dict['i_full_avg60'] < i_full_avg60): peaks_dict['i_full_avg60'] = i_full_avg60 i_full_avg300 = float(i_full_avg300) if ('i_full_avg300' not in peaks_dict or peaks_dict['i_full_avg300'] < i_full_avg300): peaks_dict['i_full_avg300'] = i_full_avg300 ####################################################################### m_some_avg10 = float(m_some_avg10) if ('m_some_avg10' not in peaks_dict or peaks_dict['m_some_avg10'] < m_some_avg10): peaks_dict['m_some_avg10'] = m_some_avg10 m_some_avg60 = float(m_some_avg60) if ('m_some_avg60' not in peaks_dict or peaks_dict['m_some_avg60'] < m_some_avg60): peaks_dict['m_some_avg60'] = m_some_avg60 m_some_avg300 = float(m_some_avg300) if ('m_some_avg300' not in peaks_dict or peaks_dict['m_some_avg300'] < m_some_avg300): peaks_dict['m_some_avg300'] = m_some_avg300 m_full_avg10 = float(m_full_avg10) if ('m_full_avg10' not in peaks_dict or peaks_dict['m_full_avg10'] < m_full_avg10): peaks_dict['m_full_avg10'] = m_full_avg10 m_full_avg60 = float(m_full_avg60) if ('m_full_avg60' not in peaks_dict or peaks_dict['m_full_avg60'] < m_full_avg60): peaks_dict['m_full_avg60'] = m_full_avg60 m_full_avg300 = float(m_full_avg300) if ('m_full_avg300' not in peaks_dict or peaks_dict['m_full_avg300'] < m_full_avg300): peaks_dict['m_full_avg300'] = m_full_avg300 stdout.flush() sleep(interval) nohang-0.2.0/systemd/000077500000000000000000000000001377337215500144705ustar00rootroot00000000000000nohang-0.2.0/systemd/nohang-desktop.service.in000066400000000000000000000033231377337215500214010ustar00rootroot00000000000000[Unit] Description=Sophisticated low memory handler Documentation=man:nohang(8) https://github.com/hakavlad/nohang Conflicts=nohang.service After=sysinit.target [Service] ExecStart=:TARGET_SBINDIR:/nohang --monitor --config :TARGET_SYSCONFDIR:/nohang/nohang-desktop.conf Slice=hostcritical.slice SyslogIdentifier=nohang-desktop KillMode=mixed Restart=always RestartSec=0 CPUSchedulingResetOnFork=true RestrictRealtime=yes TasksMax=25 MemoryMax=100M MemorySwapMax=100M UMask=0027 ProtectSystem=strict ReadWritePaths=/var/log InaccessiblePaths=/home /root ProtectKernelTunables=true ProtectKernelModules=true ProtectControlGroups=true ProtectHostname=true MemoryDenyWriteExecute=yes RestrictNamespaces=yes LockPersonality=yes PrivateTmp=true DeviceAllow=/dev/kmsg rw DevicePolicy=closed # Capabilities whitelist: # CAP_KILL is required to send signals # CAP_IPC_LOCK is required to mlockall() # CAP_SYS_PTRACE is required to check /proc/[pid]/exe realpathes # CAP_DAC_READ_SEARCH is required to read /proc/[pid]/environ files # CAP_DAC_OVERRIDE fixes #94 # CAP_DAC_READ_SEARCH CAP_AUDIT_WRITE CAP_SETUID CAP_SETGID CAP_SYS_RESOURCE # are required to send GUI notifications # CAP_SYSLOG is required to check /dev/kmsg for OOM events CapabilityBoundingSet=CAP_KILL CAP_IPC_LOCK CAP_SYS_PTRACE \ CAP_DAC_READ_SEARCH CAP_DAC_OVERRIDE CAP_AUDIT_WRITE CAP_SETUID CAP_SETGID \ CAP_SYS_RESOURCE CAP_SYSLOG # `PrivateNetwork=true` breaks GUI notifications on oldstable distros # (Debian 8, CentOS 7, Linux Mint 18). On modern distros you can set # PrivateNetwork=true for security reasons. #PrivateNetwork=true # Set realtime CPU scheduling policy if you want #CPUSchedulingPolicy=rr #CPUSchedulingPriority=1 [Install] WantedBy=multi-user.target nohang-0.2.0/systemd/nohang.service.in000066400000000000000000000033131377337215500177310ustar00rootroot00000000000000[Unit] Description=Sophisticated low memory handler Documentation=man:nohang(8) https://github.com/hakavlad/nohang Conflicts=nohang-desktop.service After=sysinit.target [Service] ExecStart=:TARGET_SBINDIR:/nohang --monitor --config :TARGET_SYSCONFDIR:/nohang/nohang.conf Slice=hostcritical.slice SyslogIdentifier=nohang KillMode=mixed Restart=always RestartSec=0 CPUSchedulingResetOnFork=true RestrictRealtime=yes TasksMax=25 MemoryMax=100M MemorySwapMax=100M UMask=0027 ProtectSystem=strict ReadWritePaths=/var/log InaccessiblePaths=/home /root ProtectKernelTunables=true ProtectKernelModules=true ProtectControlGroups=true ProtectHostname=true MemoryDenyWriteExecute=yes RestrictNamespaces=yes LockPersonality=yes PrivateTmp=true DeviceAllow=/dev/kmsg rw DevicePolicy=closed # Capabilities whitelist: # CAP_KILL is required to send signals # CAP_IPC_LOCK is required to mlockall() # CAP_SYS_PTRACE is required to check /proc/[pid]/exe realpathes # CAP_DAC_READ_SEARCH is required to read /proc/[pid]/environ files # CAP_DAC_OVERRIDE fixes #94 # CAP_DAC_READ_SEARCH CAP_AUDIT_WRITE CAP_SETUID CAP_SETGID CAP_SYS_RESOURCE # are required to send GUI notifications # CAP_SYSLOG is required to check /dev/kmsg for OOM events CapabilityBoundingSet=CAP_KILL CAP_IPC_LOCK CAP_SYS_PTRACE \ CAP_DAC_READ_SEARCH CAP_DAC_OVERRIDE CAP_AUDIT_WRITE CAP_SETUID CAP_SETGID \ CAP_SYS_RESOURCE CAP_SYSLOG # `PrivateNetwork=true` breaks GUI notifications on oldstable distros # (Debian 8, CentOS 7, Linux Mint 18). On modern distros you can set # PrivateNetwork=true for security reasons. #PrivateNetwork=true # Set realtime CPU scheduling policy if you want #CPUSchedulingPolicy=rr #CPUSchedulingPriority=1 [Install] WantedBy=multi-user.target