pax_global_header00006660000000000000000000000064134541273470014524gustar00rootroot0000000000000052 comment=3a91db8e6319089b832c277c2116b6899489cf1e urlwatch-2.17/000077500000000000000000000000001345412734700133065ustar00rootroot00000000000000urlwatch-2.17/.gitignore000066400000000000000000000000271345412734700152750ustar00rootroot00000000000000__pycache__ .idea buildurlwatch-2.17/.travis.yml000066400000000000000000000001761345412734700154230ustar00rootroot00000000000000language: python python: - "3.4" - "3.5" - "3.6" install: - python setup.py install_dependencies script: nosetests -v urlwatch-2.17/CHANGELOG.md000066400000000000000000000160001345412734700151140ustar00rootroot00000000000000# Changelog All notable changes to this project will be documented in this file. The format mostly follows [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). ## [2.17] -- 2019-04-12 ### Added - XPath/CSS: Support for excluding elements (#333, by Chenfeng Bao) - Add support for using external `diff_tool` on Windows (#373, by Chenfeng Bao) - Document how to use Amazon Simple E-Mail Service "SES" (by mborsetti) - Compare data with multiple old versions (`compared_versions`, #328, by Chenfeng Bao) ### Fixed - YAML: Fix deprecation warnings (#367, by Florent Aide) - Updated manpage with new options: Authentication, filter tests (Fixes #351) - Text formatter: Do not emit empty lines for `line_length=0` (Fixes #357) ### Changed - SMTP configuration fix: Only use smtp.user config if it's a non-empty value ## [2.16] -- 2019-01-27 ### Added - XPath: Handle `/text()` selector (#282) - Document how to specify cookies to README.md (#264) - Text Reporter: `minimal` config option to only print a summary (PR#304, fixes #147) - README.md: Document how to watch Github releases via XPath (#266) - Support for parsing XML/RSS with XPath (Fixes #281) - Allow explicit setting of `encoding` for URL jobs (PR#313, contributes to #306) - Slack Channel Reporter (PR#309) - ANSI color output on the Windows console via `colorama` (PR#296, closes #295) - Support for using CSS selectors via the `cssselect` module (PR#321, closes 273) - `ignore_http_error_codes` is now an option for URL jobs (PR#325, fixes #203) - `job_defaults` in the config for globally specifying settings (PR#345, closes #253) - Optional `timeout` (in seconds) for URL jobs to specify socket timeout (PR#348, closes #340) ### Removed - Support for JSON storage (dead code that was never used in production; PR#336) ### Changed - `HtmlReporter` now also highlights links for browser jobs (PR#303) - Allow `--features` and `--edit-*` to run without `urls.yaml` (PR#301) - When a previous run had errors, do not use conditional GETs (PR#313, fixes #292) - Explicitly specify JSON pretty print `separators` for consistency (PR#343) - Use data-driven unit tests/fixtures for easier unit test maintenance (PR#344) ### Fixed - Fix migration issues with case-insensitive filesystems (#223) - Correctly reset retry counter when job is added or unchanged (PR#291, PR#314) - Fix a `FutureWarning` on Python 3.7 with regard to regular expressions (PR#299) - If the filter list is empty, do not process the filter list (PR#308) - Fix parsing/sanity-checking of `urls.yaml` after editing (PR#317, fixes #316) - Fix Python 3.3 compatibility by depending on `enum34` there (PR#311) - Allow running unit tests on Windows (PR#318) - Fix migration issues introduced by PR#180 and #256 (PR#323, fixes #267) ## [2.15] -- 2018-10-23 ### Added - Support for Mailgun regions (by Daniel Peukert, PR#280) - CLI: Allow multiple occurences of 'filter' when adding jobs (PR#278) ### Changed - Fixed incorrect name for chat_id config in the default config (by Robin B, PR#276) ## [2.14] -- 2018-08-30 ### Added - Filter to pretty-print JSON data: `format-json` (by Niko Böckerman, PR#250) - List active Telegram chats using `--telegram-chats` (with fixes by Georg Pichler, PR#270) - Support for HTTP `ETag` header in URL jobs and `If-None-Match` (by Karol Babioch, PR#256) - Support for filtering HTML using XPath expressions, with `lxml` (PR#274, Fixes #226) - Added `install_dependencies` to `setup.py` commands for easy installing of dependencies - Added `ignore_connection_errors` per-job configuration option (by Karol Babioch, PR#261) ### Changed - Improved code (HTTP status codes, by Karol Babioch PR#258) - Improved documentation for setting up Telegram chat bots - Allow multiple chats for Telegram reporting (by Georg Pichler, PR#271) ## [2.13] -- 2018-06-03 ### Added - Support for specifying a `diff_tool` (e.g. `wdiff`) for each job (Fixes #243) - Support for testing filters via `--test-filter JOB` (Fixes #237) ### Changed - Moved ChangeLog file to CHANGELOG.md and using Keep a Changelog format. - Force version check in `setup.py`, to exclude Python 2 (Fixes #244) - Remove default parameter from internal `html2text` module (Fixes #239) - Better error/exception reporting in `--verbose` mode (Fixes #164) ### Removed - Old ChangeLog entries ## [2.12] -- 2018-06-01 ### Fixed - Bugfix: Do not 'forget' old data if an exception occurs (Fixes #242) ## [2.11] -- 2018-05-19 ### Fixed - Retry: Make sure `tries` is initialized to zero on load (Fixes #241) ### Changed - html2text: Make sure the bs4 method strips HTML tags (by Louis Sautier) ## [2.10] -- 2018-05-17 ### Added - Browser: Add support for browser jobs using `requests-html` (Fixes #215) - Retry: Add support for optional retry count in job list (by cmichi, fixes #235) - HTTP: Add support for specifying optional headers (by Tero Mononen) ### Changed - File editing: Fix issue when `$EDITOR` contains spaces (Fixes #220) - ChangeLog: Add versions to recent ChangeLog entries (Fixes #235) ## [2.9] -- 2018-03-24 ### Added - E-Mail: Add support for `--smtp-login` and document GMail SMTP usage - Pushover: Device and sound attribute (by Tobias Haupenthal) ### Changed - XDG: Move cache file to `XDG_CACHE_DIR` (by Maxime Werlen) - Migration: Unconditionally migrate urlwatch 1.x cache dirs (Fixes #206) ### Fixed - Cleanups: Fix out-of-date debug message, use https (by Jakub Wilk) ## [2.8] -- 2018-01-28 ### Changed - Documentation: Mention `appdirs` (by e-dschungel) ### Fixed - SMTP: Fix handling of missing `user` field (by e-dschungel) - Manpage: Fix documentation of XDG environment variables (by Jelle van der Waa) - Unit tests: Fix imports for out-of-source-tree tests (by Maxime Werlen) ## [2.7] -- 2017-11-08 ### Added - Filtering: `style` (by gvandenbroucke), `tag` (by cmichi) - New reporter: Telegram support (by gvandenbroucke) - Paths: Add `XDG_CONFIG_DIR` support (by Jelle van der Waa) ### Changed - ElementsByAttribute: look for matching tag in handle_endtag (by Gaetan Leurent) - HTTP: Option to avoid 304 responses, `Content-Type` header (by Vinicius Massuchetto) - html2text: Configuration options (by Vinicius Massuchetto) ### Fixed - Issue #127: Fix error reporting - E-Mail: Fix encodings (by Seokjin Han), Allow `user` parameter for SMTP (by Jay Sitter) ## [2.6] -- 2016-12-04 ### Added - New filters: `sha1sum`, `hexdump`, `element-by-class` - New reporters: pushbullet (by R0nd); mailgun (by lechuckcaptain) ### Changed - Improved filters: `BeautifulSoup` support for `html2txt` (by lechuckcaptain) - Improved handlers: HTTP Proxy (by lechuckcaptain); support for `file://` URIs - CI Integration: Build configuration for Travis CI (by lechuckcaptain) - Consistency: Feature list is now sorted by name ### Fixed - Issue #108: Fix creation of example files on first startup - Issue #118: Fix match filters for missing keys - Small fixes by: Jakub Wilk, Marc Urben, Adam Dobrawy and Louis Sautier Older ChangeLog entries can be found in the [old ChangeLog file](https://github.com/thp/urlwatch/blob/2.12/ChangeLog), or with `git show 2.12:ChangeLog` on the command line. urlwatch-2.17/COPYING000066400000000000000000000026001345412734700143370ustar00rootroot00000000000000Copyright (c) 2008-2019 Thomas Perl All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. urlwatch-2.17/MANIFEST.in000066400000000000000000000001371345412734700150450ustar00rootroot00000000000000include CHANGELOG.md COPYING README.md recursive-include share * recursive-include test/data * urlwatch-2.17/README.md000066400000000000000000000373501345412734700145750ustar00rootroot00000000000000[![Build Status](https://travis-ci.org/thp/urlwatch.svg)](https://travis-ci.org/thp/urlwatch) [![Packaging status](https://repology.org/badge/tiny-repos/urlwatch.svg)](https://repology.org/metapackage/urlwatch/versions) [![PyPI version](https://badge.fury.io/py/urlwatch.svg)](https://badge.fury.io/py/urlwatch) ``` _ _ _ ____ _ _ _ __| |_ ____ _| |_ ___| |__ |___ \ | | | | '__| \ \ /\ / / _` | __/ __| '_ \ __) | | |_| | | | |\ V V / (_| | || (__| | | | / __/ \__,_|_| |_| \_/\_/ \__,_|\__\___|_| |_| |_____| ... monitors webpages for you ``` urlwatch is intended to help you watch changes in webpages and get notified (via e-mail, in your terminal or through various third party services) of any changes. The change notification will include the URL that has changed and a unified diff of what has changed. DEPENDENCIES ------------ urlwatch 2 requires: * Python 3.3 or newer * [PyYAML](http://pyyaml.org/) * [minidb](https://thp.io/2010/minidb/) * [requests](http://python-requests.org/) * [keyring](https://github.com/jaraco/keyring/) * [appdirs](https://github.com/ActiveState/appdirs) * [lxml](https://lxml.de) * [cssselect](https://cssselect.readthedocs.io) * [enum34](https://pypi.org/project/enum34/) (Python 3.3 only) The dependencies can be installed with (add `--user` to install to `$HOME`): `python3 -m pip install pyyaml minidb requests keyring appdirs lxml cssselect` Optional dependencies (install via `python3 -m pip install `): * Pushover reporter: [chump](https://github.com/karanlyons/chump/) * Pushbullet reporter: [pushbullet.py](https://github.com/randomchars/pushbullet.py) * Stdout reporter with color on Windows: [colorama](https://github.com/tartley/colorama) * "browser" job kind: [requests-html](https://html.python-requests.org) * Unit testing: [pycodestyle](http://pycodestyle.pycqa.org/en/latest/) QUICK START ----------- 1. Start `urlwatch` to migrate your old data or start fresh 2. Use `urlwatch --edit` to customize your job list (this will create/edit `urls.yaml`) 3. Use `urlwatch --edit-config` if you want to set up e-mail sending 4. Use `urlwatch --edit-hooks` if you want to write custom subclasses 5. Add `urlwatch` to your crontab (`crontab -e`) to monitor webpages periodically The checking interval is defined by how often you run `urlwatch`. You can use e.g. [crontab.guru](https://crontab.guru) to figure out the schedule expression for the checking interval, we recommend not more often than 30 minutes (this would be `*/30 * * * *`). If you have never used cron before, check out the [crontab command help](https://www.computerhope.com/unix/ucrontab.htm). On Windows, `cron` is not installed by default. Use the [Windows Task Scheduler](https://en.wikipedia.org/wiki/Windows_Task_Scheduler) instead, or see [this StackOverflow question](https://stackoverflow.com/q/132971/1047040) for alternatives. TIPS AND TRICKS --------------- Quickly adding new URLs to the job list from the command line: ```urlwatch --add url=http://example.org,name=Example``` You can pick only a given HTML element with the built-in filter, for example to extract ```
.../
``` from a page, you can use the following in your urls.yaml: ```yaml url: http://example.org/ filter: element-by-id:something ``` Also, you can chain filters, so you can run html2text on the result: ```yaml url: http://example.net/ filter: element-by-id:something,html2text ``` The example urls.yaml file also demonstrates the use of built-in filters, here 3 filters are used: html2text, line-grep and whitespace removal to get just a certain info field from a webpage: ```yaml url: https://thp.io/2008/urlwatch/ filter: html2text,grep:Current.*version,strip ``` For most cases, this means that you can specify a filter chain in your urls.yaml page without requiring a custom hook where previously you would have needed to write custom filtering code in Python. If you are using the `grep` filter, you can grep for a comma (`,`) by using `\054` (`:` does not need to be escaped separately and can be used as-is), for example to convert HTML to text, then grep for `a,b:`, and then strip whitespace, use this: ```yaml url: https://example.org/ filter: html2text,grep:a\054b:,strip ``` If you want to extract only the body tag you can use this filer: ```yaml url: https://thp.io/2008/urlwatch/ filter: element-by-tag:body ``` You can also specify an external `diff`-style tool (a tool that takes two filenames (old, new) as parameter and returns on its standard output the difference of the files), for example to use GNU `wdiff` to get word-based differences instead of line-based difference: ```yaml url: https://example.com/ diff_tool: wdiff ``` Note that `diff_tool` specifies an external command-line tool, so that tool must be installed separately (e.g. `apt install wdiff` on Debian or `brew install wdiff` on macOS). Coloring is supported for `wdiff`-style output, but potentially not for other diff tools. To filter based on an [XPath](https://www.w3.org/TR/1999/REC-xpath-19991116/) expression, you can use the `xpath` filter like so (see Microsoft's [XPath Examples](https://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx) page for some other examples): ```yaml url: https://example.net/ filter: xpath:/body ``` This filters only the `` element of the HTML document, stripping out everything else. To filter based on a [CSS selector](https://www.w3.org/TR/2011/REC-css3-selectors-20110929/), you can use the `css` filter like so: ```yaml url: https://example.net/ filter: css:body ``` Some limitations and extensions exist as explained in [cssselect's documentation](https://cssselect.readthedocs.io/en/latest/#supported-selectors). In some cases, it might be useful to ignore (temporary) network errors to avoid notifications being sent. While there is a `display.error` config option (defaulting to `True`) to control reporting of errors globally, to ignore network errors for specific jobs only, you can use the `ignore_connection_errors` key in the job list configuration file: ```yaml url: https://example.com/ ignore_connection_errors: true ``` Similarly, you might want to ignore some (temporary) HTTP errors on the server side: ```yaml url: https://example.com/ ignore_http_error_codes: 408, 429, 500, 502, 503, 504 ``` or ignore all HTTP errors if you like: ```yaml url: https://example.com/ ignore_http_error_codes: 4xx, 5xx ``` For web pages with misconfigured HTTP headers or rare encodings, it may be useful to explicitly specify an encoding from Python's [Standard Encodings](https://docs.python.org/3/library/codecs.html#standard-encodings). ```yaml url: https://example.com/ encoding: utf-8 ``` By default, url jobs timeout after 60 seconds. If you want a different timeout period, use the `timeout` key to specify it in number of seconds, or set it to 0 to never timeout. ```yaml url: https://example.com/ timeout: 300 ``` If you want to change some settings for all your jobs, edit the `job_defaults` section in your config file: ```yaml ... job_defaults: all: diff_tool: wdiff url: ignore_connection_errors: true ``` The above config file sets all jobs to use wdiff as diff tool, and all "url" jobs to ignore connection errors. PUSHOVER -------- You can configure urlwatch to send real time notifications about changes via Pushover(https://pushover.net/). To enable this, ensure you have the chump python package installed (see DEPENDENCIES). Then edit your config (`urlwatch --edit-config`) and enable pushover. You will also need to add to the config your Pushover user key and a unique app key (generated by registering urlwatch as an application on your Pushover account(https://pushover.net/apps/build) PUSHBULLET -------- Pushbullet notification are configured similarly to Pushover (see above). You'll need to add to the config your Pushbullet Access Token, which you can generate at https://www.pushbullet.com/#settings TELEGRAM -------- Telegram notifications are configured using the Telegram Bot API. For this, you'll need a Bot API token and a chat id (see https://core.telegram.org/bots). Sample configuration: ```yaml telegram: bot_token: '999999999:3tOhy2CuZE0pTaCtszRfKpnagOG8IQbP5gf' # your bot api token chat_id: '88888888' # the chat id where the messages should be sent enabled: true ``` To set up Telegram, from your Telegram app, chat up BotFather (New Message, Search, "BotFather"), then say `/newbot` and follow the instructions. Eventually it will tell you the bot token (in the form seen above, `:`) - add this to your config file. You can then click on the link of your bot, which will send the message `/start`. At this point, you can use the command `urlwatch --telegram-chats` to list the private chats the bot is involved with. This is the chat ID that you need to put into the config file as `chat_id`. You may add multiple chat IDs as a YAML list: ```yaml telegram: bot_token: '999999999:3tOhy2CuZE0pTaCtszRfKpnagOG8IQbP5gf' # your bot api token chat_id: - '11111111' - '22222222' enabled: true ``` Don't forget to also enable the reporter. SLACK ----- Slack nofifications are configured using "Slack Incoming Webhooks". Here is a sample configuration: ```yaml slack: webhook_url: 'https://hooks.slack.com/services/T50TXXXXXU/BDVYYYYYYY/PWTqwyFM7CcCfGnNzdyDYZ' enabled: true ``` To set up Slack, from you Slack Team, create a new app and activate "Incoming Webhooks" on a channel, you'll get a webhook URL, copy it into the configuration as seen above. You can use the command `urlwatch --test-slack` to test if the Slack integration works. BROWSER ------- If the webpage you are trying to watch runs client-side JavaScript to render the page, [Requests-HTML](http://html.python-requests.org) can now be used to render the page in a headless Chromium instance first and then use the HTML of the resulting page. Use the `browser` kind in the configuration and the `navigate` key to set the URL to retrieve. note that the normal `url` job keys are not supported for the `browser` job types at the moment, for example: ```yaml kind: browser name: "A Page With JavaScript" navigate: http://example.org/ ``` E-MAIL VIA GMAIL SMTP --------------------- You need to configure your GMail account to allow for "less secure" (password-based) apps to login: 1. Go to https://myaccount.google.com/ 2. Click on "Sign-in & security" 3. Scroll all the way down to "Allow less secure apps" and enable it Now, start the configuration editor: `urlwatch --edit-config` These are the keys you need to configure (see #158): - `report/email/enabled`: `true` - `report/email/from`: `your.username@gmail.com` (edit accordingly) - `report/email/method`: `smtp` - `report/email/smtp/host`: `smtp.gmail.com` - `report/email/smtp/keyring`: `true` - `report/email/smtp/port`: `587` - `report/email/smtp/starttls`: `true` - `report/email/to`: The e-mail address you want to send reports to Now, for setting the password, it's not stored in the config file, but in your keychain. To store the password, run: `urlwatch --smtp-login` and enter your password. E-MAIL VIA AMAZON SIMPLE EMAIL SERVICE (SES) -------------------------------------------- Start the configuration editor: `urlwatch --edit-config` These are the keys you need to configure: - `report/email/enabled`: `true` - `report/email/from`: `you@verified_domain.com` (edit accordingly) - `report/email/method`: `smtp` - `report/email/smtp/host`: `email-smtp.us-west-2.amazonaws.com` (edit accordingly) - `report/email/smtp/user`: `ABCDEFGHIJ1234567890` (edit accordingly) - `report/email/smtp/keyring`: `true` - `report/email/smtp/port`: `587` (25 or 465 also work) - `report/email/smtp/starttls`: `true` - `report/email/to`: The e-mail address you want to send reports to The password is not stored in the config file, but in your keychain. To store the password, run: `urlwatch --smtp-login` and enter your password. TESTING FILTERS --------------- While creating your filter pipeline, you might want to preview what the filtered output looks like. You can do so by first configuring your job and then running urlwatch with the `--test-filter` command, passing in the index (from `--list`) or the URL/location of the job to be tested: ``` urlwatch --test-filter 1 # Test the first job in the list urlwatch --test-filter https://example.net/ # Test the job with the given URL ``` The output of this command will be the filtered plaintext of the job, this is the output that will (in a real urlwatch run) be the input to the diff algorithm. SENDING COOKIES --------------- It is possible to add cookies to HTTP requests for pages that need it, the YAML syntax for this is: ```yaml url: http://example.com/ cookies: Key: ValueForKey OtherKey: OtherValue ``` WATCHING GITHUB RELEASES ------------------------ This is an example how to watch the GitHub "releases" page for a given project for the latest release version, to be notified of new releases: ```yaml url: "https://github.com/thp/urlwatch/releases/latest" filter: - xpath: '(//div[contains(@class,"release-timeline-tags")]//h4)[1]/a' - html2text: re ``` USING XPATH AND CSS FILTERS WITH XML AND EXCLUSIONS --------------------------------------------------- By default, XPath and CSS filters are set up for HTML documents. However, it is possible to use them for XML documents as well (these examples parse an RSS feed and filter only the titles and publication dates): ```yaml url: 'https://heronebag.com/blog/index.xml' filter: - xpath: path: '//item/title/text()|//item/pubDate/text()' method: xml ``` ```yaml url: 'https://heronebag.com/blog/index.xml' filter: - css: selector: 'item > title, item > pubDate' method: xml - html2text: re ``` Another useful option with XPath and CSS filters is `exclude`. Elements selected by this `exclude` expression are removed from the final result. For example, the following job will not have any `` tag in its results: ```yaml url: https://example.org/ filter: - css: selector: 'body' exclude: 'a' ``` COMPARE WITH SEVERAL LATEST SNAPSHOTS ------------------------------------- If a webpage frequently changes between several known stable states, it may be desirable to have changes reported only if the webpage changes into a new unknown state. You can use `compared_versions` to do this. ```yaml url: https://example.com/ compared_versions: 3 ``` In this example, changes are only reported if the webpage becomes different from the latest three distinct states. The differences are shown relative to the closest match. MIGRATION FROM URLWATCH 1.x --------------------------- Migration from urlwatch 1.x should be automatic on first start. Here is a quick rundown of changes in 2.0: * URLs are stored in a YAML file now, with direct support for specifying names for jobs, different job kinds, directly applying filters, selecting the HTTP request method, specifying POST data as dictionary and much more * The cache directory has been replaced with a SQLite 3 database file "cache.db" in minidb format, storing all change history (use `--gc-cache` to remove old changes if you don't need them anymore) for further analysis * The hooks mechanism has been replaced with support for creating new job kinds by subclassing, new filters (also by subclassing) as well as new reporters (pieces of code that put the results somewhere, for example the default installation contains the "stdout" reporter that writes to the console and the "email" reporter that can send HTML and text e-mails) * A configuration file - urlwatch.yaml - has been added for specifying user preferences instead of having to supply everything via the command line CONTACT ------- Website: https://thp.io/2008/urlwatch/ E-Mail: m@thp.io urlwatch-2.17/lib/000077500000000000000000000000001345412734700140545ustar00rootroot00000000000000urlwatch-2.17/lib/urlwatch/000077500000000000000000000000001345412734700157055ustar00rootroot00000000000000urlwatch-2.17/lib/urlwatch/__init__.py000066400000000000000000000011221345412734700200120ustar00rootroot00000000000000"""urlwatch monitors webpages for you urlwatch is intended to help you watch changes in webpages and get notified (via e-mail, in your terminal or through various third party services) of any changes. The change notification will include the URL that has changed and a unified diff of what has changed. """ pkgname = 'urlwatch' __copyright__ = 'Copyright 2008-2019 Thomas Perl' __author__ = 'Thomas Perl ' __license__ = 'BSD' __url__ = 'https://thp.io/2008/urlwatch/' __version__ = '2.17' __user_agent__ = '%s/%s (+https://thp.io/2008/urlwatch/info.html)' % (pkgname, __version__) urlwatch-2.17/lib/urlwatch/command.py000066400000000000000000000272031345412734700177010ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import imp import logging import os import shutil import sys import requests from .filters import FilterBase from .handler import JobState from .jobs import JobBase, UrlJob from .reporters import ReporterBase from .util import atomic_rename, edit_file from .mailer import set_password, have_password logger = logging.getLogger(__name__) class UrlwatchCommand: def __init__(self, urlwatcher): self.urlwatcher = urlwatcher self.urlwatch_config = urlwatcher.urlwatch_config def edit_hooks(self): fn_base, fn_ext = os.path.splitext(self.urlwatch_config.hooks) hooks_edit = fn_base + '.edit' + fn_ext try: if os.path.exists(self.urlwatch_config.hooks): shutil.copy(self.urlwatch_config.hooks, hooks_edit) elif self.urlwatch_config.hooks_py_example is not None and os.path.exists( self.urlwatch_config.hooks_py_example): shutil.copy(self.urlwatch_config.hooks_py_example, hooks_edit) edit_file(hooks_edit) imp.load_source('hooks', hooks_edit) atomic_rename(hooks_edit, self.urlwatch_config.hooks) print('Saving edit changes in', self.urlwatch_config.hooks) except SystemExit: raise except Exception as e: print('Parsing failed:') print('======') print(e) print('======') print('') print('The file', self.urlwatch_config.hooks, 'was NOT updated.') print('Your changes have been saved in', hooks_edit) return 1 return 0 def show_features(self): print() print('Supported jobs:\n') print(JobBase.job_documentation()) print('Supported filters:\n') print(FilterBase.filter_documentation()) print() print('Supported reporters:\n') print(ReporterBase.reporter_documentation()) print() return 0 def list_urls(self): for idx, job in enumerate(self.urlwatcher.jobs): if self.urlwatch_config.verbose: print('%d: %s' % (idx + 1, repr(job))) else: pretty_name = job.pretty_name() location = job.get_location() if pretty_name != location: print('%d: %s (%s)' % (idx + 1, pretty_name, location)) else: print('%d: %s' % (idx + 1, pretty_name)) return 0 def _find_job(self, query): try: index = int(query) if index <= 0: return None try: return self.urlwatcher.jobs[index - 1] except IndexError: return None except ValueError: return next((job for job in self.urlwatcher.jobs if job.get_location() == query), None) def test_filter(self): job = self._find_job(self.urlwatch_config.test_filter) job = job.with_defaults(self.urlwatcher.config_storage.config) if job is None: print('Not found: %r' % (self.urlwatch_config.test_filter,)) return 1 if isinstance(job, UrlJob): # Force re-retrieval of job, as we're testing filters job.ignore_cached = True job_state = JobState(self.urlwatcher.cache_storage, job) job_state.process() if job_state.exception is not None: raise job_state.exception print(job_state.new_data) # We do not save the job state or job on purpose here, since we are possibly modifying the job # (ignore_cached) and we do not want to store the newly-retrieved data yet (filter testing) return 0 def modify_urls(self): save = True if self.urlwatch_config.delete is not None: job = self._find_job(self.urlwatch_config.delete) if job is not None: self.urlwatcher.jobs.remove(job) print('Removed %r' % (job,)) else: print('Not found: %r' % (self.urlwatch_config.delete,)) save = False if self.urlwatch_config.add is not None: # Allow multiple specifications of filter=, so that multiple filters can be specified on the CLI items = [item.split('=', 1) for item in self.urlwatch_config.add.split(',')] filters = [v for k, v in items if k == 'filter'] items = [(k, v) for k, v in items if k != 'filter'] d = {k: v for k, v in items} if filters: d['filter'] = ','.join(filters) job = JobBase.unserialize(d) print('Adding %r' % (job,)) self.urlwatcher.jobs.append(job) if save: self.urlwatcher.urls_storage.save(self.urlwatcher.jobs) return 0 def handle_actions(self): if self.urlwatch_config.features: sys.exit(self.show_features()) if self.urlwatch_config.gc_cache: self.urlwatcher.cache_storage.gc([job.get_guid() for job in self.urlwatcher.jobs]) sys.exit(0) if self.urlwatch_config.edit: sys.exit(self.urlwatcher.urls_storage.edit(self.urlwatch_config.urls_yaml_example)) if self.urlwatch_config.edit_hooks: sys.exit(self.edit_hooks()) if self.urlwatch_config.test_filter: sys.exit(self.test_filter()) if self.urlwatch_config.list: sys.exit(self.list_urls()) if self.urlwatch_config.add is not None or self.urlwatch_config.delete is not None: sys.exit(self.modify_urls()) def check_edit_config(self): if self.urlwatch_config.edit_config: sys.exit(self.urlwatcher.config_storage.edit()) def check_telegram_chats(self): if self.urlwatch_config.telegram_chats: config = self.urlwatcher.config_storage.config['report'].get('telegram', None) if not config: print('You need to configure telegram in your config first (see README.md)') sys.exit(1) bot_token = config.get('bot_token', None) if not bot_token: print('You need to set up your bot token first (see README.md)') sys.exit(1) info = requests.get('https://api.telegram.org/bot{}/getMe'.format(bot_token)).json() chats = {} for chat_info in requests.get('https://api.telegram.org/bot{}/getUpdates'.format(bot_token)).json()['result']: chat = chat_info['message']['chat'] if chat['type'] == 'private': chats[str(chat['id'])] = ' '.join((chat['first_name'], chat['last_name'])) if 'last_name' in chat else chat['first_name'] if not chats: print('No chats found. Say hello to your bot at https://t.me/{}'.format(info['result']['username'])) sys.exit(1) headers = ('Chat ID', 'Name') maxchat = max(len(headers[0]), max((len(k) for k, v in chats.items()), default=0)) maxname = max(len(headers[1]), max((len(v) for k, v in chats.items()), default=0)) fmt = '%-' + str(maxchat) + 's %s' print(fmt % headers) print(fmt % ('-' * maxchat, '-' * maxname)) for k, v in sorted(chats.items(), key=lambda kv: kv[1]): print(fmt % (k, v)) print('\nChat up your bot here: https://t.me/{}'.format(info['result']['username'])) sys.exit(0) def check_test_slack(self): if self.urlwatch_config.test_slack: config = self.urlwatcher.config_storage.config['report'].get('slack', None) if not config: print('You need to configure slack in your config first (see README.md)') sys.exit(1) webhook_url = config.get('webhook_url', None) if not webhook_url: print('You need to set up your slack webhook_url first (see README.md)') sys.exit(1) info = requests.post(webhook_url, json={"text": "Test message from urlwatch, your configuration is working"}) if info.status_code == requests.codes.ok: print('Successfully sent message to Slack') sys.exit(0) else: print('Error while submitting message to Slack:{0}'.format(info.text)) sys.exit(1) def check_smtp_login(self): if self.urlwatch_config.smtp_login: config = self.urlwatcher.config_storage.config['report']['email'] smtp_config = config['smtp'] success = True if not config['enabled']: print('Please enable e-mail reporting in the config first.') success = False if config['method'] != 'smtp': print('Please set the method to SMTP for the e-mail reporter.') success = False if not smtp_config['keyring']: print('Keyring authentication must be enabled for SMTP.') success = False smtp_hostname = smtp_config['host'] if not smtp_hostname: print('Please configure the SMTP hostname in the config first.') success = False smtp_username = smtp_config.get('user', None) or config['from'] if not smtp_username: print('Please configure the SMTP user in the config first.') success = False if not success: sys.exit(1) if have_password(smtp_hostname, smtp_username): message = 'Password for %s / %s already set, update? [y/N] ' % (smtp_username, smtp_hostname) if input(message).lower() != 'y': print('Password unchanged.') sys.exit(0) if success: set_password(smtp_hostname, smtp_username) # TODO: Actually verify that the login to the server works sys.exit(0) def run(self): self.check_edit_config() self.check_smtp_login() self.check_telegram_chats() self.check_test_slack() self.handle_actions() self.urlwatcher.run_jobs() self.urlwatcher.close() urlwatch-2.17/lib/urlwatch/config.py000066400000000000000000000126361345412734700175340ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import argparse import logging import os import urlwatch from .migration import migrate_cache, migrate_urls logger = logging.getLogger(__name__) class BaseConfig(object): def __init__(self, pkgname, urlwatch_dir, config, urls, cache, hooks, verbose): self.pkgname = pkgname self.urlwatch_dir = urlwatch_dir self.config = config self.urls = urls self.cache = cache self.hooks = hooks self.verbose = verbose class CommandConfig(BaseConfig): def __init__(self, pkgname, urlwatch_dir, bindir, prefix, config, urls, hooks, cache, verbose): super().__init__(pkgname, urlwatch_dir, config, urls, cache, hooks, verbose) self.bindir = bindir self.prefix = prefix self.migrate_cache = migrate_cache self.migrate_urls = migrate_urls if self.bindir == 'bin': # Installed system-wide self.examples_dir = os.path.join(prefix, 'share', self.pkgname, 'examples') else: # Assume we are not yet installed self.examples_dir = os.path.join(prefix, bindir, 'share', self.pkgname, 'examples') self.urls_yaml_example = os.path.join(self.examples_dir, 'urls.yaml.example') self.hooks_py_example = os.path.join(self.examples_dir, 'hooks.py.example') self.parse_args() def parse_args(self): parser = argparse.ArgumentParser(description=urlwatch.__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) parser.add_argument('--version', action='version', version='%(prog)s {}'.format(urlwatch.__version__)) parser.add_argument('-v', '--verbose', action='store_true', help='show debug output') group = parser.add_argument_group('files and directories') group.add_argument('--urls', metavar='FILE', help='read job list (URLs) from FILE', default=self.urls) group.add_argument('--config', metavar='FILE', help='read configuration from FILE', default=self.config) group.add_argument('--hooks', metavar='FILE', help='use FILE as hooks.py module', default=self.hooks) group.add_argument('--cache', metavar='FILE', help='use FILE as cache database', default=self.cache) group = parser.add_argument_group('Authentication') group.add_argument('--smtp-login', action='store_true', help='Enter password for SMTP (store in keyring)') group.add_argument('--telegram-chats', action='store_true', help='List telegram chats the bot is joined to') group.add_argument('--test-slack', action='store_true', help='Send a test notification to Slack') group = parser.add_argument_group('job list management') group.add_argument('--list', action='store_true', help='list jobs') group.add_argument('--add', metavar='JOB', help='add job (key1=value1,key2=value2,...)') group.add_argument('--delete', metavar='JOB', help='delete job by location or index') group.add_argument('--test-filter', metavar='JOB', help='test filter output of job by location or index') group = parser.add_argument_group('interactive commands ($EDITOR/$VISUAL)') group.add_argument('--edit', action='store_true', help='edit URL/job list') group.add_argument('--edit-config', action='store_true', help='edit configuration file') group.add_argument('--edit-hooks', action='store_true', help='edit hooks script') group = parser.add_argument_group('miscellaneous') group.add_argument('--features', action='store_true', help='list supported jobs/filters/reporters') group.add_argument('--gc-cache', action='store_true', help='remove old cache entries') args = parser.parse_args() for i, arg in enumerate(vars(args)): argval = getattr(args, arg) setattr(self, arg, argval) urlwatch-2.17/lib/urlwatch/filters.py000066400000000000000000000417201345412734700177330ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import re import logging import itertools import os import imp import html.parser import hashlib import json from enum import Enum from lxml import etree from .util import TrackSubClasses logger = logging.getLogger(__name__) class FilterBase(object, metaclass=TrackSubClasses): __subclasses__ = {} __anonymous_subclasses__ = [] def __init__(self, job, state): self.job = job self.state = state def _no_subfilters(self, subfilter): if subfilter is not None: raise ValueError('No subfilters supported for {}'.format(self.__kind__)) @classmethod def filter_documentation(cls): result = [] for sc in TrackSubClasses.sorted_by_kind(cls): result.extend(( ' * %s - %s' % (sc.__kind__, sc.__doc__), )) return '\n'.join(result) @classmethod def auto_process(cls, state, data): filters = itertools.chain((filtercls for _, filtercls in sorted(cls.__subclasses__.items(), key=lambda k_v: k_v[0])), cls.__anonymous_subclasses__) for filtercls in filters: filter_instance = filtercls(state.job, state) if filter_instance.match(): logger.info('Auto-applying filter %r to %s', filter_instance, state.job.get_location()) data = filter_instance.filter(data) return data @classmethod def process(cls, filter_kind, subfilter, state, data): logger.info('Applying filter %r, subfilter %r to %s', filter_kind, subfilter, state.job.get_location()) filtercls = cls.__subclasses__.get(filter_kind, None) if filtercls is None: raise ValueError('Unknown filter kind: %s:%s' % (filter_kind, subfilter)) return filtercls(state.job, state).filter(data, subfilter) def match(self): return False def filter(self, data, subfilter=None): raise NotImplementedError() class AutoMatchFilter(FilterBase): """Automatically matches subclass filters with a given location""" MATCH = None def match(self): if self.MATCH is None: return False d = self.job.to_dict() result = all(d.get(k, None) == v for k, v in self.MATCH.items()) logger.debug('Matching %r with %r result: %r', self, self.job, result) return result class RegexMatchFilter(FilterBase): """Same as AutoMatchFilter but matching is done with regexes""" MATCH = None def match(self): if self.MATCH is None: return False d = self.job.to_dict() # It's a match if we have at least one key/value pair that matches, # and no key/value pairs that do not match matches = [v.match(d[k]) for k, v in self.MATCH.items() if k in d] result = len(matches) > 0 and all(matches) logger.debug('Matching %r with %r result: %r', self, self.job, result) return result class LegacyHooksPyFilter(FilterBase): FILENAME = os.path.expanduser('~/.urlwatch/lib/hooks.py') def __init__(self, job, state): super().__init__(job, state) self.hooks = None if os.path.exists(self.FILENAME): try: self.hooks = imp.load_source('legacy_hooks', self.FILENAME) except Exception as e: logger.error('Could not load legacy hooks file: %s', e) def match(self): return self.hooks is not None def filter(self, data, subfilter=None): try: result = self.hooks.filter(self.job.get_location(), data) if result is None: result = data return result except Exception as e: logger.warn('Could not apply legacy hooks filter: %s', e) return data class Html2TextFilter(FilterBase): """Convert HTML to plaintext""" __kind__ = 'html2text' def filter(self, data, subfilter=None): if subfilter is None: method = 're' options = {} elif isinstance(subfilter, dict): method = subfilter.pop('method') options = subfilter elif isinstance(subfilter, str): method = subfilter options = {} from .html2txt import html2text return html2text(data, method=method, options=options) class Ical2TextFilter(FilterBase): """Convert iCalendar to plaintext""" __kind__ = 'ical2text' def filter(self, data, subfilter=None): self._no_subfilters(subfilter) from .ical2txt import ical2text return ical2text(data) class JsonFormatFilter(FilterBase): """Convert to formatted json""" __kind__ = 'format-json' def filter(self, data, subfilter=None): indentation = 4 if subfilter is not None: indentation = int(subfilter) parsed_json = json.loads(data) return json.dumps(parsed_json, sort_keys=True, indent=indentation, separators=(',', ': ')) class GrepFilter(FilterBase): """Filter only lines matching a regular expression""" __kind__ = 'grep' def filter(self, data, subfilter=None): if subfilter is None: raise ValueError('The grep filter needs a regular expression') return '\n'.join(line for line in data.splitlines() if re.search(subfilter, line) is not None) class InverseGrepFilter(FilterBase): """Filter which removes lines matching a regular expression""" __kind__ = 'grepi' def filter(self, data, subfilter=None): if subfilter is None: raise ValueError('The inverse grep filter needs a regular expression') return '\n'.join(line for line in data.splitlines() if re.search(subfilter, line) is None) class StripFilter(FilterBase): """Strip leading and trailing whitespace""" __kind__ = 'strip' def filter(self, data, subfilter=None): self._no_subfilters(subfilter) return data.strip() class FilterBy(Enum): ATTRIBUTE = 1 TAG = 2 class ElementsBy(html.parser.HTMLParser): def __init__(self, filter_by, name, value=None): super().__init__() self._filter_by = filter_by if self._filter_by == FilterBy.ATTRIBUTE: self._attributes = {name: value} else: self._name = name self._result = [] self._inside = False self._elts = [] def get_html(self): return ''.join(self._result) def handle_starttag(self, tag, attrs): ad = dict(attrs) if self._filter_by == FilterBy.ATTRIBUTE and all(ad.get(k, None) == v for k, v in self._attributes.items()): self._inside = True elif self._filter_by == FilterBy.TAG and tag == self._name: self._inside = True if self._inside: self._result.append('<%s%s%s>' % (tag, ' ' if attrs else '', ' '.join('%s="%s"' % (k, v) for k, v in attrs))) self._elts.append(tag) def handle_endtag(self, tag): if self._inside: self._result.append('' % (tag,)) if tag in self._elts: t = self._elts.pop() while t != tag and self._elts: t = self._elts.pop() if not self._elts: self._inside = False def handle_data(self, data): if self._inside: self._result.append(data) class GetElementById(FilterBase): """Get an HTML element by its ID""" __kind__ = 'element-by-id' def filter(self, data, subfilter=None): if subfilter is None: raise ValueError('Need an element ID for filtering') element_by_id = ElementsBy(FilterBy.ATTRIBUTE, 'id', subfilter) element_by_id.feed(data) return element_by_id.get_html() class GetElementByClass(FilterBase): """Get all HTML elements by class""" __kind__ = 'element-by-class' def filter(self, data, subfilter=None): if subfilter is None: raise ValueError('Need an element class for filtering') element_by_class = ElementsBy(FilterBy.ATTRIBUTE, 'class', subfilter) element_by_class.feed(data) return element_by_class.get_html() class GetElementByStyle(FilterBase): """Get all HTML elements by style""" __kind__ = 'element-by-style' def filter(self, data, subfilter=None): if subfilter is None: raise ValueError('Need an element style for filtering') element_by_style = ElementsBy(FilterBy.ATTRIBUTE, 'style', subfilter) element_by_style.feed(data) return element_by_style.get_html() class GetElementByTag(FilterBase): """Get an HTML element by its tag""" __kind__ = 'element-by-tag' def filter(self, data, subfilter=None): if subfilter is None: raise ValueError('Need a tag for filtering') element_by_tag = ElementsBy(FilterBy.TAG, subfilter) element_by_tag.feed(data) return element_by_tag.get_html() class Sha1Filter(FilterBase): """Calculate the SHA-1 checksum of the content""" __kind__ = 'sha1sum' def filter(self, data, subfilter=None): self._no_subfilters(subfilter) sha = hashlib.sha1() sha.update(data.encode('utf-8', 'ignore')) return sha.hexdigest() class HexdumpFilter(FilterBase): """Convert binary data to hex dump format""" __kind__ = 'hexdump' def filter(self, data, subfilter=None): self._no_subfilters(subfilter) data = bytearray(data.encode('utf-8', 'ignore')) blocks = [data[i * 16:(i + 1) * 16] for i in range(int((len(data) + (16 - 1)) / 16))] return '\n'.join('%s %s' % (' '.join('%02x' % c for c in block), ''.join((chr(c) if (c > 31 and c < 127) else '.') for c in block)) for block in blocks) class LxmlParser: EXPR_NAMES = {'css': 'a CSS selector', 'xpath': 'an XPath expression'} def __init__(self, filter_kind, subfilter, expr_key): self.filter_kind = filter_kind self.expression, self.method, self.exclude = self.parse_subfilter( filter_kind, subfilter, expr_key, self.EXPR_NAMES[filter_kind]) self.parser = (etree.HTMLParser if self.method == 'html' else etree.XMLParser)() self.data = '' @staticmethod def parse_subfilter(filter_kind, subfilter, expr_key, expr_name): if subfilter is None: raise ValueError('Need %s for filtering' % (expr_name,)) if isinstance(subfilter, str): expression = subfilter method = 'html' exclude = None elif isinstance(subfilter, dict): if expr_key not in subfilter: raise ValueError('Need %s for filtering' % (expr_name,)) expression = subfilter[expr_key] method = subfilter.get('method', 'html') exclude = subfilter.get('exclude') if method not in ('html', 'xml'): raise ValueError('%s method must be "html" or "xml", got %r' % (filter_kind, method)) else: raise ValueError('%s subfilter must be a string or dict' % (filter_kind,)) return expression, method, exclude def feed(self, data): self.data += data def _to_string(self, element): # Handle "/text()" selector, which returns lxml.etree._ElementUnicodeResult (Issue #282) if isinstance(element, str): return element return etree.tostring(element, pretty_print=True, method=self.method, encoding='unicode', with_tail=False) @staticmethod def _remove_element(element): parent = element.getparent() if parent is None: # Do not exclude root element return if isinstance(element, etree._ElementUnicodeResult): if element.is_tail: parent.tail = None elif element.is_text: parent.text = None elif element.is_attribute: del parent.attrib[element.attrname] else: previous = element.getprevious() if element.tail is not None: if previous is not None: previous.tail = previous.tail + element.tail if previous.tail else element.tail else: parent.text = parent.text + element.tail if parent.text else element.tail parent.remove(element) @classmethod def _reevaluate(cls, element): if cls._orphaned(element): return None if isinstance(element, etree._ElementUnicodeResult): parent = element.getparent() if parent is None: return element if element.is_tail: return parent.tail elif element.is_text: return parent.text elif element.is_attribute: return parent.attrib.get(element.attrname) else: return element @staticmethod def _orphaned(element): if isinstance(element, etree._ElementUnicodeResult): parent = element.getparent() if ((element.is_tail and parent.tail is None) or (element.is_text and parent.text is None) or (element.is_attribute and parent.attrib.get(element.attrname) is None)): return True else: element = parent try: tree = element.getroottree() path = tree.getpath(element) return element is not tree.xpath(path)[0] except (ValueError, IndexError): return True def _get_filtered_elements(self): try: root = etree.fromstring(self.data, self.parser) except ValueError: # Strip XML declaration, for example: '' # for https://heronebag.com/blog/index.xml, an error happens, as we get a # a (Unicode) string, but the XML contains its own "encoding" declaration self.data = re.sub(r'^<[?]xml[^>]*[?]>', '', self.data) # Retry parsing with XML declaration removed (Fixes #281) root = etree.fromstring(self.data, self.parser) if root is None: return [] excluded_elems = None if self.filter_kind == 'css': selected_elems = root.cssselect(self.expression) excluded_elems = root.cssselect(self.exclude) if self.exclude else None elif self.filter_kind == 'xpath': selected_elems = root.xpath(self.expression) excluded_elems = root.xpath(self.exclude) if self.exclude else None if excluded_elems is not None: for el in excluded_elems: self._remove_element(el) return [el for el in map(self._reevaluate, selected_elems) if el is not None] def get_filtered_data(self): return '\n'.join(self._to_string(element) for element in self._get_filtered_elements()) class CssFilter(FilterBase): """Filter XML/HTML using CSS selectors""" __kind__ = 'css' def filter(self, data, subfilter=None): lxml_parser = LxmlParser('css', subfilter, 'selector') lxml_parser.feed(data) return lxml_parser.get_filtered_data() class XPathFilter(FilterBase): """Filter XML/HTML using XPath expressions""" __kind__ = 'xpath' def filter(self, data, subfilter=None): lxml_parser = LxmlParser('xpath', subfilter, 'path') lxml_parser.feed(data) return lxml_parser.get_filtered_data() urlwatch-2.17/lib/urlwatch/handler.py000066400000000000000000000144021345412734700176750ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import datetime import logging import time import traceback from .filters import FilterBase from .jobs import NotModifiedError from .reporters import ReporterBase logger = logging.getLogger(__name__) class JobState(object): def __init__(self, cache_storage, job): self.cache_storage = cache_storage self.job = job self.verb = None self.old_data = None self.new_data = None self.history_data = {} self.timestamp = None self.exception = None self.traceback = None self.tries = 0 self.etag = None self.error_ignored = False def load(self): guid = self.job.get_guid() self.old_data, self.timestamp, self.tries, self.etag = self.cache_storage.load(self.job, guid) if self.tries is None: self.tries = 0 if self.job.compared_versions and self.job.compared_versions > 1: self.history_data = self.cache_storage.get_history_data(guid, self.job.compared_versions) def save(self): if self.new_data is None and self.exception is not None: # If no new data has been retrieved due to an exception, use the old job data self.new_data = self.old_data self.cache_storage.save(self.job, self.job.get_guid(), self.new_data, time.time(), self.tries, self.etag) def process(self): logger.info('Processing: %s', self.job) try: try: self.load() data = self.job.retrieve(self) # Apply automatic filters first data = FilterBase.auto_process(self, data) # Apply any specified filters filter_list = self.job.filter if filter_list: if isinstance(filter_list, list): for item in filter_list: key = next(iter(item)) filter_kind, subfilter = key, item[key] data = FilterBase.process(filter_kind, subfilter, self, data) elif isinstance(filter_list, str): for filter_kind in filter_list.split(','): if ':' in filter_kind: filter_kind, subfilter = filter_kind.split(':', 1) else: subfilter = None data = FilterBase.process(filter_kind, subfilter, self, data) self.new_data = data except Exception as e: # job has a chance to format and ignore its error self.exception = e self.traceback = self.job.format_error(e, traceback.format_exc()) self.error_ignored = self.job.ignore_error(e) if not (self.error_ignored or isinstance(e, NotModifiedError)): self.tries += 1 logger.debug('Increasing number of tries to %i for %s', self.tries, self.job) except Exception as e: # job failed its chance to handle error self.exception = e self.traceback = traceback.format_exc() self.error_ignored = False if not isinstance(e, NotModifiedError): self.tries += 1 logger.debug('Increasing number of tries to %i for %s', self.tries, self.job) return self class Report(object): def __init__(self, urlwatch_config): self.config = urlwatch_config.config_storage.config self.job_states = [] self.start = datetime.datetime.now() def _result(self, verb, job_state): if job_state.exception is not None: # TODO: Once we require Python >= 3.5, we can just pass in job_state.exception as "exc_info" parameter exc_info = (type(job_state.exception), job_state.exception, job_state.exception.__traceback__) logger.debug('Got exception while processing %r', job_state.job, exc_info=exc_info) job_state.verb = verb self.job_states.append(job_state) def new(self, job_state): self._result('new', job_state) def changed(self, job_state): self._result('changed', job_state) def unchanged(self, job_state): self._result('unchanged', job_state) def error(self, job_state): self._result('error', job_state) def get_filtered_job_states(self, job_states): for job_state in job_states: if not any(job_state.verb == verb and not self.config['display'][verb] for verb in ('unchanged', 'new', 'error')): yield job_state def finish(self): end = datetime.datetime.now() duration = (end - self.start) ReporterBase.submit_all(self, self.job_states, duration) urlwatch-2.17/lib/urlwatch/html2txt.py000066400000000000000000000115161345412734700200510ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import re import os import subprocess import logging logger = logging.getLogger(__name__) def html2text(data, method, options): """ Convert a string consisting of HTML to plain text for easy difference checking. Method may be one of: 'lynx' - Use "lynx -dump" for conversion options: see "lynx -help" output for options that work with "-dump" 'html2text' - Use "html2text -nobs" for conversion options: https://linux.die.net/man/1/html2text 'bs4' - Use Beautiful Soup library to prettify the HTML options: "parser" only, bs4 supports "lxml", "html5lib", and "html.parser" http://beautiful-soup-4.readthedocs.io/en/latest/#specifying-the-parser-to-use 're' - A simple regex-based HTML tag stripper 'pyhtml2text' - Use Python module "html2text" options: https://github.com/Alir3z4/html2text/blob/master/docs/usage.md#available-options """ if method == 're': stripped_tags = re.sub(r'<[^>]*>', '', data) d = '\n'.join((l.rstrip() for l in stripped_tags.splitlines() if l.strip() != '')) return d if method == 'pyhtml2text': import html2text parser = html2text.HTML2Text() for k, v in options.items(): setattr(parser, k.lower(), v) d = parser.handle(data) return d if method == 'bs4': from bs4 import BeautifulSoup parser = options.pop('parser', 'html.parser') soup = BeautifulSoup(data, parser) d = soup.get_text(strip=True) return d if method == 'lynx': cmd = ['lynx', '-nonumbers', '-dump', '-stdin', '-assume_charset UTF-8', '-display_charset UTF-8'] elif method == 'html2text': cmd = ['html2text', '-nobs', '-utf8'] else: raise ValueError('Unknown html2text method: %r' % (method,)) stdout_encoding = 'utf-8' for k, v in options.items(): cmd.append('-%s %s' % (k, v) if v is True else '-%s' % k) logger.debug('Command: %r, stdout encoding: %s', cmd, stdout_encoding) env = {} env.update(os.environ) env['LANG'] = 'en_US.utf-8' env['LC_ALL'] = 'en_US.utf-8' html2text = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, env=env) stdout, stderr = html2text.communicate(data.encode('utf-8')) stdout = stdout.decode(stdout_encoding) if method == 'lynx': # Lynx translates relative links in the mode we use it to: # file://localhost/tmp/[RANDOM STRING]/[RELATIVE LINK] # Recent versions of lynx (seen in 2.8.8pre1-1) do not include the # "localhost" in the file:// URLs; see Debian bug 732112 stdout = re.sub(r'file://%s/[^/]*/' % (os.environ.get('TMPDIR', '/tmp'),), '', stdout) # Use the following regular expression to remove the unnecessary # parts, so that [RANDOM STRING] (changing on each call) does not # expose itself as change on the website (it's a Lynx-related thing # Thanks to Evert Meulie for pointing that out stdout = re.sub(r'file://localhost%s/[^/]*/' % (os.environ.get('TMPDIR', '/tmp'),), '', stdout) # Also remove file names like L9816-5928TMP.html stdout = re.sub(r'L\d+-\d+TMP.html', '', stdout) return stdout.strip() urlwatch-2.17/lib/urlwatch/ical2txt.py000066400000000000000000000047211345412734700200150ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. def ical2text(ical_string): import vobject result = [] if isinstance(ical_string, str): parsedCal = vobject.readOne(ical_string) else: try: parsedCal = vobject.readOne(ical_string) except Exception as e: parsedCal = vobject.readOne(ical_string.decode('utf-8', 'ignore')) for event in parsedCal.getChildren(): if event.name == 'VEVENT': if hasattr(event, 'dtstart'): start = event.dtstart.value.strftime('%F %H:%M') else: start = 'unknown start date' if hasattr(event, 'dtend'): end = event.dtend.value.strftime('%F %H:%M') else: end = start if start == end: date_str = start else: date_str = '%s -- %s' % (start, end) result.append('%s: %s' % (date_str, event.summary.value)) return '\n'.join(result) urlwatch-2.17/lib/urlwatch/jobs.py000066400000000000000000000304741345412734700172240ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import email.utils import hashlib import logging import os import re import subprocess import requests import urlwatch from requests.packages.urllib3.exceptions import InsecureRequestWarning from .util import TrackSubClasses requests.packages.urllib3.disable_warnings(InsecureRequestWarning) logger = logging.getLogger(__name__) class ShellError(Exception): """Exception for shell commands with non-zero exit code""" def __init__(self, result): Exception.__init__(self) self.result = result def __str__(self): return '%s: Exit status %d' % (self.__class__.__name__, self.result) class NotModifiedError(Exception): """Exception raised on HTTP 304 responses""" ... class JobBase(object, metaclass=TrackSubClasses): __subclasses__ = {} __required__ = () __optional__ = () def __init__(self, **kwargs): # Set optional keys to None for k in self.__optional__: if k not in kwargs: setattr(self, k, None) # Fail if any required keys are not provided for k in self.__required__: if k not in kwargs: raise ValueError('Required field %s missing: %r' % (k, kwargs)) for k, v in list(kwargs.items()): setattr(self, k, v) @classmethod def job_documentation(cls): result = [] for sc in TrackSubClasses.sorted_by_kind(cls): result.extend(( ' * %s - %s' % (sc.__kind__, sc.__doc__), ' Required keys: %s' % (', '.join(sc.__required__),), ' Optional keys: %s' % (', '.join(sc.__optional__),), '', )) return '\n'.join(result) def get_location(self): raise NotImplementedError() def pretty_name(self): raise NotImplementedError() def serialize(self): d = {'kind': self.__kind__} d.update(self.to_dict()) return d @classmethod def unserialize(cls, data): if 'kind' not in data: # Try to auto-detect the kind of job based on the available keys kinds = [subclass.__kind__ for subclass in list(cls.__subclasses__.values()) if all(required in data for required in subclass.__required__) and not any( key not in subclass.__required__ and key not in subclass.__optional__ for key in data)] if len(kinds) == 1: kind = kinds[0] elif len(kinds) == 0: raise ValueError('Kind is not specified, and no job matches: %r' % (data,)) else: raise ValueError('Multiple kinds of jobs match %r: %r' % (data, kinds)) else: kind = data['kind'] return cls.__subclasses__[kind].from_dict(data) def to_dict(self): return {k: getattr(self, k) for keys in (self.__required__, self.__optional__) for k in keys if getattr(self, k) is not None} @classmethod def from_dict(cls, data): return cls(**{k: v for k, v in list(data.items()) if k in cls.__required__ or k in cls.__optional__}) def __repr__(self): return '<%s %s>' % (self.__kind__, ' '.join('%s=%r' % (k, v) for k, v in list(self.to_dict().items()))) def _set_defaults(self, defaults): if isinstance(defaults, dict): for key, value in defaults.items(): if key in self.__optional__ and getattr(self, key) is None: setattr(self, key, value) def with_defaults(self, config): new_job = JobBase.unserialize(self.serialize()) cfg = config.get('job_defaults') if isinstance(cfg, dict): new_job._set_defaults(cfg.get(self.__kind__)) new_job._set_defaults(cfg.get('all')) return new_job def get_guid(self): location = self.get_location() sha_hash = hashlib.new('sha1') sha_hash.update(location.encode('utf-8')) return sha_hash.hexdigest() def retrieve(self, job_state): raise NotImplementedError() def format_error(self, exception, tb): return tb def ignore_error(self, exception): return False class Job(JobBase): __required__ = () __optional__ = ('name', 'filter', 'max_tries', 'diff_tool', 'compared_versions') # determine if hyperlink "a" tag is used in HtmlReporter LOCATION_IS_URL = False def pretty_name(self): return self.name if self.name else self.get_location() class ShellJob(Job): """Run a shell command and get its standard output""" __kind__ = 'shell' __required__ = ('command',) __optional__ = () def get_location(self): return self.command def retrieve(self, job_state): process = subprocess.Popen(self.command, stdout=subprocess.PIPE, shell=True) stdout_data, stderr_data = process.communicate() result = process.wait() if result != 0: raise ShellError(result) return stdout_data.decode('utf-8') class UrlJob(Job): """Retrieve an URL from a web server""" __kind__ = 'url' __required__ = ('url',) __optional__ = ('cookies', 'data', 'method', 'ssl_no_verify', 'ignore_cached', 'http_proxy', 'https_proxy', 'headers', 'ignore_connection_errors', 'ignore_http_error_codes', 'encoding', 'timeout') LOCATION_IS_URL = True CHARSET_RE = re.compile('text/(html|plain); charset=([^;]*)') def get_location(self): return self.url def retrieve(self, job_state): headers = { 'User-agent': urlwatch.__user_agent__, } proxies = { 'http': os.getenv('HTTP_PROXY'), 'https': os.getenv('HTTPS_PROXY'), } if job_state.etag is not None: headers['If-None-Match'] = job_state.etag if job_state.timestamp is not None: headers['If-Modified-Since'] = email.utils.formatdate(job_state.timestamp) if self.ignore_cached or job_state.tries > 0: headers['If-None-Match'] = None headers['If-Modified-Since'] = email.utils.formatdate(0) headers['Cache-Control'] = 'max-age=172800' headers['Expires'] = email.utils.formatdate() if self.method is None: self.method = "GET" if self.data is not None: self.method = "POST" headers['Content-type'] = 'application/x-www-form-urlencoded' logger.info('Sending POST request to %s', self.url) if self.http_proxy is not None: proxies['http'] = self.http_proxy if self.https_proxy is not None: proxies['https'] = self.https_proxy file_scheme = 'file://' if self.url.startswith(file_scheme): logger.info('Using local filesystem (%s URI scheme)', file_scheme) return open(self.url[len(file_scheme):], 'rt').read() if self.headers: self.add_custom_headers(headers) if self.timeout is None: # default timeout timeout = 60 elif self.timeout == 0: # never timeout timeout = None else: timeout = self.timeout response = requests.request(url=self.url, data=self.data, headers=headers, method=self.method, verify=(not self.ssl_no_verify), cookies=self.cookies, proxies=proxies, timeout=timeout) response.raise_for_status() if response.status_code == requests.codes.not_modified: raise NotModifiedError() # Save ETag from response into job_state, which will be saved in cache job_state.etag = response.headers.get('ETag') # If we can't find the encoding in the headers, requests gets all # old-RFC-y and assumes ISO-8859-1 instead of UTF-8. Use the old # urlwatch behavior and try UTF-8 decoding first. content_type = response.headers.get('Content-type', '') content_type_match = self.CHARSET_RE.match(content_type) if not content_type_match and not self.encoding: try: try: try: return response.content.decode('utf-8') except UnicodeDecodeError: return response.content.decode('latin1') except UnicodeDecodeError: return response.content.decode('utf-8', 'ignore') except LookupError: # If this is an invalid encoding, decode as ascii (Debian bug 731931) return response.content.decode('ascii', 'ignore') if self.encoding: response.encoding = self.encoding return response.text def add_custom_headers(self, headers): """ Adds custom request headers from the job list (URLs) to the pre-filled dictionary `headers`. Pre-filled values of conflicting header keys (case-insensitive) are overwritten by custom value. """ headers_to_remove = [x for x in headers if x.lower() in [y.lower() for y in self.headers]] for header in headers_to_remove: headers.pop(header, None) headers.update(self.headers) def format_error(self, exception, tb): if isinstance(exception, requests.exceptions.RequestException): # Instead of a full traceback, just show the HTTP error return str(exception) return tb def ignore_error(self, exception): if isinstance(exception, requests.exceptions.ConnectionError) and self.ignore_connection_errors: return True elif isinstance(exception, requests.exceptions.HTTPError): status_code = exception.response.status_code ignored_codes = [] if isinstance(self.ignore_http_error_codes, int) and self.ignore_http_error_codes == status_code: return True elif isinstance(self.ignore_http_error_codes, str): ignored_codes = [s.strip().lower() for s in self.ignore_http_error_codes.split(',')] elif isinstance(self.ignore_http_error_codes, list): ignored_codes = [str(s).strip().lower() for s in self.ignore_http_error_codes] return str(status_code) in ignored_codes or '%sxx' % (status_code // 100) in ignored_codes return False class BrowserJob(Job): """Retrieve an URL, emulating a real web browser""" __kind__ = 'browser' __required__ = ('navigate',) LOCATION_IS_URL = True def get_location(self): return self.navigate def retrieve(self, job_state): from requests_html import HTMLSession session = HTMLSession() response = session.get(self.navigate) return response.html.html urlwatch-2.17/lib/urlwatch/mailer.py000066400000000000000000000105101345412734700175250ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import smtplib import getpass import subprocess import logging try: import keyring except ImportError: keyring = None import email.mime.multipart import email.mime.text import email.utils logger = logging.getLogger(__name__) class Mailer(object): def send(self, msg): raise NotImplementedError def msg_plain(self, from_email, to_email, subject, body): msg = email.mime.text.MIMEText(body, 'plain', 'utf-8') msg['Subject'] = subject msg['From'] = from_email msg['To'] = to_email msg['Date'] = email.utils.formatdate() return msg def msg_html(self, from_email, to_email, subject, body_text, body_html): msg = email.mime.multipart.MIMEMultipart('alternative') msg['Subject'] = subject msg['From'] = from_email msg['To'] = to_email msg['Date'] = email.utils.formatdate() msg.attach(email.mime.text.MIMEText(body_text, 'plain', 'utf-8')) msg.attach(email.mime.text.MIMEText(body_html, 'html', 'utf-8')) return msg class SMTPMailer(Mailer): def __init__(self, smtp_user, smtp_server, smtp_port, tls, auth): self.smtp_server = smtp_server self.smtp_user = smtp_user self.smtp_port = smtp_port self.tls = tls self.auth = auth def send(self, msg): s = smtplib.SMTP(self.smtp_server, self.smtp_port) s.ehlo() if self.tls: s.starttls() if self.auth and keyring is not None: passwd = keyring.get_password(self.smtp_server, self.smtp_user) if passwd is None: raise ValueError('No password available in keyring for {}, {}'.format(self.smtp_server, self.smtp_user)) s.login(self.smtp_user, passwd) s.sendmail(msg['From'], [msg['To']], msg.as_string()) s.quit() class SendmailMailer(Mailer): def __init__(self, sendmail_path): self.sendmail_path = sendmail_path def send(self, msg): p = subprocess.Popen([self.sendmail_path, '-t', '-oi'], stdin=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True) result = p.communicate(msg.as_string()) if p.returncode: logger.error('Sendmail failed with {result}'.format(result=result)) def have_password(smtp_server, from_email): return keyring.get_password(smtp_server, from_email) is not None def set_password(smtp_server, from_email): ''' Set the keyring password for the mail connection. Interactive.''' if keyring is None: raise ImportError('keyring module missing - service unsupported') password = getpass.getpass(prompt='Enter password for {} using {}: '.format(from_email, smtp_server)) keyring.set_password(smtp_server, from_email, password) urlwatch-2.17/lib/urlwatch/main.py000066400000000000000000000072671345412734700172170ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import imp import logging import os from .handler import Report from .worker import run_jobs logger = logging.getLogger(__name__) class Urlwatch(object): def __init__(self, urlwatch_config, config_storage, cache_storage, urls_storage): self.urlwatch_config = urlwatch_config logger.info('Using %s as URLs file', self.urlwatch_config.urls) logger.info('Using %s for hooks', self.urlwatch_config.hooks) logger.info('Using %s as cache database', self.urlwatch_config.cache) self.config_storage = config_storage self.cache_storage = cache_storage self.urls_storage = urls_storage self.report = Report(self) self.jobs = None self.check_directories() if hasattr(self.urlwatch_config, 'migrate_urls'): self.urlwatch_config.migrate_urls(self) if not self.urlwatch_config.edit_hooks: self.load_hooks() if not self.urlwatch_config.edit: self.load_jobs() if hasattr(self.urlwatch_config, 'migrate_urls'): self.urlwatch_config.migrate_cache(self) def check_directories(self): if not os.path.isdir(self.urlwatch_config.urlwatch_dir): os.makedirs(self.urlwatch_config.urlwatch_dir) if not os.path.exists(self.urlwatch_config.config): self.config_storage.write_default_config(self.urlwatch_config.config) print(""" A default config has been written to {config_yaml}. Use "{pkgname} --edit-config" to customize it. """.format(config_yaml=self.urlwatch_config.config, pkgname=self.urlwatch_config.pkgname)) def load_hooks(self): if os.path.exists(self.urlwatch_config.hooks): imp.load_source('hooks', self.urlwatch_config.hooks) def load_jobs(self): if os.path.isfile(self.urlwatch_config.urls): jobs = self.urls_storage.load_secure() logger.info('Found {0} jobs'.format(len(jobs))) else: logger.warn('No jobs file found') jobs = [] self.jobs = jobs def run_jobs(self): run_jobs(self) def close(self): self.report.finish() self.cache_storage.close() urlwatch-2.17/lib/urlwatch/migration.py000066400000000000000000000072571345412734700202630ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import logging import os.path import sys from .util import atomic_rename from .storage import UrlsYaml, UrlsTxt, CacheDirStorage logger = logging.getLogger(__name__) def migrate_urls(urlwatcher): # Migrate urlwatch 1.x URLs to urlwatch 2.x urlwatch_config = urlwatcher.urlwatch_config pkgname = urlwatch_config.pkgname urls = urlwatch_config.urls urls_txt = os.path.join(urlwatch_config.urlwatch_dir, 'urls.txt') edit = urlwatch_config.edit add = urlwatch_config.add features = urlwatch_config.features edit_hooks = urlwatch_config.edit_hooks edit_config = urlwatch_config.edit_config gc_cache = urlwatch_config.gc_cache if os.path.isfile(urls_txt) and not os.path.isfile(urls): print(""" Migrating URLs: {urls_txt} -> {urls_yaml} Use "{pkgname} --edit" to customize it. """.format(urls_txt=urls_txt, urls_yaml=urls, pkgname=pkgname)) UrlsYaml(urls).save(UrlsTxt(urls_txt).load_secure()) atomic_rename(urls_txt, urls_txt + '.migrated') if not any([os.path.isfile(urls), edit, add, features, edit_hooks, edit_config, gc_cache]): print(""" You need to create {urls_yaml} in order to use {pkgname}. Use "{pkgname} --edit" to open the file with your editor. """.format(urls_yaml=urls, pkgname=pkgname)) sys.exit(1) def migrate_cache(urlwatcher): # Migrate urlwatch 1.x cache to urlwatch 2.x urlwatch_config = urlwatcher.urlwatch_config cache = urlwatch_config.cache cache_dir = os.path.join(urlwatch_config.urlwatch_dir, 'cache') # On Windows and macOS with case-insensitive filesystems, we have to check if # "cache.db" exists in the folder, and in this case, avoid migration (Issue #223) if os.path.isdir(cache_dir) and not os.path.isfile(os.path.join(cache_dir, 'cache.db')): print(""" Migrating cache: {cache_dir} -> {cache_db} """.format(cache_dir=cache_dir, cache_db=cache)) old_cache_storage = CacheDirStorage(cache_dir) urlwatcher.cache_storage.restore(old_cache_storage.backup()) urlwatcher.cache_storage.gc([job.get_guid() for job in urlwatcher.jobs]) atomic_rename(cache_dir, cache_dir + '.migrated') urlwatch-2.17/lib/urlwatch/reporters.py000066400000000000000000000544041345412734700203130ustar00rootroot00000000000000# # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import difflib import tempfile import subprocess import re import shlex import email.utils import itertools import logging import os import sys import time import cgi import functools import requests import urlwatch from .mailer import SMTPMailer from .mailer import SendmailMailer from .util import TrackSubClasses try: import chump except ImportError: chump = None try: from pushbullet import Pushbullet except ImportError: Pushbullet = None logger = logging.getLogger(__name__) # Regular expressions that match the added/removed markers of GNU wdiff output WDIFF_ADDED_RE = r'[{][+].*?[+][}]' WDIFF_REMOVED_RE = r'[\[][-].*?[-][]]' class ReporterBase(object, metaclass=TrackSubClasses): __subclasses__ = {} def __init__(self, report, config, job_states, duration): self.report = report self.config = config self.job_states = job_states self.duration = duration def convert(self, othercls): if hasattr(othercls, '__kind__'): config = self.report.config['report'][othercls.__kind__] else: config = {} return othercls(self.report, config, self.job_states, self.duration) @classmethod def reporter_documentation(cls): result = [] for sc in TrackSubClasses.sorted_by_kind(cls): result.extend(( ' * %s - %s' % (sc.__kind__, sc.__doc__), )) return '\n'.join(result) @classmethod def submit_all(cls, report, job_states, duration): any_enabled = False for name, subclass in cls.__subclasses__.items(): cfg = report.config['report'].get(name, {'enabled': False}) if cfg['enabled']: any_enabled = True logger.info('Submitting with %s (%r)', name, subclass) subclass(report, cfg, job_states, duration).submit() if not any_enabled: logger.warn('No reporters enabled.') def submit(self): raise NotImplementedError() def unified_diff(self, job_state): if job_state.job.diff_tool is not None: with tempfile.TemporaryDirectory() as tmpdir: old_file_path = os.path.join(tmpdir, 'old_file') new_file_path = os.path.join(tmpdir, 'new_file') with open(old_file_path, 'w+b') as old_file, open(new_file_path, 'w+b') as new_file: old_file.write(job_state.old_data.encode('utf-8')) new_file.write(job_state.new_data.encode('utf-8')) cmdline = shlex.split(job_state.job.diff_tool) + [old_file_path, new_file_path] proc = subprocess.Popen(cmdline, stdout=subprocess.PIPE) stdout, _ = proc.communicate() # Diff tools return 0 for "nothing changed" or 1 for "files differ", anything else is an error if proc.returncode in (0, 1): return stdout.decode('utf-8') else: raise subprocess.CalledProcessError(proc.returncode, cmdline) timestamp_old = email.utils.formatdate(job_state.timestamp, localtime=1) timestamp_new = email.utils.formatdate(time.time(), localtime=1) return ''.join(difflib.unified_diff([l + '\n' for l in job_state.old_data.splitlines()], [l + '\n' for l in job_state.new_data.splitlines()], '@', '@', timestamp_old, timestamp_new)) class SafeHtml(object): def __init__(self, s): self.s = s def __str__(self): return self.s def format(self, *args, **kwargs): return str(self).format(*(cgi.escape(str(arg)) for arg in args), **{k: cgi.escape(str(v)) for k, v in kwargs.items()}) class HtmlReporter(ReporterBase): def submit(self): yield from (str(part) for part in self._parts()) def _parts(self): cfg = self.report.config['report']['html'] yield SafeHtml(""" urlwatch """) for job_state in self.report.get_filtered_job_states(self.job_states): job = job_state.job if job.LOCATION_IS_URL: title = '{pretty_name}' elif job.pretty_name() != job.get_location(): title = '{pretty_name}' else: title = '{location}' title = '

{verb}: ' + title + '

' yield SafeHtml(title).format(verb=job_state.verb, location=job.get_location(), pretty_name=job.pretty_name()) content = self._format_content(job_state, cfg['diff']) if content is not None: yield content yield SafeHtml('
') yield SafeHtml("""
{pkgname} {version}, {copyright}
Website: {url}
watched {count} URLs in {duration} seconds
""").format(pkgname=urlwatch.pkgname, version=urlwatch.__version__, copyright=urlwatch.__copyright__, url=urlwatch.__url__, count=len(self.job_states), duration=self.duration.seconds) def _diff_to_html(self, unified_diff): for line in unified_diff.splitlines(): if line.startswith('+'): yield SafeHtml('{line}').format(line=line) elif line.startswith('-'): yield SafeHtml('{line}').format(line=line) else: yield SafeHtml('{line}').format(line=line) def _format_content(self, job_state, difftype): if job_state.verb == 'error': return SafeHtml('
{error}
').format(error=job_state.traceback.strip()) if job_state.verb == 'unchanged': return SafeHtml('
{old_data}
').format(old_data=job_state.old_data) if job_state.old_data in (None, job_state.new_data): return SafeHtml('...') if difftype == 'table': timestamp_old = email.utils.formatdate(job_state.timestamp, localtime=1) timestamp_new = email.utils.formatdate(time.time(), localtime=1) html_diff = difflib.HtmlDiff() return SafeHtml(html_diff.make_table(job_state.old_data.splitlines(1), job_state.new_data.splitlines(1), timestamp_old, timestamp_new, True, 3)) elif difftype == 'unified': return ''.join(( '
',
                '\n'.join(self._diff_to_html(self.unified_diff(job_state))),
                '
', )) else: raise ValueError('Diff style not supported: %r' % (difftype,)) class TextReporter(ReporterBase): def submit(self): cfg = self.report.config['report']['text'] line_length = cfg['line_length'] show_details = cfg['details'] show_footer = cfg['footer'] if cfg['minimal']: for job_state in self.report.get_filtered_job_states(self.job_states): pretty_name = job_state.job.pretty_name() location = job_state.job.get_location() if pretty_name != location: location = '%s (%s)' % (pretty_name, location) yield ': '.join((job_state.verb.upper(), location)) return summary = [] details = [] for job_state in self.report.get_filtered_job_states(self.job_states): summary_part, details_part = self._format_output(job_state, line_length) summary.extend(summary_part) details.extend(details_part) if summary: sep = (line_length * '=') or None yield from (part for part in itertools.chain( (sep,), ('%02d. %s' % (idx + 1, line) for idx, line in enumerate(summary)), (sep, ''), ) if part is not None) if show_details: yield from details if summary and show_footer: yield from ('-- ', '%s %s, %s' % (urlwatch.pkgname, urlwatch.__version__, urlwatch.__copyright__), 'Website: %s' % (urlwatch.__url__,), 'watched %d URLs in %d seconds' % (len(self.job_states), self.duration.seconds)) def _format_content(self, job_state): if job_state.verb == 'error': return job_state.traceback.strip() if job_state.verb == 'unchanged': return job_state.old_data if job_state.old_data in (None, job_state.new_data): return None return self.unified_diff(job_state) def _format_output(self, job_state, line_length): summary_part = [] details_part = [] pretty_name = job_state.job.pretty_name() location = job_state.job.get_location() if pretty_name != location: location = '%s (%s)' % (pretty_name, location) pretty_summary = ': '.join((job_state.verb.upper(), pretty_name)) summary = ': '.join((job_state.verb.upper(), location)) content = self._format_content(job_state) summary_part.append(pretty_summary) sep = (line_length * '-') or None details_part.extend((sep, summary, sep)) if content is not None: details_part.extend((content, sep)) details_part.extend(('', '') if sep else ('',)) details_part = [part for part in details_part if part is not None] return summary_part, details_part class StdoutReporter(TextReporter): """Print summary on stdout (the console)""" __kind__ = 'stdout' def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self._has_color = sys.stdout.isatty() and self.config.get('color', False) def _incolor(self, color_id, s): if self._has_color: return '\033[9%dm%s\033[0m' % (color_id, s) return s def _red(self, s): return self._incolor(1, s) def _green(self, s): return self._incolor(2, s) def _yellow(self, s): return self._incolor(3, s) def _blue(self, s): return self._incolor(4, s) def _get_print(self): if sys.platform == 'win32' and self._has_color: from colorama import AnsiToWin32 return functools.partial(print, file=AnsiToWin32(sys.stdout).stream) return print def submit(self): print = self._get_print() cfg = self.report.config['report']['text'] line_length = cfg['line_length'] separators = (line_length * '=', line_length * '-', '-- ') if line_length else () body = '\n'.join(super().submit()) for line in body.splitlines(): # Basic colorization for wdiff-style differences line = re.sub(WDIFF_ADDED_RE, lambda x: self._green(x.group(0)), line) line = re.sub(WDIFF_REMOVED_RE, lambda x: self._red(x.group(0)), line) # FIXME: This isn't ideal, but works for now... if line in separators: print(line) elif line.startswith('+'): print(self._green(line)) elif line.startswith('-'): print(self._red(line)) elif any(line.startswith(prefix) for prefix in ('NEW:', 'CHANGED:', 'UNCHANGED:', 'ERROR:')): first, second = line.split(' ', 1) if line.startswith('ERROR:'): print(first, self._red(second)) else: print(first, self._blue(second)) else: print(line) class EMailReporter(TextReporter): """Send summary via e-mail / SMTP""" __kind__ = 'email' def submit(self): filtered_job_states = list(self.report.get_filtered_job_states(self.job_states)) subject_args = { 'count': len(filtered_job_states), 'jobs': ', '.join(job_state.job.pretty_name() for job_state in filtered_job_states), } subject = self.config['subject'].format(**subject_args) body_text = '\n'.join(super().submit()) if not body_text: logger.debug('Not sending e-mail (no changes)') return if self.config['method'] == "smtp": smtp_user = self.config['smtp'].get('user', None) or self.config['from'] mailer = SMTPMailer(smtp_user, self.config['smtp']['host'], self.config['smtp']['port'], self.config['smtp']['starttls'], self.config['smtp']['keyring']) elif self.config['method'] == "sendmail": mailer = SendmailMailer(self.config['sendmail']['path']) else: logger.error('Invalid entry for method {method}'.format(method=self.config['method'])) if self.config['html']: body_html = '\n'.join(self.convert(HtmlReporter).submit()) msg = mailer.msg_html(self.config['from'], self.config['to'], subject, body_text, body_html) else: msg = mailer.msg_plain(self.config['from'], self.config['to'], subject, body_text) mailer.send(msg) class WebServiceReporter(TextReporter): MAX_LENGTH = 1024 def web_service_get(self): raise NotImplementedError def web_service_submit(self, service, title, body): raise NotImplementedError def submit(self): body_text = '\n'.join(super().submit()) if not body_text: logger.debug('Not sending %s (no changes)', self.__kind__) return if len(body_text) > self.MAX_LENGTH: body_text = body_text[:self.MAX_LENGTH] try: service = self.web_service_get() except Exception as e: logger.error('Failed to load or connect to %s - are the dependencies installed and configured?', self.__kind__, exc_info=True) return self.web_service_submit(service, 'Website Change Detected', body_text) class PushoverReport(WebServiceReporter): """Send summary via pushover.net""" __kind__ = 'pushover' def web_service_get(self): app = chump.Application(self.config['app']) return app.get_user(self.config['user']) def web_service_submit(self, service, title, body): sound = self.config['sound'] device = self.config['device'] msg = service.create_message(title=title, message=body, html=True, sound=sound, device=device) msg.send() class PushbulletReport(WebServiceReporter): """Send summary via pushbullet.com""" __kind__ = 'pushbullet' def web_service_get(self): return Pushbullet(self.config['api_key']) def web_service_submit(self, service, title, body): service.push_note(title, body) class MailGunReporter(TextReporter): """Custom email reporter that uses Mailgun""" __kind__ = 'mailgun' def submit(self): region = self.config.get('region', '') domain = self.config['domain'] api_key = self.config['api_key'] from_name = self.config['from_name'] from_mail = self.config['from_mail'] to = self.config['to'] if region == 'us': region = '' if region != '': region = ".{0}".format(region) filtered_job_states = list(self.report.get_filtered_job_states(self.job_states)) subject_args = { 'count': len(filtered_job_states), 'jobs': ', '.join(job_state.job.pretty_name() for job_state in filtered_job_states), } subject = self.config['subject'].format(**subject_args) body_text = '\n'.join(super().submit()) body_html = '\n'.join(self.convert(HtmlReporter).submit()) if not body_text: logger.debug('Not calling Mailgun API (no changes)') return logger.debug("Sending Mailgun request for domain:'{0}'".format(domain)) result = requests.post( "https://api{0}.mailgun.net/v3/{1}/messages".format(region, domain), auth=("api", api_key), data={"from": "{0} <{1}>".format(from_name, from_mail), "to": to, "subject": subject, "text": body_text, "html": body_html}) try: json_res = result.json() if (result.status_code == requests.codes.ok): logger.info("Mailgun response: id '{0}'. {1}".format(json_res['id'], json_res['message'])) else: logger.error("Mailgun error: {0}".format(json_res['message'])) except ValueError: logger.error( "Failed to parse Mailgun response. HTTP status code: {0}, content: {1}".format(result.status_code, result.content)) return result class TelegramReporter(TextReporter): """Custom Telegram reporter""" MAX_LENGTH = 4096 __kind__ = 'telegram' def submit(self): bot_token = self.config['bot_token'] chat_ids = self.config['chat_id'] chat_ids = [chat_ids] if isinstance(chat_ids, str) else chat_ids text = '\n'.join(super().submit()) if not text: logger.debug('Not calling telegram API (no changes)') return result = None for chunk in self.chunkstring(text, self.MAX_LENGTH): for chat_id in chat_ids: res = self.submitToTelegram(bot_token, chat_id, chunk) if res.status_code != requests.codes.ok or res is None: result = res return result def submitToTelegram(self, bot_token, chat_id, text): logger.debug("Sending telegram request to chat id:'{0}'".format(chat_id)) result = requests.post( "https://api.telegram.org/bot{0}/sendMessage".format(bot_token), data={"chat_id": chat_id, "text": text, "disable_web_page_preview": "true"}) try: json_res = result.json() if (result.status_code == requests.codes.ok): logger.info("Telegram response: ok '{0}'. {1}".format(json_res['ok'], json_res['result'])) else: logger.error("Telegram error: {0}".format(json_res['description'])) except ValueError: logger.error( "Failed to parse telegram response. HTTP status code: {0}, content: {1}".format(result.status_code, result.content)) return result def chunkstring(self, string, length): return (string[0 + i:length + i] for i in range(0, len(string), length)) class SlackReporter(TextReporter): """Custom Slack reporter""" MAX_LENGTH = 4096 __kind__ = 'slack' def submit(self): webhook_url = self.config['webhook_url'] text = '\n'.join(super().submit()) if not text: logger.debug('Not calling slack API (no changes)') return result = None for chunk in self.chunkstring(text, self.MAX_LENGTH): res = self.submit_to_slack(webhook_url, chunk) if res.status_code != requests.codes.ok or res is None: result = res return result def submit_to_slack(self, webhook_url, text): logger.debug("Sending slack request with text:{0}".format(text)) post_data = {"text": text} result = requests.post(webhook_url, json=post_data) try: if result.status_code == requests.codes.ok: logger.info("Slack response: ok") else: logger.error("Slack error: {0}".format(result.text)) except ValueError: logger.error( "Failed to parse slack response. HTTP status code: {0}, content: {1}".format(result.status_code, result.content)) return result def chunkstring(self, string, length): return (string[0 + i:length + i] for i in range(0, len(string), length)) urlwatch-2.17/lib/urlwatch/storage.py000066400000000000000000000363571345412734700177410ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import os import stat import copy import platform from abc import ABCMeta, abstractmethod import shutil import yaml import minidb import logging from .util import atomic_rename, edit_file from .jobs import JobBase, UrlJob, ShellJob logger = logging.getLogger(__name__) DEFAULT_CONFIG = { 'display': { 'new': True, 'error': True, 'unchanged': False, }, 'report': { 'text': { 'line_length': 75, 'details': True, 'footer': True, 'minimal': False, }, 'html': { 'diff': 'unified', # "unified" or "table" }, 'stdout': { 'enabled': True, 'color': True, }, 'email': { 'enabled': False, 'html': False, 'to': '', 'from': '', 'subject': '{count} changes: {jobs}', 'method': 'smtp', 'smtp': { 'host': 'localhost', 'user': '', 'port': 25, 'starttls': True, 'keyring': True, }, 'sendmail': { 'path': 'sendmail', } }, 'pushover': { 'enabled': False, 'app': '', 'device': '', 'sound': 'spacealarm', 'user': '', }, 'pushbullet': { 'enabled': False, 'api_key': '', }, 'telegram': { 'enabled': False, 'bot_token': '', 'chat_id': '', }, 'slack': { 'enabled': False, 'webhook_url': '', }, 'mailgun': { 'enabled': False, 'region': 'us', 'api_key': '', 'domain': '', 'from_mail': '', 'from_name': '', 'to': '', 'subject': '{count} changes: {jobs}' }, }, 'job_defaults': { 'all': {}, 'shell': {}, 'url': {}, 'browser': {} } } def merge(source, destination): # http://stackoverflow.com/a/20666342 for key, value in source.items(): if isinstance(value, dict): # get node or create one node = destination.setdefault(key, {}) merge(value, node) else: destination[key] = value return destination def get_current_user(): try: return os.getlogin() except OSError: # If there is no controlling terminal, because urlwatch is launched by # cron, or by a systemd.service for example, os.getlogin() fails with: # OSError: [Errno 25] Inappropriate ioctl for device import pwd return pwd.getpwuid(os.getuid()).pw_name class BaseStorage(metaclass=ABCMeta): @abstractmethod def load(self, *args): ... @abstractmethod def save(self, *args): ... class BaseFileStorage(BaseStorage, metaclass=ABCMeta): def __init__(self, filename): self.filename = filename class BaseTextualFileStorage(BaseFileStorage, metaclass=ABCMeta): def __init__(self, filename): super().__init__(filename) self.config = {} self.load() @classmethod @abstractmethod def parse(cls, *args): ... def edit(self, example_file=None): fn_base, fn_ext = os.path.splitext(self.filename) file_edit = fn_base + '.edit' + fn_ext if os.path.exists(self.filename): shutil.copy(self.filename, file_edit) elif example_file is not None and os.path.exists(example_file): shutil.copy(example_file, file_edit) while True: try: edit_file(file_edit) # Check if we can still parse it if self.parse is not None: self.parse(file_edit) break # stop if no exception on parser except SystemExit: raise except Exception as e: print('Parsing failed:') print('======') print(e) print('======') print('') print('The file', file_edit, 'was NOT updated.') user_input = input("Do you want to retry the same edit? (y/n)") if user_input.lower()[0] == 'y': continue print('Your changes have been saved in', file_edit) return 1 atomic_rename(file_edit, self.filename) print('Saving edit changes in', self.filename) return 0 @classmethod def write_default_config(cls, filename): config_storage = cls(None) config_storage.filename = filename config_storage.save() class UrlsBaseFileStorage(BaseTextualFileStorage, metaclass=ABCMeta): def __init__(self, filename): self.filename = filename def shelljob_security_checks(self): if platform.system() == 'Windows': return [] shelljob_errors = [] current_uid = os.getuid() dirname = os.path.dirname(self.filename) or '.' dir_st = os.stat(dirname) if (dir_st.st_mode & (stat.S_IWGRP | stat.S_IWOTH)) != 0: shelljob_errors.append('%s is group/world-writable' % dirname) if dir_st.st_uid != current_uid: shelljob_errors.append('%s not owned by %s' % (dirname, get_current_user())) file_st = os.stat(self.filename) if (file_st.st_mode & (stat.S_IWGRP | stat.S_IWOTH)) != 0: shelljob_errors.append('%s is group/world-writable' % self.filename) if file_st.st_uid != current_uid: shelljob_errors.append('%s not owned by %s' % (self.filename, get_current_user())) return shelljob_errors def load_secure(self): jobs = self.load() # Security checks for shell jobs - only execute if the current UID # is the same as the file/directory owner and only owner can write shelljob_errors = self.shelljob_security_checks() if shelljob_errors and any(isinstance(job, ShellJob) for job in jobs): print(('Removing shell jobs, because %s' % (' and '.join(shelljob_errors),))) jobs = [job for job in jobs if not isinstance(job, ShellJob)] return jobs class BaseTxtFileStorage(BaseTextualFileStorage, metaclass=ABCMeta): @classmethod def parse(cls, *args): filename = args[0] if filename is not None and os.path.exists(filename): with open(filename) as fp: for line in fp: line = line.strip() if not line or line.startswith('#'): continue if line.startswith('|'): yield ShellJob(command=line[1:]) else: args = line.split(None, 2) if len(args) == 1: yield UrlJob(url=args[0]) elif len(args) == 2: yield UrlJob(url=args[0], post=args[1]) else: raise ValueError('Unsupported line format: %r' % (line,)) class BaseYamlFileStorage(BaseTextualFileStorage, metaclass=ABCMeta): @classmethod def parse(cls, *args): filename = args[0] if filename is not None and os.path.exists(filename): with open(filename) as fp: return yaml.load(fp, Loader=yaml.SafeLoader) class YamlConfigStorage(BaseYamlFileStorage): def load(self, *args): self.config = merge(self.parse(self.filename) or {}, copy.deepcopy(DEFAULT_CONFIG)) def save(self, *args): with open(self.filename, 'w') as fp: yaml.dump(self.config, fp, default_flow_style=False) class UrlsYaml(BaseYamlFileStorage, UrlsBaseFileStorage): @classmethod def parse(cls, *args): filename = args[0] if filename is not None and os.path.exists(filename): with open(filename) as fp: return [JobBase.unserialize(job) for job in yaml.load_all(fp, Loader=yaml.SafeLoader) if job is not None] def save(self, *args): jobs = args[0] print('Saving updated list to %r' % self.filename) with open(self.filename, 'w') as fp: yaml.dump_all([job.serialize() for job in jobs], fp, default_flow_style=False) def load(self, *args): with open(self.filename) as fp: return [JobBase.unserialize(job) for job in yaml.load_all(fp, Loader=yaml.SafeLoader) if job is not None] class UrlsTxt(BaseTxtFileStorage, UrlsBaseFileStorage): def load(self): return list(self.parse(self.filename)) def save(self, jobs): print(jobs) raise NotImplementedError() class CacheStorage(BaseFileStorage, metaclass=ABCMeta): @abstractmethod def close(self): ... @abstractmethod def get_guids(self): ... @abstractmethod def load(self, job, guid): ... @abstractmethod def save(self, job, guid, data, timestamp, tries, etag=None): ... @abstractmethod def delete(self, guid): ... @abstractmethod def clean(self, guid): ... def backup(self): for guid in self.get_guids(): data, timestamp, tries, etag = self.load(None, guid) yield guid, data, timestamp, tries, etag def restore(self, entries): for guid, data, timestamp, tries, etag in entries: self.save(None, guid, data, timestamp, tries, etag) def gc(self, known_guids): for guid in set(self.get_guids()) - set(known_guids): print('Removing: {guid}'.format(guid=guid)) self.delete(guid) for guid in known_guids: count = self.clean(guid) if count > 0: print('Removed {count} old versions of {guid}'.format(count=count, guid=guid)) class CacheDirStorage(CacheStorage): def __init__(self, filename): super().__init__(filename) if not os.path.exists(filename): os.makedirs(filename) def close(self): # No need to close return 0 def _get_filename(self, guid): return os.path.join(self.filename, guid) def get_guids(self): return os.listdir(self.filename) def load(self, job, guid): filename = self._get_filename(guid) if not os.path.exists(filename): return None, None, None, None try: with open(filename) as fp: data = fp.read() except UnicodeDecodeError: with open(filename, 'rb') as fp: data = fp.read().decode('utf-8', 'ignore') timestamp = os.stat(filename)[stat.ST_MTIME] return data, timestamp, None, None def save(self, job, guid, data, timestamp, etag=None): # Timestamp and ETag are always ignored filename = self._get_filename(guid) with open(filename, 'w+') as fp: fp.write(data) def delete(self, guid): filename = self._get_filename(guid) if os.path.exists(filename): os.unlink(filename) def clean(self, guid): # We only store the latest version, no need to clean return 0 class CacheEntry(minidb.Model): guid = str timestamp = int data = str tries = int etag = str class CacheMiniDBStorage(CacheStorage): def __init__(self, filename): super().__init__(filename) dirname = os.path.dirname(filename) if dirname and not os.path.isdir(dirname): os.makedirs(dirname) self.db = minidb.Store(self.filename, debug=True) self.db.register(CacheEntry) def close(self): self.db.close() self.db = None def get_guids(self): return (guid for guid, in CacheEntry.query(self.db, minidb.Function('distinct', CacheEntry.c.guid))) def load(self, job, guid): for data, timestamp, tries, etag in CacheEntry.query(self.db, CacheEntry.c.data // CacheEntry.c.timestamp // CacheEntry.c.tries // CacheEntry.c.etag, order_by=minidb.columns(CacheEntry.c.timestamp.desc, CacheEntry.c.tries.desc), where=CacheEntry.c.guid == guid, limit=1): return data, timestamp, tries, etag return None, None, 0, None def get_history_data(self, guid, count=1): history = {} if count < 1: return history for data, timestamp in CacheEntry.query(self.db, CacheEntry.c.data // CacheEntry.c.timestamp, order_by=minidb.columns(CacheEntry.c.timestamp.desc, CacheEntry.c.tries.desc), where=(CacheEntry.c.guid == guid) & ((CacheEntry.c.tries == 0) | (CacheEntry.c.tries == None))): # noqa if data not in history: history[data] = timestamp if len(history) >= count: break return history def save(self, job, guid, data, timestamp, tries, etag=None): self.db.save(CacheEntry(guid=guid, timestamp=timestamp, data=data, tries=tries, etag=etag)) self.db.commit() def delete(self, guid): CacheEntry.delete_where(self.db, CacheEntry.c.guid == guid) self.db.commit() def clean(self, guid): keep_id = next((CacheEntry.query(self.db, CacheEntry.c.id, where=CacheEntry.c.guid == guid, order_by=CacheEntry.c.timestamp.desc, limit=1)), (None,))[0] if keep_id is not None: result = CacheEntry.delete_where(self.db, (CacheEntry.c.guid == guid) & (CacheEntry.c.id != keep_id)) self.db.commit() return result return 0 urlwatch-2.17/lib/urlwatch/util.py000066400000000000000000000075001345412734700172360ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import logging import os import platform import subprocess import shlex logger = logging.getLogger(__name__) class TrackSubClasses(type): """A metaclass that stores subclass name-to-class mappings in the base class""" @staticmethod def sorted_by_kind(cls): return [item for _, item in sorted((it.__kind__, it) for it in cls.__subclasses__.values())] def __init__(cls, name, bases, namespace): for base in bases: if base == object: continue for attr in ('__required__', '__optional__'): if not hasattr(base, attr): continue inherited = getattr(base, attr, ()) new_value = tuple(namespace.get(attr, ())) + tuple(inherited) namespace[attr] = new_value setattr(cls, attr, new_value) for base in bases: if base == object: continue if hasattr(cls, '__kind__'): subclasses = getattr(base, '__subclasses__', None) if subclasses is not None: logger.info('Registering %r as %s', cls, cls.__kind__) subclasses[cls.__kind__] = cls break else: anonymous_subclasses = getattr(base, '__anonymous_subclasses__', None) if anonymous_subclasses is not None: logger.info('Registering %r', cls) anonymous_subclasses.append(cls) break super().__init__(name, bases, namespace) def atomic_rename(old_filename, new_filename): if platform.system() == 'Windows' and os.path.exists(new_filename): new_old_filename = new_filename + '.bak' if os.path.exists(new_old_filename): os.remove(new_old_filename) os.rename(new_filename, new_old_filename) os.rename(old_filename, new_filename) if os.path.exists(new_old_filename): os.remove(new_old_filename) else: os.rename(old_filename, new_filename) def edit_file(filename): editor = os.environ.get('EDITOR', None) if not editor: editor = os.environ.get('VISUAL', None) if not editor: raise SystemExit('Please set $VISUAL or $EDITOR.') subprocess.check_call(shlex.split(editor) + [filename]) urlwatch-2.17/lib/urlwatch/worker.py000066400000000000000000000110531345412734700175700ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import concurrent.futures import logging import difflib import requests from .handler import JobState from .jobs import NotModifiedError logger = logging.getLogger(__name__) MAX_WORKERS = 10 def run_parallel(func, items): executor = concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) for future in concurrent.futures.as_completed(executor.submit(func, item) for item in items): exception = future.exception() if exception is not None: raise exception yield future.result() def run_jobs(urlwatcher): cache_storage = urlwatcher.cache_storage jobs = [job.with_defaults(urlwatcher.config_storage.config) for job in urlwatcher.jobs] report = urlwatcher.report logger.debug('Processing %d jobs', len(jobs)) for job_state in run_parallel(lambda job_state: job_state.process(), (JobState(cache_storage, job) for job in jobs)): logger.debug('Job finished: %s', job_state.job) if not job_state.job.max_tries: max_tries = 0 else: max_tries = job_state.job.max_tries logger.debug('Using max_tries of %i for %s', max_tries, job_state.job) if job_state.exception is not None: if job_state.error_ignored: logger.info('Error while executing job %s ignored due to job config', job_state.job) elif isinstance(job_state.exception, NotModifiedError): logger.info('Job %s has not changed (HTTP 304)', job_state.job) report.unchanged(job_state) if job_state.tries > 0: job_state.tries = 0 job_state.save() elif job_state.tries < max_tries: logger.debug('This was try %i of %i for job %s', job_state.tries, max_tries, job_state.job) job_state.save() elif job_state.tries >= max_tries: logger.debug('We are now at %i tries ', job_state.tries) job_state.save() report.error(job_state) elif job_state.old_data is not None: matched_history_time = job_state.history_data.get(job_state.new_data) if matched_history_time: job_state.timestamp = matched_history_time if matched_history_time or job_state.new_data == job_state.old_data: report.unchanged(job_state) if job_state.tries > 0: job_state.tries = 0 job_state.save() else: close_matches = difflib.get_close_matches(job_state.new_data, job_state.history_data, n=1) if close_matches: job_state.old_data = close_matches[0] job_state.timestamp = job_state.history_data[close_matches[0]] report.changed(job_state) job_state.tries = 0 job_state.save() else: report.new(job_state) job_state.tries = 0 job_state.save() urlwatch-2.17/setup.cfg000066400000000000000000000000441345412734700151250ustar00rootroot00000000000000[pycodestyle] max-line-length = 120 urlwatch-2.17/setup.py000066400000000000000000000033061345412734700150220ustar00rootroot00000000000000#!/usr/bin/env python3 from setuptools import setup from distutils import cmd import os import re import sys main_py = open(os.path.join('lib', 'urlwatch', '__init__.py')).read() m = dict(re.findall("\n__([a-z]+)__ = '([^']+)'", main_py)) docs = re.findall('"""(.*?)"""', main_py, re.DOTALL) if sys.version_info < (3, 3): sys.exit('urlwatch requires Python 3.3 or newer') m['name'] = 'urlwatch' m['author'], m['author_email'] = re.match(r'(.*) <(.*)>', m['author']).groups() m['description'], m['long_description'] = docs[0].strip().split('\n\n', 1) m['install_requires'] = ['minidb', 'PyYAML', 'requests', 'keyring', 'pycodestyle', 'appdirs', 'lxml', 'cssselect'] if sys.version_info < (3, 4): m['install_requires'].extend(['enum34']) if sys.platform == 'win32': m['install_requires'].extend(['colorama']) m['scripts'] = ['urlwatch'] m['package_dir'] = {'': 'lib'} m['packages'] = ['urlwatch'] m['python_requires'] = '>3.3.0' m['data_files'] = [ ('share/man/man1', ['share/man/man1/urlwatch.1']), ('share/urlwatch/examples', [ 'share/urlwatch/examples/hooks.py.example', 'share/urlwatch/examples/urls.yaml.example', ]), ] class InstallDependencies(cmd.Command): """Install dependencies only""" description = 'Only install required packages using pip' user_options = [] def initialize_options(self): ... def finalize_options(self): ... def run(self): global m try: from pip._internal import main except ImportError: from pip import main main(['install', '--upgrade'] + m['install_requires']) m['cmdclass'] = {'install_dependencies': InstallDependencies} del m['copyright'] setup(**m) urlwatch-2.17/share/000077500000000000000000000000001345412734700144105ustar00rootroot00000000000000urlwatch-2.17/share/man/000077500000000000000000000000001345412734700151635ustar00rootroot00000000000000urlwatch-2.17/share/man/man1/000077500000000000000000000000001345412734700160175ustar00rootroot00000000000000urlwatch-2.17/share/man/man1/urlwatch.1000066400000000000000000000040231345412734700177310ustar00rootroot00000000000000.TH URLWATCH "1" "January 2019" "urlwatch 2.16" "User Commands" .SH NAME urlwatch \- monitors webpages for you .SH SYNOPSIS .B urlwatch [options] .SH DESCRIPTION urlwatch is intended to help you watch changes in webpages and get notified (via e\-mail, in your terminal or through various third party services) of any changes. The change notification will include the URL that has changed and a unified diff of what has changed. .SS "optional arguments:" .TP \fB\-h\fR, \fB\-\-help\fR show this help message and exit .TP \fB\-\-version\fR show program's version number and exit .TP \fB\-v\fR, \fB\-\-verbose\fR show debug output .SS "files and directories:" .TP \fB\-\-urls\fR FILE read job list (URLs) from FILE .TP \fB\-\-config\fR FILE read configuration from FILE .TP \fB\-\-hooks\fR FILE use FILE as hooks.py module .TP \fB\-\-cache\fR FILE use FILE as cache database .SS "Authentication:" .TP \fB\-\-smtp\-login\fR Enter password for SMTP (store in keyring) .TP \fB\-\-telegram\-chats\fR List telegram chats the bot is joined to .TP \fB\-\-test\-slack\fR Send a test notification to Slack .SS "job list management:" .TP \fB\-\-list\fR list jobs .TP \fB\-\-add\fR JOB add job (key1=value1,key2=value2,...) .TP \fB\-\-delete\fR JOB delete job by location or index .TP \fB\-\-test\-filter\fR JOB test filter output of job by location or index .SS "interactive commands ($EDITOR/$VISUAL):" .TP \fB\-\-edit\fR edit URL/job list .TP \fB\-\-edit\-config\fR edit configuration file .TP \fB\-\-edit\-hooks\fR edit hooks script .SS "miscellaneous:" .TP \fB\-\-features\fR list supported jobs/filters/reporters .TP \fB\-\-gc\-cache\fR remove old cache entries .SH "FILES" .TP .B $XDG_CONFIG_HOME/urlwatch/urls.yaml A list of URLs, commands and other jobs to watch .TP .B $XDG_CONFIG_HOME/urlwatch/hooks.py A Python module that can implement new job types, filters and reporters .TP .B $XDG_CACHE_HOME/urlwatch/cache.db A SQLite 3 database that contains the state history of jobs (for diffing) .SH AUTHOR Thomas Perl .SH WEBSITE https://thp.io/2008/urlwatch/ urlwatch-2.17/share/urlwatch/000077500000000000000000000000001345412734700162415ustar00rootroot00000000000000urlwatch-2.17/share/urlwatch/examples/000077500000000000000000000000001345412734700200575ustar00rootroot00000000000000urlwatch-2.17/share/urlwatch/examples/hooks.py.example000066400000000000000000000076561345412734700232240ustar00rootroot00000000000000# # Example hooks file for urlwatch # # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # import re from urlwatch import filters from urlwatch import jobs from urlwatch import reporters #class CustomLoginJob(jobs.UrlJob): # """Custom login for my webpage""" # # __kind__ = 'custom-login' # __required__ = ('username', 'password') # # def retrieve(self, job_state): # return 'Would log in to {} with {} and {}\n'.format(self.url, self.username, self.password) #class CaseFilter(filters.FilterBase): # """Custom filter for changing case, needs to be selected manually""" # # __kind__ = 'case' # # def filter(self, data, subfilter=None): # # The subfilter is specified using a colon, for example the "case" # # filter here can be specified as "case:upper" and "case:lower" # # if subfilter is None: # subfilter = 'upper' # # if subfilter == 'upper': # return data.upper() # elif subfilter == 'lower': # return data.lower() # else: # raise ValueError('Unknown case subfilter: %r' % (subfilter,)) #class IndentFilter(filters.FilterBase): # """Custom filter for indenting, needs to be selected manually""" # # __kind__ = 'indent' # # def filter(self, data, subfilter=None): # # The subfilter here is a number of characters to indent # # if subfilter is None: # indent = 8 # else: # indent = int(subfilter) # # return '\n'.join((' '*indent) + line for line in data.splitlines()) class CustomMatchUrlFilter(filters.AutoMatchFilter): # The AutoMatchFilter will apply automatically to all filters # that have the given properties set MATCH = {'url': 'http://example.org/'} def filter(self, data): return data.replace('foo', 'bar') class CustomRegexMatchUrlFilter(filters.RegexMatchFilter): # Similar to AutoMatchFilter MATCH = {'url': re.compile('http://example.org/.*')} def filter(self, data): return data.replace('foo', 'bar') class CustomTextFileReporter(reporters.TextReporter): """Custom reporter that writes the text-only report to a file""" __kind__ = 'custom_file' def submit(self): with open(self.config['filename'], 'w') as fp: fp.write('\n'.join(super().submit())) class CustomHtmlFileReporter(reporters.HtmlReporter): """Custom reporter that writes the HTML report to a file""" __kind__ = 'custom_html' def submit(self): with open(self.config['filename'], 'w') as fp: fp.write('\n'.join(super().submit())) urlwatch-2.17/share/urlwatch/examples/urls.yaml.example000066400000000000000000000027201345412734700233630ustar00rootroot00000000000000# This is an example urls.yaml file for urlwatch # A basic URL job just needs a URL name: "urlwatch webpage" url: "https://thp.io/2008/urlwatch/" # You can use a pre-supplied filter for this, here we apply two: # the html2text filter that converts the HTML to plaintext and # the grep filter that filters lines based on a regular expression filter: html2text,grep:Current.*version,strip --- # Built-in job kind "shell" needs a command specified name: "Home Listing" command: "ls -al ~" #--- #name: "Login to some webpage (custom job)" #url: "http://example.org/" # This job kind is defined in hooks.py, so you need to enable it #kind: custom-login # Additional parameters for the custom-login job kind can be specified here #username: "myuser" #password: "secret" # Filters can be specified here, separated by comma (these are also from hooks.py) #filter: case:upper,indent:5 --- # If you want to use spaces in URLs, you have to URL-encode them (e.g. %20) url: "http://example.org/With%20Spaces/" --- # POST requests are done by providing a post parameter url: "http://example.com/search.cgi" data: "button=Search&q=something&category=4" --- # You can use a custom HTTP method, this might be useful for cache invalidation url: "http://example.com/foo" method: "PURGE" --- # You can do POST requests by providing data parameter. # POST data can be a URL-encoded string (see last example) or a dict. url: "http://example.com/search.cgi" data: button: Search q: something category: 4 urlwatch-2.17/test/000077500000000000000000000000001345412734700142655ustar00rootroot00000000000000urlwatch-2.17/test/data/000077500000000000000000000000001345412734700151765ustar00rootroot00000000000000urlwatch-2.17/test/data/filter_tests.yaml000066400000000000000000000102601345412734700205700ustar00rootroot00000000000000# : # filter: # data: | # Input data as block scalar (string). # Use the literal style (starts with "|") for better readability. # Use a chomping indicator (-/+) to control trailing newlines. # Ref: # https://yaml.org/spec/1.2/spec.html#id2795688 # https://yaml.org/spec/1.2/spec.html#id2794534 # expected_result: | # element_by_tag: filter: element-by-tag:body data: | foo expected_result: |- foo element_by_tag_nested: filter: element-by-tag:div data: |
foo
bar
expected_result: |-
foo
bar
element_by_id: filter: element-by-id:bar data: |
asdf bar
asdf bar hoho
expected_result: |-
asdf bar hoho
element_by_class: filter: element-by-class:foo data: |
foo
bar
expected_result: |-
foo
xpath_elements: filter: xpath://div | //*[@id="bar"] data: |
foo
bar
expected_result: |
foo
bar
xpath_text: filter: xpath://div[1]/text() | //div[2]/@id data: |
foo
bar
expected_result: |- foo bar xpath_exclude: filter: xpath: path: //div exclude: //*[@class='excl'] | //*/@class data: |
you don't want to see me
finterrupt!ointerrupt!o
bar
expected_result: |
foo
bar
css: filter: css:div data: |
foo
bar
expected_result: |
foo
bar
css_exclude: filter: css: selector: div exclude: '.excl, #bar' data: |
you don't want to see me
finterrupt!ointerrupt!o
bar
expected_result: |
foo
grep: filter: grep:blue data: | The rose is red; the violet's blue. Sugar is sweet, and so are you. expected_result: |- the violet's blue. grep_with_comma: filter: grep:\054 data: | The rose is red; the violet's blue. Sugar is sweet, and so are you. expected_result: |- Sugar is sweet, json_format: filter: format-json data: | {"field1": {"f1.1": "value"},"field2": "value"} expected_result: |- { "field1": { "f1.1": "value" }, "field2": "value" } json_format_subfilter: filter: format-json:2 data: | {"field1": {"f1.1": "value"},"field2": "value"} expected_result: |- { "field1": { "f1.1": "value" }, "field2": "value" } sha1: filter: sha1sum data: 1234567890abcdefg expected_result: 8417680c09644df743d7cea1366fbe13a31b2d5e hexdump: filter: hexdump data: | Hello world! 你好,世界! expected_result: |- 48 65 6c 6c 6f 20 77 6f 72 6c 64 21 0a e4 bd a0 Hello world!.... e5 a5 bd ef bc 8c e4 b8 96 e7 95 8c ef bc 81 0a ................ urlwatch-2.17/test/data/invalid-url.yaml000066400000000000000000000000741345412734700203110ustar00rootroot00000000000000name: "invalid url" url: "https://invalid" max_tries: 2 --- urlwatch-2.17/test/data/urls.txt000066400000000000000000000023021345412734700167210ustar00rootroot00000000000000 # This is an example urls.txt file for urlwatch # Empty lines and lines starting with "#" are ignored http://www.dubclub-vienna.com/ http://www.openpandora.org/developers.php #http://www.statistik.tuwien.ac.at/lv-guide/u107.369/info.html #http://www.statistik.tuwien.ac.at/lv-guide/u107.369/blatter.html #http://www.dbai.tuwien.ac.at/education/dbs/current/index.html #http://www.dbai.tuwien.ac.at/education/dbs/current/uebung.html http://ti.tuwien.ac.at/rts/teaching/courses/systems_programming http://ti.tuwien.ac.at/rts/teaching/courses/systems_programming/labor http://ti.tuwien.ac.at/rts/teaching/courses/betriebssysteme #http://www.complang.tuwien.ac.at/anton/lvas/effiziente-programme.html #http://www.complang.tuwien.ac.at/anton/lvas/effizienz-aufgabe08/ http://www.kukuk.at/ical/events http://guckes.net/cal/ # You can use the pipe character to "watch" the output of shell commands |ls -al ~ # If you want to use spaces in URLs, you have to URL-encode them (e.g. %20) http://example.org/With%20Spaces/ # You can do POST requests by writing the POST data behind the URL, # separated by a single space character. POST data is URL-encoded. http://example.com/search.cgi button=Search&q=something&category=4 urlwatch-2.17/test/data/urlwatch.yaml000066400000000000000000000011251345412734700177120ustar00rootroot00000000000000display: error: true new: true unchanged: false report: email: enabled: false from: '' html: false method: smtp sendmail: path: sendmail smtp: host: localhost keyring: true port: 25 starttls: true subject: '{count} changes: {jobs}' to: '' html: diff: unified pushover: app: '' device: '' enabled: false sound: 'spacealarm' user: '' stdout: color: true enabled: true text: details: true footer: true line_length: 75 job_defaults: all: {} shell: {} url: {} browser: {} urlwatch-2.17/test/test_filters.py000066400000000000000000000027021345412734700173470ustar00rootroot00000000000000import os import logging import yaml from urlwatch.filters import FilterBase from nose.tools import eq_ logger = logging.getLogger(__name__) def test_filters(): def check_filter(test_name): filter = filter_tests[test_name]['filter'] data = filter_tests[test_name]['data'] expected_result = filter_tests[test_name]['expected_result'] if isinstance(filter, dict): key = next(iter(filter)) kind, subfilter = key, filter[key] elif isinstance(filter, str): if ',' in filter: raise ValueError('Only single filter allowed in this test') elif ':' in filter: kind, subfilter = filter.split(':', 1) else: kind = filter subfilter = None logger.info('filter kind: %s, subfilter: %s', kind, subfilter) filtercls = FilterBase.__subclasses__.get(kind) if filtercls is None: raise ValueError('Unknown filter kind: %s:%s' % (filter_kind, subfilter)) result = filtercls(None, None).filter(data, subfilter) logger.debug('Expected result:\n%s', expected_result) logger.debug('Actual result:\n%s', result) eq_(result, expected_result) with open(os.path.join(os.path.dirname(__file__), 'data/filter_tests.yaml'), 'r', encoding='utf8') as fp: filter_tests = yaml.load(fp) for test_name in filter_tests: yield check_filter, test_name urlwatch-2.17/test/test_handler.py000066400000000000000000000152061345412734700173170ustar00rootroot00000000000000import sys from glob import glob import pycodestyle as pycodestyle from urlwatch.jobs import UrlJob, JobBase, ShellJob from urlwatch.storage import UrlsYaml, UrlsTxt from nose.tools import raises, with_setup import tempfile import os import imp from urlwatch import storage from urlwatch.config import BaseConfig from urlwatch.storage import YamlConfigStorage, CacheMiniDBStorage from urlwatch.main import Urlwatch def test_required_classattrs_in_subclasses(): for kind, subclass in JobBase.__subclasses__.items(): assert hasattr(subclass, '__kind__') assert hasattr(subclass, '__required__') assert hasattr(subclass, '__optional__') def test_save_load_jobs(): jobs = [ UrlJob(name='news', url='http://news.orf.at/'), ShellJob(name='list homedir', command='ls ~'), ShellJob(name='list proc', command='ls /proc'), ] # tempfile.NamedTemporaryFile() doesn't work on Windows # because the returned file object cannot be opened again fd, name = tempfile.mkstemp() UrlsYaml(name).save(jobs) jobs2 = UrlsYaml(name).load() os.chmod(name, 0o777) jobs3 = UrlsYaml(name).load_secure() os.close(fd) os.remove(name) assert len(jobs2) == len(jobs) # Assert that the shell jobs have been removed due to secure loading if sys.platform != 'win32': assert len(jobs3) == 1 def test_load_config_yaml(): config_file = os.path.join(os.path.dirname(__file__), 'data', 'urlwatch.yaml') if os.path.exists(config_file): config = YamlConfigStorage(config_file) assert config is not None assert config.config is not None assert config.config == storage.DEFAULT_CONFIG def test_load_urls_txt(): urls_txt = os.path.join(os.path.dirname(__file__), 'data', 'urls.txt') if os.path.exists(urls_txt): assert len(UrlsTxt(urls_txt).load_secure()) > 0 def test_load_urls_yaml(): urls_yaml = 'share/urlwatch/examples/urls.yaml.example' if os.path.exists(urls_yaml): assert len(UrlsYaml(urls_yaml).load_secure()) > 0 def test_load_hooks_py(): hooks_py = 'share/urlwatch/examples/hooks.py.example' if os.path.exists(hooks_py): imp.load_source('hooks', hooks_py) def test_pep8_conformance(): """Test that we conform to PEP-8.""" style = pycodestyle.StyleGuide(ignore=['E501', 'E402', 'W503']) py_files = [y for x in os.walk(os.path.abspath('.')) for y in glob(os.path.join(x[0], '*.py'))] py_files.append(os.path.abspath('urlwatch')) result = style.check_files(py_files) assert result.total_errors == 0, "Found #{0} code style errors".format(result.total_errors) class TestConfig(BaseConfig): def __init__(self, config, urls, cache, hooks, verbose): (prefix, bindir) = os.path.split(os.path.dirname(os.path.abspath(sys.argv[0]))) super().__init__('urlwatch', os.path.dirname(__file__), config, urls, cache, hooks, verbose) self.edit = False self.edit_hooks = False def teardown_func(): "tear down test fixtures" cache = os.path.join(os.path.dirname(__file__), 'data', 'cache.db') if os.path.exists(cache): os.remove(cache) @with_setup(teardown=teardown_func) def test_run_watcher(): urls = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'share', 'urlwatch', 'examples', 'urls.yaml.example') config = os.path.join(os.path.dirname(__file__), 'data', 'urlwatch.yaml') cache = os.path.join(os.path.dirname(__file__), 'data', 'cache.db') hooks = '' config_storage = YamlConfigStorage(config) urls_storage = UrlsYaml(urls) cache_storage = CacheMiniDBStorage(cache) try: urlwatch_config = TestConfig(config, urls, cache, hooks, True) urlwatcher = Urlwatch(urlwatch_config, config_storage, cache_storage, urls_storage) urlwatcher.run_jobs() finally: cache_storage.close() def test_unserialize_shell_job_without_kind(): job = JobBase.unserialize({ 'name': 'hoho', 'command': 'ls', }) assert isinstance(job, ShellJob) @raises(ValueError) def test_unserialize_with_unknown_key(): JobBase.unserialize({ 'unknown_key': 123, 'name': 'hoho', }) def prepare_retry_test(): urls = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'test', 'data', 'invalid-url.yaml') config = os.path.join(os.path.dirname(__file__), 'data', 'urlwatch.yaml') cache = os.path.join(os.path.dirname(__file__), 'data', 'cache.db') hooks = '' config_storage = YamlConfigStorage(config) cache_storage = CacheMiniDBStorage(cache) urls_storage = UrlsYaml(urls) urlwatch_config = TestConfig(config, urls, cache, hooks, True) urlwatcher = Urlwatch(urlwatch_config, config_storage, cache_storage, urls_storage) return urlwatcher, cache_storage @with_setup(teardown=teardown_func) def test_number_of_tries_in_cache_is_increased(): urlwatcher, cache_storage = prepare_retry_test() try: job = urlwatcher.jobs[0] old_data, timestamp, tries, etag = cache_storage.load(job, job.get_guid()) assert tries == 0 urlwatcher.run_jobs() urlwatcher.run_jobs() job = urlwatcher.jobs[0] old_data, timestamp, tries, etag = cache_storage.load(job, job.get_guid()) assert tries == 2 assert urlwatcher.report.job_states[-1].verb == 'error' finally: cache_storage.close() @with_setup(teardown=teardown_func) def test_report_error_when_out_of_tries(): urlwatcher, cache_storage = prepare_retry_test() try: job = urlwatcher.jobs[0] old_data, timestamp, tries, etag = cache_storage.load(job, job.get_guid()) assert tries == 0 urlwatcher.run_jobs() urlwatcher.run_jobs() report = urlwatcher.report assert report.job_states[-1].verb == 'error' finally: cache_storage.close() @with_setup(teardown=teardown_func) def test_reset_tries_to_zero_when_successful(): urlwatcher, cache_storage = prepare_retry_test() try: job = urlwatcher.jobs[0] old_data, timestamp, tries, etag = cache_storage.load(job, job.get_guid()) assert tries == 0 urlwatcher.run_jobs() job = urlwatcher.jobs[0] old_data, timestamp, tries, etag = cache_storage.load(job, job.get_guid()) assert tries == 1 # use an url that definitely exists job = urlwatcher.jobs[0] job.url = 'file://' + os.path.join(os.path.dirname(__file__), 'data', 'urlwatch.yaml') urlwatcher.run_jobs() job = urlwatcher.jobs[0] old_data, timestamp, tries, etag = cache_storage.load(job, job.get_guid()) assert tries == 0 finally: cache_storage.close() urlwatch-2.17/urlwatch000077500000000000000000000102151345412734700150640ustar00rootroot00000000000000#!/usr/bin/env python3 # -*- coding: utf-8 -*- # # This file is part of urlwatch (https://thp.io/2008/urlwatch/). # Copyright (c) 2008-2019 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # File and folder paths import logging import os.path import signal import socket import sys from appdirs import AppDirs pkgname = 'urlwatch' urlwatch_dir = os.path.expanduser(os.path.join('~', '.' + pkgname)) urlwatch_cache_dir = AppDirs(pkgname).user_cache_dir if not os.path.exists(urlwatch_dir): urlwatch_dir = AppDirs(pkgname).user_config_dir # Check if we are installed in the system already (prefix, bindir) = os.path.split(os.path.dirname(os.path.abspath(sys.argv[0]))) if bindir != 'bin': sys.path.insert(0, os.path.join(prefix, bindir, 'lib')) from urlwatch.command import UrlwatchCommand from urlwatch.config import CommandConfig from urlwatch.main import Urlwatch from urlwatch.storage import YamlConfigStorage, CacheMiniDBStorage, UrlsYaml # One minute (=60 seconds) timeout for each request to avoid hanging socket.setdefaulttimeout(60) # Ignore SIGPIPE for stdout (see https://github.com/thp/urlwatch/issues/77) try: signal.signal(signal.SIGPIPE, signal.SIG_DFL) except AttributeError: # Windows does not have signal.SIGPIPE ... logger = logging.getLogger(pkgname) CONFIG_FILE = 'urlwatch.yaml' URLS_FILE = 'urls.yaml' CACHE_FILE = 'cache.db' HOOKS_FILE = 'hooks.py' def setup_logger(verbose): if verbose: root_logger = logging.getLogger('') console = logging.StreamHandler() console.setFormatter(logging.Formatter('%(asctime)s %(module)s %(levelname)s: %(message)s')) root_logger.addHandler(console) root_logger.setLevel(logging.DEBUG) root_logger.info('turning on verbose logging mode') if __name__ == '__main__': config_file = os.path.join(urlwatch_dir, CONFIG_FILE) urls_file = os.path.join(urlwatch_dir, URLS_FILE) hooks_file = os.path.join(urlwatch_dir, HOOKS_FILE) new_cache_file = os.path.join(urlwatch_cache_dir, CACHE_FILE) old_cache_file = os.path.join(urlwatch_dir, CACHE_FILE) cache_file = new_cache_file if os.path.exists(old_cache_file) and not os.path.exists(new_cache_file): cache_file = old_cache_file command_config = CommandConfig(pkgname, urlwatch_dir, bindir, prefix, config_file, urls_file, hooks_file, cache_file, False) setup_logger(command_config.verbose) # setup storage API config_storage = YamlConfigStorage(command_config.config) cache_storage = CacheMiniDBStorage(command_config.cache) urls_storage = UrlsYaml(command_config.urls) # setup urlwatcher urlwatch = Urlwatch(command_config, config_storage, cache_storage, urls_storage) urlwatch_command = UrlwatchCommand(urlwatch) # run urlwatcher urlwatch_command.run()