pax_global_header00006660000000000000000000000064120266454240014517gustar00rootroot0000000000000052 comment=469b0c8135067689196714956898f2f378bf9cf4 urlwatch-1.15/000077500000000000000000000000001202664542400132765ustar00rootroot00000000000000urlwatch-1.15/COPYING000066400000000000000000000026041202664542400143330ustar00rootroot00000000000000Copyright (c) 2008-2011 Thomas Perl All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. urlwatch-1.15/ChangeLog000066400000000000000000000077511202664542400150620ustar00rootroot000000000000002008-03-04 Thomas Perl * Initial Version 2008-03-17 Thomas Perl * Release version 1.0 2008-03-20 Lukas Vana * Add support for error handling missing URLs * Notify users when NEW sites appear * Option "display_errors" can be set in watch.py 2008-03-22 Thomas Perl * Release version 1.1 2008-05-09 Lukas Upton * Fix problem with Mac OS X 10.5.2 and Ubuntu 8.04 2008-05-10 Thomas Perl * Release version 1.2 2008-05-15 Craig Hoffman * Add support for sending a User-Agent header 2008-05-16 Thomas Perl * Release version 1.3 2008-11-14 Thomas Perl + Add example for using HTML Tidy (needs python-utidylib) + Add example for using the ical2txt module (needs python-vobject) + Add ical2txt.py module for converting ics to plaintext * More comments in hooks.py for better user documentation * Release version 1.4 2008-11-18 Thomas Perl * Support for installing into the system * Use ~/.urlwatch/ for config, cache and hooks * Apply BSD license * Add setup.py (and remove makefile) * Command-line options * Verbose logging mode * Example urls.txt and hooks.py * Update README * Add manpage (urlwatch.1) * Release version 1.5 2008-12-23 Thomas Perl * Use hashlib in Python 2.5 and above for SHA-1 generation * Release version 1.6 2009-01-03 Thomas Perl * Add urlwatch.html2txt module to convert/format HTML to plaintext * Add example of using html2txt in the example hooks file * The html-to-plaintext feature has been suggested by Evert Meulie * Release version 1.7 2009-01-05 Thomas Perl * Fix a problem with relative links in Lynx' "-dump" mode 2009-01-07 Thomas Perl * Fix another problem with file-relative links in html2text w/ Lynx 2009-01-12 Thomas Perl * Describe ical2txt and html2txt with examples in manpage 2009-01-15 Thomas Perl * Add TODO list 2009-01-20 Thomas Perl * Set the socket timeout to one minute to avoid hangs 2009-07-27 Thomas Perl * Catch and handle IOErrors from FTP timeouts 2009-08-01 Thomas Perl * Add error handling for socket timeouts (HTTP mode) 2009-08-10 Thomas Perl * Handle httplib errors (Debian bug 529740) (Thanks to Bastian Kleineidam and Franck Joncourt) * urlwatch 1.8 released 2009-09-29 Thomas Perl * Support for shell pipe (|) in urls.txt * Support for If-Modified-Since header + HTTP 304 * Show previous/current timestamp in diff output * Remove TODO list * urlwatch 1.9 released 2010-05-10 Thomas Perl * Get encoding from headers and convert to UTF-8 (suggested by Ján Ondrej) * urlwatch 1.10 released 2010-07-30 Thomas Perl * Detect non-zero shell command exit codes and raise an error * urlwatch 1.11 released 2011-02-10 Thomas Perl * Allow None as return value for filters (if a filter returns None, interpret it as "don't filter") * Update website URL, contact info and copyright years * urlwatch 1.12 released 2011-08-22 Thomas Perl * Support for POST requests (suggested by Sébastien Fricker) * Use concurrent.futures for parallel execution (needs Python 3.2 or "futures" from PyPI for older Python versions, including 2.x) * Various code changes to enhance compatibility with Python 3 * Add convert-to-python3.sh script to convert the codebase into Python 3 format using the "2to3" utility included with Python * urlwatch 1.13 released 2011-11-15 Thomas Perl * Fix an encoding issue related to the html2txt module (thanks to Thomas Dziedzic for reporting this issue and testing the patch) * urlwatch 1.14 released 2012-08-30 Thomas Perl * Merge changes from Slavko related to UTF-8 and html2txt, this has been tested on Debian-based systems * urlwatch 1.15 released urlwatch-1.15/PKG-INFO000066400000000000000000000003661202664542400144000ustar00rootroot00000000000000Metadata-Version: 1.0 Name: urlwatch Version: 1.15 Summary: Watch web pages and arbitrary URLs for changes Home-page: http://thp.io/2008/urlwatch/ Author: Thomas Perl Author-email: m@thp.io License: UNKNOWN Description: UNKNOWN Platform: UNKNOWN urlwatch-1.15/README000066400000000000000000000042651202664542400141650ustar00rootroot00000000000000URLWATCH README =============== ABOUT ----- This is a simple URL watcher, designed to send you diffs of webpages as they change. Ideal for watching web pages of university courses, so you always know when lecture dates have changed or new tasks are online :) DEPENDENCIES ------------ This package requires the "concurrent.futures" module as included in Python 3.2. For Python versions < 3.2, you can install it using: pip install futures or download and install it manually from its project page at http://code.google.com/p/pythonfutures/ QUICK START ----------- 1. Start "urlwatch" 2. Edit and rename the examples in ~/.urlwatch/ 3. Add "urlwatch" to your crontab (crontab -e) 4. Receive change notifications via e-mail 5. Customize your hooks in ~/.urlwatch/lib/ FREQUENTLY ASKED QUESTIONS -------------------------- Q: How do I add/remove URLs? A: Edit ~/.urlwatch/urls.txt Q: A page changes some content on every reload. How do I prevent urlwatch from always displaying these changes? A: Edit ~/.urlwatch/lib/hooks.py and implement your filters there. Examples are included in the urlwatch source distribution. Q: How do I configure urlwatch as a cron job? A: Use "crontab -e" to add the command "urlwatch" to your crontab. Make sure stdout of your cronjobs is mailed to you, so you also get the notifications. Q: Is there an easy way to show changes of .ics files? A: Indeed there is. See the example hooks.py file. Q: What about badly-formed HTML (long lines, etc..)? A: Use python-utidylib. See the example hooks.py file. Q: Is there a way to make the output more human-readable? Q: Is there a way to turn it into a diff of parsed HTML perhaps? A: Of course. See the example hooks.py file -> use html2txt.html2text(data) Q: Why do I get an error with URLs with spaces in them? A: Please make sure to URL-encode the URLs properly. Use %20 for spaces. Q: The website I want to watch requires a POST request. How do I send one? A: Add the POST data in the same line, separated by a single space. The format in urls.txt is: http://example.org/script.cgi value=5&q=search&button=Go CONTACT ------- Website: http://thp.io/2008/urlwatch/ E-Mail: m@thp.io Jabber/XMPP: thp@jabber.org urlwatch-1.15/convert-to-python3.sh000077500000000000000000000002411202664542400173340ustar00rootroot00000000000000#!/bin/sh # Convert urlwatch sources to Python 3.x compatible format SOURCES="urlwatch lib/urlwatch/*.py examples/hooks.py.example setup.py" 2to3 -w $SOURCES urlwatch-1.15/examples/000077500000000000000000000000001202664542400151145ustar00rootroot00000000000000urlwatch-1.15/examples/hooks.py.example000066400000000000000000000106671202664542400202550ustar00rootroot00000000000000# # Example hooks file for urlwatch # # Copyright (c) 2008-2011 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # # You can decide which filter you want to apply using the "url" # parameter and you can use the "re" module to search for the # content that you want to filter, so the noise is removed. # Needed for regular expression substitutions import re # Additional modules installed with urlwatch from urlwatch import ical2txt from urlwatch import html2txt def filter(url, data): if url == 'http://www.inso.tuwien.ac.at/lectures/usability/': return re.sub('.*TYPO3SEARCH_end.*', '', data) elif url == 'https://www.auto.tuwien.ac.at/courses/viewDetails/11/': return re.sub('', '', data) elif url == 'http://grenzlandvagab.gr.funpic.de/events/': return re.sub('', '', data) elif url == 'http://www.mv-eberau.at/terminliste.php': return data.replace('
', '\n') elif 'iuner.lukas-krispel.at' in url: # Remove always-changing entries from FTP server listing return re.sub('drwx.*usage', '', re.sub('drwx.*logs', '', data)) elif url.startswith('http://ti.tuwien.ac.at/rts/teaching/courses/'): # example of using the "tidy" module for cleaning up bad HTML import tidy mlr = re.compile('magicCalendarHeader.*magicCalendarBottom', re.S) data = str(tidy.parseString(data, output_xhtml=1, indent=0, tidy_mark=0)) return re.sub(mlr, '', data) elif url == 'http://www.poleros.at/calender.htm': # remove style changes, because we only want to see content changes return re.sub('style="[^"]"', '', data) elif url == 'http://www.ads.tuwien.ac.at/teaching/LVA/186170.html': return re.sub('Saved in parser cache with key .* and timestamp .* --', '', re.sub('Served by aragon in .* secs\.', '', re.sub('This page has been accessed .* times\.', '', data))) elif url.endswith('.ics') or url == 'http://www.kukuk.at/ical/events': # example of generating a summary for icalendar files # append "data" to the converted ical data, so you get # all minor changes to the ICS that are not included # in the ical2text summary (remove this if you want) return ical2txt.ical2text(data).encode('utf-8') + '\n\n' + data elif url == 'http://www.oho.at/programm/programm.php3': # example of converting HTML to plaintext for very # ugly HTML code that cannot be deciphered when just # diffing the HTML source (or if the user is just not # used to HTML, use this for every web page) # # You need to install "lynx" for this to work or use # "html2text" as method (needs "html2text") or use # "re" (does not need anything, but only strips tags # using a regular expression and does no formatting) return html2txt.html2text(data, method='lynx') # The next line is optional - if the filter function returns # None (or no value at all), the input data will be taken as # the result -> None as return value means "don't filter". return data urlwatch-1.15/examples/urls.txt.example000066400000000000000000000023021202664542400202710ustar00rootroot00000000000000 # This is an example urls.txt file for urlwatch # Empty lines and lines starting with "#" are ignored http://www.dubclub-vienna.com/ http://www.openpandora.org/developers.php #http://www.statistik.tuwien.ac.at/lv-guide/u107.369/info.html #http://www.statistik.tuwien.ac.at/lv-guide/u107.369/blatter.html #http://www.dbai.tuwien.ac.at/education/dbs/current/index.html #http://www.dbai.tuwien.ac.at/education/dbs/current/uebung.html http://ti.tuwien.ac.at/rts/teaching/courses/systems_programming http://ti.tuwien.ac.at/rts/teaching/courses/systems_programming/labor http://ti.tuwien.ac.at/rts/teaching/courses/betriebssysteme #http://www.complang.tuwien.ac.at/anton/lvas/effiziente-programme.html #http://www.complang.tuwien.ac.at/anton/lvas/effizienz-aufgabe08/ http://www.kukuk.at/ical/events http://guckes.net/cal/ # You can use the pipe character to "watch" the output of shell commands |ls -al ~ # If you want to use spaces in URLs, you have to URL-encode them (e.g. %20) http://example.org/With%20Spaces/ # You can do POST requests by writing the POST data behind the URL, # separated by a single space character. POST data is URL-encoded. http://example.com/search.cgi button=Search&q=something&category=4 urlwatch-1.15/lib/000077500000000000000000000000001202664542400140445ustar00rootroot00000000000000urlwatch-1.15/lib/urlwatch/000077500000000000000000000000001202664542400156755ustar00rootroot00000000000000urlwatch-1.15/lib/urlwatch/__init__.py000066400000000000000000000000001202664542400177740ustar00rootroot00000000000000urlwatch-1.15/lib/urlwatch/handler.py000077500000000000000000000150021202664542400176650ustar00rootroot00000000000000#!/usr/bin/python # -*- coding: utf-8 -*- # # urlwatch is a minimalistic URL watcher written in Python # # Copyright (c) 2008-2011 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # try: # Available in Python 2.5 and above and preferred if available import hashlib have_hashlib = True except ImportError: # "sha" is deprecated since Python 2.5 (throws a warning in Python 2.6) # Thanks to Frank Palvölgyi for reporting the warning in Python 2.6 import sha have_hashlib = False import subprocess import email.utils import urllib2 import os import stat import sys import re class JobBase(object): def __init__(self, location): self.location = location def __str__(self): return self.location def get_guid(self): if have_hashlib: sha_hash = hashlib.new('sha1') location = self.location if isinstance(location, unicode): location = location.encode('utf-8') sha_hash.update(location) return sha_hash.hexdigest() else: return sha.new(self.location).hexdigest() def retrieve(self, timestamp=None, filter_func=None, headers=None, log=None): raise Exception('Not implemented') class ShellError(Exception): """Exception for shell commands with non-zero exit code""" def __init__(self, result): Exception.__init__(self) self.result = result def __str__(self): return '%s: Exit status %d' % (self.__class__.__name__, self.result) def use_filter(filter_func, url, input): """Apply a filter function to input from an URL""" output = filter_func(url, input) if output is None: # If the filter does not return a value, it is # assumed that the input does not need filtering. # In this case, we simply return the input. return input return output class ShellJob(JobBase): def retrieve(self, timestamp=None, filter_func=None, headers=None, log=None): process = subprocess.Popen(self.location, \ stdout=subprocess.PIPE, \ shell=True) stdout_data, stderr_data = process.communicate() result = process.wait() if result != 0: raise ShellError(result) return use_filter(filter_func, self.location, stdout_data) class UrlJob(JobBase): CHARSET_RE = re.compile('text/(html|plain); charset=(.*)') def retrieve(self, timestamp=None, filter_func=None, headers=None, log=None): headers = dict(headers) if timestamp is not None: timestamp = email.utils.formatdate(timestamp) headers['If-Modified-Since'] = timestamp if ' ' in self.location: self.location, post_data = self.location.split(' ', 1) log.info('Sending POST request to %s', self.location) else: post_data = None request = urllib2.Request(self.location, post_data, headers) response = urllib2.urlopen(request) headers = response.info() content = response.read() encoding = 'utf-8' # Determine content type via HTTP headers content_type = headers.get('Content-type', '') content_type_match = self.CHARSET_RE.match(content_type) if content_type_match: encoding = content_type_match.group(2) # Convert from specified encoding to unicode if not isinstance(content, unicode): content = content.decode(encoding, 'ignore') return use_filter(filter_func, self.location, content) def parse_urls_txt(urls_txt): jobs = [] # Security checks for shell jobs - only execute if the current UID # is the same as the file/directory owner and only owner can write allow_shelljobs = True shelljob_errors = [] current_uid = os.getuid() dirname = os.path.dirname(urls_txt) dir_st = os.stat(dirname) if (dir_st.st_mode & (stat.S_IWGRP | stat.S_IWOTH)) != 0: shelljob_errors.append('%s is group/world-writable' % dirname) allow_shelljobs = False if dir_st.st_uid != current_uid: shelljob_errors.append('%s not owned by %s' % (dirname, os.getlogin())) allow_shelljobs = False file_st = os.stat(urls_txt) if (file_st.st_mode & (stat.S_IWGRP | stat.S_IWOTH)) != 0: shelljob_errors.append('%s is group/world-writable' % urls_txt) allow_shelljobs = False if file_st.st_uid != current_uid: shelljob_errors.append('%s not owned by %s' % (urls_txt, os.getlogin())) allow_shelljobs = False for line in open(urls_txt).read().splitlines(): if line.strip().startswith('#') or line.strip() == '': continue if line.startswith('|'): if allow_shelljobs: jobs.append(ShellJob(line[1:])) else: print >>sys.stderr, '\n SECURITY WARNING - Cannot run shell jobs:\n' for error in shelljob_errors: print >>sys.stderr, ' ', error print >>sys.stderr, '\n Please remove shell jobs or fix these problems.\n' sys.exit(1) else: jobs.append(UrlJob(line)) return jobs urlwatch-1.15/lib/urlwatch/html2txt.py000066400000000000000000000074631202664542400200470ustar00rootroot00000000000000#!/usr/bin/python # Convert HTML data to plaintext using Lynx, html2text or a regex # Requirements: Either lynx (default) or html2text or simply Python (for regex) # This file is part of urlwatch # # Copyright (c) 2009-2011 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import re def html2text(data, method='lynx', utf8=False): """ Convert a string consisting of HTML to plain text for easy difference checking. Method may be one of: 'lynx' (default) - Use "lynx -dump" for conversion 'html2text' - Use "html2text -nobs" for conversion 're' - A simple regex-based HTML tag stripper If utf8 is True, the data will be handled as utf-8 by Lynx and html2text (if possible). It seems like only the Debian-provided version of html2text has support for the "-utf8" command line flag, so this might not work on non-Debian systems. Dependencies: apt-get install lynx html2text """ if isinstance(data, unicode): data = data.encode('utf-8') if method == 're': stripped_tags = re.sub(r'<[^>]*>', '', data) d = '\n'.join((l.rstrip() for l in stripped_tags.splitlines() if l.strip() != '')) return d if method == 'lynx': cmd = ['lynx', '-dump', '-stdin'] if utf8: cmd.append('-assume_charset=UTF-8') elif method == 'html2text': cmd = ['html2text', '-nobs'] if utf8: cmd.append('-utf8') else: return data import subprocess html2text = subprocess.Popen(cmd, stdin=subprocess.PIPE, \ stdout=subprocess.PIPE) (stdout, stderr) = html2text.communicate(data) if method == 'lynx': # Lynx translates relative links in the mode we use it to: # file://localhost/tmp/[RANDOM STRING]/[RELATIVE LINK] # Use the following regular expression to remove the unnecessary # parts, so that [RANDOM STRING] (changing on each call) does not # expose itself as change on the website (it's a Lynx-related thing # Thanks to Evert Meulie for pointing that out stdout = re.sub(r'file://localhost/tmp/[^/]*/', '', stdout) # Also remove file names like L9816-5928TMP.html stdout = re.sub(r'L\d+-\d+TMP.html', '', stdout) return stdout if __name__ == '__main__': import sys if len(sys.argv) == 2: print html2text(open(sys.argv[1]).read()) else: print 'Usage: %s document.html' % (sys.argv[0]) sys.exit(1) urlwatch-1.15/lib/urlwatch/ical2txt.py000066400000000000000000000054071202664542400200070ustar00rootroot00000000000000#!/usr/bin/python # Convert iCalendar data to plaintext (very basic, don't rely on it :) # Requirements: python-vobject (http://vobject.skyhouseconsulting.com/) # This file is part of urlwatch # # Copyright (c) 2008-2011 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. def ical2text(ical_string): import vobject result = [] if isinstance(ical_string, unicode): parsedCal = vobject.readOne(ical_string) else: try: parsedCal = vobject.readOne(ical_string) except: parsedCal = vobject.readOne(ical_string.decode('utf-8', 'ignore')) for event in parsedCal.getChildren(): if event.name == 'VEVENT': if hasattr(event, 'dtstart'): start = event.dtstart.value.strftime('%F %H:%M') else: start = 'unknown start date' if hasattr(event, 'dtend'): end = event.dtend.value.strftime('%F %H:%M') else: end = start if start == end: date_str = start else: date_str = '%s -- %s' % (start, end) result.append('%s: %s' % (date_str, event.summary.value)) return '\n'.join(result) if __name__ == '__main__': import sys if len(sys.argv) == 2: print ical2text(open(sys.argv[1]).read()) else: print 'Usage: %s icalendarfile.ics' % (sys.argv[0]) sys.exit(1) urlwatch-1.15/setup.py000066400000000000000000000050471202664542400150160ustar00rootroot00000000000000#!/usr/bin/python # Generic setup.py file (for urlwatch) # # Copyright (c) 2008-2011 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. from distutils.core import setup import os import os.path import glob import imp # name of our package package = 'urlwatch' # name of the main script script = 'urlwatch' # get program info from urlwatch module s = imp.load_source('s', script) # remove compiled file created by imp.load_source os.unlink(script+'c') # s.__author__ has the format "Author Name " author = s.__author__[:s.__author__.index('<')-1] author_email = s.__author__[s.__author__.index('<')+1:s.__author__.rindex('>')] setup( name = s.pkgname, description = s.__doc__, version = s.__version__, author = author, author_email = author_email, url = s.__homepage__, scripts = [script], package_dir = {'': 'lib'}, packages = [s.pkgname], data_files = [ # Example files (os.path.join('share', package, 'examples'), glob.glob(os.path.join('examples', '*'))), # Manual page (os.path.join('share', 'man', 'man1'), ['urlwatch.1']), ], ) urlwatch-1.15/urlwatch000077500000000000000000000312321202664542400150560ustar00rootroot00000000000000#!/usr/bin/python # -*- coding: utf-8 -*- # # urlwatch is a minimalistic URL watcher written in Python # # Copyright (c) 2008-2011 Thomas Perl # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # 3. The name of the author may not be used to endorse or promote products # derived from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. # IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT # NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF # THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # """Watch web pages and arbitrary URLs for changes""" pkgname = 'urlwatch' __author__ = 'Thomas Perl ' __copyright__ = 'Copyright 2008-2011 Thomas Perl' __license__ = 'BSD' __homepage__ = 'http://thp.io/2008/urlwatch/' __version__ = '1.15' user_agent = '%s/%s (+http://thp.io/2008/urlwatch/info.html)' % (pkgname, __version__) # Configuration section display_errors = False line_length = 75 # File and folder paths import sys import os.path urlwatch_dir = os.path.expanduser(os.path.join('~', '.'+pkgname)) urls_txt = os.path.join(urlwatch_dir, 'urls.txt') cache_dir = os.path.join(urlwatch_dir, 'cache') scripts_dir = os.path.join(urlwatch_dir, 'lib') hooks_py = os.path.join(scripts_dir, 'hooks.py') # Check if we are installed in the system already (prefix, bindir) = os.path.split(os.path.dirname(os.path.abspath(sys.argv[0]))) if bindir == 'bin': # Assume we are installed in system examples_dir = os.path.join(prefix, 'share', pkgname, 'examples') else: # Assume we are not yet installed examples_dir = os.path.join(prefix, bindir, 'examples') sys.path.append(os.path.join(prefix, bindir, 'lib')) urls_txt_example = os.path.join(examples_dir, 'urls.txt.example') hooks_py_example = os.path.join(examples_dir, 'hooks.py.example') # Code section import shutil import os import stat import urllib2 import httplib import email.utils import time import socket import difflib import datetime import optparse import logging import imp # Python 3.2 includes "concurrent.futures", for older versions, # use "pip install futures" or http://code.google.com/p/pythonfutures/ import concurrent.futures from urlwatch import handler # One minute (=60 seconds) timeout for each request to avoid hanging socket.setdefaulttimeout(60) log = logging.getLogger(pkgname) log.setLevel(logging.DEBUG) class NullHandler(logging.Handler): def emit(self, record): pass log.addHandler(NullHandler()) ERROR_MESSAGE_URLS_TXT = """ Error: You need to create a urls.txt file first.' Place it in %s An example is available in %s """ ERROR_MESSAGE_HOOKS_PY = """ You can also create %s An example is available in %s """ MAX_WORKERS = 10 def foutput(type, url, content=None, summary=None, c='*', n=line_length): """Format output messages Returns a snippet of a specific message type (i.e. 'changed') for a specific URL and an optional (possibly multi-line) content. The parameter "summary" (if specified) should be a list variable that gets one item appended for the summary of the changes. The return value is a list of strings (one item per line). """ summary_txt = ': '.join((type.upper(), str(url))) if summary is not None: if content is None: summary.append(summary_txt) else: summary.append('%s (%d bytes)' % (summary_txt, len(str(content)))) result = [c*n, summary_txt] if content is not None: result += [c*n, str(content)] result += [c*n, '', ''] return result if __name__ == '__main__': start = datetime.datetime.now() # Option parser parser = optparse.OptionParser(usage='%%prog [options]\n\n%s' % __doc__.strip(), version=pkgname+' '+__version__) parser.add_option('-v', '--verbose', action='store_true', dest='verbose', help='Show debug/log output') parser.add_option('', '--urls', dest='urls', metavar='FILE', help='Read URLs from the specified file') parser.add_option('', '--hooks', dest='hooks', metavar='FILE', help='Use specified file as hooks.py module') parser.add_option('-e', '--display-errors', action='store_true', dest='display_errors', help='Include HTTP errors (404, etc..) in the output') parser.set_defaults(verbose=False, display_errors=False) (options, args) = parser.parse_args(sys.argv) if options.verbose: # Enable logging to the console console = logging.StreamHandler() console.setLevel(logging.DEBUG) formatter = logging.Formatter('%(asctime)s %(levelname)s: %(message)s') console.setFormatter(formatter) log.addHandler(console) log.info('turning on verbose logging mode') if options.display_errors: log.info('turning display of errors ON') display_errors = True if options.urls: if os.path.isfile(options.urls): urls_txt = options.urls log.info('using %s as urls.txt' % options.urls) else: log.error('%s is not a file' % options.urls) print 'Error: %s is not a file' % options.urls sys.exit(1) if options.hooks: if os.path.isfile(options.hooks): hooks_py = options.hooks log.info('using %s as hooks.py' % options.hooks) else: log.error('%s is not a file' % options.hooks) print 'Error: %s is not a file' % options.hooks sys.exit(1) # Created all needed folders for needed_dir in (urlwatch_dir, cache_dir, scripts_dir): if not os.path.isdir(needed_dir): os.makedirs(needed_dir) # Check for required files if not os.path.isfile(urls_txt): log.warning('not a file: %s' % urls_txt) urls_txt_fn = os.path.join(os.path.dirname(urls_txt), os.path.basename(urls_txt_example)) hooks_py_fn = os.path.join(os.path.dirname(hooks_py), os.path.basename(hooks_py_example)) print ERROR_MESSAGE_URLS_TXT % (urls_txt, urls_txt_fn) if not options.hooks: print ERROR_MESSAGE_HOOKS_PY % (hooks_py, hooks_py_fn) if os.path.exists(urls_txt_example) and not os.path.exists(urls_txt_fn): shutil.copy(urls_txt_example, urls_txt_fn) if not options.hooks and os.path.exists(hooks_py_example) and not os.path.exists(hooks_py_fn): shutil.copy(hooks_py_example, hooks_py_fn) sys.exit(1) headers = { 'User-agent': user_agent, } summary = [] details = [] count = 0 filter_func = lambda x, y: y if os.path.exists(hooks_py): log.info('using hooks.py from %s' % hooks_py) hooks = imp.load_source('hooks', hooks_py) if hasattr(hooks, 'filter'): log.info('found and enabled filter function from hooks.py') filter_func = hooks.filter else: log.warning('hooks.py has no filter function - ignoring') else: log.info('not using hooks.py (file not found)') def process_job(job): log.info('now processing: %s', job.location) filename = os.path.join(cache_dir, job.get_guid()) timestamp = None if os.path.exists(filename): timestamp = os.stat(filename)[stat.ST_MTIME] data = job.retrieve(timestamp, filter_func, headers, log) return filename, timestamp, data jobs = handler.parse_urls_txt(urls_txt) log.info('processing %d jobs', len(jobs)) executor = concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) future_to_job = dict((executor.submit(process_job, job), job) for job in jobs) for future in concurrent.futures.as_completed(future_to_job): job = future_to_job[future] log.info('job finished: %s' % job.location) try: exception = future.exception() if exception is not None: raise exception filename, timestamp, data = future.result() if os.path.exists(filename): log.info('%s exists - creating unified diff' % filename) old_data = open(filename).read() if (not isinstance(old_data, unicode) and isinstance(data, unicode)): # Fix for Python 2's unicode/str woes data = data.encode('utf-8') timestamp_old = email.utils.formatdate(timestamp, localtime=1) timestamp_new = email.utils.formatdate(time.time(), localtime=1) diff = ''.join(difflib.unified_diff(\ old_data.splitlines(1), \ data.splitlines(1), \ '@', \ '@', \ timestamp_old, \ timestamp_new)) if len(diff) > 0: log.info('%s has changed - adding diff' % job) details += foutput('changed', job, diff, summary) else: log.info('%s has not changed' % job) else: log.info('%s does not exist - is considered "new"' % filename) details += foutput('new', job, None, summary) log.info('writing current content of %s to %s' % (job, filename)) try: open(filename, 'w').write(data) except UnicodeEncodeError: # Happens in Python 2 when data contains non-ascii characters open(filename, 'w').write(data.encode('utf-8')) except urllib2.HTTPError, error: if error.code == 304: log.info('%s has not changed (HTTP 304)' % job) else: log.error('got HTTPError while loading url: %s' % error) if display_errors: details += foutput('error', job, error, summary) except handler.ShellError, error: log.error('Shell returned %d' % error.result) if display_errors: details += foutput('error', job, error, summary) except urllib2.URLError, error: log.error('got URLError while loading url: %s' % error) if display_errors: details += foutput('error', job, error, summary) except IOError, error: log.error('got IOError while loading url: %s' % error) if display_errors: details += foutput('error', job, error, summary) except socket.timeout, error: log.error('got timeout while loading url: %s' % error) if display_errors: details += foutput('error', job, error, summary) except httplib.error, error: # This is to workaround a bug in urllib2, see # http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=529740 log.error('got httplib error while loading url: %s' % error) if display_errors: details += foutput('error', job, (repr(error) + '\n' + str(error)).strip(), summary) count += 1 end = datetime.datetime.now() # Output everything if len(summary) > 1: log.info('printing summary with %d items' % len(summary)) print '-'*line_length print 'summary: %d changes' % (len(summary),) print '' for id, line in enumerate(summary): print '%02d. %s' % (id+1, line) print '-'*line_length print '\n\n\n' else: log.info('summary is too short - not printing') if len(details) > 1: log.info('printing details with %d items' % len(details)) print '\n'.join(details) print '-- ' print '%s %s, %s' % (pkgname, __version__, __copyright__) print 'Website: %s' % (__homepage__,) print 'watched %d URLs in %d seconds\n' % (count, (end-start).seconds) else: log.info('no details collected - not printing') urlwatch-1.15/urlwatch.1000066400000000000000000000044611202664542400152160ustar00rootroot00000000000000.TH URLWATCH "1" "August 2012" "urlwatch 1.15" "User Commands" .SH NAME urlwatch \- Watch web pages and arbitrary URLs for changes .SH SYNOPSIS .B urlwatch [\fIoptions\fR] .SH DESCRIPTION urlwatch watches a list of URLs for changes and prints out unified diffs of the changes. You can filter always-changing parts of websites by providing a "hooks.py" script. .SH OPTIONS .TP \fB\-\-version\fR show program's version number and exit .TP \fB\-h\fR, \fB\-\-help\fR show the help message and exit .TP \fB\-v\fR, \fB\-\-verbose\fR Show debug/log output .TP \fB\-\-urls\fR=\fIFILE\fR Read URLs from the specified file .TP \fB\-\-hooks\fR=\fIFILE\fR Use specified file as hooks.py module .TP \fB\-e\fR, \fB\-\-display\-errors\fR Include HTTP errors (404, etc..) in the output .SH ADVANCED FEATURES urlwatch includes some advanced features that you have to activate by creating a hooks.py file that specifies for which URLs to use a specific feature. You can also use the hooks.py file to filter trivially-varying elements of a web page. .SS ICALENDAR FILE PARSING This module allows you to parse .ics files that are in iCalendar format and provide a very simplified text-based format for the diffs. Use it like this in your hooks.py file: from urlwatch import ical2txt def filter(url, data): if url.endswith('.ics'): return ical2txt.ical2text(data).encode('utf-8') + data # ...you can add more hooks here... .SS HTML TO TEXT CONVERSION There are three methods of converting HTML to text in the current version of urlwatch: "lynx" (default), "html2text" and "re". The former two use command-line utilities of the same name to convert HTML to text, and the last one uses a simple regex-based tag stripping method (needs no extra tools). Here is an example of using it in your hooks.py file: from urlwatch import html2txt def filter(url, data): if url.endswith('.html') or url.endswith('.htm'): return html2txt.html2text(data, method='lynx') # ...you can add more hooks here... .SH "FILES" .TP .B ~/.urlwatch/urls.txt A list of HTTP/FTP URLs to watch (one URL per line) .TP .B ~/.urlwatch/lib/hooks.py A Python module that can be used to filter contents .TP .B ~/.urlwatch/cache/ The state of web pages is saved in this folder .SH AUTHOR Thomas Perl .SH WEBSITE http://thp.io/2008/urlwatch/