2008 / urlwatch - a tool for monitoring webpages for updates

urlwatch - a tool for monitoring webpages for updates

This script is intended to help you watch URLs and get notified (via email or in your terminal) of any changes. The change notification will include the URL that has changed and a unified diff of what has changed.

The script supports the use of a filtering hook function to strip trivially-varying elements of a webpage.

Basic features

  • Simple configuration (text file, one URL per line)
  • Easily hackable (clean Python implementation)
  • Can run as a cronjob and mail changes to you
  • Always outputs only plaintext - no HTML mails :)
  • Supports removing noise (always-changing website parts)
  • Example hooks to filter content in Python
  • Uses If-Modified-Since header to save bandwidth (new in 1.9)
  • Convert non-UTF8 web pages to UTF-8 for mail (new in 1.10)
  • Handle non-zero shell exit codes as error (new in 1.11)
  • Support for concurrent/parallel downloads (new in 1.13)
[image: urlwatch logo]

Current version: 1.14

Updated 2011-11-15: urlwatch 1.14 fixes a unicode decoding issue related to the html2txt module. 2011-12-08: If you are experiencing problems with the concurrent page updates, try setting the number of threads to 1. This might make the updates go slower, but according to at least one user, it is more stable this way. YMMV.

urlwatch 1.13 adds support for watching websites that only work with HTTP POST requests. You can add the POST data in URL-encoded form after the website URL in the urls.txt file, separated by a single space. This release also adds support for Python 3.x by providing an appropriate converter script.

For Python versions earlier than 3.2, this release now depends on the "futures" package from PyPI (this module is included in the 3.2 standard library). The usage of futures should reduce the total time needed to watch several URLs, because network requests are sent in parallel, which usually leads to better bandwidth usage.

Python compatibility

urlwatch is compatible with Python 2.x (2.5 and newer) and with Python 3.x. For Python 3, you have to use the included converter script, which will convert the source code to be compatible with Python 3 (by using the 2to3 tool included in Python).

Download

Official Debian package (by Franck Joncourt)

Package information: http://packages.debian.org/urlwatch

If you have sid repositories enabled, you can install urlwatch via:

    apt-get install urlwatch

Source tarball

You can download the source tarball of urlwatch here:

urlwatch-1.14.tar.gz (2011-11-15)

Old releases

It's not recommended to run an older version than the current one.

Python Package Index

urlwatch is also indexed in the Python Package Index as "urlwatch":

Advanced features

  • Clean up "bad" HTML (long lines, etc..) with python-utidylib
  • Convert iCalendar files (*.ics) to plaintext using ical2text
  • Convert HTML to plaintext using lynx, html2text or a regex
  • Watch output of shell commands (new in 1.9)
  • Filters may return None to avoid filtering (new in 1.12)
  • Support for using HTTP POST requests (new in 1.13)

3rd party patches / Contributions

License

urlwatch is released under the terms of the BSD license

Code repository

The Git repository of urlwatch now has a more permanent home over at repo.or.cz/w/urlwatch.git.

To checkout the code using git, use this command:

    git clone git://repo.or.cz/urlwatch.git

How do I..

..watch only an element on a website?

If you are lucky, the element has a "id" attribute (but other attributes work just fine as well) that you can use with the BeautifulSoup library to extract that part of the HTML document:

      from BeautifulSoup import BeautifulSoup
      soup = BeautifulSoup(data)
      data = str(soup.find(id='tisiDocumentBody'))

..watch a remote Git repository for new tags?

urlwatch supports running commands and checking their output using the Pipe symbol (|). You can use this in combination with the Git command to watch a remote repo for new tags (in urls.txt):

|git ls-remote --tags http://github.com/gpodder/gpodder.git

As an alternative, Thomas Dziedzic has written a tool for this specifically in Ruby, it's called tagurit and can be found at https://github.com/gostrc/tagurit.

Information about the User-Agent

Since version 1.3, urlwatch now sends a better User-Agent string. More information about this User-agent string can be found on this page.

Thomas Perl (m at thp io), jabber: thp@jabber.org