urlwatch - a tool for monitoring webpages for updates
This script is intended to help you watch URLs and get notified (via email or in your terminal) of any changes. The change notification will include the URL that has changed and a unified diff of what has changed.
The script supports the use of a filtering hook function to strip trivially-varying elements of a webpage.
Basic features
- Simple configuration (text file, one URL per line)
- Easily hackable (clean Python implementation)
- Can run as a cronjob and mail changes to you
- Always outputs only plaintext - no HTML mails :)
- Supports removing noise (always-changing website parts)
- Example hooks to filter content in Python
- Uses If-Modified-Since header to save bandwidth (new in 1.9)
- Convert non-UTF8 web pages to UTF-8 for mail (new in 1.10)
- Handle non-zero shell exit codes as error (new in 1.11)
- Support for concurrent/parallel downloads (new in 1.13)
- Support for handling UTF-8 in Lynx and html2text (new in 1.15)
Current version: 1.15
2012-08-30: urlwatch 1.15 adds support for optional UTF-8 handling in the html2text function of the "html2txt" helper module. Patch contributed by Slavko.
2011-11-15: urlwatch 1.14 fixes a unicode decoding issue related to the html2txt module.
2011-12-08: If you are experiencing problems with the concurrent page updates, try setting the number of threads to 1. This might make the updates go slower, but according to at least one user, it is more stable this way. YMMV.
urlwatch 1.13 adds support for watching websites that only work with HTTP POST requests. You can add the POST data in URL-encoded form after the website URL in the urls.txt file, separated by a single space. This release also adds support for Python 3.x by providing an appropriate converter script.
For Python versions earlier than 3.2, this release now depends on the "futures" package from PyPI (this module is included in the 3.2 standard library). The usage of futures should reduce the total time needed to watch several URLs, because network requests are sent in parallel, which usually leads to better bandwidth usage.
Python compatibility
urlwatch is compatible with Python 2.x (2.5 and newer) and with Python 3.x. For Python 3, you have to use the included converter script, which will convert the source code to be compatible with Python 3 (by using the 2to3 tool included in Python).
Download
Official Debian package (by Franck Joncourt)
Package information: http://packages.debian.org/urlwatch
If you have sid repositories enabled, you can install urlwatch via:
apt-get install urlwatch
Source tarball
You can download the source tarball of urlwatch here:
Old releases
It's not recommended to run an older version than the current one.
- urlwatch-1.14.tar.gz (2011-11-15)
- urlwatch-1.13.tar.gz (2011-08-22)
- urlwatch-1.12.tar.gz (2011-02-10)
- urlwatch-1.11.tar.gz (2010-07-30)
- urlwatch-1.10.tar.gz (2010-05-10)
- urlwatch-1.9.tar.gz (2009-09-29)
- urlwatch-1.8.tar.gz (2009-08-10)
- urlwatch-1.7.tar.gz (2009-01-03)
- urlwatch-1.6.tar.gz (2008-12-23)
- urlwatch-1.5.tar.gz (2008-11-18)
- urlwatch-1.4.tar.gz (2008-11-14)
- urlwatch-1.3.tar.gz (2008-05-16)
- urlwatch-1.2.tar.gz (2008-05-10)
- urlwatch-1.1.tar.gz (2008-03-22)
- urlwatch-1.0.tar.gz (2008-03-17)
Python Package Index
urlwatch is also indexed in the Python Package Index as "urlwatch":
Advanced features
- Clean up "bad" HTML (long lines, etc..) with python-utidylib
- Convert iCalendar files (*.ics) to plaintext using ical2text
- Convert HTML to plaintext using lynx, html2text or a regex
- Watch output of shell commands (new in 1.9)
- Filters may return None to avoid filtering (new in 1.12)
- Support for using HTTP POST requests (new in 1.13)
3rd party patches / Contributions
- Michael Düll has integrated XMPP support into urlwatch, so you can get notified via Jabber instead of E-Mail: urlwatch_xmpp_1.9.patch
License
urlwatch is released under the terms of the BSD license
Code repository
The Git repository of urlwatch now has a more permanent home over at repo.or.cz/w/urlwatch.git.
To checkout the code using git, use this command:
git clone git://repo.or.cz/urlwatch.git
How do I..
..watch only an element on a website?
If you are lucky, the element has a "id" attribute (but other attributes work just fine as well) that you can use with the BeautifulSoup library to extract that part of the HTML document:
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(data) data = str(soup.find(id='tisiDocumentBody'))
..watch a remote Git repository for new tags?
urlwatch supports running commands and checking their output using
the Pipe symbol (|). You can use this in combination
with the Git command to watch a remote repo for new tags (in urls.txt):
|git ls-remote --tags http://github.com/gpodder/gpodder.git
As an alternative, Thomas Dziedzic has written a tool for this specifically in Ruby, it's called tagurit and can be found at https://github.com/gostrc/tagurit.
..get colored diff output on the console?
You can use colordiff to convert normal diffs to colored ones. Just pipe the urlwatch output into colordiff, and you have colored urlwatch output:
urlwatch | colordiff
