Introduction¶
Quick Start¶
Run
urlwatchonce to migrate your old data or start freshUse
urlwatch --editto customize jobs and filters (urls.yaml)Use
urlwatch --edit-configto customize settings and reporters (urlwatch.yaml)Add
urlwatchto your crontab (crontab -e) to monitor webpages periodically
The checking interval is defined by how often you run urlwatch. You
can use e.g. crontab.guru to figure out the
schedule expression for the checking interval, we recommend not more
often than 30 minutes (this would be */30 * * * *). If you have
never used cron before, check out the crontab command
help.
On Windows, cron is not installed by default. Use the Windows Task
Scheduler
instead, or see this StackOverflow
question for
alternatives.
How it works¶
Every time you run urlwatch(1), it:
retrieves the output of each job and filters it
compares it with the version retrieved the previous time (“diffing”)
if it finds any differences, it invokes enabled reporters (e.g. text reporter, e-mail reporter, …) to notify you of the changes
Jobs and Filters¶
Each website or shell command to be monitored constitutes a “job”.
The instructions for each such job are contained in a config file in the YAML
format. If you have more than one job, you separate them with a line
containing only ---.
You can edit the job and filter configuration file using:
urlwatch --edit
If you get an error, set your $EDITOR (or $VISUAL) environment
variable in your shell, for example:
export EDITOR=/bin/nano
While you can edit the YAML file manually, using --edit will
do sanity checks before activating the new configuration file.
Kinds of Jobs¶
Each job must have exactly one of the following keys, which also defines the kind of job:
urlretrieves what is served by the web server (HTTP GET by default),navigateuses a headless browser to load web pages requiring JavaScript, andcommandruns a shell command.
Each job can have an optional name key to define a user-visible name for the job.
You can then use optional keys to finely control various job’s parameters.
Filters¶
You may use the filter key to select one or more Filters to apply to
the data after it is retrieved, for example to:
select HTML:
css,xpath,element-by-class,element-by-id,element-by-style,element-by-tagmake HTML more readable:
html2text,beautifymake PDFs readable:
pdf2textmake JSON more readable:
format-jsonmake iCal more readable:
ical2textmake binary readable:
hexdumpjust detect changes:
sha1sumedit text:
grep,grepi,strip,sort,striplines
These filters can be chained. As an example, after retrieving an HTML
document by using the url key, you can extract a selection with the
xpath filter, convert this to text with html2text, use grep to
extract only lines matching a specific regular expression, and then sort
them:
name: "Sample urlwatch job definition"
url: "https://example.dummy/"
https_proxy: "http://dummy.proxy/"
max_tries: 2
filter:
- xpath: '//section[@role="main"]'
- html2text:
method: pyhtml2text
unicode_snob: true
body_width: 0
inline_links: false
ignore_links: true
ignore_images: true
pad_tables: false
single_line_break: true
- grep: "lines I care about"
- sort:
---
Reporters¶
urlwatch can be configured to do something with its report besides (or in addition to) the default of displaying it on the console.
Reporters are configured in the global configuration file:
urlwatch --edit-config
Examples of reporters:
email(using SMTP)email using
mailgunslackdiscordpushbullettelegrammatrixpushoverstdoutxmppshell