NAMEurlwatch-filters - Filtering output and diff data of urlwatch jobsSYNOPSISurlwatch --editDESCRIPTIONEach job can have two filter stages configured, with one or more filters processed after each other:
While creating your filter pipeline, you might want to preview what the filtered output looks like. You can do so by first configuring your job and then running urlwatch with the --test-filter command, passing in the index (from --list) or the URL/location of the job to be tested: urlwatch --test-filter 1 # Test the first job in the list urlwatch --test-filter https://example.net/ # Test the job with the given URL The output of this command will be the filtered plaintext of the job, this is the output that will (in a real urlwatch run) be the input to the diff algorithm. The filter is only applied to new content, the old content was already filtered when it was retrieved. This means that changes to filter are not visible when reporting unchanged contents (see configuration_display for details), and the diff output will be between (old content with filter at the time old content was retrieved) and (new content with current filter). Once urlwatch has collected at least 2 historic snapshots of a job (two different states of a webpage) you can use the command-line option --test-diff-filter to test your diff_filter settings; this will use historic data cached locally. BUILT-IN FILTERSThe list of built-in filters can be retrieved using:urlwatch --features At the moment, the following filters are built-in:
PICKING OUT ELEMENTS FROM A WEBPAGEYou can pick only a given HTML element with the built-in filter, for example to extract <div id="something">.../<div> from a page, you can use the following in your urls.yaml:url: http://example.org/idtest.html filter: - element-by-id: something Also, you can chain filters, so you can run html2text on the result: url: http://example.net/id2text.html filter: - element-by-id: something - html2text CHAINING MULTIPLE FILTERSThe example urls.yaml file also demonstrates the use of built-in filters, here 3 filters are used: html2text, line-grep and whitespace removal to get just a certain info field from a webpage:url: https://example.net/version.html filter: - html2text - grep: "Current.*version" - strip EXTRACTING ONLY THE <BODY> TAG OF A PAGEIf you want to extract only the body tag you can use this filter:url: https://example.org/bodytag.html filter: - element-by-tag: body FILTERING BASED ON AN XPATH EXPRESSIONTo filter based on an XPath <https://www.w3.org/TR/1999/REC-xpath-19991116/> expression, you can use the xpath filter like so:url: https://example.net/xpath.html filter: - xpath: /html/body/marquee This filters only the <marquee> elements directly below the <body> element, which in turn must be below the <html> element of the document, stripping out everything else. See Microsoft’s XPath Examples <https://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx> page for some other examples. You can also find an XPath of an <html> node in the Chromium/Google Chrome developer tools by right clicking on the node and selecting copy XPath. FILTERING BASED ON CSS SELECTORSTo filter based on a CSS selector <https://www.w3.org/TR/2011/REC-css3-selectors-20110929/>, you can use the css filter like so:url: https://example.net/css.html filter: - css: ul#groceries > li.unchecked This would filter only <li class="unchecked"> tags directly below <ul id="groceries"> elements. Some limitations and extensions exist as explained in cssselect’s documentation <https://cssselect.readthedocs.io/en/latest/#supported-selectors>. USING XPATH AND CSS FILTERS WITH XML AND EXCLUSIONSBy default, XPath and CSS filters are set up for HTML documents. However, it is possible to use them for XML documents as well (these examples parse an RSS feed and filter only the titles and publication dates):url: https://example.com/blog/xpath-index.rss filter: - xpath: path: '//item/title/text()|//item/pubDate/text()' method: xml url: http://example.com/blog/css-index.rss filter: - css: selector: 'item > title, item > pubDate' method: xml - html2text: re To match an element in an XML namespace <https://www.w3.org/TR/xml-names/>, use a namespace prefix before the tag name. Use a : to separate the namespace prefix and the tag name in an XPath expression, and use a | in a CSS selector. url: https://example.net/feed/xpath-namespace.xml filter: - xpath: path: '//item/media:keywords/text()' method: xml namespaces: media: http://search.yahoo.com/mrss/ url: http://example.org/feed/css-namespace.xml filter: - css: selector: 'item > media|keywords' method: xml namespaces: media: http://search.yahoo.com/mrss/ - html2text Alternatively, use the XPath expression //*[name()='<tag_name>'] to bypass the namespace entirely. Another useful option with XPath and CSS filters is exclude. Elements selected by this exclude expression are removed from the final result. For example, the following job will not have any <a> tag in its results: url: https://example.org/css-exclude.html filter: - css: selector: body exclude: a LIMITING THE RETURNED ITEMS FROM A CSS SELECTOR OR XPATHIf you only want to return a subset of the items returned by a CSS selector or XPath filter, you can use two additional subfilters:
For example, if the page has multiple elements, but you only want to select the second and third matching element (skip the first, and return at most two elements), you can use this filter: url: https://example.net/css-skip-maxitems.html filter: - css: selector: div.cpu skip: 1 maxitems: 2 Dealing with duplicated resultsIf you get multiple results on one page, but you only expected one (e.g. because the page contains both a mobile and desktop version in the same HTML document, and shows/hides one via CSS depending on the viewport size), you can use maxitems: 1 to only return the first item.FILTERING PDF DOCUMENTSTo monitor the text of a PDF file, you use the pdf2text filter. It requires the installation of the pdftotext <https://github.com/jalan/pdftotext/blob/master/README.md#pdftotext> library and any of its OS-specific dependencies <https://github.com/jalan/pdftotext/blob/master/README.md#os-dependencies>.This filter must be the first filter in a chain of filters, since it consumes binary data and outputs text data. url: https://example.net/pdf-test.pdf filter: - pdf2text - strip If the PDF file is password protected, you can specify its password: url: https://example.net/pdf-test-password.pdf filter: - pdf2text: password: urlwatchsecret - strip DEALING WITH CSV INPUTThe csv2text filter can be used to turn CSV data to a prettier textual representation This is done by supplying a format_string which is a python format string <https://docs.python.org/3/library/string.html#format-string-syntax>.If the CSV has a header, the format string should use the header names lowercased. For example, let's say we have a CSV file containing data like this: Name;Company Smith;Initech Doe;Initech A possible format string for the above CSV (note the lowercase keys): Mr {name} works at {company} If there is no header row, you will need to use the numeric array notation: Mr {0} works at {1} You can force the use of numeric indices with the flag ignore_header. The key has_header can be used to force use the first line or first ignore the first line as header, otherwise csv.Sniffer <https://docs.python.org/3/library/csv.html#csv.Sniffer> will be used. SORTING OF WEBPAGE CONTENTSometimes a web page can have the same data between comparisons but it appears in random order. If that happens, you can choose to sort before the comparison.url: https://example.net/sorting.txt filter: - sort The sort filter takes an optional separator parameter that defines the item separator (by default sorting is line-based), for example to sort text paragraphs (text separated by an empty line): url: http://example.org/paragraphs.txt filter: - sort: separator: "\n\n" This can be combined with a boolean reverse option, which is useful for sorting and reversing with the same separator (using % as separator, this would turn 3%2%4%1 into 4%3%2%1): url: http://example.org/sort-reverse-percent.txt filter: - sort: separator: '%' reverse: true REVERSING OF LINES OR SEPARATED ITEMSTo reverse the order of items without sorting, the reverse filter can be used. By default it reverses lines:url: http://example.com/reverse-lines.txt filter: - reverse This behavior can be changed by using an optional separator string argument (e.g. items separated by a pipe (|) symbol, as in 1|4|2|3, which would be reversed to 3|2|4|1): url: http://example.net/reverse-separator.txt filter: - reverse: '|' Alternatively, the filter can be specified more verbose with a dict. In this example "\n\n" is used to separate paragraphs (items that are separated by an empty line): url: http://example.org/reverse-paragraphs.txt filter: - reverse: separator: "\n\n" WATCHING GITHUB RELEASES AND GITLAB TAGSThis is an example how to watch the GitHub “releases” page for a given project for the latest release version, to be notified of new releases:url: https://github.com/tulir/gomuks/releases filter: - xpath: '(//div[contains(@class,"d-flex flex-column flex-md-row my-5 flex-justify-center")]//h1//a)[1]' - html2text: re - strip This is the corresponding version for Github tags: url: https://github.com/thp/urlwatch/tags filter: - xpath: (//div[contains(@class,"commit js-details-container Details")]//h4//a)[1] - html2text - strip and for Gitlab tags: url: https://gitlab.com/chinstrap/gammastep/-/tags filter: - xpath: (//a[contains(@class,"item-title ref-name")])[1] - html2text Alternatively, jq can be used for filtering: url: https://api.github.com/repos/voxpupuli/puppet-rundeck/tags filter: - jq: '.[0].name' REMOVE OR REPLACE TEXT USING REGULAR EXPRESSIONSJust like Python’s re.sub function, there’s the possibility to apply a regular expression and either remove of replace the matched text. The following example applies the filter 3 times:
All features are described in Python’s re.sub <https://docs.python.org/3/library/re.html#re.sub> documentation (the pattern and repl values are passed to this function as-is, with the value of repl defaulting to the empty string). url: https://example.com/regex-substitute.html filter: - re.sub: '\s*href="[^"]*"' - re.sub: pattern: '<h1>' repl: 'HEADING 1: ' - re.sub: pattern: '</([^>]*)>' repl: '<END OF TAG \1>' If you want to enable certain flags (e.g. re.MULTILINE) in the call, this is possible by inserting an "inline flag" documented in flags in re.compile <https://docs.python.org/3/library/re.html#re.compile>, here are some examples:
This allows you, for example, to remove all leading spaces (only space character and tab): url: http://example.com/leading-spaces.txt filter: - re.sub: '(?m)^[ \t]*' USING A SHELL SCRIPT AS A FILTERWhile the built-in filters are powerful for processing markup such as HTML and XML, in some cases you might already know how you would filter your content using a shell command or shell script. The shellpipe filter allows you to start a shell and run custom commands to filter the content.The text data to be filtered will be written to the standard input (stdin) of the shell process and the filter output will be taken from the shell's standard output (stdout). For example, if you want to use grep tool with the case insensitive matching option (-i) and printing only the matching part of the line (-o), you can specify this as shellpipe filter: url: https://example.net/shellpipe-grep.txt filter: - shellpipe: "grep -i -o 'price: <span>.*</span>'" This feature also allows you to use sed(1), awk(1) and perl(1) one-liners for text processing (of course, any text tool that works in a shell can be used). For example, this awk(1) one-liner prepends the line number to each line: url: https://example.net/shellpipe-awk-oneliner.txt filter: - shellpipe: awk '{ print FNR " " $0 }' You can also use a multi-line command for a more sophisticated shell script (| in YAML denotes the start of a text block): url: https://example.org/shellpipe-multiline.txt filter: - shellpipe: | FILENAME=`mktemp` # Copy the input to a temporary file, then pipe through awk tee $FILENAME | awk '/The numbers for (.*) are:/,/The next draw is on (.*)./' # Analyze the input file in some other way echo "Input lines: $(wc -l $FILENAME | awk '{ print $1 }')" rm -f $FILENAME Within the shellpipe script, two environment variables will be set for further customization (this can be useful if you have an external shell script file that is used as filter for multiple jobs, but needs to treat each job in a slightly different way):
CONVERTING TEXT IN IMAGES TO PLAINTEXTThe ocr filter uses the Tesseract OCR engine <https://github.com/tesseract-ocr> to convert text in images to plain text. It requires two Python modules to be installed: pytesseract <https://github.com/madmaze/pytesseract> and Pillow <https://python-pillow.org>. Any file formats supported by Pillow (PIL) are supported.This filter must be the first filter in a chain of filters, since it consumes binary data and outputs text data. url: https://example.net/ocr-test.png filter: - ocr: timeout: 5 language: eng - strip The subfilters timeout and language are optional:
FILTERING JSON RESPONSE DATA USING JQ SELECTORSThe jq filter uses the Python bindings for jq <https://stedolan.github.io/jq/>, a lightweight JSON processor. Use of this filter requires the optional jq Python module <https://github.com/mwilliamson/jq.py> to be installed.url: https://example.net/jobs.json filter: - jq: query: '.[].title' The subfilter query is optional:
Supports aggregations, selections, and the built-in operators like length. For more information on the operations permitted, see the jq Manual <https://stedolan.github.io/jq/manual/>. FILES$XDG_CONFIG_HOME/urlwatch/urls.yamlSEE ALSOurlwatch(1), urlwatch-intro(5), urlwatch-jobs(5)COPYRIGHT2022 Thomas Perl