Awasu WebScrape2 plugin

This plugin lets you scrape web pages and monitor the results as a channel in your Awasu.

This can be used for web sites that don't offer an RSS feed. You just specify what parts of the web page you're interested in, and Awasu will regularly check in, extract the information you want, and synthesize a feed that you can monitor in your Awasu.

Drop us a line if you're interested in finding out more.

Creating a new channel

Additional help is available here.

Configuring the channel

The plugin accepts the following parameters:

Config file The file that contains the instructions for how to scrape the web page.
How to set up this file is explained below.
Download URL If the scrape config can be used for multiple web pages, enter the URL of the page you want to scrape here.
Download user name Set this if the web page requires authentication.
Download password Set this if the web page requires authentication.
Script timeout Maximum amount of time to allow this script to run (seconds).

Writing a scrape configuration

A web scrape is configured in a separate INI file. A blank template can be found here.

The first thing to consider is the URL of the web page to be downloaded. If the scrape config will only ever be used for a single web page, this can be set in the config file:

Download URL = http://...

However, if the scrape config will be used for multiple web pages, the user must configure the URL in the Channel Wizard each time.

Extracting feed-level content

The following feed-level elements can be extracted from a web page:

These all work in the same way:

Identifying the HTML to extract feed items from

For extracting feed items, it is possible to first isolate a section of the HTML to work with, via a regular expression e.g.

Snip pattern = <body>.*</body>
If no snip pattern is configured, the entire web page will be used.
The snip pattern only applies when extracting item-level values. The entire web page is always used when extracting feed-level values.

You can then specify regular expression that isolates each block of HTML that corresponds to a single feed item e.g.

Item pattern = <div class="item">.*?</div>
If you want to extract multiple feed items, then you must configure an item pattern, otherwise the plugin will only scan the HTML once and hence will only ever generate a single feed item.

Once you have isolated the HTML for each feed item, you can then extract the following item-level values from that HTML fragment:

These work in the same way as the feed-level values described above.

Special placeholders for item ID's

It is possible to generate unique item ID's, based on the item content. Item ID templates recognize the following special placeholders:

If these appear in an item ID template, they will be replaced with the MD5 of the item's title/description/link. This will cause a new item ID to be generated if these values change, thus causing Awasu to treat the item as a new one.

Downloading and scraping linked-to URL's

This plugin offers a special feature whereby instead of generating descriptions from the main HTML page, it can follow the linked-to URL and scrape that page instead. This is very useful for things like forums and other web pages that consist mainly of links to other pages that have the content of interest. The item titles and links can be generated from the main HTML page, then each link followed and the item descriptions generated from each of those pages.

To enable this feature, add the following to the scrape configuration:

Get item description from link = 1

Other scrape options

Max. items Limits the number of feed items extracted from the web page.
Force HTML charset Forces the HTML to be interpreted using the specified character set.
Translate relative links Causes relative links to be converted to absolute, relative to the URL of the web page being scrape.
An example

As an example, we will write a scrape configuration to generate a feed from this page.

Extracting the feed title

First, we want to extract the contents of the <h1> header, and use that as the feed title:

Feed title pattern = <h1>(.*?)</h1>

This regex looks for the opening <h1>, the closing </h1>, then grabs everything between the two.

We use .*? instead of .* to make sure the regex match stops at the first </h1>. In this particular case, it's not important since there's only one <h1> heading, but if there was another one, it would be important to make sure that the regex was not greedy and try to match too much.

Isolating the feed items

While it's not necessary for such a small page, we define a snip pattern that isolates the part of the web page we want to work with (the HTML table):

Snip pattern = <table(.*?)</table>

We then define an item pattern, that locates the HTML corresponding to each feed item we want to generate (each table row):

Item pattern = <tr>(.*?)</tr>

Extracting each feed item

The plugin now iterates over each HTML fragment matched by the item pattern and extracts a feed item from them. The titles and descriptions are extracted in the normal way:

Item title pattern = <td class="title">(.*?)</td>
Item description pattern = <td class="description">(.*?)</td>

The item link is more interesting. A regex is defined to extract values from the HTML:

Item link pattern = <td class="url">\s*(.*?)\s*</td>

A template is then used to generate the item link, using the extracted values:

Item link template ={1}
In other words, the item link pattern is used to extract values from the HTML (item1.html, item2.html, etc.), then these values are inserted into the item link template to generate the feed item links (, etc.).
Debugging your scrape configurations

Writing regex's can be tricky, so to help you write yours, you can turn on logging for the plugin.

Create a file called WebScrape2.ini in the same directory as WebScrape2.exe that looks like this:

Log file = ...
Format snips = 1
The plugin will now log what it's doing in the log file you specified.