[WEB4LIB] Re: Controlling Bloglines crawler?

Wed Nov 10 17:44:24 EST 2004

Thomas Dowling wrote:

> <>
> Thanks to the people who responded with hints (more polite than "RTF
> Standard!") about the <ttl> and <skipHours> elements in RSS. They do
> exactly what I want (respectivesly, tell an RSS reader not to pull
> updates more often than X minutes and tell a reader not to pull feeds
> during certain hours of the day).
>
> Unfortunately, Bloglines disregards both elements; on inspection, I see
> my copy of RSSReader does also. I'll experiment with the 304 HTTP status.

To implement the 304 trick, or something similar, it might be useful to 
use apache's URL rewriting rules.  A simple way would be like this:

      RewriteEngine On
      RewriteCond %{HTTP_USER_AGENT} 
^.*some_string_in_the_rss_bots_agent_name.*$ [NC]
      RewriteRule /my/file.rss /var/www/rssbotbuster.asis [L]

The file /var/www/rssbotbuster.asis would have a single line (and you 
would need to enable mod_asis and enable the ".asis" content handler:

Status: 304 Botbuster

Now, you could also use the rewrite rules so that the bot can only get 
in at certain times.  For example you could add:

RewriteEngine On
RewriteCond %{TIME_HOUR}%{TIME_MIN} >0100
RewriteCond %{TIME_HOUR}%{TIME_MIN} <0600
RewriteCond %{HTTP_USER_AGENT} 
^.*some_string_in_the_rss_bots_agent_name.*$ [NC]
RewriteRule /my/file.rss /var/www/rssbotbuster.asis [L]

Or instead of using an 'asis' file you could redirect to a CGI or PHP or 
JSP page that does the same and perhaps tarpits the bot but introducing 
a sleep cycle.

Another approach would be redirect all requests for RSS files to a CGI 
that checks for the offending user agent.  If it finds the bad bot, it 
tarpits them, if it does not then it provides the correct file (or 
redirects to an alternative URL for the same content).  This might be 
done with apache's "action" directive, or with simple redirects and 
aliases, or with rewrite rules.

-- 
Michael McDonnell, GCIA
Winterstorm Solutions, Inc.
michael at winterstorm.ca