[WEB4LIB] Re: Controlling Bloglines crawler?
Michael McDonnell
michael at winterstorm.ca
Wed Nov 10 17:44:24 EST 2004
Thomas Dowling wrote:
> <>
> Thanks to the people who responded with hints (more polite than "RTF
> Standard!") about the <ttl> and <skipHours> elements in RSS. They do
> exactly what I want (respectivesly, tell an RSS reader not to pull
> updates more often than X minutes and tell a reader not to pull feeds
> during certain hours of the day).
>
> Unfortunately, Bloglines disregards both elements; on inspection, I see
> my copy of RSSReader does also. I'll experiment with the 304 HTTP status.
To implement the 304 trick, or something similar, it might be useful to
use apache's URL rewriting rules. A simple way would be like this:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT}
^.*some_string_in_the_rss_bots_agent_name.*$ [NC]
RewriteRule /my/file.rss /var/www/rssbotbuster.asis [L]
The file /var/www/rssbotbuster.asis would have a single line (and you
would need to enable mod_asis and enable the ".asis" content handler:
Status: 304 Botbuster
Now, you could also use the rewrite rules so that the bot can only get
in at certain times. For example you could add:
RewriteEngine On
RewriteCond %{TIME_HOUR}%{TIME_MIN} >0100
RewriteCond %{TIME_HOUR}%{TIME_MIN} <0600
RewriteCond %{HTTP_USER_AGENT}
^.*some_string_in_the_rss_bots_agent_name.*$ [NC]
RewriteRule /my/file.rss /var/www/rssbotbuster.asis [L]
Or instead of using an 'asis' file you could redirect to a CGI or PHP or
JSP page that does the same and perhaps tarpits the bot but introducing
a sleep cycle.
Another approach would be redirect all requests for RSS files to a CGI
that checks for the offending user agent. If it finds the bad bot, it
tarpits them, if it does not then it provides the correct file (or
redirects to an alternative URL for the same content). This might be
done with apache's "action" directive, or with simple redirects and
aliases, or with rewrite rules.
--
Michael McDonnell, GCIA
Winterstorm Solutions, Inc.
michael at winterstorm.ca
More information about the Web4lib
mailing list