[Web4lib] link checkers
Kozlowski,Brendon
bkozlowski at sals.edu
Wed Aug 19 13:13:08 EDT 2009
To understand how link checkers work (or adversely, don't work), you have to understand a little bit about HTML error response codes. A link checker will typically expect to receive a 404, 500, or 3xx response code if there is an error. A 200 response means that everything is OK. Unfortunately, many sites, including some subscription database web-services, redirect error'd pages to an HTML page that visibly tells the user that there is an issue with the page they were trying to view. The unfortunate part is that it is delivered to the browser with a 200 response code, meaning the link checker doesn't think anything is wrong.
Some other services completely block or deny spiders or robots from visiting their pages. The link checker will (or might) show it as a broken link as it was unable to access the page, however visiting the page on your own in the browser will work just fine.
As for your webmaster having to spend over 2 hours for the spider to run, and then it not being able to generate a report... Be very careful with any pages that create recursive, or continuous links, such as a calendar (which would allow a user to visit dates up until the year 9999 AD, and every single day in-between). Wordpress is also very redundant. Xenu Linksleuth has the ability to blacklist certain URL paths, and you can check those links manually rather than waiting hours for the crawling to occur.
I maintain a rather small website with approximately 450 public facing pages. It takes Xenu about 7 minutes to run its crawl and generate a report (Intel P4 200MHz 3GB RAM, highspeed cable connection), and I haven't toyed with the threads to get the best possible efficiency from it - I'm happy with its current speed.
Now, I realize you asked for a different tool, and unfortunately as Xenu's worked (almost) marvelously for me, I have not tried any other applications, so I cannot offer any further advice, but in the meantime, if you have the patience, perhaps some of these thoughts might help you play around with it and see if you can get better results while trying to find a product better suited to you and your needs. (Also, don't let it follow links beyond your domain. It might still do it once and awhile, but you don't need to index the entire internet.)
The only issue I have with Xenu is that I can no longer verify links on our staff backend web site as the authentication is done server-side, no cookie is stored, and I don't want to make a rule to temporarily allow a backdoor for the Xenu bot.
Good luck on your search.
Brendon Kozlowski
Web Administrator
Saratoga Springs Public Library
49 Henry Street
Saratoga Springs, NY, 12866
[518] 584-7860 x217
To report this message as spam, offensive, or if you feel you have received this in error,
please send e-mail to abuse at sals.edu including the entire contents and subject of the message.
It will be reviewed by staff and acted upon appropriately.
More information about the Web4lib
mailing list