[Web4lib] Link Checking Services

Micah Stevens micah at raincross-tech.com
Tue Jan 30 13:09:25 EST 2007


Any link checker should be checking the http headers returned by the 
server. In this case, if your server is not returning proper response 
codes, the problem is not with the link checker, your server is broken. 
Fix the server, and you'll solve this problem.

Granted, depending on the situation, this may not be easy. If there's 
any consistency to the issues that arise, you can use some tools to scan 
the site for specific word groupings. If the site is public and has been 
scanned by google, you can even use google to do this via the site: 
keyword, although I'm not sure I would would trust the results to be 
comprehensive.

I'm not familiar with link scanners enough to say for sure, but it makes 
sense that some would have this feature. I usually do this task by hand 
as I have greater control this way. Well, not by hand in a browser, but 
I use page and header parsing tools that will walk through the site 
automatically.

Another method is to look at the server logs as if there are generic 
Page not Found pages in the system, you can look at the logs to tell you 
what page request referred to these pages and get the broken link that way.

-Micah

On 01/30/2007 08:33 AM, Thomas Dowling wrote:
>> -----Original Message-----
>> From: web4lib-bounces at webjunction.org
>> [mailto:web4lib-bounces at webjunction.org] On Behalf Of Valerie Reid
>> Sent: Tuesday, January 30, 2007 8:24 AM
>> To: web4lib at webjunction.org
>> Subject: [Web4lib] Link Checking Services
>>
>> We currently subscribe to a link-checking service which checks all the
>> links
>> on our library's web site and sends me a report on a weekly basis...
>>     
>
>
> Without knowing any of the products that have been mentioned so far, I
> will point out one thing to look for.  When I've played with link
> checkers in the past, they were all pretty reliable about finding
> "bad" links - that is, links that when requeted return something other
> than a 200 ("OK") status code.  Unfortunately, there are a lot of
> sloppy/clueless/too-clever-by-half webmasters out there, and I have
> not seen link checkers that cope well with these situations.
>
>   - Getting a 200 status page that actually says either "Page Not
>     Found" or "Page has moved <a href='foo.html'>here</a>".  With that
>     200 status, this won't be reported as a broken link.
>
>   - Getting a 200 from a page that uses Javascript to redirect you to
>     an updated URL.  Again, this won't show up in the report, and the
>     real URL may not be checked.
>
>   - Getting a 200 page that says "This domain name is for sale - want
>     some herbal supplements?"
>
>   - Getting a 302 (temporary redirect) to a page that says "Page Not
>     Found".
>
>   - Getting a 302 to a page that says "You must enable cookies to
>     appreciate our wonderful site"
>
>   - Getting a 302 from http://site/real.page to
>     http://site/real.page?session=temporary_id.
>
>   - Getting a 500 ("server error"), 401 ("unauthorized") or 403
>     ("forbidden") because some misguided bit of browser sniffing balks
>     at talking to your link checker.
>
>
> Is there anything out there that addresses any of these issues in any
> way?
>
>   


More information about the Web4lib mailing list