Homebrew Link Checker

Thomas Dowling tdowling at ohiolink.edu
Mon Sep 16 17:33:01 EDT 1996


I thought I posted this last week, but I never saw it come back from the
list, and it isn't in the archive.  Oh, well.

I got started on this when I looked at some of the link checkers suggested
here; some of them are commercial products or shareware, and it just seemed
that people were trying to make a couple of bucks doing something I already
had the tools to do.  A little playing around came up with the following.

This is absolutely not guaranteed to do anything--I normally don't show my
perl scripts to the outside world, as I'm sure they drive Real Programmers
into giggle fits--but it seems to work in our environment.  It only wants a
Unix command line, Perl, and an up to date version of Lynx (it uses "lynx
-head" which I don't remember being in earlier versions).  It can check a
local file ("linkcheck.pl [filename]") or, with a little help, a given URL
("lynx -source [URL] | linkcheck.pl").  If you're a Perl ace, feel free to
tinker.

If you've never made a Unix script, save this message to a file, delete
anything before the line that reads "#!/usr/local/bin/perl" and anything
that may have gotten appended after the final line of "#" characters.  FTP
or otherwise get this to your Unix machine with the filename "linkcheck.pl"
and give the command "chmod +x linkcheck.pl".  Presto change-o, you've got
a script.

Thomas Dowling                    \ tdowling at ohiolink.edu
Asst. Director, Client/Server Apps.\    614/728-3600 x326
OhioLINK                            \    FAX 614/728-3610

//////Delete everything up to and including this line//////
#!/usr/local/bin/perl

####################################################################
#linkcheck.pl - a very simple http link checker
#Thomas Dowling, tdowling at ohiolink.edu, 9/11/96
#No guarantees, no promises, if you're caught the General
#and I will disavow all knowledge of your mission.

#usage: linkcheck.pl [filename]  e.g.
#       linkcheck.pl mongo-list-o-links.html
#       linkcheck.pl *.htm
#       lynx -source [URL] | linkcheck.pl

#This script pulls http URLs out of documents, uses
#lynx to pull an HTTP HEAD command across the net,
#fishes out the three digit status code, and provides
#some arguably comprehensible comments.

#Apologies to people with more than one link in a line of
#text, as this will only see the first one.  Also, if the server
#doesn't return a valid HEAD, the script keeps quiet about it.
#Whadda want for free?

#This script may be redistributed or edited for
#non-commercial use only.

#For information on HTTP status codes, see
#<URL:http://www.ics.uci.edu/pub/ietf/http/rfc1945.html#Status-Codes>
####################################################################

#This is the command for lynx to grab the HTTP head and
#dump out the results.  If lynx is in a different directory,
#you will need to edit this line.
$lynx_cmd = "/usr/local/bin/lynx -dump -head";

while (<>) {

  if (/href=\"http:/) {               #Find lines with links
    $link = $_;
    chop($link);
    $link =~ s/.*(http.*)\".*/$1/;    #Trim the line down to just the link

                                      #Run the link through lynx and grab
                                      #the line with the HTTP code
    open(LYNX, "$lynx_cmd $link 2>/dev/null | grep \"^HTTP\" |") ;
      while (<LYNX>) {
        if (/HTTP.* (2\d\d) .*/) {
          print "     $link seems to work correctly (Code $1).\n";
        } elsif (/HTTP.* (3\d\d) .*/) {
          print "---> $link is being redirected (Code $1):\n";
          print "        Check to see if it has moved.\n"
        } elsif (/HTTP.* 403 .*/) {
          print "---> $link is being \n";
          print "        forbidden to this computer (Code 403):\n";
          print "        Check to see if access has been restricted.\n";
        } elsif (/HTTP.* 404 .*/) {
          print "---> $link was not found (Code 404):\n";
          print "        Check to make sure this document still exists.\n";
        } elsif (/HTTP.* (\d\d\d) .*/) {
          print "---> $link is returning Code $1:\n        Please check\n";
        } else {
          print "---> There may be a problem with $link\n";
          print "        Please check\n";
        }
      }
  }

}
####################################################################


More information about the Web4lib mailing list