Homebrew Link Checker
Thomas Dowling
tdowling at ohiolink.edu
Mon Sep 16 17:33:01 EDT 1996
I thought I posted this last week, but I never saw it come back from the
list, and it isn't in the archive. Oh, well.
I got started on this when I looked at some of the link checkers suggested
here; some of them are commercial products or shareware, and it just seemed
that people were trying to make a couple of bucks doing something I already
had the tools to do. A little playing around came up with the following.
This is absolutely not guaranteed to do anything--I normally don't show my
perl scripts to the outside world, as I'm sure they drive Real Programmers
into giggle fits--but it seems to work in our environment. It only wants a
Unix command line, Perl, and an up to date version of Lynx (it uses "lynx
-head" which I don't remember being in earlier versions). It can check a
local file ("linkcheck.pl [filename]") or, with a little help, a given URL
("lynx -source [URL] | linkcheck.pl"). If you're a Perl ace, feel free to
tinker.
If you've never made a Unix script, save this message to a file, delete
anything before the line that reads "#!/usr/local/bin/perl" and anything
that may have gotten appended after the final line of "#" characters. FTP
or otherwise get this to your Unix machine with the filename "linkcheck.pl"
and give the command "chmod +x linkcheck.pl". Presto change-o, you've got
a script.
Thomas Dowling \ tdowling at ohiolink.edu
Asst. Director, Client/Server Apps.\ 614/728-3600 x326
OhioLINK \ FAX 614/728-3610
//////Delete everything up to and including this line//////
#!/usr/local/bin/perl
####################################################################
#linkcheck.pl - a very simple http link checker
#Thomas Dowling, tdowling at ohiolink.edu, 9/11/96
#No guarantees, no promises, if you're caught the General
#and I will disavow all knowledge of your mission.
#usage: linkcheck.pl [filename] e.g.
# linkcheck.pl mongo-list-o-links.html
# linkcheck.pl *.htm
# lynx -source [URL] | linkcheck.pl
#This script pulls http URLs out of documents, uses
#lynx to pull an HTTP HEAD command across the net,
#fishes out the three digit status code, and provides
#some arguably comprehensible comments.
#Apologies to people with more than one link in a line of
#text, as this will only see the first one. Also, if the server
#doesn't return a valid HEAD, the script keeps quiet about it.
#Whadda want for free?
#This script may be redistributed or edited for
#non-commercial use only.
#For information on HTTP status codes, see
#<URL:http://www.ics.uci.edu/pub/ietf/http/rfc1945.html#Status-Codes>
####################################################################
#This is the command for lynx to grab the HTTP head and
#dump out the results. If lynx is in a different directory,
#you will need to edit this line.
$lynx_cmd = "/usr/local/bin/lynx -dump -head";
while (<>) {
if (/href=\"http:/) { #Find lines with links
$link = $_;
chop($link);
$link =~ s/.*(http.*)\".*/$1/; #Trim the line down to just the link
#Run the link through lynx and grab
#the line with the HTTP code
open(LYNX, "$lynx_cmd $link 2>/dev/null | grep \"^HTTP\" |") ;
while (<LYNX>) {
if (/HTTP.* (2\d\d) .*/) {
print " $link seems to work correctly (Code $1).\n";
} elsif (/HTTP.* (3\d\d) .*/) {
print "---> $link is being redirected (Code $1):\n";
print " Check to see if it has moved.\n"
} elsif (/HTTP.* 403 .*/) {
print "---> $link is being \n";
print " forbidden to this computer (Code 403):\n";
print " Check to see if access has been restricted.\n";
} elsif (/HTTP.* 404 .*/) {
print "---> $link was not found (Code 404):\n";
print " Check to make sure this document still exists.\n";
} elsif (/HTTP.* (\d\d\d) .*/) {
print "---> $link is returning Code $1:\n Please check\n";
} else {
print "---> There may be a problem with $link\n";
print " Please check\n";
}
}
}
}
####################################################################
More information about the Web4lib
mailing list