webliographies (for hardcopy or research) -- http://astl-10.stanford.edu/eta.cgi

James Salsman jps at astl-10.stanford.edu
Thu Jan 2 03:26:57 EST 1997


So, have you ever wanted to print a Web page, but your browser
couldn't make proper footnotes with the URLs?  Well, no not despair;
just pop your printed page into  http://astl-10.stanford.edu/eta.cgi
and print the resulting page, which will show all the <A>nchors, along
with their associated URLs (recursively if you click further.)

It doesn't handle images unless they are absolutely specified in the
document, and it doesn't handle image maps at all.  Beyond that the
only problem is that you will probably want to turn off underlining
when you print webliographies, unless your printer shows _underscores_
when they are printed underlined (sometimes they obscure each other.)

Should you want to make changes of any sort, go ahead.  The source is
attached.  It requires Perl 5.002 or later along with the CGI.pm and
libwww-perl.pm Perl modules.

Sincerely,
:James Salsman (formerly of Netscape; with Bovik Research since 1990)

--- file eta.pm ---
#!/usr/local/bin/perl5

use CGI qw( cgi form html2 );       # include CGI.pm
use LWP::Simple;                    #  " libwww-perl.pm base
use URI::URL;                       #  "  " URL package
use HTML::Element;                  #  "  " element manipulation
use HTML::Parse;                    #  "  " HTML parser

$frm = new CGI;  # establish new Common Gateway Interface with httpd

print            # form
    $frm->header,
    ($base = $frm->param('target'))             # test for submission
    ? $frm->start_html("Anchors from $base")    # then include it in title
    : $frm->start_html('Extract the Anchors') , # else generic title
    $frm->h1('Extract The Anchors'), "\n",
    $frm->start_form(-method=>qw(GET)), "Enter URL target: \n ",
    $frm->textfield('target',"http://",50), " \n ",
    $frm->submit('Scan'), "\n",
    $frm->end_form, $frm->p;

if ($base) { # then we have a submission

    print $frm->h2('<A HREF="'.$base.'">'.$base.'</A>');  # entitle the list

    $doc = get($base);               # retrieve the HTML via HTTP
    $syn = parse_html($doc);         # parse the HTML
#   print $frm->h3($syn->title);     # probably not this easy
#   print $doc;                      # for debugging, display it

    print "\n<DL>";                    # begin an indented definition list
    $thisScript = $frm->script_name(); # for prefixing subqueries

    for (@{ $syn->extract_links(qw(a)) }) { # find "A"nchors
	($link, $linkelem) = @$_;    # iterate over all the "A" elements
	$anch = $linkelem->as_HTML;  # get the <A HREF...>...</A> string

#	$anch =~ s/^[^>]*>//;        # remove the <A HREF...> part
#	$anch =~ s/<.*\$//;          # remove the </A> part
	$href = (url($link)->abs(url($base)))->as_string; # make absolute from base
	$locn = url($link)->as_string;               # use relative link to search
	$rplc = $thisScript . '?target=' . $href;    # and absolutle link to replace
	$anch =~ s/$locn/$rplc/i;                    # so the anchor calls this script

	print
	    "\n <DT>", $anch,               # show the anchor text
	    "  <DD><A HREF=", '"',          # and the hot, 
	    $href, '">', $href, "</A>";     # absolute URL HREF for that anchor 
    }
    print "\n</DL>\n";                      # done with the list
}

print $frm->hr, $frm->end_html, "\n";       # done with the output

# :jps 31 Dec 96

 


More information about the Web4lib mailing list