[WEB4LIB] Best/cheapest Tool for Counting URLs on a Web Page

Paul F. Schaffner pfs at umich.edu
Thu Oct 10 17:44:15 EDT 2002


> But what's the best/cheapest way to run through a number of (static,
> hand-created) Web pages to count the # of URLs on each page--and even
> better yet, generate a list of these URLs?  

By URLs I assume you mean URLs used as ordinary links--i.e. as the value of HREF 
attributes of <A>; including other kinds of URLs just means looking for 
other patterns.

I'm probably missing something here, but assuming they're your pages, that 
you can get to the source files, that they live close together
in the file system, and that they are formatted with some consistency, why 
won't almost any good text editor or grep-like  text utility running under 
the appropriate  OS do the job as cheaply and easily as anything else? I 
regularly extract, unique, count, and check hundreds of thousands of such 
elements by such means.

If they were on a Wintel machine, for example, you could use an editor 
like TextPad to search in files "*.html" in binary form for strings
matching a pattern like this  "<a [^<]*href="[^#][^>]+>" 
Likewise, if all you want is http links just search for the pattern 
="http:[^"]+"

The former would retrieve a list of files and links
like this (sampling some local files here):

cheat.html: <A HREF="http://www.hti.umich.edu/t/tei">
eebofaq1.html: <a href="http://wwwlib.umi.com/eebo/featured">
eebofaq1.html: <a href="http://wwwlib.umi.com/eebo/">
eebofaq1.html: <a href="./samples.html">
instruct2.html: <A HREF="http://www.hti.umich.edu/t/tei/">
instruct2.html: <a href="./rnums.html">
notes.html: 
  <a href="http://www.lib.umich.edu/eebo/docs/dox/noteflag.html">


The latter one like this:

cheat.html: ="http://www.hti.umich.edu/t/tei"
eebofaq1.html: ="http://wwwlib.umi.com/eebo/featured"
eebofaq1.html: ="http://wwwlib.umi.com/eebo/"
instruct2.html: ="http://www.hti.umich.edu/t/tei/"
instruct2.html: ="http://lcweb.loc.gov/marc/languages/"
instruct2.html: ="http://lcweb.loc.gov/standards/iso639-2/langhome.html"
notes.html: ="http://www.lib.umich.edu/eebo/docs/dox/noteflag.html"

Most such editors can be set to count instead of display:

  Searching for: ="http:[^"]+"
  C:\ets\eebo\dox\cheat.html: 1
  C:\ets\eebo\dox\eebofaq1.html: 2
  C:\ets\eebo\dox\instruct.html: 2
  C:\ets\eebo\dox\instruct2.html: 3
  C:\ets\eebo\dox\notes.html: 1
  Found 9 occurrence(s) in 5 file(s)

But then, as I said, this is so obvious I'm probably missing something.

--------------------------------------------------------------------
Paul Schaffner | pfs at umich.edu | http://www-personal.umich.edu/~pfs/
--------------------------------------------------------------------








More information about the Web4lib mailing list