Hiding draft pages from browsers, search engines

Chuck Bearden cbearden at sparc.hpl.lib.tx.us
Mon Mar 23 17:21:32 EST 1998


On Sun, 22 Mar 1998 morganj at iupui.edu wrote:

> I sometimes have draft pages in a web directory.  I have assumed that if
> there are no links to these pages, others would have to know the name of
> the file to get to them.  One exception seems to be if there is no index
> or home file in the directory, the browser returns a list of files in the
> directory.  However, if there is an index or home file is there any way to
> force a browser to bypass it and list the files in the directory?

As others have noted , you can in Apache & NCSA (and, I would assume, in
other full-featured web servers), prevent anyone from being able to
retrieve directory listings when no default page is present.  Steve
Thomas mentioned the .htaccess file as a place to make these settings.  
I would add that in Apache/NCSA, you can also use the access.conf file 
(or any .conf file in Apache) for this purpose.

> Secondly, can these "hidden" files be indexed by search engines?  I know
> that same-site search engines, like Swish, index all files in the
> directory that are not specially marked, and that a Harvest gatherer can
> be pointed to index directories on the server as the gatherer (rather than
> use html links).  However can search engines be set to ignore the index
> and home files and index all files in a directory on a remote web server?

If they can't retrieve the file, they can't index it.  And if there
are no links in the outside world or on your site to it, and no one
can retrieve directory listings, then it can't be retrieved by a robot.
Unless folks are making links to these files from other sites, this
method would likely be more secure than a robots.txt, which a search
engine may choose to ignore.  A robots.txt file would indeed alert 
outsiders to the existence of files explicitly named (i.e. not just
with a *) in it.

> This came to mind when I contemplated having a local email directory, and
> began to think about how to make it available to individuals within the
> library but not to email spammers search engines.

Apache & NCSA (and, I would expect, other full-featured servers) will let 
you either deny access to directories based on IP or host address of
the requestor, or to restrict access to folks with a valid username/password
pair.  These methods are not what you would call "strong
authentication", but they are probably adequate for most general
purposes.  

These methods would solve the problem not only with search engine 
indexing agents, but with other folks you don't want retrieving these 
pages. 

Chuck Bearden
Network Services Librarian
Houston Public Library
Houston, TX  77002
713/247-2264 (voice)
713/247-1182 (fax)
cbearden at hpl.lib.tx.us



More information about the Web4lib mailing list