Hiding draft pages from browsers, search engines

Liz Best lbest at brain.royalroads.ca
Sun Mar 22 14:42:49 EST 1998


> Date: Sun, 22 Mar 1998 09:45:46 -0500 (EST)
> From: <morganj at iupui.edu>
> Subject: Hiding draft pages from browsers, search engines
>
> I sometimes have draft pages in a web directory.  I have assumed that
> if there are no links to these pages, others would have to know the
> name of the file to get to them.
This is true.
> One exception seems to be if there is no index or home file in the
> directory, the browser returns a list of files in the directory.
This is also true, if the web server allows directory browsing. Your
webmaster sets this on a site basis.
> However, if there is an index or home file is there any way to force a
> browser to bypass it and list the files in the directory?
Not that I know of, but again the webmaster could not set a default home
page for the site and in that case a directory would always return a
listing. You could also use a web site mapping tool to do this. I am
looking into two at the moment, Wisebot
http://www.tetranetsoftware.com/wisebot.htm and PowerMapper
http://www.electrum.co.uk/mapper/ Wisebot offers educational discounts.

> Secondly, can these "hidden" files be indexed by search engines?
Yes.
> I know that same-site search engines, like Swish, index all files in
> the directory that are not specially marked, and that a Harvest
> gatherer can be pointed to index directories on the server as the
> gatherer (rather than use html links).  However can search engines be
> set to ignore the index and home files and index all files in a
> directory on a remote web server?
Yes. In the root directory of your web server you need a robots.txt file
but not all robots may pay attention to this. Again this is done on a
site basis by the webmaster. Here's some samples for the robots.txt:
----begin sample-------
# robots.txt for http://www.royalroads.ca/
# This example indicates that no robots should visit this site further:
# go away
User-agent: *
Disallow: /

# This example indicates that no robots should visit any URL starting
with "/cyberworld/map/" or "/tmp/:
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear

# This example indicates that no robots should visit any URL starting
with /cyberworld/map/",  except the robot called "cybermapper":
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:
----end sample-----
You can find more information on using robots.txt at
http://info.webcrawler.com/mak/projects/robots/robots.html

I think you would want yours to read something like this:

# This example indicates that no robots should visit a URL with
"/e-mail/filename.html"
User-agent: *
Disallow: /e-mail/index.html
Disallow: /e-mail/default.htm

A second thing you can do is a meta tag

     <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

but not all robots pay attention to this.

> This came to mind when I contemplated having a local email directory,
> and began to think about how to make it available to individuals
> within the library but not to email spammers search engines.
>
> Jim Morgan
> morganj at iupui.edu
>
Regards,
Liz
________________________________________________
Elizabeth L. Best, Systems Analyst
Royal Roads University, Learning Resource Centre
2005 Sooke Road Victoria, BC  V9B 5Y2

Phone: (250) 391-2663 Fax:   (250) 391-2594
mailto:liz.best at royalroads.ca http://www.royalroads.ca/


More information about the Web4lib mailing list