[WEB4LIB] Re: Making the Invisible Web more Visible

Richard Wiggins rich at richardwiggins.com
Fri Jun 7 23:34:09 EDT 2002


Some folks at Stanford proposed in 1997 or so a scheme for Web sites in
general to talk mano-a-mano to spiders, feeding them metadata and page
summaries directly, instead of hoping the spider can infer it all by
crawling and skimming HTML.  Great idea, never happened.

There is no reason why database-driven Web content can't be served to
spiders either along the lines of the Stanford idea, or as pseudo-static
HTML.  We did that years ago for the course catalog at Michigan State. 
Works great.  Yum, yum, great spider food.

As an example, please visit:

http://search.msu.edu/specialindexes/index.html?q=&i=Courses

Search on any university course topic you find interesting.  For instance:

http://search.msu.edu/specialindexes/index.html?q=physics&i=Courses

You're searching an AltaVista index for content that resides in a database. 
No invisible Web any more.

This was fairly trivial to set up, and, as I say, we've been doing it for
years.  (Clever insight by my former colleague Dennis Boone was to build
pointer pages with null text for the HREFs so the spider doesn't index the
pointer page.)

The AltaVista Intranet product can also talk directly via ODBC to databases
to index content.

The whole "invisible Web" stuff is, in my opinion, way overblown.   All that
keeps database-driven content from being indexed by spiders is:

-- Lack of understanding of how easy this is to fix.

-- Lack of initiative by content providers to take trivial steps to fix it.

-- Fact that some content (e.g. human genome data) isn't suited to textual
indexes and OUGHT to be invisible to them.

Metadata and trust issues are as Avi says important, but I'm not sure how
that's peculiar to this discussion.  We "trust" the spider to index static
HTML and we could "trust" each other in the database realm as well.

/rich


Avi Rappoport wrote

> 
> The place to start is the Open Archives Initiative, 
> <http://www.openarchives.org>.
> 
> The major search engines tend to be iffy about metadata because they 
> can't trust many of their sources -- too much spamming and scamming. 
> Cliff Lynch has written some good stuff about trust and credibility, 
> but it's not a trivial problem.
> 
> Avi
> 
> At 3:34 PM -0700 6/7/02, Hanan Cohen wrote:
> >Shalom
> >
> >I have an idea that I thought WEB4LIB would be the best place to tell
> >about it and see if it has something in it. I am not a librarian so
> >please excuse me for using the wrong terms to express the wrong ideas
> >(or vice versa), excuse me if WEB4LIB is the wrong place for this kind
> >of message and excuse me if what I suggest has already been done.
> >
> >The problem
> >
> >We all know that a lot (if not most) of the information available on the
> >Internet is invisible to indexing robots. They know how to index
> >information presented as HTML  and only recently Google was able to show
> >us content stored in DOC,RTF,PPT and PDF files. What's missing? Databases.
> >
> >What we have today are manually collected database directories. The
> >databases are collected manually because there is no automatic way to
> >index their content or their meta-data.
> >
> >Search robots cannot index information stored in databases because each
> >database has it's own query syntax. Search robots are only able to index
> >the HTML pages leading to those databases.
> >
> >It would be very good if there was an agreed upon standard for
> >"exposing" ALL the information to indexing robots, but we know it's very
> >hard.
> >
> >The solution
> >
> >What I suggest is something simpler. Creating a standard for making the
> >METADATA on the databases available for automatic indexing.
> >
> >Publishers would publish an XML file with a standard structure
> >describing what's in their database.
> >
> >Indexing robots would find the standard XML file and index it in a
> >special index. Google (or any other search facility) would have a
> >"databases" tab on its interface and users would be able to search for
> >databases containing the information they need.
> >
> >I am not sure of what standardizing body should take it as their mission
> >to develop such a standard but I think it's essential.
> >
> >Thank for listening.
> >
> >Hanan Cohen - http://www.info.org.il/english/
> >***Love and Peace***a
> 
> 
> -- 
> Complete Guide to Search Engines for Web Sites and Intranets
>     <http://www.searchtools.com>

____________________________________________________
Richard Wiggins
Writing, Speaking, and Consulting on Internet Topics
rich at richardwiggins.com       www.richardwiggins.com     



More information about the Web4lib mailing list