Making the Invisible Web more Visible

Hanan Cohen hanan_cohen at fastmail.fm
Fri Jun 7 19:33:57 EDT 2002


Shalom

I have an idea that I thought WEB4LIB would be the best place to tell 
about it and see if it has something in it. I am not a librarian so 
please excuse me for using the wrong terms to express the wrong ideas 
(or vice versa), excuse me if WEB4LIB is the wrong place for this kind 
of message and excuse me if what I suggest has already been done.

The problem

We all know that a lot (if not most) of the information available on the 
Internet is invisible to indexing robots. They know how to index 
information presented as HTML  and only recently Google was able to show 
us content stored in DOC,RTF,PPT and PDF files. What's missing? Databases.

What we have today are manually collected database directories. The 
databases are collected manually because there is no automatic way to 
index their content or their meta-data.

Search robots cannot index information stored in databases because each 
database has it's own query syntax. Search robots are only able to index 
the HTML pages leading to those databases.

It would be very good if there was an agreed upon standard for 
"exposing" ALL the information to indexing robots, but we know it's very 
hard.

The solution

What I suggest is something simpler. Creating a standard for making the 
METADATA on the databases available for automatic indexing.

Publishers would publish an XML file with a standard structure 
describing what's in their database.

Indexing robots would find the standard XML file and index it in a 
special index. Google (or any other search facility) would have a 
"databases" tab on its interface and users would be able to search for 
databases containing the information they need.

I am not sure of what standardizing body should take it as their mission 
to develop such a standard but I think it's essential.

Thank for listening.

Hanan Cohen - http://www.info.org.il/english/
***Love and Peace***a




More information about the Web4lib mailing list