Making the Invisible Web more Visible
Hanan Cohen
hanan_cohen at fastmail.fm
Fri Jun 7 19:33:57 EDT 2002
Shalom
I have an idea that I thought WEB4LIB would be the best place to tell
about it and see if it has something in it. I am not a librarian so
please excuse me for using the wrong terms to express the wrong ideas
(or vice versa), excuse me if WEB4LIB is the wrong place for this kind
of message and excuse me if what I suggest has already been done.
The problem
We all know that a lot (if not most) of the information available on the
Internet is invisible to indexing robots. They know how to index
information presented as HTML and only recently Google was able to show
us content stored in DOC,RTF,PPT and PDF files. What's missing? Databases.
What we have today are manually collected database directories. The
databases are collected manually because there is no automatic way to
index their content or their meta-data.
Search robots cannot index information stored in databases because each
database has it's own query syntax. Search robots are only able to index
the HTML pages leading to those databases.
It would be very good if there was an agreed upon standard for
"exposing" ALL the information to indexing robots, but we know it's very
hard.
The solution
What I suggest is something simpler. Creating a standard for making the
METADATA on the databases available for automatic indexing.
Publishers would publish an XML file with a standard structure
describing what's in their database.
Indexing robots would find the standard XML file and index it in a
special index. Google (or any other search facility) would have a
"databases" tab on its interface and users would be able to search for
databases containing the information they need.
I am not sure of what standardizing body should take it as their mission
to develop such a standard but I think it's essential.
Thank for listening.
Hanan Cohen - http://www.info.org.il/english/
***Love and Peace***a
More information about the Web4lib
mailing list