Organizing Web Information

Tue Jul 16 11:46:19 EDT 1996

George Porter wrote:
|The task of separating the wheat from the chaff appears to be nontrivial 
|and I don't have anything to offer on that count at the moment.

This is a collection development problem that does not lend itself easily
to an automated solution.

Wilfred [Bill] Drew in response to George Porter:
|Why are we only supposed to index so-called "Scholarly" information?  There 
|are many other sources besides that from "scholars" that needs to be indexed 
|and is of value.

|The condemnation of personal home pages is troubling at best and certainly 
|smells like intellectual elitism.  

As an indexing problem, choosing a relatively coherent subset of web info
is a nice simplifying assumption.  Theory being, if one can successfully
index that coherent subset, those lessons and techniques can be propigated
upwards in scope towards the web as a whole.

This is an active debate, especially here in SF, where our brand spanking
new main library has popular press a-plenty (travel books, self-help books,
newsweek, time, etc) but has engaged in an herbicidal weeding policy designed
not to eliminate fluff, but to cram a city-mandated growing collection into a
space that was (erroneously?) designed too small.  What is made room for and
kept compared to what is tossed is an interesting comparison.

|> And that the application of metatags by document authors is the beginning of
|> the solution to the document indexing/retrieval problem.
|I agree with this statement.  Librarians should be creating the indexing and
|classification schemes while authors could pick a particular classification 
|from such a scheme.  There is no longer time for librarians to fully catalog 
|every new electronic document that comes along.

Sorry, but there is no free lunch.  And I do not believe that every web object
*deserves* to be classified in-depth.  Do web objects have abstract rights,
as in right-to-life or equal access to indexing?  Librarians' hands will be 
full cataloging every new electronic document from set of classes of documents 
that have been determined worthy of analysis in any case.  So don't count on
accessing EVERY or even MOST web documents through a legitimate LCSH search.

The intellectual high-value-added of indexing and cataloging is not a task 
that can be left to amateurs or even programs.  Someone at the WWW conf in
Paris in May asked "How many people fill out the metadata box in MS Word
for each doc they create?"  It probably cannot even be left to <META> tags 
and HTTP MIME header fields if it is to function in a manner librarians would 
like to become accustomed to.

> Karen Schneider is right to assert that indexing articles is the key to
> meaningful retrieval.  The question is how to make it work.

I've been working on an internet draft for an extension to HTTP that allows
for a native method (META) for accessing metainformation associated with an 
object.  Currently, HTTP allows for metainformation to be expressed through 
the MIME headers, translated from the <META> tag (if we're talking about an 
HTML doc).  My proposal allows for content negotiation amongst different 
metadata content types as they develop.  If anyone is interested, check out:

<URL:http://www.ckm.ucsf.edu/marc/work/meta/draft-salomon-http-meta-00.txt>

|The present journals can not fill that need because of the amount of new 
|information coming out every day.

I'm not sure what you mean by this, but if you mean that the nature of
journal periodicity changes from quarterly to a trickle of articles as
they pass peer-review, then its time to get used to it--the rules have
just changed.  Or more specifically, the constraints of the paper world and
its high-capital costs of printing and distribution are no more.  If you mean 
the trickling out of totally new journals, then it should keep the colldev 
folks hopping.

-marc