Organizing Web Information

Wed Jul 17 12:21:34 EDT 1996

Howard Pasternack wrote:

[stuff cut]
> Web pages are not books.  They are most akin to individual documents found
> in manuscript collections.

I agree that web pages are not books, but there are also problems with the 
'documents in a manuscript collection' analogy -- for example, web pages might 
contain images, sounds, video clips etc., each of which may be considered a valuable 
resource in their own right, may exist physically on an entirely different server, 
may have different author(s), etc.

Networked resources are often only limited analogues of traditional library stock, 
and to an extent require new paradigms and practices; compare traditional linear 
narratives with hypertext, for example.

>  For the most part we do not index the individual
> manuscript pages because we can't.  We're lucky if we can index at the
> collection level.  So, maybe librarians should focus on identifying useful
> sites and leave the indexing of the pages to automatic schemes like
> Alta Vista.

We (eLib ANR Projects [1]) have been having some discussion about the problem of 
Internet cataloguing granularity recently on the COUSNS [2] list, because as Howard 
points out, it's impractical in many systems to index individual pages, embedded 
images etc. within a site.

The discussions started when I mentioned our project's wish to create a relational 
structure for our metadata database. This would allow the various components of a 
web site to be catalogued to a level of detail that was appropriate to their 
usefulness, and for them to be linked to a more comprehensive 'site' level record.

This would hopefully allow us to:

* Offer end-users more customisable search options
* Provide better information about the relationship of a resource returned
  from a search hit with it's overall site
* Reduce redundancy and allow significant resources within a site's
  hierarchy to be catalogued more quickly without having to duplicate all the
  general 'site-level' information
* Allow file and page level resources to be described more accurately,
  increasing precision of retrieval
* Enable output in some of the more popular 'would-be' standards for metadata
  that are emerging at the moment [3], allowing interoperability with other
  search engines.
* Integrate vocabulary resources into the relational structure (in our case,
  Getty AHIP's Art & Architecture Thesaurus)

The theory is that, by implementing a more complex system that offers enhanced
functionality, we are investing in streamlining the most time-consuming
and expensive part of our work -- the evaluation and cataloguing on Internet 
resources. It is within this painstaking work where the 'quality' that 
differentiates us from the likes of Yahoo! and Alta Vista lies.

I'd be interested to find out how many other library initiatives are using or 
planning to use either relational or object-orientated data structures.

The wider debate in this thread, however, seems to be about the role for librarians 
and manual indexing in the world of exponential information growth and 
'vacuum-cleaners' like Alta Vista; below is my 2p's worth:

1. Information authors/providers *must* provide some level of resource description; 
whether this is a simple list of key attributes or a complex subject-specific record 
shouldn't really matter. Dublin Core/Warwick Framework [4] seems very promising, as 
it allows either very simple descriptions using the HTML <meta> tags, right through 
to more complex subject-specific schemes, by specifying a 'container architecture 
for metadata.'

2. 'Vacuum cleaners' should use the author-generated descriptions, possibly in 
conjunction with full text retrieval and/or summarizer techniques to produce a 
'rough but comprehensive' distributed database.

3. Subject-specialists performing librarian-type roles could then search through 
these big messy databases, check and evaluate the resources, and edit the catalogue 
record where necessary and worthwhile.

Basically, I see the Internet librarian's role shifting from 'primarily cataloguing' 
to 'primarily evaluating with a bit of cataloguing quality control'.

Tony

[1] Electronic Libraries Programme Access to Network Resources Projects.
[2] Committee of UK Subject-based Network Services; mailing list archive available 
at http://www.mailbase.ac.uk/lists-a-e/cousns/
[3] For example IAFA/WHOIS++ Templates, Harvest SOIF, Dublin Core/Warwick Framework, 
Z39.50 BIB-1, MARC.
[4] http://cs-tr.cs.cornell.edu:80/Dienst/UI/2.0/Describe/ncstrl.cornell%2fTR96-1593

p.s. I don't normally write footnotes on e-mails -- I'm new to this list and don't 
know how familiar people will be with this stuff!
-- 
== Tony Gill ================================= ADAM Project Leader ==
 Surrey Institute of Art & Design * Farnham * Surrey * GU9 7DS * UK
      Tel: +44 (0)1252 722441 x2427 * Fax: +44 (0)1252 712925
== tony at adam.ac.uk ============================= http://adam.ac.uk ==