[LONG] A Note on Internet Search Engines and Metadata

Fri Jan 15 10:29:48 EST 1999

Here is a longish note I prepared to help guide a
couple of organization's activities about this. 
The META tag mythology is distracting since organization's
believe this will help people beat a path to their
door. This is wrong. Metadata is useful, but it
needs a context to make it so. Comments and criticisms 
are welcomed as always.

-tk

-------------------------------------------------------

A NOTE ON INTERNET SEARCH ENGINES AND METADATA

(DRAFT: December 11, 1998)

Terry Kuny
email: terry.kuny at xist.com

INTRODUCTION

The use of metadata by Internet search engine developers 
has been a recurring thread in numerous metadata and 
library-related lists. At first glance, it seems apparent 
that having the Internet search engine services use metadata 
would be critical to uptake and successful deployment of 
metadata schemas. As it turns out, this is probably a 
wrong assertion but it is one that is frequently made. 
The usual refrain that stimulates this discussion is:

1. An organization or company has the following objective: 
"We want to make our information more accessible on 
the Internet."

2. The above organization or company then addresses 
this objective by wanting to use metadata, in the 
hope that it will be picked up by Internet search 
engines and that users will be able to access this 
metadata in some meaningful manner.

Therefore the question put to the metadata 
community has been: "What efforts are being made 
to have search engine developers integrate metadata 
support into their products?"

This note provides an update on the role of metadata, 
the future of Internet search engines, and the state 
of Internet information retrieval.

WHAT DO SEARCH ENGINES DO WITH METADATA?

None of the Internet search engines accessible on the 
web (i.e. Yahoo, Alta Vista, Excite, Infoseek, etc.) 
will actually read Dublin Core (DC) or any other metadata and do 
meaningful things with this information. For example, 
none provide structured metadata searches or appear to 
use the metadata to rank metadata-enabled resources differently. 
Some search engines use a non-standardized description 
metadata tag to provide a text string describing a resource 
in results listings. This is a weak mechanism for helping to
determine search relevance and is not proven to be effective
for scanning since the tags is used infrequently and/or poorly.
There are no search engines that specifically exploit 
the advantages of structured search using any metadata scheme.

WHAT IS THE FUTURE OF METADATA SUPPORT IN INTERNET SEARCH ENGINES?

Will this situation change? The answer is not encouraging - 
at least from the perspective of those who believe that metadata 
support in the Internet search services is key to the success 
of metadata. All Internet search engines are currently under 
considerable scrutiny and criticism for the manner in which 
they provide their services. Some of the criticisms leveled 
against search engines include:

· Deceptive claims about coverage.
· Lack of information about how the index is constructed.
· Commercialization of results sets.
· Increased hit counts and poor relevance ranking.
· Clumsy and non-standard search syntaxes.

Numerous individuals within different metadata communities 
and within other interested communities have been in regular 
communications with the major Internet search engine developers 
(i.e. AltaVista, Excite, Yahoo). It is notable that none of 
the major Internet search engine developers are participating 
in metadata standards work. These companies have indicated 
that they have no intention of providing metadata support 
or functionality at this time, because of the following:

1. There is no clear business case for adding metadata support. 
The search engine companies' commercial interest is in keeping 
people at their sites for as long as possible to support their 
commercial clients and advertisers. It is not in their business 
plan to build a public domain search service that provides effective 
results. The driving force behind the web right now, and this 
is true of search engines, is primarily commerce and marketing. 
Dog chow vendors and other companies have an interest in being 
seen in the first couple of screens of a results set. Dog chow 
vendors have the money and a commercial interest. Their interests 
are congruent with the search engine provider's interest in 
getting money to improve access to company information. The 
commercial interests of search engine companies are to support 
a paying audience, not the free riders. Internet search engines 
are not public goods.

2. There is a justifiable fear that metadata will be used 
inappropriately by users to exploit their position in search 
results, i.e. index spamming or "metajacking." There are a 
number of techniques for doing this kind of spam and there 
are no simple technical solutions to guard against this kind 
of activity. It should be noted that there are already a number 
of legal cases involving the misuse of "metadata" and index 
spamming. 

3. The quality and consistency of managed collections of metadata 
(provided by organizations with a clear interest in reliable 
metadata, such as libraries, museums, government agencies, 
publishers, etc.) is always likely to be higher than harvested 
metadata of unknown provenance. More related to this below.

4. There is also reluctance on the part of developers of Intranet 
systems to implement metadata in their systems since the business 
case has not been made for them either. A general consensus has 
emerged that these vendors will remain uninterested until buyers 
put significant pressure on them to develop metadata-enabled 
search engines and provide meaningful metadata support tools. 
It is these vendors that the metadata standards community is 
particularly interested in getting on-board with metadata 
development. 

It should also be noted that many Internet search engines are 
moving toward a "portal" strategy. At the heart of this strategy 
is not the continuance of a generalized search index, but rather
a set of niche-oriented, focused, classified and categorized 
information services, usually targeted at supporting 
e-commerce imperatives.

According to an article in Science magazine in early 1998, 
none of the current search systems cover more than, at best, 
a third of publicly available web resources. It is also 
interesting to note that the most popular service, Yahoo, 
is also the one with the smallest number of resources - a 
very tiny fraction of the others.  Why is this so? Yahoo's 
use of a classification system provides them with a retrieval 
tool that is viewed by users as better than keywords in full-text 
databases. It is Yahoo's success and model that other Internet 
search services are trying to emulate. 

The reality that Internet search engines are not the primary 
sites for metadata activity or exploitation has been recognized 
widely in the metadata community. In response, in Australia, Europe, 
and the U.S., various organizations and "portals" have developed 
their own metadata-enabled services. Examples include the various 
EC/UK subject gateway services, the Internet Scout project, 
work at OCLC, Australia's Government Locator Service (AGLS) 
and BEP (Business Entry point) service, and so forth. 
The IFLA Metadata has a list of many of these projects 
page <URL: http://www.ifla.org/II/metadata.htm>.

CHANGE IN PHILOSOPHY OF SEARCHING

What is apparent to researchers in distributed indexing is 
that the current strategy of search engines - that is, to 
indiscriminately harvest whatever they can find and then do 
selective indexing on those contents - is an unsustainable 
architecture for retrieval in a billion document universe. 
Full-text searching on heterogeneous contents is intrinsically 
problematic, resulting in the retrieval of tremendous numbers 
of low relevance documents. Standardized metadata upon which 
to support structured searches in a billion document universe 
is similarly problematic, both because a common standard is 
unlikely and because even simple metadata may be inadequate 
for providing meaningful retrieval. There does not exist an 
architecture that will support effective and widespread networked 
resource discovery in a scalable manner. 

The development of a scalable, networked retrieval architecture 
remains a significant area of research and development. There is 
some expectation that standards developments such as XML/RDF will 
be able to address some, certainly not all, of the requirements for 
enhanced search and retrieval in a networked environment.

The metadata community is moving away from a "catalog everything" 
approach to a domain-specific metadata approach. This is an appropriate, 
and much needed, shift in perspective, since it is widely held that 
generalized Internet search engines will probably remain largely 
ineffective in providing effective retrieval even if there was a 
widespread adoption of metadata. 

If metadata is going to work, we must give up on Alta Vista, Excite, 
Yahoo, etc. and develop search engines that serve our own needs (which 
include identifying, acquiring, selecting, describing, arranging, storing, 
and retrieving). Abandoning the "StarTrek" view of information retrieval 
("Computer: What is the gestation period of a blue Tribble") is a much 
needed development. It can also be argued that this means abandoning 
or redirecting the "single window" metaphor that has become prevalent 
in various communities.

A sound approach to the problems of networked resource discovery 
is to recognize that information retrieval is an iterative process, 
and that metadata can be used most effectively when it is given a 
particular context for its use. For example, we do not go to a 
library OPAC to find local pizza restaurants. Local resource 
description communities must work to develop effective navigation 
and search services that make their own information resources more 
accessible through the development of domain-specific, tightly scoped, 
niche indexes and retrieval tools that support identifiable user 
communities. The strategy is to think local, act local. 

RECOMMENDATION

As a guidance document, the following recommendation may 
be considered as an appropriate response given the current 
Internet search engine environment:

Organizationsshould not implement metadata based on the belief that 
undertaking this will somehow enhance retrieval through Internet 
search engines at this time.

An organization may choose to implement a variety of different 
metadata-enabled retrieval services and functions, but these 
should be addressed each on their own merits and not with the 
assumption that they will increase visibility within the Internet 
search engines. Metadata implementations should have a localized 
context for their use and development.

The answer to the question of how to make an organization's website 
or information resources more visible or accessible 
is not an issue of metadata. Improvements in making 
local web resources more usable and navigable can help 
considerably. But in the final analysis, it is organizational 
commitment, effective marketing and promotion, 
regular communications with user communities, and the provision 
of compelling contents and services, which will be the key 
determinants of success in making information accessible.