[LONG] A Note on Internet Search Engines and Metadata
Terry Kuny
Terry.Kuny at xist.com
Fri Jan 15 10:29:48 EST 1999
Here is a longish note I prepared to help guide a
couple of organization's activities about this.
The META tag mythology is distracting since organization's
believe this will help people beat a path to their
door. This is wrong. Metadata is useful, but it
needs a context to make it so. Comments and criticisms
are welcomed as always.
-tk
-------------------------------------------------------
A NOTE ON INTERNET SEARCH ENGINES AND METADATA
(DRAFT: December 11, 1998)
Terry Kuny
email: terry.kuny at xist.com
INTRODUCTION
The use of metadata by Internet search engine developers
has been a recurring thread in numerous metadata and
library-related lists. At first glance, it seems apparent
that having the Internet search engine services use metadata
would be critical to uptake and successful deployment of
metadata schemas. As it turns out, this is probably a
wrong assertion but it is one that is frequently made.
The usual refrain that stimulates this discussion is:
1. An organization or company has the following objective:
"We want to make our information more accessible on
the Internet."
2. The above organization or company then addresses
this objective by wanting to use metadata, in the
hope that it will be picked up by Internet search
engines and that users will be able to access this
metadata in some meaningful manner.
Therefore the question put to the metadata
community has been: "What efforts are being made
to have search engine developers integrate metadata
support into their products?"
This note provides an update on the role of metadata,
the future of Internet search engines, and the state
of Internet information retrieval.
WHAT DO SEARCH ENGINES DO WITH METADATA?
None of the Internet search engines accessible on the
web (i.e. Yahoo, Alta Vista, Excite, Infoseek, etc.)
will actually read Dublin Core (DC) or any other metadata and do
meaningful things with this information. For example,
none provide structured metadata searches or appear to
use the metadata to rank metadata-enabled resources differently.
Some search engines use a non-standardized description
metadata tag to provide a text string describing a resource
in results listings. This is a weak mechanism for helping to
determine search relevance and is not proven to be effective
for scanning since the tags is used infrequently and/or poorly.
There are no search engines that specifically exploit
the advantages of structured search using any metadata scheme.
WHAT IS THE FUTURE OF METADATA SUPPORT IN INTERNET SEARCH ENGINES?
Will this situation change? The answer is not encouraging -
at least from the perspective of those who believe that metadata
support in the Internet search services is key to the success
of metadata. All Internet search engines are currently under
considerable scrutiny and criticism for the manner in which
they provide their services. Some of the criticisms leveled
against search engines include:
· Deceptive claims about coverage.
· Lack of information about how the index is constructed.
· Commercialization of results sets.
· Increased hit counts and poor relevance ranking.
· Clumsy and non-standard search syntaxes.
Numerous individuals within different metadata communities
and within other interested communities have been in regular
communications with the major Internet search engine developers
(i.e. AltaVista, Excite, Yahoo). It is notable that none of
the major Internet search engine developers are participating
in metadata standards work. These companies have indicated
that they have no intention of providing metadata support
or functionality at this time, because of the following:
1. There is no clear business case for adding metadata support.
The search engine companies' commercial interest is in keeping
people at their sites for as long as possible to support their
commercial clients and advertisers. It is not in their business
plan to build a public domain search service that provides effective
results. The driving force behind the web right now, and this
is true of search engines, is primarily commerce and marketing.
Dog chow vendors and other companies have an interest in being
seen in the first couple of screens of a results set. Dog chow
vendors have the money and a commercial interest. Their interests
are congruent with the search engine provider's interest in
getting money to improve access to company information. The
commercial interests of search engine companies are to support
a paying audience, not the free riders. Internet search engines
are not public goods.
2. There is a justifiable fear that metadata will be used
inappropriately by users to exploit their position in search
results, i.e. index spamming or "metajacking." There are a
number of techniques for doing this kind of spam and there
are no simple technical solutions to guard against this kind
of activity. It should be noted that there are already a number
of legal cases involving the misuse of "metadata" and index
spamming.
3. The quality and consistency of managed collections of metadata
(provided by organizations with a clear interest in reliable
metadata, such as libraries, museums, government agencies,
publishers, etc.) is always likely to be higher than harvested
metadata of unknown provenance. More related to this below.
4. There is also reluctance on the part of developers of Intranet
systems to implement metadata in their systems since the business
case has not been made for them either. A general consensus has
emerged that these vendors will remain uninterested until buyers
put significant pressure on them to develop metadata-enabled
search engines and provide meaningful metadata support tools.
It is these vendors that the metadata standards community is
particularly interested in getting on-board with metadata
development.
It should also be noted that many Internet search engines are
moving toward a "portal" strategy. At the heart of this strategy
is not the continuance of a generalized search index, but rather
a set of niche-oriented, focused, classified and categorized
information services, usually targeted at supporting
e-commerce imperatives.
According to an article in Science magazine in early 1998,
none of the current search systems cover more than, at best,
a third of publicly available web resources. It is also
interesting to note that the most popular service, Yahoo,
is also the one with the smallest number of resources - a
very tiny fraction of the others. Why is this so? Yahoo's
use of a classification system provides them with a retrieval
tool that is viewed by users as better than keywords in full-text
databases. It is Yahoo's success and model that other Internet
search services are trying to emulate.
The reality that Internet search engines are not the primary
sites for metadata activity or exploitation has been recognized
widely in the metadata community. In response, in Australia, Europe,
and the U.S., various organizations and "portals" have developed
their own metadata-enabled services. Examples include the various
EC/UK subject gateway services, the Internet Scout project,
work at OCLC, Australia's Government Locator Service (AGLS)
and BEP (Business Entry point) service, and so forth.
The IFLA Metadata has a list of many of these projects
page <URL: http://www.ifla.org/II/metadata.htm>.
CHANGE IN PHILOSOPHY OF SEARCHING
What is apparent to researchers in distributed indexing is
that the current strategy of search engines - that is, to
indiscriminately harvest whatever they can find and then do
selective indexing on those contents - is an unsustainable
architecture for retrieval in a billion document universe.
Full-text searching on heterogeneous contents is intrinsically
problematic, resulting in the retrieval of tremendous numbers
of low relevance documents. Standardized metadata upon which
to support structured searches in a billion document universe
is similarly problematic, both because a common standard is
unlikely and because even simple metadata may be inadequate
for providing meaningful retrieval. There does not exist an
architecture that will support effective and widespread networked
resource discovery in a scalable manner.
The development of a scalable, networked retrieval architecture
remains a significant area of research and development. There is
some expectation that standards developments such as XML/RDF will
be able to address some, certainly not all, of the requirements for
enhanced search and retrieval in a networked environment.
The metadata community is moving away from a "catalog everything"
approach to a domain-specific metadata approach. This is an appropriate,
and much needed, shift in perspective, since it is widely held that
generalized Internet search engines will probably remain largely
ineffective in providing effective retrieval even if there was a
widespread adoption of metadata.
If metadata is going to work, we must give up on Alta Vista, Excite,
Yahoo, etc. and develop search engines that serve our own needs (which
include identifying, acquiring, selecting, describing, arranging, storing,
and retrieving). Abandoning the "StarTrek" view of information retrieval
("Computer: What is the gestation period of a blue Tribble") is a much
needed development. It can also be argued that this means abandoning
or redirecting the "single window" metaphor that has become prevalent
in various communities.
A sound approach to the problems of networked resource discovery
is to recognize that information retrieval is an iterative process,
and that metadata can be used most effectively when it is given a
particular context for its use. For example, we do not go to a
library OPAC to find local pizza restaurants. Local resource
description communities must work to develop effective navigation
and search services that make their own information resources more
accessible through the development of domain-specific, tightly scoped,
niche indexes and retrieval tools that support identifiable user
communities. The strategy is to think local, act local.
RECOMMENDATION
As a guidance document, the following recommendation may
be considered as an appropriate response given the current
Internet search engine environment:
Organizationsshould not implement metadata based on the belief that
undertaking this will somehow enhance retrieval through Internet
search engines at this time.
An organization may choose to implement a variety of different
metadata-enabled retrieval services and functions, but these
should be addressed each on their own merits and not with the
assumption that they will increase visibility within the Internet
search engines. Metadata implementations should have a localized
context for their use and development.
The answer to the question of how to make an organization's website
or information resources more visible or accessible
is not an issue of metadata. Improvements in making
local web resources more usable and navigable can help
considerably. But in the final analysis, it is organizational
commitment, effective marketing and promotion,
regular communications with user communities, and the provision
of compelling contents and services, which will be the key
determinants of success in making information accessible.
More information about the Web4lib
mailing list