[WEB4LIB] web characterization program - from CNI
Richard Wiggins
wiggins at mail.com
Wed Oct 25 19:36:29 EDT 2000
The BrightPlanet paper is worth a read. They have some thought-provoking analysis, but they make a LOT of assumptions. I find many of their conclusions unpersuasive. I suggest reading the paper -- but with a skeptical eye.
At my own institution, we learned first hand that it can be very misleading to extrapolate from what search engines find. We run a local copy of AltaVista, and it reports on the order of 4 million indexed URLs. The global AltaVista no doubt has a fraction of this many msu.edu pages in its index.
So far, so good. This meshes with the "shallow vs Deep Web" analysis. Global AltaVista is a shallow view of our domain. But when Google built an index of msu.edu, they counted only about 300,000 URLs. I talked to them and they explained that they deliberately do NOT index CGIs.
We built a test AltaVista index that doesn't index CGIs. URL count? About 300,000.
So the big difference -- a factor of 10 or more! -- is the dynamic content. And it turns out a lot of dynamic content is bogus -- virtual pages that correspond to days with zero listed events far in the future on a virtual calendar, for instance.
So when BrightPlanet says the deep Web is 500 times bigger than the surface Web, besides questioning the methodology, I also wonder how much of that deep content is meaningful, and how much is sparse nothingness representing virtual pages in databases.
Note also that Google reports over a billion URLs and growing on the global ("surface" Web) and AltaVista counts a few hundred million. Difference might be handling of duplicates and/or effectiveness of crawling???
The BrightPlanet paper also makes some assertions about the relevance and quality of "deep Web" content that I just plain don't buy.
/rich
PS -- some of this is discussed in a piece I wrote for Library Journal's quarterly, should be out soon.
------Original Message------
From: "GraceAnne A. DeCandido" <ladyhawk at well.com>
To: Multiple recipients of list <web4lib at webjunction.org>
Sent: October 25, 2000 12:24:26 PM GMT
Subject: [WEB4LIB] web characterization program - from CNI
I am forwarding these two URLs about the size of the web
and the deep web from the CNI list, fyi. Both articles are
fascinating even in summary form.
GraceAnne DeCandido
------- Forwarded message follows -------
Date sent: Wed, 18 Oct 2000 09:57:02 -0700
From: Clifford Lynch <cliff at cni.org>
To: Multiple recipients of list <cni-announce at cni.org>
Subject: OCLC web characterization program latest
The latest results from the OCLC Office of Research Web
Characterization project have just been released.
Interesting data. You can find a summary at:
http://www.oclc.org/oclc/press/20001016a.htm
For those interested in web characterization, there is also
an interesting report on what they call "the deep web" (ie
databases etc which are accessible via web interfaces;
others call this "dark matter" or "the invisible web") which
was released in July by a company called BrightPlanet. If
you haven't seen this it's at
http://www.completeplanet.com/Tutorials/DeepWeb/index.a
sp
Clifford Lynch
Director, CNI
Richard Wiggins
Consulting, Writing & Training on Internet Topics
www.netfact.com/rww wiggins at mail.com
517-349-6919 (home office) 517-353-4955 (work)
Richard Wiggins
Consulting, Writing & Training on Internet Topics
www.netfact.com/rww wiggins at mail.com
517-349-6919 (home office) 517-353-4955 (work)
______________________________________________
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup
More information about the Web4lib
mailing list