[WEB4LIB] RE: web characterization program - from CNI
Roy Tennant
roy.tennant at ucop.edu
Wed Oct 25 19:57:39 EDT 2000
Rich's piece is out now (and it's good) in NetConnect. Unfortunately
Library Journal has not put it online, but you can view the table of
contents at http://www.libraryjournal.com/netconnect.asp . NetConnect
is a supplement to Library Journal and the School Library Journal, so
anyone who has a subscription to one of those will get it.
Roy
At 4:43 PM -0700 10/25/00, Richard Wiggins wrote:
>The BrightPlanet paper is worth a read. They have some
>thought-provoking analysis, but they make a LOT of assumptions. I
>find many of their conclusions unpersuasive. I suggest reading the
>paper -- but with a skeptical eye.
>
>At my own institution, we learned first hand that it can be very
>misleading to extrapolate from what search engines find. We run a
>local copy of AltaVista, and it reports on the order of 4 million
>indexed URLs. The global AltaVista no doubt has a fraction of this
>many msu.edu pages in its index.
>
>So far, so good. This meshes with the "shallow vs Deep Web"
>analysis. Global AltaVista is a shallow view of our domain. But
>when Google built an index of msu.edu, they counted only about
>300,000 URLs. I talked to them and they explained that they
>deliberately do NOT index CGIs.
>
>We built a test AltaVista index that doesn't index CGIs. URL count?
>About 300,000.
>
>So the big difference -- a factor of 10 or more! -- is the dynamic
>content. And it turns out a lot of dynamic content is bogus --
>virtual pages that correspond to days with zero listed events far in
>the future on a virtual calendar, for instance.
>
>So when BrightPlanet says the deep Web is 500 times bigger than the
>surface Web, besides questioning the methodology, I also wonder how
>much of that deep content is meaningful, and how much is sparse
>nothingness representing virtual pages in databases.
>
>Note also that Google reports over a billion URLs and growing on the
>global ("surface" Web) and AltaVista counts a few hundred million.
>Difference might be handling of duplicates and/or effectiveness of
>crawling???
>
>The BrightPlanet paper also makes some assertions about the
>relevance and quality of "deep Web" content that I just plain don't
>buy.
>
>/rich
>
>PS -- some of this is discussed in a piece I wrote for Library
>Journal's quarterly, should be out soon.
>
>
>------Original Message------
>
>From: "GraceAnne A. DeCandido" <ladyhawk at well.com>
>To: Multiple recipients of list <web4lib at webjunction.org>
>Sent: October 25, 2000 12:24:26 PM GMT
>Subject: [WEB4LIB] web characterization program - from CNI
>
>
>I am forwarding these two URLs about the size of the web
>and the deep web from the CNI list, fyi. Both articles are
>fascinating even in summary form.
>GraceAnne DeCandido
>
>------- Forwarded message follows -------
>Date sent: Wed, 18 Oct 2000 09:57:02 -0700
>From: Clifford Lynch <cliff at cni.org>
>To: Multiple recipients of list <cni-announce at cni.org>
>Subject: OCLC web characterization program latest
>
>The latest results from the OCLC Office of Research Web
>Characterization project have just been released.
>Interesting data. You can find a summary at:
>
>http://www.oclc.org/oclc/press/20001016a.htm
>
>For those interested in web characterization, there is also
>an interesting report on what they call "the deep web" (ie
>databases etc which are accessible via web interfaces;
>others call this "dark matter" or "the invisible web") which
>was released in July by a company called BrightPlanet. If
>you haven't seen this it's at
>
>http://www.completeplanet.com/Tutorials/DeepWeb/index.a
>sp
>
>
>Clifford Lynch
>Director, CNI
>Richard Wiggins
>Consulting, Writing & Training on Internet Topics
>www.netfact.com/rww wiggins at mail.com
>517-349-6919 (home office) 517-353-4955 (work)
>Richard Wiggins
>Consulting, Writing & Training on Internet Topics
>www.netfact.com/rww wiggins at mail.com
>517-349-6919 (home office) 517-353-4955 (work)
>______________________________________________
>FREE Personalized Email at Mail.com
>Sign up at http://www.mail.com/?sr=signup
More information about the Web4lib
mailing list