[WEB4LIB] RE: web characterization program - from CNI

Roy Tennant roy.tennant at ucop.edu
Wed Oct 25 19:57:39 EDT 2000


Rich's piece is out now (and it's good) in NetConnect. Unfortunately 
Library Journal has not put it online, but you can view the table of 
contents at http://www.libraryjournal.com/netconnect.asp . NetConnect 
is a supplement to Library Journal and the School Library Journal, so 
anyone who has a subscription to one of those will get it.
Roy


At 4:43 PM -0700 10/25/00, Richard Wiggins wrote:
>The BrightPlanet paper is worth a read.  They have some 
>thought-provoking analysis, but they make a LOT of assumptions.  I 
>find many of their conclusions unpersuasive.   I suggest reading the 
>paper -- but with a skeptical eye.
>
>At my own institution, we learned first hand that it can be very 
>misleading to extrapolate from what search engines find.  We run a 
>local copy of AltaVista, and it reports on the order of 4 million 
>indexed URLs.  The global AltaVista no doubt has a fraction of this 
>many msu.edu pages in its index.
>
>So far, so good.  This meshes with the "shallow vs Deep Web" 
>analysis.  Global AltaVista is a shallow view of our domain. But 
>when Google built an index of msu.edu, they counted only about 
>300,000 URLs.  I talked to them and they explained that they 
>deliberately do NOT index CGIs.
>
>We built a test AltaVista index that doesn't index CGIs.  URL count? 
>About 300,000. 
>
>So the big difference -- a factor of 10 or more! -- is the dynamic 
>content.  And it turns out a lot of dynamic content is bogus -- 
>virtual pages that correspond to days with zero listed events far in 
>the future on a virtual calendar, for instance.
>
>So when BrightPlanet says the deep Web is 500 times bigger than the 
>surface Web, besides questioning the methodology, I also wonder how 
>much of that deep content is meaningful, and how much is sparse 
>nothingness representing virtual pages in databases.
>
>Note also that Google reports over a billion URLs and growing on the 
>global ("surface" Web) and AltaVista counts a few hundred million. 
>Difference might be handling of duplicates and/or effectiveness of 
>crawling???
>
>The BrightPlanet paper also makes some assertions about the 
>relevance and quality of "deep Web" content that I just plain don't 
>buy.
>
>/rich
>
>PS -- some of this is discussed in a piece I wrote for Library 
>Journal's quarterly, should be out soon.
>
>
>------Original Message------
>
>From: "GraceAnne A. DeCandido" <ladyhawk at well.com>
>To: Multiple recipients of list <web4lib at webjunction.org>
>Sent: October 25, 2000 12:24:26 PM GMT
>Subject: [WEB4LIB] web characterization program - from CNI
>
>
>I am forwarding these two URLs about the size of the web
>and the deep web from the CNI list, fyi. Both articles are
>fascinating even in summary form.
>GraceAnne DeCandido
>
>------- Forwarded message follows -------
>Date sent:     	Wed, 18 Oct 2000 09:57:02 -0700
>From:          	Clifford Lynch <cliff at cni.org>
>To:            	Multiple recipients of list <cni-announce at cni.org>
>Subject:       	OCLC web characterization program latest
>
>The latest results from the OCLC Office of Research Web
>Characterization project have just been released.
>Interesting data. You can find a summary at:
>
>http://www.oclc.org/oclc/press/20001016a.htm
>
>For those interested in web characterization, there is also
>an interesting report on what they call "the deep web" (ie
>databases etc which are accessible via web interfaces;
>others call this "dark matter" or "the invisible web") which
>was released in July by a company called BrightPlanet. If
>you haven't seen this it's at
>
>http://www.completeplanet.com/Tutorials/DeepWeb/index.a
>sp
>
>
>Clifford Lynch
>Director, CNI
>Richard Wiggins
>Consulting, Writing & Training on Internet Topics
>www.netfact.com/rww         wiggins at mail.com
>517-349-6919 (home office)  517-353-4955 (work)
>Richard Wiggins
>Consulting, Writing & Training on Internet Topics
>www.netfact.com/rww         wiggins at mail.com
>517-349-6919 (home office)  517-353-4955 (work) 
>______________________________________________
>FREE Personalized Email at Mail.com
>Sign up at http://www.mail.com/?sr=signup



More information about the Web4lib mailing list