[WEB4LIB] web characterization program - from CNI

Richard Wiggins wiggins at mail.com
Wed Oct 25 19:36:29 EDT 2000


The BrightPlanet paper is worth a read.  They have some thought-provoking analysis, but they make a LOT of assumptions.  I find many of their conclusions unpersuasive.   I suggest reading the paper -- but with a skeptical eye.

At my own institution, we learned first hand that it can be very misleading to extrapolate from what search engines find.  We run a local copy of AltaVista, and it reports on the order of 4 million indexed URLs.  The global AltaVista no doubt has a fraction of this many msu.edu pages in its index.

So far, so good.  This meshes with the "shallow vs Deep Web" analysis.  Global AltaVista is a shallow view of our domain. But when Google built an index of msu.edu, they counted only about 300,000 URLs.  I talked to them and they explained that they deliberately do NOT index CGIs. 

We built a test AltaVista index that doesn't index CGIs.  URL count?  About 300,000.  

So the big difference -- a factor of 10 or more! -- is the dynamic content.  And it turns out a lot of dynamic content is bogus -- virtual pages that correspond to days with zero listed events far in the future on a virtual calendar, for instance.

So when BrightPlanet says the deep Web is 500 times bigger than the surface Web, besides questioning the methodology, I also wonder how much of that deep content is meaningful, and how much is sparse nothingness representing virtual pages in databases.

Note also that Google reports over a billion URLs and growing on the global ("surface" Web) and AltaVista counts a few hundred million.  Difference might be handling of duplicates and/or effectiveness of crawling???

The BrightPlanet paper also makes some assertions about the relevance and quality of "deep Web" content that I just plain don't buy.

/rich

PS -- some of this is discussed in a piece I wrote for Library Journal's quarterly, should be out soon.


------Original Message------

From: "GraceAnne A. DeCandido" <ladyhawk at well.com>
To: Multiple recipients of list <web4lib at webjunction.org>
Sent: October 25, 2000 12:24:26 PM GMT
Subject: [WEB4LIB] web characterization program - from CNI


I am forwarding these two URLs about the size of the web 
and the deep web from the CNI list, fyi. Both articles are 
fascinating even in summary form.
GraceAnne DeCandido

------- Forwarded message follows -------
Date sent:      	Wed, 18 Oct 2000 09:57:02 -0700
From:           	Clifford Lynch <cliff at cni.org>
To:             	Multiple recipients of list <cni-announce at cni.org>
Subject:        	OCLC web characterization program latest

The latest results from the OCLC Office of Research Web
Characterization project have just been released.
Interesting data. You can find a summary at:

http://www.oclc.org/oclc/press/20001016a.htm

For those interested in web characterization, there is also
an interesting report on what they call "the deep web" (ie
databases etc which are accessible via web interfaces;
others call this "dark matter" or "the invisible web") which
was released in July by a company called BrightPlanet. If
you haven't seen this it's at

http://www.completeplanet.com/Tutorials/DeepWeb/index.a
sp


Clifford Lynch
Director, CNI
Richard Wiggins
Consulting, Writing & Training on Internet Topics
www.netfact.com/rww         wiggins at mail.com
517-349-6919 (home office)  517-353-4955 (work)
Richard Wiggins
Consulting, Writing & Training on Internet Topics
www.netfact.com/rww         wiggins at mail.com
517-349-6919 (home office)  517-353-4955 (work)  
______________________________________________
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup


More information about the Web4lib mailing list