[Web4lib] Huge increase of Google database

Richard Wiggins richard.wiggins at gmail.com
Fri May 19 07:32:02 EDT 2006


Google rolls out new versions of its database roughly every month, and those
rollouts can include tweaks to the search algorithms. If you search for "Big
Daddy" you'll find remarkably frank discussion of Big Daddy as a major
change to their crawling and indexing.  See:

http://www.mattcutts.com/blog/

A number of years ago, when AltaVista was king, we used it as our search
engine.  I discovered that it had crawled an events calendar on campus, and
by "clicking" on the drop-down menu for month, day, and year, it had indexed
dates centuries into the future.  Of course, each day's entry had no real
content, but each day got indexed.

Once you count every unique URL that could map to a database entry, all bets
are off.  The "page" metaphor just doesn't apply any more.

Soon afterwards I interviewed the then-chief scientist at AltaVista for a
magazine article, and he said that due to examples such as that, measuring
the size of a Web index would always be problematic, because you could
define the universe of URLs you crawl to be essentially infinite.

I think that's why Google decided to stop playing the game of "who has the
bigger index?"  Of course, it is possible that "Big Daddy" represents more
complete coverage of usable content.

/rich

On 5/19/06, Isidro F. Aguillo <isidro at cindoc.csic.es> wrote:
>
> Google no longer provides in its homepage hints about the size of its
> database. I read that even when they publish a figure over 8 billion the
> number of webpages covered exceeded the 11 billion mark. In the last
> months they appear to have changed their databases (something called Big
> Daddy??) and now the results are significantly higher than before. I
> noticed an increase of both "dead links" as well of "invisible web"
> (records from databases) results.
>
> From previous estimates, the global size of the Web was around 20-24
> billion webpages, but after making some experiments, our best estimate
> for the current size of Google web database alone is close to 40
> billion. Even considering "invisible" records I am wondering if Google
> is really covering more than half or two thirds of the Web.
>
> I would like to know about recent comments about the coverage ratio of
> Google and other search engines and also any references to geographical
> or language biases in the coverage of the engines.
>
> Thanks in advance,
>
> --
> ***************************************
> Isidro F. Aguillo
> isidro at cindoc.csic.es
> Ph:(+34) 91-5635482 ext. 313
>
> InternetLab. CINDOC-CSIC
> Joaquin Costa, 22
> 28002 Madrid. SPAIN
>
> http://www.webometrics.info
> http://www.cindoc.csic.es/cybermetrics
> http://internetlab.cindoc.csic.es
> ****************************************
>
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
>


More information about the Web4lib mailing list