[Web4lib] How completely are you crawled?

Wed Jul 16 18:50:14 EDT 2008

Roy,

This is a topic that comes up frequently on the SEO lists and blogs. There's a good discussion at http://seo-theory.com/wordpress/2008/04/10/large-web-site-design-theory-and-crawl-management/

In a 2006 blog post, Google's Matt Cutts answered the question:

Q: "My sitemap has about 1350 urls in it. . . . . its been around for 2+ years, but I cannot seem to get all the pages indexed. Am I missing something here?"
A: One of the classic crawling strategies that Google has used is the amount of PageRank on your pages. So just because your site has been around for a couple years (or that you submit a sitemap), that doesn't mean that we'll automatically crawl every page on your site. In general, getting good quality links would probably help us know to crawl your site more deeply. You might also want to look at the remaining unindexed urls; do they have a ton of parameters (we typically prefer urls with 1-2 parameters)? Is there a robots.txt? Is it possible to reach the unindexed urls easily by following static text links (no Flash, JavaScript, AJAX, cookies, frames, etc. in the way)? That's what I would recommend looking at.
http://www.mattcutts.com/blog/q-a-thread-march-27-2006/

Last year, there was lots of concern about Google's Supplemental Index (aka "Google Hell" in the SEO community), which got more of the low PageRank pages, but Google stopped labeling pages as supplemental back in December. Submitting sitemaps was sometimes associated with a large number of pages in the supplemental index -- in fact, there were cases where webmasters claims that they had pages demoted to the supplemental index after submitting a sitemap. 

Danielle Cunniff Plumer, Coordinator
Texas Heritage Digitization Initiative
Texas State Library and Archives Commission
512.463.5852 (phone) / 512.936.2306 (fax)
dplumer at tsl.state.tx.us

-----Original Message-----
From: web4lib-bounces at webjunction.org
[mailto:web4lib-bounces at webjunction.org]On Behalf Of Roy Tennant
Sent: Wednesday, July 16, 2008 11:26 AM
To: web4lib
Subject: [Web4lib] How completely are you crawled?

In an email conversation with Debbie Campbell of the National Library of
Australia, it came out that although we both use Google Sitemaps[1] to
expose content to crawling that is behind a database wall (often termed the
"deep web"), neither of our sites was anywhere close to fully crawled by
Google despite this effort. Debbie reported something around 54% coverage of
Picture Australia's 1.5 million items while my coverage appeared to be about
37% of my 2,250 items. So size clearly does not appear to be an issue in
terms of percentage of coverage.

This made me wonder what others have been experiencing regarding crawling
coverage of their database-driven sites even when providing a Google
Sitemap. Can anyone else report their crawling statistics? If you're
registered your site map at Google Webmaster Tools[2], you can find the
appropriate statistic by selecting "Sitemap" from the menu on the left, then
clicking on the "Details" link beside the appropriate sitemap. Thanks,
Roy

[1] http://tinyurl.com/224cuu
[2] https://www.google.com/webmasters/tools/

_______________________________________________
Web4lib mailing list
Web4lib at webjunction.org
http://lists.webjunction.org/web4lib/