[WEB4LIB] RE: Seattletimes.com: Public to taste life without its libraries

gary gprice at gwu.edu
Tue Aug 20 21:31:25 EDT 2002


Nancy: 
Google captures a copy of each page* it finds during its crawl and makes the page 
available via the Google Cache. If a web site owner doesn't want a page(s) 
cached, The Washington Post as an example, Google needs to be contacted or the 
proper file needs to be placed on the server. 

At the moment, Google has approx. 1300 pages from the www.spl.org domain in its 
database. 

I browsed through a few pages of results and all had cached versions available. 

So, will SPL ask them to purge the cache? It's a good question. 
  
Another question, if a web searcher were to access the page with links 
to remotely accessible subscription databases via SPL 
(http://216.239.51.100/search?q=cache:9xLUIuY--
C8C:www.spl.org/selectedsites/subscriptions.html+&hl=en&ie=UTF-8), will these 
links be disconnected? What about the OPAC?


Finally, other search engines are caching pages. The very new Gigablast also 
caches content. http://www.gigablast.com 
Example of Cache: 
http://www.gigablast.com/cgi/0.cgi?n=10&ns=2&sd=0&q=%22seattle+public+library%22 


*Google crawls and caches the first 110k of a web pages. If a page is longer, 
it's truncated at the 110 mark.  According to Greg Notess, Google truncates most 
pdf files at "about 120k". http://www.searchengineshowdown.com/new.shtml#may18   

cheers, 
gary 




-- 

Gary D. Price, MLIS
Librarian
Gary Price Library Research and Internet Consulting
gary at freepint.com

The Virtual Acquisition Shelf and News Desk 
http://resourceshelf.freepint.com



More information about the Web4lib mailing list