[Web4lib] Another Google question

Wed Jul 6 12:16:42 EDT 2005

And that one thing that Google does well is, in most cases, put SOMETHING 
apparently useful at the top of the hit list. They are not trying to be 
comprehensive, or exhaustive. That's why Google has a market capitalization 
of $85 billion right now -- several times more than GM. There's something 
useful at the top of the hit list. The percent of people who crawl through 
hundreds of hit list entries is very, very low. 
 There are many things that Google does at the tail end of the hit list that 
don't make sense to trained searchers -- folks looking for obscure facts at 
the tail end of the Zipf curve.
 And PageRank can be just plain dumb at times.
 My wife's cat had to be treated for a thyroid condition a couple of years 
ago. The treatment is radiation. The cat becomes radioactive. You can't 
spend time close to it until the half-life expires, and if you dispose of 
the litter improperly, landfill radiation alarms go off.
 I posted a blog entry on that on Blogger, one correcting a self-styled 
expert in human thyroid disease. For a while, a search for "radioactive cat" 
showed the posting high on Google's hit list. Once the article fell into the 
archives, the article fell off Google's hit list.
 Recently I posted a rant about how bad ABC television's overnight news is. 
So if you search for "insomniac ABC" my silly throwaway comment is in the 
first couple dozen hits. IT will vanish in a few weeks as the blog rotates. 
 The flaw here is that Google ranks Blogger postings based on links to the 
home page of the blog. Once an article goes into the archive, the home page 
URL doesn't point to it, so the article's PageRank tanks, unless there are 
numerous links to its permalink. 
 In many cases the links to new Blogger postings get a PageRank that's too 
high, and the links to postings in the archives get a PageRank of zero. 
 Now Google OWNS Blogger and could easily address this. But the same 
phenomenon plays out elsewhere; the home page of The New York Times has a 
PageRank that is very high, but the rank of an article that's got huge value 
to a researcher may be very low due to a lack of direct links to that 
article's specific URL.
 /rich
 PS -- 
http://wigblog.blogspot.com/2003/11/our-cat-is-radioactive-self-styled.html 
 On 7/6/05, Roy Tennant <roy.tennant at ucop.edu> wrote: 
> 
> Lars' question and Patricia's answer overlooks the fact that Google
> is making a huge assumption about user needs, and creating a system
> that fulfills that assumption but provides no mechanisms for the user
> to change those assumptions. Allow me to be specific.
> 
> Sometimes I want to find, for example, brand new web pages -- pages
> that are so new I'm not even sure if Google has crawled them yet. But
> based on the PageRank algorithm as I understand it, these pages would
> naturally fall to the bottom of the search results. Does Google
> provide any method to reverse-sort the results? No. Does Google
> provide a mechanism to view results based on date added to the index?
> No. Does Google provide a mechanism to sort results based on the last
> change date of the page itself? No. So what are we left with? Trying
> to get to the "end" of the search results, wherever that may be.
> Sorry, but that's bad interface design. The fact that you can't,
> apparently, even do it given the systems own mechanisms is flat out
> indefensible. Or, if there numbers are in fact completely wrong and
> there are really only 900 items instead of 15,000 then I guess
> they're just lying to us.
> 
> Google does one thing, and it appears to do that one thing well. But
> let's not make the unfortunate assumption that it does more than that
> one, very specific, thing.
> Roy
>