Answer: On Mean time of Survival of URLs

Terry Kuny Terry.Kuny at xist.com
Tue Jul 29 12:29:08 EDT 1997


Hello all,

Brewser Kahle in his article in Scientific American 
(March 1997) uses estimates that put the average 
lifetime for a URL at 44 days. 

URL: http://www.sciam.com/0397issue/0397kahle.html

Although the source is not given, the figure comes from
a paper done as part of the HARVEST networked resource
discovery research on caching.

Anawat Chankhunthod, Peter B. Danzig, Chuck Neerdaels, 
Michael F. Schwartz and Kurt J. Worrell. A Hierarchical
Internet Object Cache. Technical Report 95-611, Computer 
Science Department, University of Southern California, Los
Angeles, California, March 1995. Also, Technical Report 
CU-CS-766-95, Department of Computer Science, University of
Colorado, Boulder, Colorado. 

URL:
http://harvest.transarc.com/afs/transarc.com/public/trg/Harvest/papers.html

The research sampled the modification times of 4600 HTTP objects 
istributed across 2000 sites over a three month period. I am not sure, but
this may be somewhat different than Brewster Kahle's "lifetime
of a URL" since what the research seems to indicate is not that the
object "disappeared" or that the URL was broken but
rather that the object was "modified" (of which broken or removed
are only two kinds of modifications).

The exact wording of the results in this paper are:

- the mean lifetime of all objects was 44 days
- HTML text objects were 75 days
- images were 107 days
- objects of unknown types averaged 27 days

- 28% of the objects were updated at least every 10 days
- 1% of the objects were dynamically updated


I also recall that Michael Schwartz recently commented
that this estimate is probably too high now and 
that the mean lifetime has likely decreased.

If the number seems low, think about the number of URLs
that point to pages in large databases (read:
newspapers, CNN, Time-Warner, etc.), the number of 
sites "under construction" at any one time, the 
amount of change that goes on within a site. Some
URLs may appear to be persistent but the contents
are most certainly not. And if top-level pages at
lots of sites do not change much, my own purely 
informal survey suggests there is lots of churn below
the surface. 

When this result came out and I talked about it
with friends, there was some skepticism as some
people thought the figure too low. I would suggest
that if the Web looks more stable to us than
this, I would suggest it is because we are attracted
to places where stability and predictability
are more common. This would explain why our view
of the Web makes the environment look less volatile
than 44 days.

It would be interesting to see if anyone has redone this
research and what the results are. It would also be
good to see if the results can be refined somewhat,
i.e. how many URLs remained intact but where the contents
change, where the URLs are broken and cannot be
relocated or where there are no redirect, etc. 

I don't know if this research has been done yet. It was
talked about at the Toronto IETF a couple of years ago
in a quality BOF but nothing seemed to have come of it.
Perhaps someone out there has information they 
can share about this.

-terry



---------------------------------------------------------------

Mr. Terry Kuny                  Home Office: 819-776-6602   
XIST Inc./                      Email: terry.kuny at xist.com
Global Village Research         URL:   http://xist.com/kuny/

Snail: Box 1141, St. B, Hull, Quebec, Canada

---------------------------------------------------------------


More information about the Web4lib mailing list