[Web4lib] How do news aggregators know that a user has already read an item

Jonathan Gorman jtgorman at uiuc.edu
Thu Mar 15 09:32:28 EST 2007


> This raises the question how do News Aggregators achieve the > effect of not displaying previously read items? 

I would imagine that it varies by aggregator and feed.  Perhaps a link to the feed in question?  First of all, RSS 2.0 does allow for items to have a unique id (guid) and a publication date (pubDate).  Those two should be included if at all possible.  The publication date would be when the item was first posted.  From your description this shouldn't be difficult to add at all. 

If I was developing an aggregator feed I'd probably go by these two elements in the item.  However, since they're optional, I'd probably keep enough information about the individual items so I know if they've been seen already or not.  The extreme would be a md5 hash of each post.  Then when get new items compare it to the hashes you already have.  This would likely be slow though.  I've seen aggregators make mistakes or repost things, so my suspicion is that there's still flaws and cases where the aggregator is just guessing.

Of course, you also need to make sure to set your headers correctly.  Most aggregators will issue a http header to the web server asking if the "file" has been modified since the last time it accessed it.  If the header to the script isn't set up correctly, it may appear to never to be updated. On the flip side, you might always be returning results when you might not want to be.  

In addition, a lot of rss feeds I've seen expand on RSS by using the syn elements, (like syn:updatePeriod).  I can't seem to find the spec off hand.  

Jon Gorman





More information about the Web4lib mailing list