Post Reply
markmorgan
Posts: 38
Joined: Fri Sep 16, 2005 8:50 am
Location: UK
Contact:

Item revisions

Post by markmorgan » Wed Apr 26, 2006 1:27 pm

This is a two part problem really. Firstly every time Awasu refreshes the results of a Google News Search RSS feed it identifies every single item as updated. Secondly how do you switch of the item revsions.

Example Google feed URL for Tutankhamun news search
http://news.google.co.uk/news?svnum=10&as_scoring=d&num=100&hl=en&lr=&tab=wn&ie=UTF-8&q=egypt+tutankhamun+OR+tutankhamen+OR+tutanchamun+OR+tutankhamon+OR+toutanchamon+OR+toutankhamon+OR+%22king+tut%22&output=rss

This means that Awasu is storing hundreds of redundant item revisions in the database for these feeds as they all look pretty much the same. The only difference I can spot is that the news stories are marked '1 hour ago', '2 hours ago' etc as they get older, once they hit 24 hours they start getting dates which then stay the same obviously. So with a 2 hour update frequency and an 10 hour blackout window I shouldn't get more than around 7 revisions. Yet looking at a news article from 05/04/2006 it has 95! Hmm, if my archive time is set to 2 weeks how come there are items going back to early March in the feed?

Is there any way of switching off the revision storing and only store the latest? The 'disable item revisions' option on the properties page implies that if you switch it off then every time the feed is read it will treat every item as new... Not the desired result. I want only the latest revision. BTW are the revisions stored as deltas or full feed items?

The other obvious side effect is that items marked as read keep getting marked as unread as they are classsed as updated. Maybe a global or feed level option should control this behaviour. For blog posts you would want to see updates to posts you have read, for site news feeds you probably don't.

Is there any way of lowering the updated item sensitivity? Using some sort of percentage change algorithm? Is there any way to spot this common text and filter it out? I don't recall this problem in RssReader and SharpReader that I used before (they were memory hogs which is why I moved to Awasu).

This may account for the degredation in my Awasu performance as the user directory is now 268MB and I have reduced the number of feeds and cut the archive time from 1 month to 2 weeks.

Mark Morgan

User avatar
support
Site Admin
Posts: 3065
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Re: Item revisions

Post by support » Wed Apr 26, 2006 5:12 pm

markmorgan wrote:Firstly every time Awasu refreshes the results of a Google News Search RSS feed it identifies every single item as updated.

Yes, if they're changing the text from "1 hour ago" to "2 hours ago" and so on, Awasu will flag each one as a new revision. In fact, Awasu can be configured to determine if an item has been revised in two ways: (1) if *anything* in the feed item's XML has changed, including stuff that you don't see or even is part of RSS (e.g. an extension) or (2) if only key things like the title, description, URL, etc. have changed.

However, this isn't going to help you in this particular case since Google is changing the description which will always trigger a revision.

markmorgan wrote:So with a 2 hour update frequency and an 10 hour blackout window I shouldn't get more than around 7 revisions. Yet looking at a news article from 05/04/2006 it has 95!

Actually, 95 is about right: you'll get around 7 revisions per day but with 14 days archived, 14 * 7 = 98

markmorgan wrote:Is there any way of switching off the revision storing and only store the latest?

This is a good idea. I'll mull over it...

markmorgan wrote:The 'disable item revisions' option on the properties page implies that if you switch it off then every time the feed is read it will treat every item as new... Not the desired result. I want only the latest revision.

Turning this off causes Awasu to show each revision as a separate entry in the item pane. If it's on, they get collapsed down into a single entry with the "has-revisions" indicator.

markmorgan wrote:BTW are the revisions stored as deltas or full feed items?

Full items. However, everything is compressed so it's probably not going to be too bad.

markmorgan wrote:This may account for the degredation in my Awasu performance

I've got some strong suspicions that it's something else. I'll be doing some work on it for 2.2.2.

markmorgan
Posts: 38
Joined: Fri Sep 16, 2005 8:50 am
Location: UK
Contact:

Post by markmorgan » Wed Apr 26, 2006 9:24 pm

markmorgan wrote:
So with a 2 hour update frequency and an 10 hour blackout window I shouldn't get more than around 7 revisions. Yet looking at a news article from 05/04/2006 it has 95!


support wrote:
Actually, 95 is about right: you'll get around 7 revisions per day but with 14 days archived, 14 * 7 = 98


Yeah but after a news item is 24 yours old the description text switches to dates so something else must be changing in the feed items for the items to be marked as changed ever day after the first day. I've sent a mail to Google about it as well.

markmorgan wrote:
This may account for the degredation in my Awasu performance


support wrote:
I've got some strong suspicions that it's something else. I'll be doing some work on it for 2.2.2.


I look forward to it. As I have mentioned in previous posts I get a lot of problems with the processor hitting 100% for minutes on end sometimes making the machine unusable for 20 minutes or so. I'll get a response off to the previous post...

Mark.

User avatar
support
Site Admin
Posts: 3065
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Post by support » Thu Apr 27, 2006 2:15 am

markmorgan wrote:Yeah but after a news item is 24 yours old the description text switches to dates so something else must be changing in the feed items for the items to be marked as changed ever day after the first day.

Yes, that's right. I'll check out the feed and take a look at it.

markmorgan wrote:I get a lot of problems with the processor hitting 100% for minutes on end sometimes making the machine unusable for 20 minutes or so.

Wow, I've never seen anything that bad. Awasu uses a lot of CPU but it's all at low priority so it reliquishes the CPU if anyone else wants it. It's much more likely that your machine is having trouble because the disk is getting hammered.

User avatar
support
Site Admin
Posts: 3065
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Post by support » Thu Apr 27, 2006 5:24 am

markmorgan wrote:so something else must be changing in the feed items for the items to be marked as changed ever day after the first day. I've sent a mail to Google about it as well.


The URL they provide to link through to the underlying article has a different <tt>ei</tt> parameter which causes Awasu to flag the item as being different. It's unlikely Google are going to be willing to <strike>give up their ability to track what we're doing</strike> change this :-) :roll:

This is going to be more of an issue as the marketeers invade the syndication space and start inserting ads (that change on every update) and other user-monitoring artifacts. I'll definitely take a look at this for 2.2.2.

Sigh.. :-(

Post Reply

Return to “Awasu - General Discussion”