Optimize feed fetching

I'm working on a site now that have to fetch users feeds. But how can I best optimize fetching if I have a database with, lets say, 300 feeds. I'm going to set up a cron-job to which fetches the feeds, but should I do it like 5 every second minute or something?

Any ideas on how to do this the best way in PHP?

Answers


Based on the new information I think I would do something like this:

Let the "first" client initiate the updatework and store timestamp with it. Everey other clients that will ask for the information get a cashed information until that information are to old. Next hit from a client will then refresh the cashe that then will be used by all clients till next time its to old.

The client that will actually initiate the updatework should not have to wait for it to finnish, just serv the old cashed version and continue to do it till the work is done.

That way you dont have to update anything if no clients are requesting it.


If I understand you question, you are basically working on a feed agregator site?

You can do the following; start by refreshing every 1 hor (for example). When you have anough entries from some feed - calculate the average interval between entries. Then use that interval as an interval for fetching that feed.

For example, if the site published 7 articles in the last 7 days - you can fetch feeds from it every 24hours (1day).

I use this algorithm with a few changes, when I calculate this average interval I divide it by 2 (to be sure not to fetch too rarely). If the result is less than 60 minutes - I set the interval to 1h or it is bigger than 24 I set it to 24h.

For example, something like this:

    public function updateRefreshInterval() {
            $sql = 'select count(*) _count ' .
                    'from article ' .
                    'where created>adddate(now(), interval -7 day) and feed_id = ' . (int) $this->getId();
            $array = Db::loadArray( $sql );

            $count = $array[ '_count' ];

            $interval = 7 * 24 * 60 * 60 / ( $count + 1 );
            $interval = $interval / 2;
            if( $interval < self::MIN_REFRESH_INTERVAL ) {
                    $interval = self::MIN_REFRESH_INTERVAL;
            }
            if( $interval > self::MAX_REFRESH_INTERVAL ) {
                    $interval = self::MAX_REFRESH_INTERVAL;
            }

            Db::execute( 'update feed set refresh_interval = ' . $interval . ' where id = ' . (int) $this->getId() );
    }

The table is 'feed', 'refreshed' is the timestampt when the feed was last time refreshed and 'refresh_interval' is the desired time interval between two fetches of the same feed.


The best thing to do is to be 'nice' and not overload the feeds with lots of needless requests. I settled on a 1 hour update time for one of my webapps that monitors about 150 blogs for updates. I store the time they were last checked in the database and use that to decide when to update them. The feeds were added at random times so they aren't all updated at the same time.


I wrote pfetch to do this for me. It's small, but has a couple really important aspects:

  1. It's written in twisted and can handle massive concurrency even when the network is slow.
  2. It doesn't require any cron jockeying or anything.

I actually wrote it because my cron-based fetchers were becoming a problem. Now I have it configured to fetch some random stuff I want around the internet and then runs scripts whenever things change to update parts of my own web site.


Need Your Help

Segmentation faults in x86 NASM program

loops assembly x86 segmentation-fault nasm

I have an assignment where I must create a text file, such as:

Pagination in DocumentDB

node.js nosql azure-documentdb

Is it possible to limit and offset results in documentDB to achieve pagination using LIMIT and OFFSET doesn't work.