Some thoughts on caching
Keith is thinking about caching. He’s drawn up a pretty interesting set of thoughts and requirements, such as support for conditional GET and fine-grained cache length control, and support for caching most of a page while leaving some small parts dynamic.
Personally, I’m considering moving away from dynamically generated content for the most part on this site and going with a page generation scheme something like Moveable Type (or giving funky caching a go). My justification for generating everything dynamically when I built the software originally was that I’d never get a huge amount of traffic so PHP and MySQL would be more than capable of keeping up with demand. While that’s still true, traffic to this site has risen to the point where it’s a little less certain that everything will hold together (I get failed database connection emails a few times a day). I’m already caching the front page once a minute and caching some internal things such as related entries, so moving to a full caching system for the next generation of the system seems like a logical progression. Besides, there’s not much point in dynamically generating hits to archived entries from last year.
If you are internally caching parts of the page, it isn't a great leap to go to funky caching (although, I would not use a 404 handler, I would use mod_rewrite, and readfile()).
You also have to consider what happens if you dynamically generate content for individuals (if you let people create accounts so that they can hide certain topics, for example). You can still cache things, it's just a little trickier.
In terms of deciding where is the best place to cache things, I would use the same rule of thumb as when deciding which database fields to index. If the majority of operations are writing, rather than reading (for instance, a page where you get lots of comments), then this is not going to be as effective to cache as a relatively static page. Common sense, really.
Jim - 21st June 2003 12:14 - #
Even a page with lots of comments tends to be read more often than written.
It was surprisingly easy to implement funky caching on my site, and let Apache deal with e-tags, last modified headers, etc.
Sam Ruby - 21st June 2003 13:30 - #
Yeah you are right, I didn't think that bit through fully. There are situations though, where a page is written to more often than read (specifically, where there is a blogroll on the page like Simon has).
If you let funky caching deal with this, you get incorrect responses, unless you rebuild the page every time the dynamic content is updated. This isn't a problem for non-essential data like blogrolls, but can be in other circumstances.
I think I'd rather just check the metadata and handle the caching explicitly in PHP, than let Apache take over. Usually, I prefer to farm off stuff like this, but I reckon custom-written code can do a better job than generic Apache handlers.
What I would like to see, however, is Apache take notice of the ETags/Last-modified timestamps generated by a PHP script, and return a 304 Not Modified response in the presence of appropriate client headers, instead of forcing PHP authors to check these headers themselves and generate 304s themselves.
Jim - 21st June 2003 14:50 - #
This scheme is also good for assets that change only once in a while or not at all... like images, css, or javascript files. It tells the browser not to even check to see if there's a newer version available... preventing a 304 not modified response from even happening.
Am I missing something or is there more to what you're trying to accomplish?
Rich Manalang - 21st June 2003 18:30 - #
Hmm... I think browser caching solves a different problem. The concern here is the overhead of dynamically generating pages on the fly for every request. Even if a site is very supportive of browser caching you could still have problems if 100 different people all visit it within a minute of each other.
My blogroll is actually pretty much cached already - I grab an XML file from blo.gs once an hour and save it locally (it's still parsed for every page view but PHP's XML parser is so fast I hardly consider it overhead at all). Ideally though I would cache that and only have it displayed on the front page, as doing so prevents people from getting unexplained referrals from entry pages.
Simon Willison - 21st June 2003 19:08 - #
Rich,
There are two separate topics here. Firstly, you may wish to cache things on the server, to prevent cpu/memory/time-intensive operations. For instance, Simon doesn't go out and get the information in his blogroll every time somebody requests a page - I'd imagine it's cached on his server in a database somewhere. In a similar vein, if you are generating the pages from an XML source, transformed into XHTML via XSLT, this is an "expensive" operation - especially if you have lots of visitors. So you do the transformation once, and cache the result on the server until the page is updated.
Personally, I have a small, personal photo gallery for my friends & family, where I just dump the images into a directory via SFTP. The thumbnails are generated automatically on the first request for them, and after that, a cached copy is presented to the visitor. If I didn't cache the result, the website would be incredibly slow, or I would have to maintain the thumbnails when authoring.
Obviously, this is a different kind of caching to proxy/browser caching, and has different advantages. You can't cache individual parts of an HTML document in a browser, it's all-or-nothing.
The Expires header is a mixed blessing. If you are sure that your content will not change, or you are fine with people getting out of date objects, then by all means, go ahead. But what most people want is 'must-revalidate', which means that browsers/proxies can still cache the documents, but they need to validate their copy before presenting it to the end user. By 'validate', I mean "check it is up to date" (this is different to HTML validation).
To check that their copy is up to date, they can send a couple of extra headers when requesting the object. The server then has the freedom to send back a 304 Not Modified response. This is something that is done automatically by Apache for flat files (hence the appeal of "funky caching"). However, if you are generating your content via PHP, you have to explicitly code the check and response.
I originally wrote supporting code for this and submitted it to the maintainer of PEAR::HTTP, without getting any response. However, I feel that a better place for it is within Apache, as the benefit would extend to all filters.
Also, remember that once their copy of the object expires, all the cache will do is revalidate the copy anyway, so you will still get benefits from proper 304 handling if you use the Expires header.
Jim - 21st June 2003 19:22 - #
jack - 21st June 2003 20:06 - #
FYI: I employ a variety of strategies.
For "important" changes, I simply delete all files which might be affected. Should writes outpace reads, there would not be a need to delete any given file twice.
For "non-important" changes (e.g., blogroll updates), I simply wait until an update is required for other reasons. To help things along, daily I remove all files that haven't been accessed in a day, or were created over a week ago.
Perfect? No. Effective? I think so.
Sam Ruby - 21st June 2003 20:56 - #
The method I use is to cache the content separately from the template:
Voila! Fresh looking pages. The biggest advantage of this is that template changes cascade through all pages without having to re-cache. Quite handy.
GaryF - 21st June 2003 23:19 - #
Another angle of attack in PHP is to take advantage of the output control functions and rather than thinking about caching a complete page, cache it incrementally each with different "expiry" conditions. The page is still rendered by PHP but if it finds a cached version of a section it uses that instead of performing the related query.Posted an example here on Sitepointforums - sorry; the comments system was choking on the PHP.
PEAR::Cache_Lite makes a solid API to handle this for you BTW.
As well as having expiry times for cached files, you could "link" them to the comments interface so that if someone posts a new comment, the cache file for the comments section of the page is re-built while the rest of the cache files remain the same.
For mod_python it looks like the formatter namespace contains something that might help do the same. Don't know - still on the Python learning curve.
Harry Fuecks - 21st June 2003 23:22 - #
What I want to know, is how do you rig up a system that deletes your cached files on a template changes, but doesn't touch anything that you may have created yourself. Different filename extension, such as .cache?
Lach - 22nd June 2003 02:38 - #
Lach,
What I do is set up an .htaccess file for each website, that sets the WEBSITE_ROOT environment variable to /home/www/www.example.com (or whatever is appropriate for the server I am on).
This environment variable is available in PHP through $_SERVER["WEBSITE_ROOT"]. As long as I make sure that the cache directory exists and is writable by the user Apache runs under (which is all handled by the website setup scripts), I can dump cache files in there.
Generally speaking, I create a subdirectory for each different cache use - one for syndicated content, one for the photo gallery, and so on. I have a fairly sane URL scheme, so I can simply md5sum the URL and use that as the filename within the appropriate directory.
Whenever the PHP file gets called, it checks to see if the content has been updated, and if so, regenerates the cached file. In the photo gallery example I gave above, that's simply a case of stat()ing the original image file and the cached thumbnail file, and comparing the last modified timestamp.
Jim - 22nd June 2003 03:28 - #
Simon: All JournURL blogs are dynamically generated and cached. It works like this:
So for most requests, the only work Coldfusion has to do is include the pre-rendered HTML, handle the textad, and update the user's referrers.xml file.
Roger Benningfield - 25th June 2003 23:29 - #
Simon: All JournURL blogs are dynamically generated and cached. It works like this:
So for most requests, the only work Coldfusion has to do is include the pre-rendered HTML, handle the textad, and update the user's referrers.xml file.
Roger Benningfield - 25th June 2003 23:30 - #