|
|
|
[
Permlink
| « Hide
]
Chris Miller - [08/Jan/04 09:10 AM ]
Can you please try the current CVS version of OSCache to see if this resolves your problem?
Is there any news on this? I have noticed similar effects when running in production on a site serving ~300,000 pages per day.
I have built the latest version of oscache from CVS today, and it appears to be running fine in my staging environment, but that is not under the same load as my production machine. I'd like to try and put this beta build into production, but I am worried about any unwanted side effects. Does anyone have any thoughts? Sorry for the late reply. I've been very busy finishing our project for release and haven't been able to make time for testing before now.
The problem has not disappeared, but we've optimized the code so the time used rendering the page has decreased. This has helped the problem, as it occurs less frequently. My guess is that it's caused by the implementation of the AbstractConcurrentReadCache class. The API documentation says: "A version of Hashtable that supports mostly-concurrent reading, but exclusive writing. Because reads are not limited to periods without writes, a concurrent reader policy is weaker than a classic reader/writer policy, but is generally faster and allows more concurrency. This class is a good choice especially for tables that are mainly created by one thread during the start-up phase of a program, and from then on, are mainly read (with perhaps occasional additions or removals) in many threads. If you also need concurrency among writes, consider instead using ConcurrentHashMap." We haven't really considered switching to using ConcurrentHashMap, but I suspect this would affect the performance too much. Our site takes up to 2,000,000 hits a day, so speed is of the utmost importance. Instead, we've set the timeout of every element to 30 minutes, to be shure that stale pages do not live too long: <oscache:cache ... time="1800"> We've also beefed up the application cluster with an extra cluster member, which means the load on each member will be less. OSCache runs standalone on each cluster member. Reducing the load on each cache instance reduses the possibility of stale pages, as the flush error only occurs during high load. Also, if the error occurs, it wil probably just occur on one of the cluster members, not all of them, meaning that most users will still get information that is up to date. This issue is a real problem to us, but speed is even more important. Since we've been able to decrease the risk of the problem occuring, we've choosing to live with it as of now. I should add that the problem occurs very infrequently now. If page generation is fairly quick, we have a hard time reproducing it.
We have a requirement specification that demands a maximum delay of 10 seconds on updates, and this is why this situation is critical to us. In any other situation I would not hesitate to trust OSCache. I have yet to do testing with the newest build from CVS. We've had to focus on reducing the problem with the release version of OSCache. We built a version from CVS on 20040218 and, after a short amount of testing in our staging environment, deployed it to our live server. It is running without a hitch (we served 500,000 pages the other day and it did not hiccup).
The issue seems to have been fixed on our live server, but the load has reduced a lot in recent days so it's hard to say for certain... | ||||||||||||||||||||||||||||||||||||||||||||||