History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: CACHE-170
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Critical Critical
Assignee: Andres March
Reporter: Guillaume Berche
Votes: 0
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
OSCache

Data race handling Cache.updateStates results in Thread hangs when the blocking mode is used in concurrence

Created: 28/Apr/05 02:24 AM   Updated: 06/Oct/05 07:15 AM
Component/s: Base Classes
Affects Version/s: 2.1, 2.0.1
Fix Version/s: 2.1.1, 2.2 RC

File Attachments: 1. GZip Archive modified_sources.tar.gz (11 kb)
2. Text File oscache-2-0-1-modifs.patch (11 kb)
3. Text File test_output_2_1.txt (12 kb)

Issue Links:
Duplicate
 
This issue is duplicated by:
CACHE-184 Filter deadlock with external apps (m... Critical Closed
Required
This issue requires:
CACHE-174 Regression in fix of CACHE-170: Updat... Major Closed
 

Flags: Important


 Description  « Hide
The bug reproduces in the following scenario: the cache is configured in blocking mode. One thread T1 is currently "owning the EntryStateUpdate E1" (i.e. has caught the NeedRefereshException when calling Cache.getFromCache(K)) while N threads T2, T3, T4... are calling Cache.getFromCache(K) and get suspended waiting to be notified on E1.

There is a small data race window between the a call to Cache.cancelUpdate() by Thread T1 which removes the EntryStateUpdate E1 from the Cache.updateStates Map, and the reception of the notification on EntryStateUpdate E1 by another thread T2, which would reinsert the EntryStateUpdate E1 into Cache.updateStates for key K before throwing a NeedRefereshException.

During this data race window, another thread T3 might make a concurrent Cache.getFromCache(K) call and would as a result create a new EntryStateUpdate E2 instance and assign it into the Cache.updateStates for key K later on. As a result, if other threads T3, T4 where also waiting for a notification on E1, they would never receive such notification (because E1 is not accesible through the Cache.putInCache(K) (which would only find E2 into the Cache.updateStates accessed within the Cache.completeUpdate() method). Threads T3, T4 would therefore remain hanging for ever waiting to be notified.


The modified unit test is in the attached patch file: a new testConcurrentStaleGets() method was added to simulate more massive concurrent usage of cache.cancelUpdate()
The patch file was generated through the command "diff --exclude=*.class --exclude=*.html -Naur oscache-2-0-1-original/src/ oscache-2-0-1/src > oscache-2-0-1-modifs.patch" where the modifications were made into my local oscache-2-0-1 directory. I also include the full source for these two modified classes (built using the following command "tar cvfz modified_sources.tar.gz oscache-2-0-1/src/core/java/com/opensymphony/oscache/base/Cache.java oscache-2-0-1/src/core/test/com/opensymphony/oscache/base/TestConcurrency.java")


Additionally, I made a modification to the Cache.getFromCache() class to add the following error trace:


                            // We put the updateState object back into the updateStates map so
                            // any remaining threads waiting on this cache entry will be notified
                            // once this thread has done its thing (either updated the cache or
                            // cancelled the update). Without this code they'll get left hanging...
                            synchronized (updateStates) {
                                Object previousObject = updateStates.put(key, updateState);
                                if (previousObject != null && previousObject != updateState) {
                                 log.error("date race window triggered synchro bug: interlaced [" + previousObject+ "] instead of [" + updateState + "]");
                                }
                            }



When running this modified unit test, it works fine if few threads are configured and the number of repeat steps is low. However, when more massive concurrency is configured, then it generates the output on the 2-1 version which is reproduced in attachement and fails. This also fails on the 2-0-1 version and the latest CVS code (on Aprl 27, 2005). I hope this modified unit test will be helpfull to support the fix for this synchro bug.


I then tried to imagine ways to fix this bug. To remove the data race, I think the simpler thing would be to keep a same EntryStateUpdate instance associated to a given key value for the whole duration of the key value (i.e. through add/and updates requests to the cache with this same key value).

Concretely, I could only find the following coding alternatives:

1- Have the EntryStateUpdate be associated to the key into the Cache.cacheMap field (for instance within the CacheEntry). However the following comment suggests this is not possible.

/**
     * A set that holds keys of cache entries that are currently being built.
     * The cache checks against this map when a stale entry is requested.
     * If the requested key is in here, we know the entry is currently being
     * built by another thread and hence we can either block and wait or serve
     * the stale entry (depending on whether cache blocking is enabled or not).
     * <p>
     * We need to isolate these here since the actual CacheEntry
     * objects may not normally be held in memory at all (eg, if no
     * memory cache is configured).
     */
    private Map updateStates = new HashMap();

Is this comment correct? When the actual CacheEntry object is not held in memory, I imagine the CacheEntry is then restored from disk. Would it be possible to either restore the EntryStateUpdate from disk at the same time or atomically recreate one at this time?

2- Keep the EntryStateUpdate instance present into the Cache.updateStates until the key is removed from the cache. This has the drawback of having a larger Cache.updateStates map which may increase the memory usage significantly when the number of keys is large.


Did you already study other ways to fix this problem?

Thanks in advance for your help and for contributing this great package to the community!

Best regards,

Guillaume.


 All   Comments   Change History      Sort Order:
Guillaume Berche - [28/Apr/05 02:26 AM ]
Output of the contributed TestConcurrency.testConcurrentStaleGets() method

Guillaume Berche - [28/Apr/05 02:27 AM ]
The incremental changes (.patch file) and the full copy of modified source (.tar.gz file) for Cache and TestConcurrency files.

Guillaume Berche - [28/Apr/05 02:43 AM ]
I confirm that the following brutal fix of "not removing the EntryUpdateState from the Cache.updateStates" indeed fixes the symptom.

As previously discussed, a more precise fix could only remove the EntryUpdateState from the Cache.updateStates once Cache.removeEntry() is called (and by safety notify all threads waiting on the removed EntryUpdateState instance). However it has the drawback of using slightly more memory because the size of the Cache.updateStates Map would be roughly equals to the Cache.cacheMap

public void cancelUpdate(String key) {
        EntryUpdateState state;

if (key != null) {
            synchronized (updateStates) {
                //state = (EntryUpdateState) updateStates.remove(key);
                state = (EntryUpdateState) updateStates.get(key);

if (state != null) {
                    synchronized (state) {
                        state.cancelUpdate();
                        state.notify();
                    }
                }
            }
        }
    }

Andres March - [29/Apr/05 10:48 AM ]
This should be backported to 2.1.1 but let's see if this all works

Guillaume Berche - [02/May/05 01:30 AM ]
Andres,

Thanks for your prmpt work on this fix! Could you please describe the fix that was brought into 2-1-1 to help me understand the alternative that you selected?

I tried looking at the CVS repository log and could only find a fix on the Cache class:
https://oscache.dev.java.net/source/browse/oscache/src/core/java/com/opensymphony/oscache/base/Cache.java?r1=1.13&r2=1.13.2.1&only_with_tag=v2_1_1_release

Were there other modifications that I missed?


I wonder whether how the following points are handled in this fix:

1 - how are the EntryStateUpdate entries removed from the updateStates map? I understand they are now only removed in the completeUpdate() method called from putInCache(). Would'nt it necessary to call completeUpdate() method when a CacheEntry is removed from the cacheMap as to avoid dangling EntryStateUpdate instances that could represent a memory leak? I could find the following cases for this CacheEntry instance removal (there may be others?):
a) explicit removeEntry() method method call
b) implicit removal from the cache when it is full.


2- is the following block of code in Cache.getFromCache() still necessary? I understand that it can be removed since the EntryStateUpdate is not removed from the updateStates map anymore between cancelled updates. Consequently, this old workaround would only decrease concurrency.

// We put the updateState object back into the updateStates map so
                            // any remaining threads waiting on this cache entry will be notified
                            // once this thread has done its thing (either updated the cache or
                            // cancelled the update). Without this code they'll get left hanging...
                            synchronized (updateStates) {
                                updateStates.put(key, updateState);
                            }

3- was the modification on completeUpdate() related to this bug? I.e. the call to the EntryStateUpdate.startUpdate() if ! EntryStateUpdate.isUpdating()

Thanks in advance for your help,

Guillaume.

Andres March - [02/May/05 11:24 AM ]
I took a look and it seems you are correct. This fix could lead to a memory leak. That's what happens when I code late at night. I'm thinking of possible solutions but unfortunately the implicit removes have no ability to signal the Cache to remove the update state.

Guillaume Berche - [09/May/05 02:35 AM ]
Andres,

Should this bug reopened to account for the memory leak introduced by the fix? Should rather a new bug opened?

Would it be possible to move the EntryStateUpdate directly within the CacheEntry class and hence have it stored within Cache.cacheMap field. This way it can be removed the Cache.updateStates, and the implicit remove would automatically properly handle be clean up of the EntryStateUpdate associated with a key?


Could a contributed patch help on this issue?

Thanks in advance,

Guillaume.

Andres March - [09/May/05 09:38 AM ]
We should open a new issue for the leak. Your suggestion is one of the problems I have been contemplating. By putting the state in the cache map, we would need to synchronize on the get() method to make sure the state is reflected accurately, thereby significantly reducing concurrency.

I am always open to any suggestions or patches.

Guillaume Berche - [09/May/05 10:19 AM ]
Andres,

Thanks for your response. I then open a new issue for the leak if you're OK.

I don't understand why we would need to synchronize on the AbstractConcurrentReadCache.get() if the EntryStateUpdate reference is maintained as a final value within the CacheEntry?

The same synchronization optimization currently done in AbstractConcurrentReadCache.get() could be left unchanged: from a key instance, AbstractConcurrentReadCache.get() retreives a CacheEntry instance. The EntryStateUpdate instance is then accessed from the CacheEntry instance and the accessor does not require anymore synchronization if the reference to the EntryStateUpdate is final.

Rather, what worries me if the following comment on the current code for the "private Map updateStates = new HashMap();" declaration:

* We need to isolate these here since the actual CacheEntry
   * objects may not normally be held in memory at all (eg, if no
   * memory cache is configured).

Is this comment true? What would happen when "no memory cache is configured"? What would the Cache.getCacheEntry() method return in this case? Would that invalidate the solution to bug CACHE-170 based on moving the EntryStateUpdate within the EntryStateUpdate ?

Thanks for your help,

Guillaume.

Andres March - [09/May/05 07:28 PM ]
you might be right about the sync requirements but I need to look at this solution more closely. If you want to provide a patch, I will have a look.

But the comment is correct. This is a barrier to using the cachemap to store the update state.

The issues were having come up very often and I am leaning towards rewriting the internals to be more clear and cleaner. This would be a major rework but given the types of problems that crop up because of the complex implementation I think it is worth it.

What I'm suggesting is :
- to replace Cache with an interface that exposes the api.
- Create a base abstract implementation that gives basic functionality. It would use a single map to store the entry, its groups, and its state. This map would be a backport of the java.util.concurrent.ConcurrentHashMap. All entries would be in the map, although the entry's content may only exist on disk.
- Port the exisiting algorithm classes to extend the class above instead of AbstractConcurrentReadCache.

Guillaume Berche - [10/May/05 02:20 AM ]
Andres,

> But the comment is correct. This is a barrier to using the cachemap to store the update state.

Thanks for the confirmation. Looking in more details to the AbstractConcurrentReadCache.sput() method, I now better understand what would be the problem with storing the EntryStateUpdate within the CacheEntry instance (which is stored as a value within the AbstractConcurrentReadCache class:
- if no memory cache is used, then the CacheEntry instance is read from disk each time it is requested. Since a new ObjectInputStream instance is created for each call to AbstractDiskPersistence.retreive(File file), then a new CacheEntry instance would be returned for each call. As a result, two consecutive calls to Cache.get() would return a different CacheEntry instance and therefore a different EntryStateUpdate instance, making it unsuiteable to be used as a shared monitor among current threads. Consequently, the current design for the disk-only persistence mode indeed does not support using the CacheEntry to store and share the EntryStateUpdate.

> What I'm suggesting is :
> - to replace Cache with an interface that exposes the api.
> - Create a base abstract implementation that gives basic functionality. It would use a single map to store the entry,
> its groups, and its state. This map would be a backport of the java.util.concurrent.ConcurrentHashMap. All entries would
> be in the map, although the entry's content may only exist on disk.
> - Port the exisiting algorithm classes to extend the class above instead of AbstractConcurrentReadCache.

As a newcomer to the oscache source code, this seems a good direction to me :-)

In the modifications you describe, I imagine the Map could store a "composite" object as values. This composite object would include the group, the EntryStateUpdate and a "description" of where the content is stored (direct reference for memory cache or "persistent reference" for persistent cache).

However, wouldn't the backport of java.util.concurrent.ConcurrentHashMap be difficult because of its dependencies on JDK 1.5 synchronization constructs?

> you might be right about the sync requirements but I need to look at this solution more closely. If you want to provide a patch, I will have a look.

My best bet was to move the EntryStateUpdate into the CacheEntry, but we discussed the issues with that related to the "persistence-only cache mode". Do you know if the feature of "using a persistence-only cache" is widely used by oscache users? Could it be considered to make a special case for this feature (e.g. either temporary keep the updateState map only in this configuration, or temporary disable this feature until the major rework you describe is complete)?

Thanks,

Guillaume.

Andres March - [10/May/05 02:34 AM ]
Correct, the composite object would include the group(s), state, and either the content or a pointer to where it is stored.

The backport is done. I have been trying out the classes. It is endorsed by Doug Lea (the original creator) and does not depend on the memory model changes of 1.5.

I don't think providing a memory only fix at this point is a good idea. It would be more constructive to put my time towards a more comprehensive fix. I do consider this issue a major one but I feel if we don't address the core problems then more will crop up.

The question really is when could I get this all done. This weekend is out but I might be able to get to it next weekend.

I'd more than willing to accept help with this. You may feel new to the code but it won't matter much in regards to what we want to change. Let me know if you are interested. I may create a branch in cvs to work this stuff out.

Guillaume Berche - [10/May/05 03:30 AM ]
Thanks for the precisions. If I understand well this refactoring would consist of the following steps:

0- change the concrete Cache class into an interface with same API and probably same name (as to preserve compatibility). Separate from concrete implementation (e.g. CacheImpl)
1- replace AbstractConcurrentReadCache (based on a modified version of the old Doug Lea class from concurrent lib) with new java.util.concurrent.ConcurrentHashMap, and port the current algorithms implemented in AbstractConcurrentReadCache subclasses (FIFO, LRU, Unlimited caches) as ConcurrentHashMap subclasses
2- change the values stored within the map from directly the content to a composite object that would include group(s), state, and either the content or a pointer to where it is stored. Remove the updateStates field

While I think I understand steps #0 and #2 and the value they provide, I don't quite understand the details of step #1 and its benefits. Would the ConcurrentHashMap source code be modified to integrate oscache feature in the same way than AbstractConcurrentReadCache? What design can be imagined to make this simpler and clearer?

Thanks for the offer to help out. I'll try to see if I am able to free some time any time soon, and will let you know soon.

By the way, as discussed, I have opened bug CACHE-174 to track the memory leak issue.

Cheers,

Guillaume.

Andres March - [10/May/05 09:44 AM ]
0- right
1- the main difference would be instead of modifying the ConcurrentReadHashMap to use it as a private property of the new abstract cache impl. The map would be the backing store for the cache impl and hold the 3 things we mentioned.
2- right

> While I think I understand steps #0 and #2 and the value they provide, I don't quite understand the details of step #1 and >its benefits. Would the ConcurrentHashMap source code be modified to integrate oscache feature in the same way than >AbstractConcurrentReadCache?
No. The main purpose of this refactoring would be to consolidate the cache access code and detangle the cache logic from the concurrent hash map.

>What design can be imagined to make this simpler and clearer?
still working on it but this is what I have so far.

Guillaume Berche - [02/Jun/05 06:29 AM ]
Andres,

As we discussed, I have built an internal patch version (based on version 2-0-1, which I called internally 2-0-3) applying the following design:
---------------------------
Since the EntryStateUpdate instances are used to coordinate threads during the cache miss/cache stale/putInCache/cancelUpdate conditions, the EntryStateUpdate instances should not be removed from the map until all threads being coordinated are done dealing with the key associated with a given EntryStateUpdate instance.
As a result, an explicit usage count is maintained into the EntryStateUpdate, it represent the number of usages on a given EntryStateUpdate instance among the following:
- looking up a stale entry
- waiting for an update of an entry
- [about to be] performing an update

Note that a single Thread may potentially account for more than one usage count for a given EntryStateUpdate instance (this case was not optimized and usage is not bound to threads, clients are still responsible for making balanced number of calls to cancelUpdate() or putInCache()).

When the usage count reaches zero, then the EntryStateUpdate instance is removed from the updateMap.
--------------------------

I attach the diff ("cvs-diff-2-0-1_to_2-0-3.txt") along with a tgz containing the modified classes ("modified_sources_in_2-0-3.tgz"). It includes the modifications for bugs CACHE-170 and CACHE-177, the corresponding enhancements to the unit tests. I also modified the LRUMap and FIFOMap to support JDK 1.3 compilation (i.e. no static reference to 1.4 classes, pure introspected instancation).


I hope this can help and could be integrated in a future release.

Regards,

Guillaume.

Guillaume Berche - [02/Jun/05 06:33 AM ]
Note: I was unable to add attachements to this bug (maybe because it is CLOSED), I have therefore added them into the CACHE-174 issue.

Andres March - [02/Jun/05 08:35 PM ]
THe fix for this caused a small memory leak

Thorsten Gast - [06/Oct/05 07:15 AM ]
Hello Andres,

You write that there is now a small memory leak. What does small means in this context and how dows this effect the running application? We are using oscache in a web-application, which should, in the best case, never restart or even worser crash. With a memory leak in the system you can't say when the memory is exceeded.

Regards

Thorsten