[Mondrian] assumed memory leak causes oome in Mondrian 3.1.6

Thu Apr 8 13:36:31 EDT 2010

We put cached data into a thread local to ensure that we do not thrash while
a query is being processed. If a query has a working set larger than main
memory, thrashing would be a pattern where a query loads the first 20% into
memory, loads the next 20% which pushes out the first 20%, loads the next
20% which pushes out the next 20%, references the first 20% which causes the
first 20% to be reloaded (an expensive operation) and pushes out the third
20%. I call it thrashing because it is very similar to virtual memory
thrashing [see  <http://en.wikipedia.org/wiki/Thrashing_(computer_science)>
http://en.wikipedia.org/wiki/Thrashing_(computer_science) ]. If thrashing
occurs, the query might be hundreds or thousands of times slower than if
adequate memory were available. We'd rather that it fails fast with an OOME
than thrashing, and the thread-local achieves this.

That said, you are correct that data should be removed from the thread-local
after the query has stopped processing. Either
RolapStar.pushAggregateModificationsToGlobalCache or
RolapStar.clearCacheAggregations must be called at the end of query
execution, whether the query succeeds or fails. If that is not happening, it
is a bug.

Julian

  _____  

From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On
Behalf Of Plöger, Henning
Sent: Thursday, April 08, 2010 9:41 AM
To: mondrian at pentaho.org
Subject: [Mondrian] assumed memory leak causes oome in Mondrian 3.1.6

Dear all,

Recently we reproducible got an OutOfMemoryException after executing many
different queries one after another. 

Looking at the heap dump taken on OOME or after a certain amount of queries
showed a suspicious schema object referencing 

multiple hundred megabytes growing on each query. The GC seems not to be
able to remove these objects 

although - I thought that - these objects are softly referenced in
Mondrian’s cache. 

Further investigation showed that the whole object tree is hard referenced
from a threadlocal variable. Since we 

run Mondrian within an application server that  uses an thread pool there
were many of these dangling threadlocals and 

the GC was never able to clean the cache. 

To further stress our hypothesis we removed the threadlocal after each
request/query and run our test again. Now we 

could clearly see that the GC freed the heap, the OOME disappeared and a
heap dump showed no hard 

referenced objects from Mondrian anymore. 

The problematic threadlocal variable is
mondrian.rolap.RolapStar.localAggregations. The AggregationKey object holds
a 

reference path to the RolapStar object which references the threadlocal.
This reference chain prevents the GC from 

removing the threadlocal (threadlocal referenced by threadlocal’s value). 

Is this threadlocal variable intended to be removed after each request? 

Does removing the threadlocal variable after each request yield another
behavior? 

Are the query results guaranteed to be the same as before (our tests let us
assume so)?

Kind regards,

Henning

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20100408/46170596/attachment.html