[Mondrian] hanging application: segment cache manager and schema load?

Tue Jun 28 13:39:04 EDT 2016

I see that MONDRIAN-2476 has been marked resolved with this commit:
https://github.com/pentaho/mondrian/pull/717/files. There's a problem with
that fix - a null pointer guard in the clear() method that was deleted
should have been kept to ensure that null values aren't inserted into the
list.

The change should be updated from this –

-                    RolapSchema schema = ref.get();
-                    if (schema != null) {
-                        schema.finalCleanUp();
-                    }
+                    schemas.add(ref.get());

To this –

                    RolapSchema schema = ref.get();
                    if (schema != null) {
-                        schema.finalCleanUp();
+                        schemas.add(schema);
                    }

Full disclosure: I'm the developer that sent the proposed change to Jeff
Wright in the first place, so this was my own mistake.

Jon Rand

On Tue, Jun 14, 2016 at 10:35 AM, Wright, Jeff <
jeff.s.wright at truvenhealth.com> wrote:

> Regarding MONDRIAN-2476 deadlock, we were able to reproduce the deadlock
> in development with Mondrian 3.12.0.1-196. There’s a thread dump excerpt
> below.
>
>
>
> The engineer working on this with me has a code change he’s going to try
> locally. Here’s his description (this description is based on the older,
> synchronized RolapSchemaPool, but he’s actually testing a modified version
> with the reentrant lock, so just translate the “synchronized” comments to
> “lock”):
>
>
>
> The deadlock problem in RolapSchemaPool seems to be related to calling
> schema.finalCleanUp() while holding the write lock. I’ve made the following
> changes:
>
>
>
> 1.       The first change is to make the public remove() methods
> unsynchronized, relying instead on synchronization of lower-level methods,
> as described next.
>
>
>
> 2.       The second change is to refactor the private remove() method,
> separating the functionality that needs to be in a critical section
> (removing the schema from the mapKeyToSchema and mapMd5ToSchema maps) from
> the functionality that calls the schema.finalCleanup() method.  The new
> method that removes the schema from the maps, removeFromSchemaMaps(), IS
> synchronized.
>
>
>
> 3.       The private remove() method is NOT synchronized.  This method is
> still where the call to schema.finalCleanUp() is made, but
> schema.finalCleanUp() is called after the synchronized
> removeFromSchemaMaps() method has returned.  This ensures that
> schema.finalCleanUp() will no longer be running inside the critical section
> and should no longer cause a deadlock.
>
>
>
> 4.       The same refactoring was applied to the RolapSchemaPool.clear()
> method.
>
>
>
> Any reactions to that approach?
>
>
>
> Thread dump excerpt:
>
>
>
> *Name: mondrian.rolap.agg.SegmentCacheManager$ACTOR*
>
> State: WAITING on
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync at 1bad7c39
> owned by: mondrian.server.DynamicContentFinder$timer
>
> Total blocked: 0  Total waited: 290
>
>
>
> Stack trace:
>
> sun.misc.Unsafe.park(Native Method)
>
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964)
>
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
>
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
>
> mondrian.rolap.RolapSchemaPool.getRolapSchemas(RolapSchemaPool.java:513) – attempting
> to acquire lock.readLock
>
> mondrian.rolap.RolapSchema.getRolapSchemas(RolapSchema.java:990)
>
>
> mondrian.rolap.agg.SegmentCacheManager.getStar(SegmentCacheManager.java:1592)
>
>
> mondrian.rolap.agg.SegmentCacheManager$Handler.visit(SegmentCacheManager.java:634)
>
>
> mondrian.rolap.agg.SegmentCacheManager$ExternalSegmentCreatedEvent.acceptWithoutResponse(SegmentCacheManager.java:1195)
>
>
> mondrian.rolap.agg.SegmentCacheManager$Actor.run(SegmentCacheManager.java:992)
>
> java.lang.Thread.run(Thread.java:744)
>
>
>
>
>
> *Name: mondrian.server.DynamicContentFinder$timer*
>
> State: WAITING on java.util.concurrent.CountDownLatch$Sync at 330e0060
>
> Total blocked: 8  Total waited: 327
>
>
>
> Stack trace:
>
> sun.misc.Unsafe.park(Native Method)
>
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
>
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
>
> java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236)
>
> mondrian.util.SlotFuture.get(SlotFuture.java:87)
>
> mondrian.olap.Util.safeGet(Util.java:2101)
>
> mondrian.util.BlockingHashMap.get(BlockingHashMap.java:63)
>
>
> mondrian.rolap.agg.SegmentCacheManager$Actor.execute(SegmentCacheManager.java:1017)
>
>
> mondrian.rolap.agg.SegmentCacheManager.execute(SegmentCacheManager.java:313)
>
>
> mondrian.rolap.agg.AggregationManager$1.flushNonUnion(AggregationManager.java:129)
>
> mondrian.rolap.CacheControlImpl.flushInternal(CacheControlImpl.java:204)
>
> mondrian.rolap.CacheControlImpl.access$000(CacheControlImpl.java:36)
>
> mondrian.rolap.CacheControlImpl$1.execute(CacheControlImpl.java:179)
>
> mondrian.rolap.CacheControlImpl$1.execute(CacheControlImpl.java:177)
>
> mondrian.server.Locus.execute(Locus.java:86)
>
> mondrian.server.Locus.execute(Locus.java:71)
>
> mondrian.rolap.CacheControlImpl.flush(CacheControlImpl.java:174)
>
> mondrian.rolap.agg.AggregationManager$1.flush(AggregationManager.java:152)
>
> mondrian.rolap.RolapSchema.flushSegments(RolapSchema.java:340)
>
> mondrian.rolap.RolapSchema.finalCleanUp(RolapSchema.java:362)
>
> mondrian.rolap.RolapSchemaPool.remove(RolapSchemaPool.java:474) – has
> acquired lock.writeLock at this point
>
> mondrian.rolap.RolapSchemaPool.remove(RolapSchemaPool.java:462)
>
> mondrian.rolap.CacheControlImpl.flushSchema(CacheControlImpl.java:300)
>
>
> mondrian.server.DynamicContentFinder.flushCatalog(DynamicContentFinder.java:184)
>
>
> mondrian.server.DynamicContentFinder.flushObsoleteCatalogs(DynamicContentFinder.java:175)
>
>    - locked com.truven.mondrian.server.DbContentFinder at 58801091
>
>
> mondrian.server.DynamicContentFinder.reloadDataSources(DynamicContentFinder.java:103)
>
>    - locked com.truven.mondrian.server.DbContentFinder at 58801091
>
> mondrian.server.DynamicContentFinder$1.run(DynamicContentFinder.java:73)
>
> java.util.TimerThread.mainLoop(Timer.java:555)
>
> java.util.TimerThread.run(Timer.java:505)
>
>
>
>
>
> --jeff
>
>
>
> *From:* mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org]
> *On Behalf Of *Wright, Jeff
> *Sent:* Thursday, March 31, 2016 8:57 AM
> *To:* Mondrian developer mailing list <mondrian at pentaho.org>
> *Subject:* Re: [Mondrian] hanging application: segment cache manager and
> schema load?
>
>
>
> http://jira.pentaho.com/browse/MONDRIAN-2476
>
>
>
> Thanks Luc! Based on your pointers, my interpretation is the deadlock can
> occur when:
>
>
>
> -          DynamicContentFinder is running (too bad I misspelled that in
> the JIRA title)
>
> -          A distributed cache is in place
>
>
>
> The environment where this occurred has about 50 catalogs loaded.
>
>
>
> --jeff
>
>
>
> *From:* mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org
> <mondrian-bounces at pentaho.org>] *On Behalf Of *Luc Boudreau
> *Sent:* Wednesday, March 30, 2016 11:46 AM
> *To:* Mondrian developer mailing list <mondrian at pentaho.org>
> *Subject:* Re: [Mondrian] hanging application: segment cache manager and
> schema load?
>
>
>
> Thanks for the file. Looking at it, I've found several threads waitin on
> lock 0x00000005e23c2028. Do a search for this value and you'll find them.
>
>
>
> Looking at other threads, I can also tell that the system was in the
> middle of a flush operation. Thread 0x00007f1b34146000 is currently
> flushing obsolete catalogs. This same thread owns the lock on the schema
> pool. It's a deadlock.
>
>
>
>  - Thread A tries to clear the cache of some elements. It first acquires a
> lock on the schema pool.
>
>  - Thread A then sends a message to the Actor to do the actual flush, so
> it waits for an answer.
>
>  - Thread B, the Actor, in the meanwhile, receives an external event. To
> process it, it needs a lock on the pool.
>
>
>
> That's unfortunate timing. We'll need to think about how to fix this
> correctly. Please file a Jira bug with as much information as you can. If
> you can share some of the thread snippets, that's also useful to confirm
> our tests.
>
>
>
>
>
>
>
> On Wed, Mar 30, 2016 at 11:27 AM, Luc Boudreau <lucboudreau at gmail.com>
> wrote:
>
>
>
> Your description is exact.
>
>
>
> I'm a little surprised that no other threads are waiting on that lock.
> Would you mind sharing the full thread dump? Maybe privately?
>
>
>
> On Wed, Mar 30, 2016 at 11:21 AM, Wright, Jeff <
> jeff.s.wright at truvenhealth.com> wrote:
>
> There are no other threads waiting on the same monitor. I have 5 thread
> dumps from 5 different app servers, and all of them have this same pattern.
> There are two Actor threads, one of the Actor threads is blocked on
> getRolapSchemas() and no other thread references the same monitor.
>
>
>
> From your description, here’s how I imagine things to work with a 5 node
> cluster and a distributed cache... If a user runs MDX against CatalogA on
> Node1, the results of that query will get added to the distributed segment
> cache on Node1. The distributed cache will send an event to Nodes 2-5. That
> event will allow Nodes 2-5 to index the new segments. Indexing will force
> CatalogA to be loaded on Nodes 2-5, if it’s not already loaded. Is that
> right?
>
>
>
> --jeff
>
>
>
> *From:* mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org]
> *On Behalf Of *Luc Boudreau
> *Sent:* Wednesday, March 30, 2016 10:34 AM
> *To:* Mondrian developer mailing list <mondrian at pentaho.org>
> *Subject:* Re: [Mondrian] hanging application: segment cache manager and
> schema load?
>
>
>
> This stack tells us that the Actor has received a notification from an
> external event. A new segment must be indexed. The Actor is waiting on the
> RolapSchemaPool to free up so that it can grap a Star instance and pin the
> segment.
>
>
>
> What other threads are waiting on that same monitor? (0x00007f1ad965f000)
>
>
>
> On Wed, Mar 30, 2016 at 9:14 AM, Wright, Jeff <
> jeff.s.wright at truvenhealth.com> wrote:
>
> We’ve seen our application hang a couple times. Looking at thread dumps,
> I’m suspicious of this excerpt:
>
>
>
> "mondrian.rolap.agg.SegmentCacheManager$ACTOR" daemon prio=10
> tid=0x00007f1b34170000 nid=0xf25 waiting for monitor entry
> [0x00007f1ad965f000]
>
>    java.lang.Thread.State: BLOCKED (on object monitor)
>
>                 at
> mondrian.rolap.RolapSchemaPool.getRolapSchemas(RolapSchemaPool.java:420)
>
>                 - waiting to lock <0x00000005e23c2028> (a
> mondrian.rolap.RolapSchemaPool)
>
>                 at
> mondrian.rolap.RolapSchema.getRolapSchemas(RolapSchema.java:930)
>
>                 at
> mondrian.rolap.agg.SegmentCacheManager.getStar(SegmentCacheManager.java:1621)
>
>                 at
> mondrian.rolap.agg.SegmentCacheManager$Handler.visit(SegmentCacheManager.java:661)
>
>                 at
> mondrian.rolap.agg.SegmentCacheManager$ExternalSegmentCreatedEvent.acceptWithoutResponse(SegmentCacheManager.java:1222)
>
>                 at
> mondrian.rolap.agg.SegmentCacheManager$Actor.run(SegmentCacheManager.java:1019)
>
>                 at java.lang.Thread.run(Thread.java:724)
>
>
>
> I don’t fully understand SegmentCacheManager, but based on Julian’s 2012
> blog post I get the impression the Actor thread is supposed to run very. If
> that corresponds to the stack trace above, that’s a big problem - we see
> schema loads take minutes.
>
>
>
> I also see that there were some code changes in August last year for
> MONDRIAN-2390, to make locking for schema load lower level. We don’t have
> that code.
>
>
>
> Btw we have a distributed cache.
>
>
>
> Does it sound like I’m on to a problem in our environment? Maybe even a
> general problem?
>
>
>
> --Jeff Wright
>
>
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>
>
>
>
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>
>
>
>
>
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20160628/4648967d/attachment-0001.html