[Mondrian] Possible race condition in 3.6.x SegmentCacheManager/Actor?

Joe Barnett thejoe at gmail.com
Wed Apr 30 11:30:56 EDT 2014


It looks like it happens ~3-5 times per day, on servers receiving ~15-25K
requests per day.  I haven't been able to narrow down specifically what
triggers it, but it does appear to confine itself to a specific type of
request.  We instantiate separate MondrianServer instances for each of our
clients (which are kept in client-specific tables), as well as one
MondrianServer instance for data that is stored in a cross-client set of
tables.  The queries that get stuck all hit the cross-client instance.
They also tend to be fired somewhat concurrently with another request for
the same data due to being rendered in a chart and a grid on the same page,
and the front end code not reusing the same data request between the
widgets.  Approximately 600-700 request per day are of this type.

Thanks,
-Joe



On Wed, Apr 30, 2014 at 5:41 AM, Matt Campbell <mcampbell at pentaho.com>wrote:

> I don't have any good guesses about what's going on.  From your
> description it does sound like the responseMap must be getting into a bad
> state.
>
> How reproducible is this for you?  Is there anything notable about level
> of concurrency or workload when it happens?
>
>
> ________________________________________
> From: mondrian-bounces at pentaho.org [mondrian-bounces at pentaho.org] On
> Behalf Of Joe Barnett [thejoe at gmail.com]
> Sent: Tuesday, April 29, 2014 4:08 PM
> To: Mondrian developer mailing list
> Subject: [Mondrian] Possible race condition in 3.6.x
>  SegmentCacheManager/Actor?
>
> Hello All,
>
> We've been migrating our mondrian implementation from 3.2.x to 3.6.x for a
> while now, and are getting close to being ready to push to production.
>  Unfortunately, we're currently seeing what looks like some sort of race
> condition in our test environment.  I haven't been able to isolate the case
> that causes the problem, but after running for a while, we see threads
> locked up when examining the JVM thread state.  We see our query thread
> waiting for a RolapResultShepherd$executor thread[1] (no problem there),
> but that executor thread is waiting[2] for the SegmentCacheManager$ACTOR to
> get back to it, and the actor itself is waiting for work to do[3].  It
> seems that the command is getting submitted to the Actor, but somehow
> dropped, or perhaps processed by the responseMap is getting cleaned up
> before the executor can .get() it (or there's some sort of hash/equals
> collision on the command objects somehow)?  Possibly these requests are
> happening concurrently with cache flushes / MondrianServer shutdowns, and
> the queue is not fully processing?
>
> Does anyone with more knowledge of how the SegmentCacheManager works have
> any suggestions on things we should look into?  Our current implemenation
> is based on git commit id 7498b4a4bcfb791e303de20407830aaa61753042 (3.6
> branch from Fri Jan 24 05:55:31 2014 -0800 -- we had previously run into
> MONDRIAN-1726 when we were based on git commit id
> 609d3f16ee772456f405ca21f9d2a6053ea5cb7f from Wed Dec 4 09:29:51 2013
> -0500, pulling mkambol's change seems to have fixed that deadlock, but
> revealed this issue).
>
> Thanks,
> -Joe
>
>
> [1] "http-8080-8849951882670653653" daemon prio=10 tid=0x00007fe5c4334800
> nid=0x11ca waiting on condition [0x00007fe51e1df000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000006ab9877e8> (a
> java.util.concurrent.FutureTask$Sync)
>         at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
>         at
> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>         at
> mondrian.rolap.RolapResultShepherd.shepherdExecution(RolapResultShepherd.java:146)
>         at mondrian.rolap.RolapConnection.execute(RolapConnection.java:616)
>         at mondrian.rolap.RolapConnection.execute(RolapConnection.java:600)
>         ....
>
> [2] "mondrian.rolap.RolapResultShepherd$executor_1" daemon prio=10
> tid=0x00007fe54c020800 nid=0x5139 waiting on condition [0x00007fe524ecd000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000005156ef888> (a
> java.util.concurrent.CountDownLatch$Sync)
>         at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
>         at
> java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
>         at mondrian.util.SlotFuture.get(SlotFuture.java:87)
>         at mondrian.olap.Util.safeGet(Util.java:2096)
>         at mondrian.util.BlockingHashMap.get(BlockingHashMap.java:63)
>         at
> mondrian.rolap.agg.SegmentCacheManager$Actor.execute(SegmentCacheManager.java:1025)
>         at
> mondrian.rolap.agg.SegmentCacheManager.execute(SegmentCacheManager.java:350)
>         at
> mondrian.rolap.FastBatchingCellReader.loadAggregations(FastBatchingCellReader.java:243)
>         at mondrian.rolap.RolapResult.phase(RolapResult.java:563)
>         at mondrian.rolap.RolapResult.executeBody(RolapResult.java:926)
>         at mondrian.rolap.RolapResult.<init>(RolapResult.java:469)
>         at
> mondrian.rolap.RolapConnection.executeInternal(RolapConnection.java:671)
>         at
> mondrian.rolap.RolapConnection.access$000(RolapConnection.java:51)
>         at mondrian.rolap.RolapConnection$1.call(RolapConnection.java:622)
>         at mondrian.rolap.RolapConnection$1.call(RolapConnection.java:620)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
>
> [3] "mondrian.rolap.agg.SegmentCacheManager$ACTOR" daemon prio=10
> tid=0x00007fe53c15d000 nid=0x524e waiting on condition [0x00007fe50ffa5000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000005ecc0cac8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
>         at
> java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:322)
>         at
> mondrian.rolap.agg.SegmentCacheManager$Actor.run(SegmentCacheManager.java:967)
>         at java.lang.Thread.run(Thread.java:662)
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20140430/e4eba298/attachment.html 


More information about the Mondrian mailing list