[Mondrian] Possible race condition in 3.6.x SegmentCacheManager/Actor?

Joe Barnett thejoe at gmail.com
Fri May 16 14:46:51 EDT 2014


We create new MondrianServers for each of our cubes because we were seeing
contention between them due to the SegmentCacheManager$ACTOR thread being
shared between cubes/connections in the same server, and certain cubes
being much slower to load segments than others due to underlying db state /
table size / etc.  We shut them down programatically when we're "done" with
them, so as to not leak resources.

I've implemented (a version of) the shutdown change here:
https://github.com/pentaho/mondrian/pull/310

And we've also modified our application stack to let queries finish before
shutting down a server (which we want to do anyway to let those queries
provide results).

Thanks,
-Joe


On Fri, May 9, 2014 at 10:08 AM, Luc Boudreau <lucboudreau at gmail.com> wrote:

> Seems like you found the issue. A couple questions before proposing a
> solution. Why are you shutting down the MondrianServer instance? Is it
> getting GCd or is it your code which shuts it down?
>
> To make things better, I'd recommend that when a ShutdownCommand is
> processed, the Actor sets a flag telling it to finish it's queue and not
> accept new events. Once the queue is empty, the graceful shutdown can be
> done.
>
> Luc
>
>
> On Fri, May 9, 2014 at 12:53 PM, Joe Barnett <thejoe at gmail.com> wrote:
>
>> A little more progress here, as I've been able to narrow down the cause:
>>
>> A query is executing while MondrianServer#shutdown() is called.  That
>> query ends up queuing a SegmentCacheManager$PeekCommand after the shutdown
>> invocation queues a SegmentCacheManager$ShutdownCommand.  The Actor thread
>> processes the eventQueue in order, sees the ShutdownCommand, and exits its
>> loop.  Meanwhile, the eventQueue still holds a reference to the
>> PeekCommand, and the query execution thread is still waiting on a response
>> in the responseMap to the PeekCommand that it submitted.
>>
>> I'm going to look into modifying the Actor such that it will:
>> a) make sure to not accept any new execute() commands after it sees a
>> ShutdownCommand (this may already be the case?), and
>> b) fully flush the queue before exiting its loop
>>
>> There may be some lifecycle stuff to look into as well (or instead of?)
>> in our application code, possibly making it such that we wait until all
>> query threads are done before shutting down the MondrianServer, as we
>> generally get a new connection/MondrianServer while old queries may be
>> executing against the old one, and ideally those queries would be allowed
>> to fully execute before fully shutting things down.
>>
>> Thoughts?
>>
>> Thanks,
>> -Joe
>>
>>
>> On Wed, Apr 30, 2014 at 8:30 AM, Joe Barnett <thejoe at gmail.com> wrote:
>>
>>> It looks like it happens ~3-5 times per day, on servers receiving
>>> ~15-25K requests per day.  I haven't been able to narrow down specifically
>>> what triggers it, but it does appear to confine itself to a specific type
>>> of request.  We instantiate separate MondrianServer instances for each of
>>> our clients (which are kept in client-specific tables), as well as one
>>> MondrianServer instance for data that is stored in a cross-client set of
>>> tables.  The queries that get stuck all hit the cross-client instance.
>>> They also tend to be fired somewhat concurrently with another request for
>>> the same data due to being rendered in a chart and a grid on the same page,
>>> and the front end code not reusing the same data request between the
>>> widgets.  Approximately 600-700 request per day are of this type.
>>>
>>> Thanks,
>>> -Joe
>>>
>>>
>>>
>>> On Wed, Apr 30, 2014 at 5:41 AM, Matt Campbell <mcampbell at pentaho.com>wrote:
>>>
>>>> I don't have any good guesses about what's going on.  From your
>>>> description it does sound like the responseMap must be getting into a bad
>>>> state.
>>>>
>>>> How reproducible is this for you?  Is there anything notable about
>>>> level of concurrency or workload when it happens?
>>>>
>>>>
>>>> ________________________________________
>>>> From: mondrian-bounces at pentaho.org [mondrian-bounces at pentaho.org] On
>>>> Behalf Of Joe Barnett [thejoe at gmail.com]
>>>> Sent: Tuesday, April 29, 2014 4:08 PM
>>>> To: Mondrian developer mailing list
>>>> Subject: [Mondrian] Possible race condition in 3.6.x
>>>>  SegmentCacheManager/Actor?
>>>>
>>>> Hello All,
>>>>
>>>> We've been migrating our mondrian implementation from 3.2.x to 3.6.x
>>>> for a while now, and are getting close to being ready to push to
>>>> production.  Unfortunately, we're currently seeing what looks like some
>>>> sort of race condition in our test environment.  I haven't been able to
>>>> isolate the case that causes the problem, but after running for a while, we
>>>> see threads locked up when examining the JVM thread state.  We see our
>>>> query thread waiting for a RolapResultShepherd$executor thread[1] (no
>>>> problem there), but that executor thread is waiting[2] for the
>>>> SegmentCacheManager$ACTOR to get back to it, and the actor itself is
>>>> waiting for work to do[3].  It seems that the command is getting submitted
>>>> to the Actor, but somehow dropped, or perhaps processed by the responseMap
>>>> is getting cleaned up before the executor can .get() it (or there's some
>>>> sort of hash/equals collision on the command objects somehow)?  Possibly
>>>> these requests are happening concurrently with cache flushes /
>>>> MondrianServer shutdowns, and the queue is not fully processing?
>>>>
>>>> Does anyone with more knowledge of how the SegmentCacheManager works
>>>> have any suggestions on things we should look into?  Our current
>>>> implemenation is based on git commit id
>>>> 7498b4a4bcfb791e303de20407830aaa61753042 (3.6 branch from Fri Jan 24
>>>> 05:55:31 2014 -0800 -- we had previously run into MONDRIAN-1726 when we
>>>> were based on git commit id 609d3f16ee772456f405ca21f9d2a6053ea5cb7f from
>>>> Wed Dec 4 09:29:51 2013 -0500, pulling mkambol's change seems to have fixed
>>>> that deadlock, but revealed this issue).
>>>>
>>>> Thanks,
>>>> -Joe
>>>>
>>>>
>>>> [1] "http-8080-8849951882670653653" daemon prio=10
>>>> tid=0x00007fe5c4334800 nid=0x11ca waiting on condition [0x00007fe51e1df000]
>>>>    java.lang.Thread.State: WAITING (parking)
>>>>         at sun.misc.Unsafe.park(Native Method)
>>>>         - parking to wait for  <0x00000006ab9877e8> (a
>>>> java.util.concurrent.FutureTask$Sync)
>>>>         at
>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>>>>         at
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
>>>>         at
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
>>>>         at
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
>>>>         at
>>>> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
>>>>         at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>>>>         at
>>>> mondrian.rolap.RolapResultShepherd.shepherdExecution(RolapResultShepherd.java:146)
>>>>         at
>>>> mondrian.rolap.RolapConnection.execute(RolapConnection.java:616)
>>>>         at
>>>> mondrian.rolap.RolapConnection.execute(RolapConnection.java:600)
>>>>         ....
>>>>
>>>> [2] "mondrian.rolap.RolapResultShepherd$executor_1" daemon prio=10
>>>> tid=0x00007fe54c020800 nid=0x5139 waiting on condition [0x00007fe524ecd000]
>>>>    java.lang.Thread.State: WAITING (parking)
>>>>         at sun.misc.Unsafe.park(Native Method)
>>>>         - parking to wait for  <0x00000005156ef888> (a
>>>> java.util.concurrent.CountDownLatch$Sync)
>>>>         at
>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>>>>         at
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
>>>>         at
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
>>>>         at
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
>>>>         at
>>>> java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
>>>>         at mondrian.util.SlotFuture.get(SlotFuture.java:87)
>>>>         at mondrian.olap.Util.safeGet(Util.java:2096)
>>>>         at mondrian.util.BlockingHashMap.get(BlockingHashMap.java:63)
>>>>         at
>>>> mondrian.rolap.agg.SegmentCacheManager$Actor.execute(SegmentCacheManager.java:1025)
>>>>         at
>>>> mondrian.rolap.agg.SegmentCacheManager.execute(SegmentCacheManager.java:350)
>>>>         at
>>>> mondrian.rolap.FastBatchingCellReader.loadAggregations(FastBatchingCellReader.java:243)
>>>>         at mondrian.rolap.RolapResult.phase(RolapResult.java:563)
>>>>         at mondrian.rolap.RolapResult.executeBody(RolapResult.java:926)
>>>>         at mondrian.rolap.RolapResult.<init>(RolapResult.java:469)
>>>>         at
>>>> mondrian.rolap.RolapConnection.executeInternal(RolapConnection.java:671)
>>>>         at
>>>> mondrian.rolap.RolapConnection.access$000(RolapConnection.java:51)
>>>>         at
>>>> mondrian.rolap.RolapConnection$1.call(RolapConnection.java:622)
>>>>         at
>>>> mondrian.rolap.RolapConnection$1.call(RolapConnection.java:620)
>>>>         at
>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>         at java.lang.Thread.run(Thread.java:662)
>>>>
>>>> [3] "mondrian.rolap.agg.SegmentCacheManager$ACTOR" daemon prio=10
>>>> tid=0x00007fe53c15d000 nid=0x524e waiting on condition [0x00007fe50ffa5000]
>>>>    java.lang.Thread.State: WAITING (parking)
>>>>         at sun.misc.Unsafe.park(Native Method)
>>>>         - parking to wait for  <0x00000005ecc0cac8> (a
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>>>>         at
>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>>>>         at
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
>>>>         at
>>>> java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:322)
>>>>         at
>>>> mondrian.rolap.agg.SegmentCacheManager$Actor.run(SegmentCacheManager.java:967)
>>>>         at java.lang.Thread.run(Thread.java:662)
>>>> _______________________________________________
>>>> Mondrian mailing list
>>>> Mondrian at pentaho.org
>>>> http://lists.pentaho.org/mailman/listinfo/mondrian
>>>>
>>>
>>>
>>
>> _______________________________________________
>> Mondrian mailing list
>> Mondrian at pentaho.org
>> http://lists.pentaho.org/mailman/listinfo/mondrian
>>
>>
>
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20140516/f03d93b5/attachment.html 


More information about the Mondrian mailing list