[Mondrian] CellBatchSize

Brandon Jackson usbrandon at gmail.com
Tue Jul 2 00:48:31 EDT 2013


It would be cool if we had a machine learning mode in Mondrian where we could enable it and let it capture training data for finding the optimal set of adjustments, then let the system with it.

We could then submit our training sets and you could know if there is convergence and therefore a most ideal set of parameters.

It is a curious question about the ways Mondrian is used by others.  I wonder if some adaptive algorithm is possible.  So far, the whole crux of the conversation deals with the question of consistency.  Prediction = Compression.



Sent from my iPhone

On Jul 1, 2013, at 1:31 PM, Luc Boudreau <lucboudreau at gmail.com> wrote:

> It's probably worth a try at an optimization procedure that we can enable optionally and gather some stats to figure out exactly how much of a difference it makes. We should also add a feedback loop into the algorithm so that it can be more or less greedy as it loops through the phases. The server monitor API should tell us how many cells are hits & miss.
> 
> We should also use the Statistics SPI and take into account the estimated cardinality of the columns. I'm thinking that by looking at the number of values for a given phase versus the total number of values is a cheap way to decide if we want to cover the whole level in one swoop.
> 
> Luc
> 
> On Jul 1, 2013 2:18 PM, "Matt Campbell" <mcampbell at pentaho.com> wrote:
>> It seems like cellsets should usually have a predictable consistency.  If you’re crossjoining { [Customer].[1] : [Customer].[300000] } with [Measures].[MyMeasure], for example, once we’ve evaluated [MyMeasure] down to its component base measures for one intersection, it’s likely that the only thing that will differ from one set of cell requests to the next will be the level member constraint.  It’s possible that there will be no consistency—[MyMeasure] might change context out from under us based on the current Customer—but I’m guessing in practice that’s unusual.
>> 
>>  
>> 
>> I’m wondering if we could leverage that expected consistency to be more efficient in cell request creation.  We could do some limited sampling of cells up-front, make a guess about what cell requests will ultimately be required, and then preload those cells.  If we guessed wrong the worst that happens is we’ve loaded cells into cache that aren’t immediately needed. Best case we get all cells required in a single pass.
>> 
>>  
>> 
>>  
>> 
>> From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On Behalf Of Julian Hyde
>> Sent: Friday, June 28, 2013 7:14 PM
>> To: Mondrian developer mailing list
>> Subject: Re: [Mondrian] CellBatchSize
>> 
>>  
>> 
>> How can we do better? My best guess is that we should try to "compact" batches before sending them out. 
>> 
>>  
>> 
>> And I think we should do better at automatically tuning the batch size based on available resources. Every extra parameter is something else the user can screw up. :)
>> 
>>  
>> 
>> Other suggestions welcome.
>> 
>>  
>> 
>> Julian
>> 
>>  
>> 
>>  
>> 
>> On Jun 28, 2013, at 7:06 AM, Matt Campbell <mcampbell at pentaho.com> wrote:
>> 
>> 
>> 
>> 
>>  
>> 
>> I definitely get your point about the resource cost of cell requests, and the issue of duplicate requests.  I imagine it’s a common scenario to have the same measure repeated many times in the same query (as the denominator of a set of ratio measures, for example).  And I’m sure there are other cases where cell requests can overlap, both within a single query and across concurrent queries.
>> 
>>  
>> 
>> Another benefit (intentional?) that I noticed is that if predicate optimization occurs after hitting the cell size threshold, it’s possible that all required cells will get loaded in a single batch anyway.   For example, if you’re crossjoining all members of multiple levels, hitting the batch size limit may allow short-circuiting cell requests that will be fulfilled by a single, predicate-optimized SQL query.  I think the flip side of that is where some performance complaints have been coming from.  If a query has a large number of non-duplicated cells, as you would with a big crossjoin, and predicate optimization doesn’t happen, then you’re stuck with many expensive SQL queries.
>> 
>>  
>> 
>> So yes, knowing what number between 1->infinity to use sounds tricky.  Even more so given that the best answer varies by MDX and concurrent activity.
>> 
>>  
>> 
>> From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On Behalf Of Julian Hyde
>> Sent: Thursday, June 27, 2013 7:38 PM
>> To: Mondrian developer mailing list
>> Subject: Re: [Mondrian] CellBatchSize
>> 
>>  
>> 
>> To see why we batch, consider the extremes.
>> 
>>  
>> 
>> 1. Don't batch at all. (Equivalent to batch size = 1.) Each time a cell is missing from the query's cell cache, we send a request to the cell manager and wait for a response. The cell manager is an agent running in another thread, so this round trip involves two queues. If the cell is missing from the shared cache, the cell manager will generate and execute a SQL statement. There will tend to be a lot of small SQL statements each asking for a very specific cell. Answering each them may still require significant effort from the database, e.g. a scan.
>> 
>>  
>> 
>> 2. Gather all requests generated by an MDX execution pass into a single batch. (Equivalent to batch size = INFINITY.) Each MDX execution pass won't ask for cells that have been successfully fetched by a previous pass. Each cell request requires a considerable amount of memory, so large batches might run out of memory. We don't make an effort to identify duplicate cell requests while we are building a batch, so a large batch might have many duplicates of the same request.
>> 
>>  
>> 
>> A large batch will tend to generate more efficient SQL.
>> 
>>  
>> 
>> A small batch gets results sooner, and so allows Mondrian to resolve conditionals such as "Iif([Measures].[Unit Sales] > 10, [Measures].[Store Sales], [Measures].[Store Cost])" sooner, and therefore send fewer cell requests overall.
>> 
>>  
>> 
>> The conclusion -- the ideal batch size is somewhere between 1 and infinity. The trouble is to figure out where.
>> 
>>  
>> 
>> Julian
>> 
>>  
>> 
>>  
>> 
>> On Jun 27, 2013, at 6:34 AM, Luc Boudreau <lucboudreau at gmail.com> wrote:
>> 
>> 
>> 
>> 
>> 
>> We have a couple of comments in the code about what we were foreseeing in the future. What we'd like to get at is a pluggable system to define the batching rules, not just in terms of size, but also being able to determine if it is worth sharing a particular batch across threads or would it be cheaper to duplicate some cells. One concrete example of this is a big query which pulls a lot of cells and takes a lot of time to execute, while a second smaller query comes in afterwards and has to wait for a subset of the big segment. Sometimes, it is cheaper and more effective to fragment the cache.
>> 
>>  
>> 
>> I'd prefer that we address this issue in its broader application, rather than focus solely on the number of cells.
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> On Thu, Jun 27, 2013 at 9:22 AM, Matt Campbell <mcampbell at pentaho.com> wrote:
>> 
>> There have been reports in the forum over the past few months of cases where performance is much worse in Mondrian 3.5/6 compared to 3.3.  What I think is going on is that some queries significantly exceed the cellBatchSize, causing a whole sequence of segment load queries, each with a different IN list for the items in that particular batch.  The benefits of batching cells in these cases are greatly outweighed by the cost of extra SQL queries.
>> 
>>  
>> 
>> A couple questions:
>> 
>> 1)      I notice that the default value of cellBatchSize is -1, which I would interpret as meaning that there is no hard limit on the number of cells batched together.  In FastBatchingCellReader, though, if cellBatchSize is less than 0 we set the limit at a hardcoded 100000.  Should we provide some way of truly having no hard limit for cellBatchSize?
>> 
>>  
>> 
>> 2)      More generally--what is the benefit of batching, and what can we do to balance that against the cost of extra queries?
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> 
>> _______________________________________________
>> Mondrian mailing list
>> Mondrian at pentaho.org
>> http://lists.pentaho.org/mailman/listinfo/mondrian
>> 
>>  
>> 
>> _______________________________________________
>> Mondrian mailing list
>> Mondrian at pentaho.org
>> http://lists.pentaho.org/mailman/listinfo/mondrian
>> 
>>  
>> 
>> _______________________________________________
>> Mondrian mailing list
>> Mondrian at pentaho.org
>> http://lists.pentaho.org/mailman/listinfo/mondrian
>> 
>>  
>> 
>> 
>> _______________________________________________
>> Mondrian mailing list
>> Mondrian at pentaho.org
>> http://lists.pentaho.org/mailman/listinfo/mondrian
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20130701/c84fc346/attachment-0001.html 


More information about the Mondrian mailing list