[Mondrian] CellBatchSize

Julian Hyde jhyde at pentaho.com
Thu Jun 27 19:37:30 EDT 2013


To see why we batch, consider the extremes.

1. Don't batch at all. (Equivalent to batch size = 1.) Each time a cell is missing from the query's cell cache, we send a request to the cell manager and wait for a response. The cell manager is an agent running in another thread, so this round trip involves two queues. If the cell is missing from the shared cache, the cell manager will generate and execute a SQL statement. There will tend to be a lot of small SQL statements each asking for a very specific cell. Answering each them may still require significant effort from the database, e.g. a scan.

2. Gather all requests generated by an MDX execution pass into a single batch. (Equivalent to batch size = INFINITY.) Each MDX execution pass won't ask for cells that have been successfully fetched by a previous pass. Each cell request requires a considerable amount of memory, so large batches might run out of memory. We don't make an effort to identify duplicate cell requests while we are building a batch, so a large batch might have many duplicates of the same request.

A large batch will tend to generate more efficient SQL.

A small batch gets results sooner, and so allows Mondrian to resolve conditionals such as "Iif([Measures].[Unit Sales] > 10, [Measures].[Store Sales], [Measures].[Store Cost])" sooner, and therefore send fewer cell requests overall.

The conclusion -- the ideal batch size is somewhere between 1 and infinity. The trouble is to figure out where.

Julian


On Jun 27, 2013, at 6:34 AM, Luc Boudreau <lucboudreau at gmail.com<mailto:lucboudreau at gmail.com>> wrote:

We have a couple of comments in the code about what we were foreseeing in the future. What we'd like to get at is a pluggable system to define the batching rules, not just in terms of size, but also being able to determine if it is worth sharing a particular batch across threads or would it be cheaper to duplicate some cells. One concrete example of this is a big query which pulls a lot of cells and takes a lot of time to execute, while a second smaller query comes in afterwards and has to wait for a subset of the big segment. Sometimes, it is cheaper and more effective to fragment the cache.

I'd prefer that we address this issue in its broader application, rather than focus solely on the number of cells.




On Thu, Jun 27, 2013 at 9:22 AM, Matt Campbell <mcampbell at pentaho.com<mailto:mcampbell at pentaho.com>> wrote:
There have been reports in the forum over the past few months of cases where performance is much worse in Mondrian 3.5/6 compared to 3.3.  What I think is going on is that some queries significantly exceed the cellBatchSize, causing a whole sequence of segment load queries, each with a different IN list for the items in that particular batch.  The benefits of batching cells in these cases are greatly outweighed by the cost of extra SQL queries.

A couple questions:

1)      I notice that the default value of cellBatchSize is -1, which I would interpret as meaning that there is no hard limit on the number of cells batched together.  In FastBatchingCellReader, though, if cellBatchSize is less than 0 we set the limit at a hardcoded 100000.  Should we provide some way of truly having no hard limit for cellBatchSize?



2)      More generally--what is the benefit of batching, and what can we do to balance that against the cost of extra queries?




_______________________________________________
Mondrian mailing list
Mondrian at pentaho.org<mailto:Mondrian at pentaho.org>
http://lists.pentaho.org/mailman/listinfo/mondrian


_______________________________________________
Mondrian mailing list
Mondrian at pentaho.org<mailto:Mondrian at pentaho.org>
http://lists.pentaho.org/mailman/listinfo/mondrian

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20130627/fa75d1b0/attachment.html 


More information about the Mondrian mailing list