[Mondrian] CellBatchSize

Luc Boudreau lucboudreau at gmail.com
Mon Jul 1 14:31:02 EDT 2013


It's probably worth a try at an optimization procedure that we can enable
optionally and gather some stats to figure out exactly how much of a
difference it makes. We should also add a feedback loop into the algorithm
so that it can be more or less greedy as it loops through the phases. The
server monitor API should tell us how many cells are hits & miss.

We should also use the Statistics SPI and take into account the estimated
cardinality of the columns. I'm thinking that by looking at the number of
values for a given phase versus the total number of values is a cheap way
to decide if we want to cover the whole level in one swoop.

Luc
On Jul 1, 2013 2:18 PM, "Matt Campbell" <mcampbell at pentaho.com> wrote:

> It seems like cellsets should usually have a predictable consistency.  If
> you’re crossjoining { [Customer].[1] : [Customer].[300000] } with
> [Measures].[MyMeasure], for example, once we’ve evaluated [MyMeasure] down
> to its component base measures for one intersection, it’s likely that the
> only thing that will differ from one set of cell requests to the next will
> be the level member constraint.  It’s possible that there will be *no*consistency—[MyMeasure] might change context out from under us based on the
> current Customer—but I’m guessing in practice that’s unusual.****
>
> ** **
>
> I’m wondering if we could leverage that expected consistency to be more
> efficient in cell request creation.  We could do some limited sampling of
> cells up-front, make a guess about what cell requests will ultimately be
> required, and then preload those cells.  If we guessed wrong the worst that
> happens is we’ve loaded cells into cache that aren’t immediately needed.
> Best case we get all cells required in a single pass.****
>
> ** **
>
> ** **
>
> *From:* mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org]
> *On Behalf Of *Julian Hyde
> *Sent:* Friday, June 28, 2013 7:14 PM
> *To:* Mondrian developer mailing list
> *Subject:* Re: [Mondrian] CellBatchSize****
>
> ** **
>
> How can we do better? My best guess is that we should try to "compact"
> batches before sending them out. ****
>
> ** **
>
> And I think we should do better at automatically tuning the batch size
> based on available resources. Every extra parameter is something else the
> user can screw up. :)****
>
> ** **
>
> Other suggestions welcome.****
>
> ** **
>
> Julian****
>
> ** **
>
> ** **
>
> On Jun 28, 2013, at 7:06 AM, Matt Campbell <mcampbell at pentaho.com> wrote:*
> ***
>
>
>
> ****
>
>  ****
>
> I definitely get your point about the resource cost of cell requests, and
> the issue of duplicate requests.  I imagine it’s a common scenario to have
> the same measure repeated many times in the same query (as the denominator
> of a set of ratio measures, for example).  And I’m sure there are other
> cases where cell requests can overlap, both within a single query and
> across concurrent queries.****
>
>  ****
>
> Another benefit (intentional?) that I noticed is that if predicate
> optimization occurs after hitting the cell size threshold, it’s possible
> that all required cells will get loaded in a single batch anyway.   For
> example, if you’re crossjoining all members of multiple levels, hitting the
> batch size limit may allow short-circuiting cell requests that will be
> fulfilled by a single, predicate-optimized SQL query.  I think the flip
> side of that is where some performance complaints have been coming from.
> If a query has a large number of non-duplicated cells, as you would with a
> big crossjoin, and predicate optimization doesn’t happen, then you’re stuck
> with many expensive SQL queries.****
>
>  ****
>
> So yes, knowing what number between 1->infinity to use sounds tricky.
> Even more so given that the best answer varies by MDX and concurrent
> activity.****
>
>  ****
>
> *From:* mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org]
>  *On Behalf Of *Julian Hyde
> *Sent:* Thursday, June 27, 2013 7:38 PM
> *To:* Mondrian developer mailing list
> *Subject:* Re: [Mondrian] CellBatchSize****
>
>  ****
>
> To see why we batch, consider the extremes.****
>
>  ****
>
> 1. Don't batch at all. (Equivalent to batch size = 1.) Each time a cell is
> missing from the query's cell cache, we send a request to the cell manager
> and wait for a response. The cell manager is an agent running in another
> thread, so this round trip involves two queues. If the cell is missing from
> the shared cache, the cell manager will generate and execute a SQL
> statement. There will tend to be a lot of small SQL statements each asking
> for a very specific cell. Answering each them may still require significant
> effort from the database, e.g. a scan.****
>
>  ****
>
> 2. Gather all requests generated by an MDX execution pass into a single
> batch. (Equivalent to batch size = INFINITY.) Each MDX execution pass won't
> ask for cells that have been successfully fetched by a previous pass. Each
> cell request requires a considerable amount of memory, so large batches
> might run out of memory. We don't make an effort to identify duplicate cell
> requests while we are building a batch, so a large batch might have many
> duplicates of the same request.****
>
>  ****
>
> A large batch will tend to generate more efficient SQL.****
>
>  ****
>
> A small batch gets results sooner, and so allows Mondrian to resolve
> conditionals such as "Iif([Measures].[Unit Sales] > 10, [Measures].[Store
> Sales], [Measures].[Store Cost])" sooner, and therefore send fewer cell
> requests overall.****
>
>  ****
>
> The conclusion -- the ideal batch size is somewhere between 1 and
> infinity. The trouble is to figure out where.****
>
>  ****
>
> Julian****
>
>  ****
>
>  ****
>
> On Jun 27, 2013, at 6:34 AM, Luc Boudreau <lucboudreau at gmail.com> wrote:**
> **
>
>
>
>
> ****
>
> We have a couple of comments in the code about what we were foreseeing in
> the future. What we'd like to get at is a pluggable system to define the
> batching rules, not just in terms of size, but also being able to determine
> if it is worth sharing a particular batch across threads or would it be
> cheaper to duplicate some cells. One concrete example of this is a big
> query which pulls a lot of cells and takes a lot of time to execute, while
> a second smaller query comes in afterwards and has to wait for a subset of
> the big segment. Sometimes, it is cheaper and more effective to fragment
> the cache.****
>
>  ****
>
> I'd prefer that we address this issue in its broader application, rather
> than focus solely on the number of cells.****
>
>  ****
>
>  ****
>
>  ****
>
> On Thu, Jun 27, 2013 at 9:22 AM, Matt Campbell <mcampbell at pentaho.com>
> wrote:****
>
> There have been reports in the forum over the past few months of cases
> where performance is much worse in Mondrian 3.5/6 compared to 3.3.  What I
> think is going on is that some queries significantly exceed the
> cellBatchSize, causing a whole sequence of segment load queries, each with
> a different IN list for the items in that particular batch.  The benefits
> of batching cells in these cases are greatly outweighed by the cost of
> extra SQL queries.****
>
>  ****
>
> A couple questions:****
>
> 1)      I notice that the default value of cellBatchSize is -1, which I
> would interpret as meaning that there is no hard limit on the number of
> cells batched together.  In FastBatchingCellReader, though, if
> cellBatchSize is less than 0 we set the limit at a hardcoded 100000.
> Should we provide some way of truly having no hard limit for cellBatchSize?
> ****
>
>  ****
>
> 2)      More generally--what is the benefit of batching, and what can we
> do to balance that against the cost of extra queries?****
>
>  ****
>
>  ****
>
>  ****
>
>
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian****
>
>  ****
>
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian****
>
>  ****
>
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian****
>
> ** **
>
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20130701/1a19aee7/attachment-0001.html 


More information about the Mondrian mailing list