[Mondrian] The perils of the crossjoin optimizer

Fri Nov 6 14:57:30 EST 2015

One of the reasons I'm following this closely is we're seeing performance issues with security roles. We didn't see anything in JIRA about that, so we're in the process of characterizing it more and coming up with a foodmart example.

--jeff

From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On Behalf Of Matt Campbell
Sent: Friday, November 06, 2015 2:54 PM
To: Mondrian developer mailing list <mondrian at pentaho.org>
Subject: Re: [Mondrian] The perils of the crossjoin optimizer

I guess the question is really whether to work on improving the cj optimizer, or to focus on making native evaluation better at handling queries like the grand-total one I pasted below.  There's a case for that--  MONDRIAN-2426.  If that could be handled in native eval we'd never hit the cj optimizer at all.

Either way it seems sensible to me to have some number larger than 0 for a default cj optimizer threshold.

From: mondrian-bounces at pentaho.org<mailto:mondrian-bounces at pentaho.org> [mailto:mondrian-bounces at pentaho.org] On Behalf Of Wright, Jeff
Sent: Friday, November 06, 2015 2:20 PM
To: Mondrian developer mailing list <mondrian at pentaho.org<mailto:mondrian at pentaho.org>>
Subject: Re: [Mondrian] The perils of the crossjoin optimizer

Thanks, that's easier to follow. Your logic makes sense to me, that attempting to use a fact table to optimize axis evaluation makes more sense when you expect the tuple count to be high.

Should we try to make the optimizer more efficient? Is that a trick question? I guess I'm curious if the calculated member issue would be different if it were an explicit slicer or a named set.

--jeff

From: mondrian-bounces at pentaho.org<mailto:mondrian-bounces at pentaho.org> [mailto:mondrian-bounces at pentaho.org] On Behalf Of Matt Campbell
Sent: Friday, November 06, 2015 10:20 AM
To: Mondrian developer mailing list <mondrian at pentaho.org<mailto:mondrian at pentaho.org>>
Subject: Re: [Mondrian] The perils of the crossjoin optimizer

Hi Jeff,

Consider the crossjoin from my example:

       CrossJoin(Store.[Store Total], Product.[Product Total]))

It will not use native evaluation due to the presence of calculated members, so the crossjoin optimizer will pick it up.  The optimizer will:

1)      Determine whether the size of the incoming tuple set exceeds the threshold (zero in this case).  1>0, so yes it does.

2)      Determine all the base measures used in the query.

3)      Determine what slicer context can be applied.  This will exclude the calculated member in the slicer, setting [Education Level] to [All].

4)      Iterate through the single input tuple, retrieving cell data for all base measures determined in (2) using the context from (3). This will require an unconstrained SQL query across the whole fact table (assuming the value wasn't cached).

5)      Add only those tuples with data (i.e. the single tuple we started with) to the resulting set.

All of this is part of axis evaluation, so reducing the tuple set means fewer intersections Mondrian has to evaluate downstream.  In this case the set is just a single tuple, though, so limited value.

As another example, consider

CrossJoin([Customer].[Name].members, Product.[Product Total])

We'll again go through the same sequence of steps, but if there are 10s or 100s of thousands of customers the value of reducing the set size may be greater than the cost of the extra SQL executed.

So I guess my question is should the default threshold be higher?  I'm curious if others can think of realistic counter examples where it's useful even with tiny set sizes.

A side question is should we try to make the optimizer more efficient?   I.e. can we make sure contraints from calculated members in the slicer get applied.

-matt

From: mondrian-bounces at pentaho.org<mailto:mondrian-bounces at pentaho.org> [mailto:mondrian-bounces at pentaho.org] On Behalf Of Wright, Jeff
Sent: Friday, November 06, 2015 9:49 AM
To: Mondrian developer mailing list <mondrian at pentaho.org<mailto:mondrian at pentaho.org>>
Subject: Re: [Mondrian] The perils of the crossjoin optimizer

I've read this several times and I'm just not following. I think you're asking for feedback on a property setting that will control whether Mondrian uses the crossjoin optimizer or does something else. And it sounds like you're looking for a way to make sure the slicer gets used in the cell loading sql.

Could you maybe make a stab at describing the two query execution plans, and how this optimizer threshold comes into play?

--jeff

From: mondrian-bounces at pentaho.org<mailto:mondrian-bounces at pentaho.org> [mailto:mondrian-bounces at pentaho.org] On Behalf Of Matt Campbell
Sent: Friday, November 06, 2015 9:19 AM
To: Mondrian developer mailing list <mondrian at pentaho.org<mailto:mondrian at pentaho.org>>
Subject: [Mondrian] The perils of the crossjoin optimizer

Mondrian's crossjoin optimizer acts as a fall back to native crossjoin, applying an alternative optimization strategy in cases where native evaluation was not possible or disabled.  It works by loading cell data for crossjoined tuples to eliminate empty intersections.  Loading these cells can be an added cost, but often that's okay since the cells may have been needed anyway, if not by this query then potentially by similar queries.

There are scenarios where the cost can be excessive, however.  The MDX below loads detail rows with a grand total.  In this case, the second crossjoin (of two "Total" calculated members) cannot be natively evaluated, so the crossjoin optimizer gets a shot.  Since the WHERE slicer is also a calculated member, however, this results in an unconstrained SQL query against the fact table.  That a big expense to reduce a tuple set that's already of size 1.

The default crossjoin optimizer threshold has a value of 0 tuples, meaning it always kicks in.  I'm not sure what the ideal default is, but it seems to me that in most cases with tiny sets the risk of expensive SQL outweighs the benefit.  A setting somewhere in the 10-100 region would eliminate many SQL queries with total/subtotal MDX like the one below.

Thoughts?

WITH
member Store.[Store Total] as
    'Aggregate([Store].[Store State].[WA].children)'
member Product.[Product Total] as
    'Aggregate({[Product].[All Products].[Drink], [Product].[All Products].[Food]})'
member [Education Level].[Education Filter] as
    'Aggregate({[Education Level].[All Education Level].[Bachelors Degree],
       [Education Level].[All Education Level].[Graduate Degree]})'

SELECT Measures.[Unit Sales] on 0,
  NON EMPTY
     UNION(
       CrossJoin([Store].[Store State].[WA].children,
                  {[Product].[All Products].[Drink], [Product].[All Products].[Food]}),
       CrossJoin(Store.[Store Total], Product.[Product Total])) on 1
FROM Sales
WHERE
   [Education Level].[Education Filter]

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20151106/3afee4c0/attachment-0001.html