[Mondrian] virtual cubes, non-joining dims, performance

Wright, Jeff jeff.s.wright at truvenhealth.com
Tue Aug 23 08:30:22 EDT 2016


I can’t think of another way to get a native tuple list without a union. But my reactions on performance are:


·         It appeals to me to use a cardinality estimate to decide on native NECJ, versus just building the cartesian product.

·         I have an on-going gripe that Mondrian generates sql at a lower level of granularity than the MDX asks for. In my case the NECJ is in the WHERE clause, so the tuples aren’t really required. This behavior is also not governed by any kind of cardinality estimate, which can lead to slowness and lots of memory used. It also prevents TopCount() from being evaluated natively (e.g. if you have a slicer).

--jeff

From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On Behalf Of Matt Campbell
Sent: Monday, August 22, 2016 5:00 PM
To: Mondrian developer mailing list <mondrian at pentaho.org>
Subject: Re: [Mondrian] virtual cubes, non-joining dims, performance


One unfortunate fact is that native.crossjoin will be used even if the NECJ is tiny.  For example, crossjoining [Gender] and [Age] has a small cartesian product, and the cost of querying the fact table (or UNIONing multiple fact tables) outweighs the benefit of reducing the tuple size.  But Mondrian doesn’t check the cardinality of the crossjoined sets before doing native evaluation, so small and large get treated the same.  In some limited workloads it might be better to have native.crossjoin off, but in general I’d be worried about crossjoins with very large cartesian products, since they have the potential to run long and consume excessive resources.

Just to confirm- when you see these expensive UNION queries, do you have calculated measures in your MDX which use ValidMeasure, or otherwise shift dimensional context?  If not, the behavior should effectively be as it was before.  If you do have something like ValidMeasure, do you see another way of getting the correct set of tuples without the UNION query?

-matt


From: <mondrian-bounces at pentaho.org<mailto:mondrian-bounces at pentaho.org>> on behalf of "jeff.s.wright at truvenhealth.com<mailto:jeff.s.wright at truvenhealth.com>" <jeff.s.wright at truvenhealth.com<mailto:jeff.s.wright at truvenhealth.com>>
Reply-To: Mondrian list <mondrian at pentaho.org<mailto:mondrian at pentaho.org>>
Date: Monday, August 22, 2016 at 2:57 PM
To: Mondrian list <mondrian at pentaho.org<mailto:mondrian at pentaho.org>>
Subject: Re: [Mondrian] virtual cubes, non-joining dims, performance

I’ve been experimenting with 2 flavors of “filter” MDX, and also turning native eval off with

mondrian.native.crossjoin.enable=false
mondrian.native.nonempty.enable=false

For the queries I’m looking at, it looks to me like writing filter MDX with NonEmptyCrossJoin() in the WHERE improves some queries, and turning off native eval avoids the NonEmptyCrossJoin() resulting in an expensive UNION sql from SqlTupleReader. So the UNION gets avoided by deferring constraints to the segment load.

Setting mondrian.rolap.aggregates.optimizePredicates=false also gives me results I’m happier with. With this property set to true, a lot of filters are more or less ignored.

I feel like I’m doing something wrong to turn off native and optimizePredicates…

--jeff

From: mondrian-bounces at pentaho.org<mailto:mondrian-bounces at pentaho.org> [mailto:mondrian-bounces at pentaho.org] On Behalf Of Matt Campbell
Sent: Friday, August 19, 2016 11:38 AM
To: Mondrian developer mailing list <mondrian at pentaho.org<mailto:mondrian at pentaho.org>>
Subject: Re: [Mondrian] virtual cubes, non-joining dims, performance


Hey Jeff-

The purpose of the UNION query with native evaluation of virtual cubes is to make sure that all of the tuples with NON EMPTY data from any of the relevant base cubes are included.  So if a query crossjoins Products and Stores from the [Warehouse and Sales] cube, native eval needs to query both the sales and warehouse fact tables in order to get the full set of NON EMPTY tuples.  It does that with queries to each fact table UNION’d together (that logic has been in place for years).

Prior to the changes for 1599/2280, SqlTupleReader would retrieve those base cubes relevant to the measures in the query, and then pare the list of cubes down to only those that join to all the dimensions in the crossjoin.  If any measures in the query might change context, however, that approach could give incorrect results.  There are some examples in the 2 jira cases.

The changes for 1599/2280 introduced grouping of native tuple queries, so we now retrieve tuples grouped by common joining base cubes.  STR only does this if one or more measures are determined to shift context (with ValidMeasure or otherwise).  But this does mean that we’re now more likely to execute UNION queries, since we’re evaluating in groups that share common cubes.

I’m definitely interested in hearing thoughts on improving performance in this scenario.  The trick is assuring correctness, which requires knowing which tuples have non-empty data across all relevant fact tables, without executing more expensive SQL than necessary.

-matt

From: <mondrian-bounces at pentaho.org<mailto:mondrian-bounces at pentaho.org>> on behalf of "jeff.s.wright at truvenhealth.com<mailto:jeff.s.wright at truvenhealth.com>" <jeff.s.wright at truvenhealth.com<mailto:jeff.s.wright at truvenhealth.com>>
Reply-To: Mondrian list <mondrian at pentaho.org<mailto:mondrian at pentaho.org>>
Date: Friday, August 19, 2016 at 10:26 AM
To: Mondrian list <mondrian at pentaho.org<mailto:mondrian at pentaho.org>>
Subject: [Mondrian] virtual cubes, non-joining dims, performance

I think this is primarily to Matt, but sending to mailing list in case there’s other interest or input…

We’ve been testing some new-ish code in an effort to get correct behavior of non-joining dimensions, and there seems to be a somewhat steep performance cost to the fix. What we’re seeing is that recent versions of SqlTupleReader generate an expensive UNION query across the fact tables in a virtual cube.

I haven’t carefully sorted through the code our team has tested, but looking at the revision history on github, I suspect we’re testing with the checkin for issue #1, but not issue #2:

#1 - http://jira.pentaho.com/browse/MONDRIAN-1599 - Some queries against Virtual cubes can give wrong results when dimensions don't fully join
#2 - http://jira.pentaho.com/browse/MONDRIAN-2280 - ValidMeasure not taken into account when Analyzer uses NonEmptyCrossjoin as it returns no data

Is the expensive UNION query something you’ve seen as well?

--jeff
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20160823/262eabc5/attachment-0001.html 


More information about the Mondrian mailing list