[Mondrian] Re: Handling High Cardinality queries

Matt Campbell mkambol at gmail.com
Tue Jun 2 17:51:54 EDT 2009


Here's an example query that demonstrates some of the limitations of the
existing native non empty:

SELECT NON EMPTY
 CrossJoin(  [Education Level].[Education Level].members,
     UNION( CrossJoin( [product].[product name].members, {gender.M}),
      { CrossJoin( { [Product].[All Products].[Food].[Frozen
Foods].[Breakfast Foods].[Waffles].[Golden].[Golden Waffles] },
        { gender.M}) } ) ) on 0
FROM SALES
WHERE ([Time].[1997].[Q1].[3])

Ideally this MDX would fire a single native query to get the crossjoined
members constrained by Gender.M and Time.1997.Q1.3.  With existing NNE,
however, the UNION part requires ExpandNonNative, which will result in the
nested CrossJoin args being evaluated separately.  If the nested crossjoins
contained features requiring ExpandNonNative, the logic would continue to
descend the expression, performing a bottom up evaluation.

ExpandNonNative will actually fail outright with the above query, since it
only works if the expanded set is a list of members.  Even if we corrected
this limitation, though, it would still fire additional native queries,
which can be very expensive.

The above query is a little contrived, but here's a query generated by our
reporting client (Cognos) that fails in part for the same reason--the
sub-expressions require expansion, but also contain CrossJoins.

WITH
MEMBER [Product].[COG_OQP_INT_umg1] AS 'IIF([Measures].CURRENTMEMBER IS
[Measures].[Unit Sales], ([Product].[COG_OQP_INT_m2], [Measures].[Unit
Sales]), AGGREGATE({[Product].[Product Name].MEMBERS}))', SOLVE_ORDER = 4
MEMBER [Product].[COG_OQP_INT_m2] AS 'AGGREGATE({[Product].[Product
Name].MEMBERS}, [Measures].[Unit Sales])', SOLVE_ORDER = 4 SET
[COG_OQP_INT_s5] AS 'CROSSJOIN({[Marital Status].[All Marital Status].[S]},
[COG_OQP_INT_s4])' SET [COG_OQP_INT_s4] AS 'CROSSJOIN({[Gender].[All
Gender].[F]}, [COG_OQP_INT_s2])'
SET [COG_OQP_INT_s3] AS 'CROSSJOIN({[Gender].[All Gender].[F]},
{[COG_OQP_INT_s2], [COG_OQP_INT_s1]})'
SET [COG_OQP_INT_s2] AS 'CROSSJOIN({[Product].[Product Name].MEMBERS},
{[Customers].[Name].MEMBERS})'
SET [COG_OQP_INT_s1] AS 'CROSSJOIN({[Product].[COG_OQP_INT_umg1]},
{[Customers].DEFAULTMEMBER})'
SELECT {[Measures].[Unit Sales]} ON AXIS(0),
GENERATE({[Education Level].[All Education Levels].[Graduate Degree]},
CROSSJOIN(HEAD({([Education Level].CURRENTMEMBER)},
IIF(COUNT([COG_OQP_INT_s5], INCLUDEEMPTY) > 0, 1, 0)), GENERATE({[Marital
Status].[All Marital Status].[S]}, CROSSJOIN(HEAD({([Marital
Status].CURRENTMEMBER)}, IIF(COUNT([COG_OQP_INT_s4], INCLUDEEMPTY) > 0, 1,
0)), [COG_OQP_INT_s3]), ALL)), ALL) ON AXIS(1)
FROM [Sales]

In addition to the problems with ExpandNonNative, this particular query also
has a calculated member on the columns.  Calculated members on a dimension
will also cause NNE to fail right now.

The approach I've been exploring attempts to tranform the starting, high
cardinality set to a simpler low cardinality version.  The results of
evaluating the low cardinality set can give us information for limiting the
high cardinality set.


On Fri, May 29, 2009 at 8:20 PM, Matt Campbell <mkambol at gmail.com> wrote:

> ExpandNonNative does increase the number of cases native non empty can
> handle, but it will fail if the expanded set (non-natively) exceeds
> some threshold (has greater than 1k members).  Also, expandNonNative
> can result in *several* native queries being fired, which is far less
> efficient than a single native query.
>
> I'll send out some specific examples next week.
>
> On 5/29/09, Zelaine Fong <zfong at lucidera.com> wrote:
> > Matt,
> >
> > Doesn't the existing mondrian property "mondrian.native.ExpandNonNative"
> > already do this?  The property is currently set to false by default.  I
> > set it true in my environment ran the following query:
> >
> > SELECT NonEmptyCrossjoin( Product.[Product Name].members,
> >     {Time.[1997].LastChild, Time.[1997].LastChild.PrevMember}) on 0
> > from sales;
> >
> > and the generated readTuples query is doing the cross join on Products
> > and Time natively.
> >
> > 5299 [main] DEBUG mondrian.rolap.RolapUtil  - SqlTupleReader.readTuples
> > [[Product].[Product Name], [Time].[Quarter]]: executing sql [select
> > "product_class"."product_family", "product_class"."product_department",
> > "product_class"."product_category",
> > "product_class"."product_subcategory", "product"."brand_name",
> > "product"."product_name", "time_by_day"."the_year",
> > "time_by_day"."quarter" from "product" as "product", "product_class" as
> > "product_class", "sales_fact_1997" as "sales_fact_1997", "time_by_day"
> > as "time_by_day" where "product"."product_class_id" =
> > "product_class"."product_class_id" and "sales_fact_1997"."product_id" =
> > "product"."product_id" and "sales_fact_1997"."time_id" =
> > "time_by_day"."time_id" and ("time_by_day"."quarter" in ('Q4', 'Q3') and
> > "time_by_day"."the_year" = 1997) group by
> > "product_class"."product_family", "product_class"."product_department",
> > "product_class"."product_category",
> > "product_class"."product_subcategory", "product"."brand_name",
> > "product"."product_name", "time_by_day"."the_year",
> > "time_by_day"."quarter" order by "product_class"."product_family" ASC,
> > "product_class"."product_department" ASC,
> > "product_class"."product_category" ASC,
> > "product_class"."product_subcategory" ASC, "product"."brand_name" ASC,
> > "product"."product_name" ASC, "time_by_day"."the_year" ASC,
> > "time_by_day"."quarter" ASC], exec 1469 ms
> >
> > -- Zelaine
> >
> > Matt Campbell wrote:
> >> I’m involved in the development of a reporting application for analyzing
> >> medical claims data.  Users of our application often want to run reports
> >> that generate very high cardinality queries.  This could mean queries
> >> with very large dimensions (e.g. “Patients”, which could have millions
> >> of members) or deeply crossjoined axes (e.g. 15 nested dimensions) or a
> >> combination of both.  Since they are executed NON EMPTY and usually are
> >> tightly constrained, the final results often involve only a few dozen
> >> tuples.  While Mondrian does have some facilities to help push NON EMPTY
> >> into native evaluation, we have run into a number of limitations that
> >> have prevented us from taking advantage of the feature.  In particular
> >> it fails if the CrossJoin contains complex expressions or calculated
> >> members (which occur in nearly all of our queries).
> >>
> >> The existing native code will look for certain features within the set
> >> it is evaluating.  For example, it will look for <Level>.members, or
> >> <Member>.children within a crossjoin arg.  If it finds a construct that
> >> it can’t recognize, or if any number of conditions fail to hold, then it
> >> will abort native evaluation.  If everything goes correctly, then it
> >> will use SqlTupleReader to load the set of tuples with non empty data
> >> using the individual members within the set as a constraint.  The key
> >> piece here is to come up with that list of individual members for the
> >> constraint.  So if I have a CrossJoin of ( Product.[Product
> >> Name].members, { Time.1997.Q1 } ), the native query should be
> >> constrained to Q1 1997.
> >>
> >> An alternative to get the set of applicable constraints would be to use
> >> a multi-pass approach, first transforming the query to a simpler form
> >> and using the results of that query to determine how to natively
> >> evaluate it.  So for example:
> >>
> >> 1)       Given an input set, modify any references to <Level>.members,
> >> <Member>.children, etc. to be a set consisting of a single calculated
> >> member with a scalar value (note that the scalar value is not
> >> important—we just want to replace it with something that needs no
> >> further evaluation).
> >>
> >> 2)       Evaluate the modified set.  This should produce a comparatively
> >> small set of tuples that captures the individual values of any other
> >> dimensions in the set.  Any dimension that contains only individual
> >> members at a single level can be used as constraints, since the
> >> <Level>.members intersects with them.
> >>
> >> 3)       Fire a native query to get the tuples that have non empty data
> >> with the constraints identified in (2), expanding back out the
> >> expressions that were modified in (1).  (or, take an approach like
> >> crossjoin optimizer and actually evaluate the remaining intersections).
> >>
> >> I implemented a quick and dirty proof-of-concept function which uses the
> >> above approach.  So with a query like the following:
> >>
> >>     SELECT NativizeSet( Crossjoin( Product.[Product Name].members,
> >>            {Time.[1997].LastChild, Time.[1997].LastChild.PrevMember}  ))
> >>     on 0
> >>     from sales
> >>
> >> (1)  The function will first transform the query to
> >>
> >>     with member [Product].[substitution-Product] as '101010.0'
> >>       set [setProduct] as '{[Product].[substitution-Product]}'
> >>     select NativizeSet(Crossjoin([setProduct], {[Time].[1997].LastChild,
> >>     [Time].[1997].LastChild.PrevMember})) ON COLUMNS
> >>     from [Sales]
> >>
> >> (2)  It will then evaluate the transformed set, which will return a list
> >> of tuples
> >>
> >>       ( product.[substitution-Product], Time.1997.Q4)
> >>       ( product.[substitution-Product], Time.1997.Q3)
> >>
> >> (3)  The function can then fire a native query to retrieve the actual
> >> [Product Name] members that intersect with Q4 and Q3.
> >>
> >> This works nicely with sets that have arbitrarily complex expressions,
> >> but with some caveats:
> >>
> >> * the set expression must contain an expression that can actually be
> >> modified (like <Level>.members).
> >>
> >> * expressions replaced in (1) can’t be inside of functions like filter,
> >> topcount, or head, since the expression actually needs to look at the
> >> set of members to determine what to return.  I’m sure there are other
> >> cases like this where the “substituted” set is dynamic and the
> >> transformation described above would not result in the desired result.
> >>
> >> Even with the caveats above I believe that an approach like the above
> >> will work for most of our use cases.
> >>
> >>
> >>
> >> Let me know if anyone has feedback.
> >>
> >>
> >>
> >>
> >>
> >>
> >> ------------------------------------------------------------------------
> >>
> >> _______________________________________________
> >> Mondrian mailing list
> >> Mondrian at pentaho.org
> >> http://lists.pentaho.org/mailman/listinfo/mondrian
> >
> >
> >
> > _______________________________________________
> > Mondrian mailing list
> > Mondrian at pentaho.org
> > http://lists.pentaho.org/mailman/listinfo/mondrian
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20090602/988de181/attachment.html 


More information about the Mondrian mailing list