[Mondrian] Re: Handling High Cardinality queries

Matt Campbell mkambol at gmail.com
Fri May 29 20:18:12 EDT 2009


ExpandNonNative does increase the number of cases native non empty can
handle, but it will fail if the expanded set (non-natively) exceeds
some threshold (has greater than 1k members).  Also, expandNonNative
can result in *several* native queries being fired, which is far less
efficient than a single native query.

I'll send out some specific examples next week.

On 5/29/09, Zelaine Fong <zfong at lucidera.com> wrote:
> Matt,
>
> Doesn't the existing mondrian property "mondrian.native.ExpandNonNative"
> already do this?  The property is currently set to false by default.  I
> set it true in my environment ran the following query:
>
> SELECT NonEmptyCrossjoin( Product.[Product Name].members,
>     {Time.[1997].LastChild, Time.[1997].LastChild.PrevMember}) on 0
> from sales;
>
> and the generated readTuples query is doing the cross join on Products
> and Time natively.
>
> 5299 [main] DEBUG mondrian.rolap.RolapUtil  - SqlTupleReader.readTuples
> [[Product].[Product Name], [Time].[Quarter]]: executing sql [select
> "product_class"."product_family", "product_class"."product_department",
> "product_class"."product_category",
> "product_class"."product_subcategory", "product"."brand_name",
> "product"."product_name", "time_by_day"."the_year",
> "time_by_day"."quarter" from "product" as "product", "product_class" as
> "product_class", "sales_fact_1997" as "sales_fact_1997", "time_by_day"
> as "time_by_day" where "product"."product_class_id" =
> "product_class"."product_class_id" and "sales_fact_1997"."product_id" =
> "product"."product_id" and "sales_fact_1997"."time_id" =
> "time_by_day"."time_id" and ("time_by_day"."quarter" in ('Q4', 'Q3') and
> "time_by_day"."the_year" = 1997) group by
> "product_class"."product_family", "product_class"."product_department",
> "product_class"."product_category",
> "product_class"."product_subcategory", "product"."brand_name",
> "product"."product_name", "time_by_day"."the_year",
> "time_by_day"."quarter" order by "product_class"."product_family" ASC,
> "product_class"."product_department" ASC,
> "product_class"."product_category" ASC,
> "product_class"."product_subcategory" ASC, "product"."brand_name" ASC,
> "product"."product_name" ASC, "time_by_day"."the_year" ASC,
> "time_by_day"."quarter" ASC], exec 1469 ms
>
> -- Zelaine
>
> Matt Campbell wrote:
>> I’m involved in the development of a reporting application for analyzing
>> medical claims data.  Users of our application often want to run reports
>> that generate very high cardinality queries.  This could mean queries
>> with very large dimensions (e.g. “Patients”, which could have millions
>> of members) or deeply crossjoined axes (e.g. 15 nested dimensions) or a
>> combination of both.  Since they are executed NON EMPTY and usually are
>> tightly constrained, the final results often involve only a few dozen
>> tuples.  While Mondrian does have some facilities to help push NON EMPTY
>> into native evaluation, we have run into a number of limitations that
>> have prevented us from taking advantage of the feature.  In particular
>> it fails if the CrossJoin contains complex expressions or calculated
>> members (which occur in nearly all of our queries).
>>
>> The existing native code will look for certain features within the set
>> it is evaluating.  For example, it will look for <Level>.members, or
>> <Member>.children within a crossjoin arg.  If it finds a construct that
>> it can’t recognize, or if any number of conditions fail to hold, then it
>> will abort native evaluation.  If everything goes correctly, then it
>> will use SqlTupleReader to load the set of tuples with non empty data
>> using the individual members within the set as a constraint.  The key
>> piece here is to come up with that list of individual members for the
>> constraint.  So if I have a CrossJoin of ( Product.[Product
>> Name].members, { Time.1997.Q1 } ), the native query should be
>> constrained to Q1 1997.
>>
>> An alternative to get the set of applicable constraints would be to use
>> a multi-pass approach, first transforming the query to a simpler form
>> and using the results of that query to determine how to natively
>> evaluate it.  So for example:
>>
>> 1)       Given an input set, modify any references to <Level>.members,
>> <Member>.children, etc. to be a set consisting of a single calculated
>> member with a scalar value (note that the scalar value is not
>> important—we just want to replace it with something that needs no
>> further evaluation).
>>
>> 2)       Evaluate the modified set.  This should produce a comparatively
>> small set of tuples that captures the individual values of any other
>> dimensions in the set.  Any dimension that contains only individual
>> members at a single level can be used as constraints, since the
>> <Level>.members intersects with them.
>>
>> 3)       Fire a native query to get the tuples that have non empty data
>> with the constraints identified in (2), expanding back out the
>> expressions that were modified in (1).  (or, take an approach like
>> crossjoin optimizer and actually evaluate the remaining intersections).
>>
>> I implemented a quick and dirty proof-of-concept function which uses the
>> above approach.  So with a query like the following:
>>
>>     SELECT NativizeSet( Crossjoin( Product.[Product Name].members,
>>            {Time.[1997].LastChild, Time.[1997].LastChild.PrevMember}  ))
>>     on 0
>>     from sales
>>
>> (1)  The function will first transform the query to
>>
>>     with member [Product].[substitution-Product] as '101010.0'
>>       set [setProduct] as '{[Product].[substitution-Product]}'
>>     select NativizeSet(Crossjoin([setProduct], {[Time].[1997].LastChild,
>>     [Time].[1997].LastChild.PrevMember})) ON COLUMNS
>>     from [Sales]
>>
>> (2)  It will then evaluate the transformed set, which will return a list
>> of tuples
>>
>>       ( product.[substitution-Product], Time.1997.Q4)
>>       ( product.[substitution-Product], Time.1997.Q3)
>>
>> (3)  The function can then fire a native query to retrieve the actual
>> [Product Name] members that intersect with Q4 and Q3.
>>
>> This works nicely with sets that have arbitrarily complex expressions,
>> but with some caveats:
>>
>> * the set expression must contain an expression that can actually be
>> modified (like <Level>.members).
>>
>> * expressions replaced in (1) can’t be inside of functions like filter,
>> topcount, or head, since the expression actually needs to look at the
>> set of members to determine what to return.  I’m sure there are other
>> cases like this where the “substituted” set is dynamic and the
>> transformation described above would not result in the desired result.
>>
>> Even with the caveats above I believe that an approach like the above
>> will work for most of our use cases.
>>
>>
>>
>> Let me know if anyone has feedback.
>>
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Mondrian mailing list
>> Mondrian at pentaho.org
>> http://lists.pentaho.org/mailman/listinfo/mondrian
>
>
>
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>




More information about the Mondrian mailing list