[Mondrian] Handling High Cardinality queries

Zelaine Fong zfong at lucidera.com
Fri May 29 17:51:19 EDT 2009


Doesn't the existing mondrian property "mondrian.native.ExpandNonNative" 
already do this?  The property is currently set to false by default.  I 
set it true in my environment ran the following query:

SELECT NonEmptyCrossjoin( Product.[Product Name].members,
    {Time.[1997].LastChild, Time.[1997].LastChild.PrevMember}) on 0
from sales;

and the generated readTuples query is doing the cross join on Products 
and Time natively.

5299 [main] DEBUG mondrian.rolap.RolapUtil  - SqlTupleReader.readTuples 
[[Product].[Product Name], [Time].[Quarter]]: executing sql [select 
"product_class"."product_family", "product_class"."product_department", 
"product_class"."product_subcategory", "product"."brand_name", 
"product"."product_name", "time_by_day"."the_year", 
"time_by_day"."quarter" from "product" as "product", "product_class" as 
"product_class", "sales_fact_1997" as "sales_fact_1997", "time_by_day" 
as "time_by_day" where "product"."product_class_id" = 
"product_class"."product_class_id" and "sales_fact_1997"."product_id" = 
"product"."product_id" and "sales_fact_1997"."time_id" = 
"time_by_day"."time_id" and ("time_by_day"."quarter" in ('Q4', 'Q3') and 
"time_by_day"."the_year" = 1997) group by 
"product_class"."product_family", "product_class"."product_department", 
"product_class"."product_subcategory", "product"."brand_name", 
"product"."product_name", "time_by_day"."the_year", 
"time_by_day"."quarter" order by "product_class"."product_family" ASC, 
"product_class"."product_department" ASC, 
"product_class"."product_category" ASC, 
"product_class"."product_subcategory" ASC, "product"."brand_name" ASC, 
"product"."product_name" ASC, "time_by_day"."the_year" ASC, 
"time_by_day"."quarter" ASC], exec 1469 ms

-- Zelaine

Matt Campbell wrote:
> I’m involved in the development of a reporting application for analyzing 
> medical claims data.  Users of our application often want to run reports 
> that generate very high cardinality queries.  This could mean queries 
> with very large dimensions (e.g. “Patients”, which could have millions 
> of members) or deeply crossjoined axes (e.g. 15 nested dimensions) or a 
> combination of both.  Since they are executed NON EMPTY and usually are 
> tightly constrained, the final results often involve only a few dozen 
> tuples.  While Mondrian does have some facilities to help push NON EMPTY 
> into native evaluation, we have run into a number of limitations that 
> have prevented us from taking advantage of the feature.  In particular 
> it fails if the CrossJoin contains complex expressions or calculated 
> members (which occur in nearly all of our queries).
> The existing native code will look for certain features within the set 
> it is evaluating.  For example, it will look for <Level>.members, or 
> <Member>.children within a crossjoin arg.  If it finds a construct that 
> it can’t recognize, or if any number of conditions fail to hold, then it 
> will abort native evaluation.  If everything goes correctly, then it 
> will use SqlTupleReader to load the set of tuples with non empty data 
> using the individual members within the set as a constraint.  The key 
> piece here is to come up with that list of individual members for the 
> constraint.  So if I have a CrossJoin of ( Product.[Product 
> Name].members, { Time.1997.Q1 } ), the native query should be 
> constrained to Q1 1997.
> An alternative to get the set of applicable constraints would be to use 
> a multi-pass approach, first transforming the query to a simpler form 
> and using the results of that query to determine how to natively 
> evaluate it.  So for example:
> 1)       Given an input set, modify any references to <Level>.members, 
> <Member>.children, etc. to be a set consisting of a single calculated 
> member with a scalar value (note that the scalar value is not 
> important—we just want to replace it with something that needs no 
> further evaluation).
> 2)       Evaluate the modified set.  This should produce a comparatively 
> small set of tuples that captures the individual values of any other 
> dimensions in the set.  Any dimension that contains only individual 
> members at a single level can be used as constraints, since the 
> <Level>.members intersects with them.
> 3)       Fire a native query to get the tuples that have non empty data 
> with the constraints identified in (2), expanding back out the 
> expressions that were modified in (1).  (or, take an approach like 
> crossjoin optimizer and actually evaluate the remaining intersections).
> I implemented a quick and dirty proof-of-concept function which uses the 
> above approach.  So with a query like the following:
>     SELECT NativizeSet( Crossjoin( Product.[Product Name].members,
>            {Time.[1997].LastChild, Time.[1997].LastChild.PrevMember}  ))
>     on 0
>     from sales
> (1)  The function will first transform the query to
>     with member [Product].[substitution-Product] as '101010.0'
>       set [setProduct] as '{[Product].[substitution-Product]}'
>     select NativizeSet(Crossjoin([setProduct], {[Time].[1997].LastChild,
>     [Time].[1997].LastChild.PrevMember})) ON COLUMNS
>     from [Sales]
> (2)  It will then evaluate the transformed set, which will return a list 
> of tuples
>       ( product.[substitution-Product], Time.1997.Q4)
>       ( product.[substitution-Product], Time.1997.Q3)
> (3)  The function can then fire a native query to retrieve the actual 
> [Product Name] members that intersect with Q4 and Q3.
> This works nicely with sets that have arbitrarily complex expressions, 
> but with some caveats:
> * the set expression must contain an expression that can actually be 
> modified (like <Level>.members).
> * expressions replaced in (1) can’t be inside of functions like filter, 
> topcount, or head, since the expression actually needs to look at the 
> set of members to determine what to return.  I’m sure there are other 
> cases like this where the “substituted” set is dynamic and the 
> transformation described above would not result in the desired result.
> Even with the caveats above I believe that an approach like the above 
> will work for most of our use cases. 
> Let me know if anyone has feedback.
> ------------------------------------------------------------------------
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian

More information about the Mondrian mailing list