[Mondrian] Handling High Cardinality queries

Matt Campbell mkambol at gmail.com
Fri May 29 16:10:24 EDT 2009


I’m involved in the development of a reporting application for analyzing
medical claims data.  Users of our application often want to run reports
that generate very high cardinality queries.  This could mean queries with
very large dimensions (e.g. “Patients”, which could have millions of
members) or deeply crossjoined axes (e.g. 15 nested dimensions) or a
combination of both.  Since they are executed NON EMPTY and usually are
tightly constrained, the final results often involve only a few dozen
tuples.  While Mondrian does have some facilities to help push NON EMPTY
into native evaluation, we have run into a number of limitations that have
prevented us from taking advantage of the feature.  In particular it fails
if the CrossJoin contains complex expressions or calculated members (which
occur in nearly all of our queries).

The existing native code will look for certain features within the set it is
evaluating.  For example, it will look for <Level>.members, or
<Member>.children within a crossjoin arg.  If it finds a construct that it
can’t recognize, or if any number of conditions fail to hold, then it will
abort native evaluation.  If everything goes correctly, then it will use
SqlTupleReader to load the set of tuples with non empty data using the
individual members within the set as a constraint.  The key piece here is to
come up with that list of individual members for the constraint.  So if I
have a CrossJoin of ( Product.[Product Name].members, { Time.1997.Q1 } ),
the native query should be constrained to Q1 1997.

An alternative to get the set of applicable constraints would be to use a
multi-pass approach, first transforming the query to a simpler form and
using the results of that query to determine how to natively evaluate it.  So
for example:

1)       Given an input set, modify any references to <Level>.members,
<Member>.children, etc. to be a set consisting of a single calculated member
with a scalar value (note that the scalar value is not important—we just
want to replace it with something that needs no further evaluation).

2)       Evaluate the modified set.  This should produce a comparatively
small set of tuples that captures the individual values of any other
dimensions in the set.  Any dimension that contains only individual members
at a single level can be used as constraints, since the <Level>.members
intersects with them.

3)       Fire a native query to get the tuples that have non empty data with
the constraints identified in (2), expanding back out the expressions that
were modified in (1).  (or, take an approach like crossjoin optimizer and
actually evaluate the remaining intersections).

I implemented a quick and dirty proof-of-concept function which uses the
above approach.  So with a query like the following:

SELECT NativizeSet( Crossjoin( Product.[Product Name].members,
       {Time.[1997].LastChild, Time.[1997].LastChild.PrevMember}  )) on 0
from sales

(1)  The function will first transform the query to

with member [Product].[substitution-Product] as '101010.0'
  set [setProduct] as '{[Product].[substitution-Product]}'
select NativizeSet(Crossjoin([setProduct], {[Time].[1997].LastChild,
[Time].[1997].LastChild.PrevMember})) ON COLUMNS
from [Sales]

(2)  It will then evaluate the transformed set, which will return a list of
tuples

  ( product.[substitution-Product], Time.1997.Q4)
  ( product.[substitution-Product], Time.1997.Q3)

(3)  The function can then fire a native query to retrieve the actual
[Product Name] members that intersect with Q4 and Q3.

This works nicely with sets that have arbitrarily complex expressions, but
with some caveats:

* the set expression must contain an expression that can actually be
modified (like <Level>.members).

* expressions replaced in (1) can’t be inside of functions like filter,
topcount, or head, since the expression actually needs to look at the set of
members to determine what to return.  I’m sure there are other cases like
this where the “substituted” set is dynamic and the transformation described
above would not result in the desired result.

Even with the caveats above I believe that an approach like the above will
work for most of our use cases.



Let me know if anyone has feedback.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20090529/0a96331a/attachment.html 


More information about the Mondrian mailing list