[Mondrian] Adding Grouping Set support for Distinct Count measures

John V. Sichi jsichi at gmail.com
Thu Jan 31 05:21:13 EST 2008

John V. Sichi wrote:
> As Ajit points out, this means we have to be careful with pushing down 
> constraints for non-distinct aggs to SQL.  Rushan and I have been 
> thinking about how to build on the distinct-count optimization she put 
> in, with the next step being optimization and caching for non-visual 
> totals via calculated members.

Oops, I meant for visual totals.  Below is an example MDX query which 
demonstrates an extreme case (computing the visual grand total for some 
combination of large arbitrary subsets of three dimensions).

Set [*BASE_MEMBERS_Customer] as [Customers].[Name].members
Set [*BASE_MEMBERS_Store] as [Store].[Store Name].members
Set [*BASE_MEMBERS_Product] as [Product].[All Products].[Food].children
Member [Customers].[*SUBTOTAL] As Aggregate([*BASE_MEMBERS_Customer])
Member [Product].[*SUBTOTAL] As Aggregate([*BASE_MEMBERS_Product])
Member [Store].[*SUBTOTAL] As Aggregate([*BASE_MEMBERS_Store])
{[Measures].[Unit Sales]} on columns,
on rows
 From [Sales]
where ([Time].[1997]);

With latest Mondrian, on my laptop against Derby, it takes about 180 
seconds.  And if I repeat it, the time does not go down, since the total 
cannot be cached.

However, if I edit line 79 of AggregateFunDef.java to

             if ((aggregator == RolapAggregator.DistinctCount)
                 || (aggregator == RolapAggregator.Sum)) {

then the time goes down to 105s for the first execution, and 40s for 
subsequent executions.  The speedup comes from pushing the sum 
computation down to SQL instead of letting Mondrian iterate over the big 
sparse 3D space.

Derby is quite slow for the join/agg; if I run it on LucidDB, the 
improvement for the first MDX execution (where the SQL execution 
matters) is quite a bit larger, taking it down to 83s.  (Subsequent runs 
are 40s as with Derby since the result is already cached, so SQL 
performance doesn't come into play.)

I was wondering why the cached runs still took 40s, so I probed the 
stack and it was spending most of the time in 
AggregateFunDef.removeOverlappingTupleEntries.  I commented this out, 
and the LucidDB time went down to 15s for the first execution, and 8s 
for the cached execution.

To try this, you also have to set mondrian.rolap.maxConstraints=100000, 
because the IN list generated is huge.  (If your DBMS doesn't like large 
IN lists, this won't be an attractive approach.)  This implies cache 
bloat issues, since those make up the cache key.  And the overall 
approach raises optimization issues, since for many MDX queries, pushing 
down to SQL like this would be overkill.


More information about the Mondrian mailing list