[Mondrian] Aggregate over distinct-count measures

Julian Hyde julianhyde at speakeasy.net
Mon Aug 6 03:44:24 EDT 2007


I just checked in the last feature on the roadmap for mondrian 2.4, in
change 9710:  <http://p4web.eigenbase.org/@md=d&c=6PU@//9710?ac=10>
http://p4web.eigenbase.org/@md=d&c=6PU@//9710?ac=10
 
It was a big check-in to mondrian for what seems (on the face of it) a small
feature: the ability to roll up distinct-count measures to calculated
members which represent a span of time or a collection of customers.
 
A simple example: you know you had 1000 distinct customers in Q1 and 800
distinct customers in Q2. How many distinct customers did you have in Q1 and
Q2 combined? In MDX, we represent Q1 and Q2 combined as a member calculated
using the Aggregate function:

WITH MEMBER [Time].[1997 H1]
    AS 'Aggregate({[Time].[1997].[Q1], [Time].[1997].[Q2]})'
SELECT {[Measures].[Customer Count]} ON COLUMNS,
 {[Time].[1997].[Q1], [Time].[1997].[Q2], [Time].[1997 H1]} ON ROWS
FROM [Sales]

We know the answer is somewhere between 1000 (if all of the Q2 customers
were from Q1) and 1800 (if all of the Q2 customers were new), but finding
that answer is a difficult one for mondrian. Normally the MDX layer asks for
a cell from the cell cache, using a set of (column, value) pairs as the
coordinates of the cell, but the compound member [Time].[1997 H1] doesn't
fit into that coordinate scheme. In fact the answer cannot be computed from
cells in cache.
 
To answer queries like this, I needed to extend the data structure which
mondrian uses to record cache requests, and translate each cache request
into a single SQL statement. The SQL turns out to be quite complex,
especially if there are multiple compound members.
 
For example, the query

WITH MEMBER [Store].[CA plus USA] AS
        'AGGREGATE({[Store].[USA].[CA], [Store].[USA]})', solve_order=1
    MEMBER [Time].[Q1 plus July] AS
        'AGGREGATE({[Time].[1997].[Q1], [Time].[1997].[Q3].[7]})',
solve_order=2
SELECT {[Measures].[Customer Count], [Measures].[Unit Sales]} ON COLUMNS,
    {[Store].[CA plus USA]} * {[Time].[Q1 plus July]} ON ROWS
FROM Sales

generates

select count(`c`) as `c0`
from (
    select distinct `sales_fact_1997`.`customer_id` as `c`
    from `time_by_day` as `time_by_day`,
     `sales_fact_1997` as `sales_fact_1997`,
     `store` as `store`
    where `sales_fact_1997`.`time_id` = `time_by_day`.`time_id`
    and `time_by_day`.`the_year` = 1997
    and `sales_fact_1997`.`store_id` = `store`.`store_id`
    and (`store`.`store_state` = 'CA'
            or `store`.`store_country` = 'USA')
    and ((`time_by_day`.`quarter` = 'Q1' and `time_by_day`.`the_year` =
1997)
     or (`time_by_day`.`month_of_year` = 7 and `time_by_day`.`quarter` =
'Q3' and `time_by_day`.`the_year` = 1997))
) as `dummyname`

Besides representing these requests and generating this baroque SQL, I also
had to ensure that these unusual, more general requests did not compromise
the design simplicity and performance of ordinary dimensional requests. I
did this by storing the non-relational coordinates in a new field
RolapEvaluator.aggregationLists, and ensuring that most of the dimensional
expression process doesn't look at that field. To make things a bit simpler,
I issue a one SQL query per compound cell.
 
Please give this new feature a try, and run your application's regression
suite over the latest code base. Next stop is mondrian-2.4. Now this feature
is complete, I will make release candidate 1 in the next day or so, and one
RC per week until we have the quality required for a release.
 
Julian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20070806/88b9647e/attachment.html 


More information about the Mondrian mailing list