[Mondrian] Formalisation of your suggestion on first, last, none etc

michael bienstein mbienstein at yahoo.fr
Tue Nov 27 04:15:24 EST 2007


OK Julian, I understand your worry.  The short answer is "yes" the "none" aggregator could work for me with some modifications to what you were proposing.  What I'm going to do in this post is to take a step out of my project requirements and look at what you are proposing.  To keep it short I've kept the text rigid.  Otherwise I'd be meandering far too long.  This makes it less readable and more like a maths paper.  Sorry for that.

I have to say first up that I am not even sure if this project will use Mondrian yet - I'm trying to determine the risks and money involved in doing it right now.  If it becomes too time consuming, using too much money or if the risk of maintenance goes too high, the customer is likely to refuse this solution.
---
So I see the following proposals on the table:
1) Writeback.  Let's drop this.  My problem is essentially a writeback problem but since it's not "write over", i.e. the user doesn't write the data so that there are more than one version of the data, let's treat it like a measure in the cube rather than "writeback."

2) Measures which aggregate differently along different levels/dimensions and potentially in a specific order.
First of all, consider how "count" works.  We use a SQL count to roll up from the fact table to an aggregate table, but to roll up from an aggregate table to another we use SQL sum.  My point is that the rolling up from the fact table is mathematically a separate step to rolling up along levels or dimensions.
Formalising this we get a) an aggregator for the fact table to the lowest level of aggregation b) an order of levels to rollup c) an aggregator to apply to each level in order.  Optimisation is possible if the order does not affect mathematically the result.  "none" as an aggregator can't work from the fact table in step (a).  If "none" is an aggregation applied to a level then all levels after it in the rollup must be "none" as well and the cells at members that are at higher levels will be EMPTY.  i.e. none({a,b,c}) gives EMPTY everytime.  "first" and "last" can be introduced as aggregators but these can't apply to the fact table as facts within a cell are un-ordered.

Now consider the two types of distinct count possible:
i) Distinct count of members of a level that is represented in the cube space.
Your example is distinct customers and customer is a level in the cube space.
ii) Distinct count on a measure column or other column that does not participate in the cube space.
A contrived example is the number of distinct prices.

In the second type of distinct count, aggregation tables can only be used if it is an EXACT match.  Otherwise we apply distinct count to the fact table.  (The only other way to do it is to store the lists of distinct values in intermediary tables or have a bitmap representation of those values as the intermediary values.  Let's not do it since it is not the main use case of distinct count).  The annoying thing here is that the logic described above of (a) fact table aggregator - (b) order of levels - (c) per-level aggregation, is broken for this type of distinct count.

In the first type of distinct count we have to introduce a different type of aggregator which I will call "nonempty" for now.  "nonempty" returns either 1 or EMPTY depending on whether there is at least one non-NULL/EMPTY value in the set.  We then have the following (a)/(b)/(c) definition of distinct count: (a) is a fact table to finest aggregation level using nonempty, (b) rollup along first those levels that are not functionally dependent on the distinct count and afterwards rollup along those that are functionnally dependent, (c) for levels that are not functionally dependent use nonempty, for levels that are dependent use sum.  This works because I just need to know if there is at least one fact per customer in the selected slice of the cube and then I roll up the 1's and EMPTY using sum along the customer dimension.  The order is important because nonempty(sum(a,b),sum(c,d)) != sum(nonempty(a,b), nonempty(c,d)).

So in short the (a)/(b)/(c) formalisation works well for the first type of distinct count but not for the second.

As for mixing other types of aggregations, only sum mixed with min/max needs the order to be specified.
sum mixed with sum is ok.  min/max mixed with min/max is ok.  min/max with first/last is ok.  sum with first/last is ok.  first/last mixed with first/last is ok.

So the formalisation of a measure's aggregator is:
MeasureAggregator {/*To be determined*/}
NonLevelDistinctCountAggregator extends MeasureAggregator {}
RegularMeasureAggregator extends MeasureAggregator {
    SQLExpression factExpression; //To rollup an expression from the fact table.  E.g. a standard deviation can be easily done by rolling up the count(m), sum(m) and sum(m*m).  stddev(m)=sqrt(sum(m*m)/count(m) +sum(m)*sum(m)*(1-2/count(m))).  For this we need to apply sum to "m*m" and not just "m".
    FactAggregator factagg;    //count, sum, min, max, nonempty
    Comparator<Level> levelOrder; //not a list ... this allows many levels to be interchangable
    Map<Level,LevelAggregator> aggForLevels; //sum, min, max, first, last, nonempty
}
SimpleCalculatedMeasureAggregator {
    //e.g. avg = sum/count.
    //I'm surprised as to how Mondrian does this now by storing intermediary avg values in aggregation tables
}

So now there are three things to do:
1) How can a user specify the RegularMeasureAggregator's properties?  It should be easy to do the most obvious use cases but flexible enough to do the general cases.
2) How to change the rollup functionality to use it. (i.e. change Mondrian's innards).
3) How to map intermediary values to SQL on aggregate tables (and create those tables).

(1) Requires thought.  The general case is too general to be user-friendly.
Frankly for distinct count as a use-case specifying functional dependency would be better to do at a relational table level than at a Cube/Level level.  See in this regard your email to me of 3 June 2006 in which you wrote about an idea I had had on what I called "implication groups": "It's also similar to Functional Dependency -- database normalization theory, e.g. the writings of Chris Date in the late 80's."  
(2) I haven't thought about at all except that we have after the aggregation from the fact table, a value for each cell and this can be stored in either the aggregate tables or the cell-cache.
(3) I think that in order to ensure that the SQL that is generated doesn't break, we shouldn't allow aggregate tables to not have columns that we know are going to be filled with NULLs.  We should force these redundant columns into the aggregate tables.  For example none is the aggregator for everything on a measure (my issue).  In this case any aggregation table that exists should have the column for this measure even if every row has a NULL value for this column.

So what I propose that I can work on if my budget is approved is only point (4) - "none" aggregator for a measure will never use the fact table and will map to a column in each existing aggregate table.  If Mondrian creates aggregate tables then this column will be created with only NULL values populated.

Michael






      _____________________________________________________________________________ 
Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20071127/59b80f8e/attachment.html 


More information about the Mondrian mailing list