[Mondrian] Re: change 9710: aggregating count-distinct over compound cells

Mon Nov 19 20:33:57 EST 2007

I have drafted a design doc based on the "CellContext" idea to improve 
the performance of aggregate loading for cells with "compound" constraints.

http://www.eigenbase.org/wiki/index.php/MondrianDistinctCountAggregateImprovement

This proposal requires pretty far-reaching code change so I did some 
prototyping to make sure this idea would work. So far, despite the 
sizable changes required, the basic functionalities(batch loading, 
caching) are working with some careful extraction of code and 
streamlining of interfaces. This hopefully will make aggregate 
loading/caching more modular and using the new set of interfaces less 
error prone.

Since I have done just some prototyping, there could be design flaws 
lurking still. I would really appreciate your input and/or your comments 
on how to better test this improvement.

Thanks,

Rushan

Julian Hyde wrote:
>> Rushan Chen wrote:
>>
>> I was also looking at the caching problem you referred to. I was 
>> thinking of including in the aggregate cache key a "CellContext" to 
>> represent the compound member; so the new key will consist of 
>> ColumnBitKeys and CellContext. Usually CellContext is empty, 
>> except for 
>> aggregates such as count-distinct that are computed for 
>> compound members 
>> which cannot be rolled up from single cells computed for the members 
>> that are contained in the compound member.
>>
>> Here is an example of the aggregation cache content storing 
>> the product 
>> and store dimensions. Aggregations are represented by 
>> AggegationKeys(which will include CellContext, in {}), and Cells are 
>> represented by the CellKeys.
>>
>> Aggregation                                             Cell
>>
>> (store, {[beer], [wine].[Bordeaux]})       (south)  -- aggregate for 
>> store "south", over "beer" and "Bordeaux wine"
>>                                                                    
>> (east)     -- aggregate for store "east",  over "beer" and 
>> "Bordeaux wine"
>>
>> (store, beverage group)                          (east, beer)   -- 
>> aggregate for store "east" and  beverage group "beer"
>>                                                               
>>     (west, 
>> beer)  --  aggregate for store "west" and beverage group "beer"
>>                                                               
>>     (west, 
>> wine)  --   aggregate for store "west" and beverage group "wine"
>>
>> (store, beverage type)                            (west, Bordeaux 
>> wine)   -- aggregate for store "west" and beverage type 
>> "Bordeaux wine"
>>
>> CellRequest will have to include a similar field too to 
>> distinguish the 
>> CellContext from the members that constrain the cell. If 
>> CellContext and 
>> constraining columns match, cells can be accumulated along the 
>> constraining column, and be loaded in  a single SQL statement.
>>
>> Some implications of this idea:
>> (1) Aggregate cache loookup and loading will need to be changed quite 
>> extensively.
>> (2) Effectively segmenting the aggregate cache into areas that do not 
>> have a CellContext, and areas with CellContext. (Is this the 
>> one you are 
>> trying to avoid?)
>> (3) CellContext  contains sets of compound members, and have a less 
>> compact representation (unlike columnBitKey). The key portion of the 
>> aggregate cache will be a lot larger, and the lookup will be slower; 
>> however, once the matching Aggregation is located, the actual 
>> cell can 
>> be located using just the constraining column values. Also, the cell 
>> keys stored in the Segment will be the same as if there were 
>> no compound 
>> members.
>>
>> Like the original single cell loading scheme, this new one 
>> likely will 
>> take some effort to make it work. So I'd really appreciate 
>> comments on 
>> the design; or, if a simpler approach can be found.
>>     
>
> I thought of the 'CellContext' approach, and realised it had the same
> problem - the entire caching infrastructure needs to be aware of it, and
> it's easy to forget about it. Given that compound members are rarely used,
> we'd have problems finding all of the bugs. 
>
> I suggested the solution in my previous email - a special column value which
> lives in the regular bitmap. A client who is looking for [Beer and wine]
> looks in  the cache for
>
>   (product_special: 'Beer and wine',
>    product_type: null)
>
> Note that 'Beer' and 'Wine' are values of the product_type column, but the
> constraint on product_special makes them superfluous. This column would be
> part of the regular bitmap, so could not easily be ignored. The cache even
> obeys proper rollup semantics. The most difficult problem to overcome is
> that these columns need to be added on the fly when someone adds a calc
> member of a previously unused dimensionality.
>
> Julian
>
>   

-- 
Rushan Chen

rchen at lucidera.com

Read <http://tinyurl.com/ypc73a>  our customer reviews: "LucidEra is a
must have tool for any company that extensively uses salesforce.com"

Test drive <http://www.lucidera.com/test-drive.php>  LucidEra Revenue
Cycle Analysis

Comment <http://www.lucidera.com/blog/>  on our "Keep it Simple" blog