[Mondrian] Re: change 9710: aggregating count-distinct over compound cells

Rushan Chen rchen at lucidera.com
Wed Oct 10 16:25:29 EDT 2007


Hi Julian,

Thanks for the inputs.

I was also looking at the caching problem you referred to. I was 
thinking of including in the aggregate cache key a "CellContext" to 
represent the compound member; so the new key will consist of 
ColumnBitKeys and CellContext. Usually CellContext is empty, except for 
aggregates such as count-distinct that are computed for compound members 
which cannot be rolled up from single cells computed for the members 
that are contained in the compound member.

Here is an example of the aggregation cache content storing the product 
and store dimensions. Aggregations are represented by 
AggegationKeys(which will include CellContext, in {}), and Cells are 
represented by the CellKeys.

Aggregation                                             Cell

(store, {[beer], [wine].[Bordeaux]})       (south)  -- aggregate for 
store "south", over "beer" and "Bordeaux wine"
                                                                   
(east)     -- aggregate for store "east",  over "beer" and "Bordeaux wine"

(store, beverage group)                          (east, beer)   -- 
aggregate for store "east" and  beverage group "beer"
                                                                  (west, 
beer)  --  aggregate for store "west" and beverage group "beer"
                                                                  (west, 
wine)  --   aggregate for store "west" and beverage group "wine"

(store, beverage type)                            (west, Bordeaux 
wine)   -- aggregate for store "west" and beverage type "Bordeaux wine"

CellRequest will have to include a similar field too to distinguish the 
CellContext from the members that constrain the cell. If CellContext and 
constraining columns match, cells can be accumulated along the 
constraining column, and be loaded in  a single SQL statement.

Some implications of this idea:
(1) Aggregate cache loookup and loading will need to be changed quite 
extensively.
(2) Effectively segmenting the aggregate cache into areas that do not 
have a CellContext, and areas with CellContext. (Is this the one you are 
trying to avoid?)
(3) CellContext  contains sets of compound members, and have a less 
compact representation (unlike columnBitKey). The key portion of the 
aggregate cache will be a lot larger, and the lookup will be slower; 
however, once the matching Aggregation is located, the actual cell can 
be located using just the constraining column values. Also, the cell 
keys stored in the Segment will be the same as if there were no compound 
members.

Like the original single cell loading scheme, this new one likely will 
take some effort to make it work. So I'd really appreciate comments on 
the design; or, if a simpler approach can be found.

Rushan

Julian Hyde wrote:
> I knew when I wrote the feature that it would issue one SQL statement for
> each cell with a distinct-count measure and a compound member on any axis.
> This is not ideal, but I accepted it as a limitation in the first version.
>
> My rationale was that (a) it was incredibly difficult to get this feature to
> work at all - several months work on my part, (b) the queries that are
> running somewhat slowly now were not working at all or were returning
> incorrect results previously.  I'm glad we're talking how we can make it
> better.
>
> I agree that the compound slicer is a common case. However, compound members
> can occur on other axes too, and I'd like to optimize that too. The problem
> is to work within the current relational caching infrastructure, and that
> cache is based on tuples of (column, value) pairs - NOT based on members. (I
> don't consider it to be an option to create a new cache. We have enough
> caches already!)
>
> So, suppose we've invented a compound member,
>
>   [Product].[Beverage].[Beer and Wine] =
> Aggregate({[Product].[Beverage].[Beer], [Product].[Beverage].[Wine]})
>
> This would be represented by the SQL condition
>
>   category = 'Beverage' AND (type = 'Beer' OR type ='Wine')
>
> The approach I'm thinking of is to generate a new value for the type column
> so that cells belonging to this member can be represented in the cache.
>
> There's also a problem populating the cache. You can't write a single SQL
> statement which aggregates Beer, Wine, and Beer-and-wine in one round trip
> to the database, because each fact table row needs to end up in at most one
> group in the GROUP BY clause.
>
> Finally there's a problem with compound members which cross levels, e.g. 
>
>   [Product].[Beverage].[Beer and Bordeaux Wine] =
> Aggregate({[Product].[Beverage].[Beer],
> [Product].[Beverage].[Wine].[Bordeaux]})
>
> That scenario is less common, and I have no idea how to start solving it.
>
> Julian
>
> PS Forwarding to the mondrian list. Please address emails to the list
> wherever possible.
>
>   
>> -----Original Message-----
>> From: Rushan Chen [mailto:rchen at lucidera.com] 
>> Sent: Tuesday, October 09, 2007 3:30 PM
>> To: Julian Hyde
>> Subject: change 9710: aggregating count-distinct over compound cells
>>
>> Hi Julian,
>>
>> This is regarding the change you made in 9710 to compute 
>> count-distinct 
>> correctly over compound cells.
>>
>> http://lists.pentaho.org/pipermail/mondrian/2007-August/000690.html
>>
>> It used to compute count-distinct over a compound group as sum of the 
>> count-distinct over each subgroup, which is incorrect; After your 
>> change, the compound group is expressed as where clause 
>> predicates and 
>> cells are loaded one at a time. As expected, loading 
>> aggregates a cell 
>> per SQL will slow down queries that load every cells over dimensions 
>> where the members are not compounded. This commonly happens when the 
>> slicer axis of a query contains a compound member. For 
>> example, to find 
>> the count of customers of product P1 or P2 for each store, a 
>> SQL has to 
>> be issued for each store over product (P1, P2); while a single SQL 
>> filtering on the required products and grouping by store will be more 
>> efficient.
>>
>> Is this the right direction for optimizing such queries? If so, 
>> CellRequest seems to be the right class to distinguish between slicer 
>> axis(to generate where clause) and query axis(to generate group by 
>> clause).  It needs to include a new member field to remember 
>> the slicer 
>> context a cell is computed for, and provide interface to distinguish 
>> cells with the same query axis but different slicer axis. For 
>> example,count of customers of product P1 or P3 for each store is 
>> different from the count of product P1 or p2.
>>
>> Likely the change will be extensive; so I'd like to hear your 
>> inputs; or 
>> if you had thought about other possible optimizations  when you were 
>> implementing the compound distinct-count.
>>
>> Thanks,
>>
>> Rushan
>>
>> p.s.
>>
>> This Foodmart example shows one SQL is issued for each store 
>> state. Note 
>> that there's a predicate generation bug when forming SQL for the 
>> compound cell(should use OR instead of AND between all 
>> products). I will 
>> fix it.
>>
>> With
>> Set [Products] as '{[Product].[All Products].[Drink], [Product].[All 
>> Products].[Food], [Product].[All Products].[Non-Consumable]}'
>> Member [Product].[Selected Products] as 'Aggregate([Products])', 
>> SOLVE_ORDER=2
>> Select {[Store].[Store State].Members}  on rows, 
>> {[Measures].[Customer 
>> Count]} on columns
>>  From [Sales]
>> Where ([Product].[Selected Products]);
>>
>>
>>
>>     
>
>   





More information about the Mondrian mailing list