[Mondrian] RE: change 9710: aggregating count-distinct over compound cells

Julian Hyde julianhyde at speakeasy.net
Wed Oct 10 11:58:25 EDT 2007


I knew when I wrote the feature that it would issue one SQL statement for
each cell with a distinct-count measure and a compound member on any axis.
This is not ideal, but I accepted it as a limitation in the first version.

My rationale was that (a) it was incredibly difficult to get this feature to
work at all - several months work on my part, (b) the queries that are
running somewhat slowly now were not working at all or were returning
incorrect results previously.  I'm glad we're talking how we can make it
better.

I agree that the compound slicer is a common case. However, compound members
can occur on other axes too, and I'd like to optimize that too. The problem
is to work within the current relational caching infrastructure, and that
cache is based on tuples of (column, value) pairs - NOT based on members. (I
don't consider it to be an option to create a new cache. We have enough
caches already!)

So, suppose we've invented a compound member,

  [Product].[Beverage].[Beer and Wine] =
Aggregate({[Product].[Beverage].[Beer], [Product].[Beverage].[Wine]})

This would be represented by the SQL condition

  category = 'Beverage' AND (type = 'Beer' OR type ='Wine')

The approach I'm thinking of is to generate a new value for the type column
so that cells belonging to this member can be represented in the cache.

There's also a problem populating the cache. You can't write a single SQL
statement which aggregates Beer, Wine, and Beer-and-wine in one round trip
to the database, because each fact table row needs to end up in at most one
group in the GROUP BY clause.

Finally there's a problem with compound members which cross levels, e.g. 

  [Product].[Beverage].[Beer and Bordeaux Wine] =
Aggregate({[Product].[Beverage].[Beer],
[Product].[Beverage].[Wine].[Bordeaux]})

That scenario is less common, and I have no idea how to start solving it.

Julian

PS Forwarding to the mondrian list. Please address emails to the list
wherever possible.

> -----Original Message-----
> From: Rushan Chen [mailto:rchen at lucidera.com] 
> Sent: Tuesday, October 09, 2007 3:30 PM
> To: Julian Hyde
> Subject: change 9710: aggregating count-distinct over compound cells
> 
> Hi Julian,
> 
> This is regarding the change you made in 9710 to compute 
> count-distinct 
> correctly over compound cells.
> 
> http://lists.pentaho.org/pipermail/mondrian/2007-August/000690.html
> 
> It used to compute count-distinct over a compound group as sum of the 
> count-distinct over each subgroup, which is incorrect; After your 
> change, the compound group is expressed as where clause 
> predicates and 
> cells are loaded one at a time. As expected, loading 
> aggregates a cell 
> per SQL will slow down queries that load every cells over dimensions 
> where the members are not compounded. This commonly happens when the 
> slicer axis of a query contains a compound member. For 
> example, to find 
> the count of customers of product P1 or P2 for each store, a 
> SQL has to 
> be issued for each store over product (P1, P2); while a single SQL 
> filtering on the required products and grouping by store will be more 
> efficient.
> 
> Is this the right direction for optimizing such queries? If so, 
> CellRequest seems to be the right class to distinguish between slicer 
> axis(to generate where clause) and query axis(to generate group by 
> clause).  It needs to include a new member field to remember 
> the slicer 
> context a cell is computed for, and provide interface to distinguish 
> cells with the same query axis but different slicer axis. For 
> example,count of customers of product P1 or P3 for each store is 
> different from the count of product P1 or p2.
> 
> Likely the change will be extensive; so I'd like to hear your 
> inputs; or 
> if you had thought about other possible optimizations  when you were 
> implementing the compound distinct-count.
> 
> Thanks,
> 
> Rushan
> 
> p.s.
> 
> This Foodmart example shows one SQL is issued for each store 
> state. Note 
> that there's a predicate generation bug when forming SQL for the 
> compound cell(should use OR instead of AND between all 
> products). I will 
> fix it.
> 
> With
> Set [Products] as '{[Product].[All Products].[Drink], [Product].[All 
> Products].[Food], [Product].[All Products].[Non-Consumable]}'
> Member [Product].[Selected Products] as 'Aggregate([Products])', 
> SOLVE_ORDER=2
> Select {[Store].[Store State].Members}  on rows, 
> {[Measures].[Customer 
> Count]} on columns
>  From [Sales]
> Where ([Product].[Selected Products]);
> 
> 
> 




More information about the Mondrian mailing list