[Mondrian] Optimizing count distinct on degenerate dimensions

Julian Hyde jhyde at pentaho.com
Wed Oct 28 18:10:06 EDT 2009


Specify the approxRowCount attribute in the Level, then Mondrian will not
need to execute the distinct-count.
 
Julian


  _____  

From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On
Behalf Of Wes Biggs
Sent: Wednesday, October 28, 2009 2:53 PM
To: mondrian at pentaho.org
Subject: [Mondrian] Optimizing count distinct on degenerate dimensions


The "Optimizing Mondrian Performance" article
(http://p4webhost.eigenbase.org:8080/@rev1=head@//open/mondrian/doc/optimizi
ng_performance.html) notes the following:

<snip>
Mondrian currently uses 'count(distinct ...)' queries to determine the
cardinality of dimensions and levels as it starts, and for your measures
that are counts, that is,aggregator="count". Indexes might speed up those
queries -- although performance is likely to vary between databases, because
optimizing count-distinct queries is a tricky problem.
</snip>


This is particularly relevant for degenerate dimensions, which require a
table scan of the fact table.  We are using MySQL (5.1) and have a fact
table with >50M rows, and a simple startup query like:

select count(distinct FIELD) from FACT_TABLE;


requires a table scan of the fact table (even though there is an index on
FIELD) and takes more than 4 minutes at present.


The same query can be rewritten (for MySQL, at least), as 


select count(*) from (select distinct FIELD from FACT_TABLE group by FIELD)
as anylabel;

...which is almost instantaneous ("explain" notes that this query is "Using
index for group-by").

The code in RolapStar.java appears to actually implement this approach, but
only as a fallback for the case where the database dialect does not support
count(distinct).  (Though I can't tell from my quick glance at the code if
it includes the "as anylabel" suffix which is required in MySQL).

(1) am I missing an obvious way to get this to work with the code as is?
(2) if not, is there an easy way I can hack this in?  I guess I would need
to know how to detect if a RolapStar.Column instance was degenerate, and
change the logic in getCardinality() accordingly.

For what it's worth (and for anyone else having this problem), in my case I
can get away with an inline table because the distinct values of FIELD are
known in advance (it is effectively an enum column), which also avoids the
table scan, but it might be useful to optimize in the more general case.
There was also some discussion in the archives of this list to use aggregate
tables in order to determine cardinality[1] but I don't think any work has
been done on that yet.

Thanks,

Wes


[1] http://forums.pentaho.org/showthread.php?t=70263


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20091028/1e507916/attachment.html 


More information about the Mondrian mailing list