[Mondrian] RE: Interest in aggregate tables and their utilization

Julian Hyde jhyde at pentaho.com
Sat Jun 12 03:32:26 EDT 2010


Justin,

I'm pleased that you are implementing aggregate tables for MySQL. There is
definitely a huge need. Without some kind of materialization, any database
can only answer a top-level ROLAP query by doing a full table scan. That
effort tends to grow linearly with the data size, and MySQL runs out of
stream fast.

There are several difficult problems with aggregate tables:
1. Choosing the right set of aggregate tables to instantiate;
2. Generating the DDL for those aggregate tables;
3. Initially populating them;
4. Incrementally populating them;
5. Formulating queries so that the aggregate tables are used.

All of these problems are hard. The good news is that Mondrian's approach is
complimentary to yours. We knew that MySQL's optimizer couldn't do #5 and
was unlikely to do so in the near future, so we explicitly included the
aggregate definitions in mondrian's schema and changed the SQL we generate.
AggGen was a weak attempt at #1, #2 and #3, and you are doing an excellent
job at #4 (also #3).

Even better news, there is a superior tool for recommending and creating
aggregate tables (#1, #2, #3). It is called Pentaho Aggregate Designer [ see
http://ci.pentaho.com/view/Analysis/job/aggdesigner-core/ and
http://ci.pentaho.com/view/Analysis/job/PAD/ ].

PAD is an aggregate table recommender, exposed via a user interface. The
recommendation algorithm is inspired by "Implementing Data Cubes
Efficiently" (Harinarayan, Rajaraman, and Ullman, 1995).
http://ilpubs.stanford.edu:8090/102/, and I used a randomized algorithm to
tame the beastly polynomial running time.

The algorithm is pluggable. You can drop in a completely different
algorithm, if you have the urge to write one, provided that it works on the
same inputs and produces a set of aggregate tables. It tool is also
extensible: you can write your own input plugin to read a schema, or output
plugin to generate DDL. The algorithm can even run from the command line,
and has been embedded inside a stored procedure in LucidDB.

The algorithm works on a static analysis of the schema (structure of the
star schema, observed cardinalities of rows and column values) but does not
today take query patterns into account. Strictly speaking if you want to be
driven purely by the actual queries, AggGen is a superior algorithm, because
it generated the aggregates for precisely the queries the end-user entered.
But PAD generates much better coverage.

I suggest you take a look at the PAD source code and see whether it would be
possible to write a plugin to generate stored procedure calls. Also download
and run PAD; playing with the UI gives you an idea what it can do.

The algorithm could certainly do a good job at generating a near-optimal set
of aggregate tables for a given amount of available disk space.

Julian

> -----Original Message-----
> From: Justin Swanhart [mailto:greenlion at gmail.com] 
> Sent: Wednesday, June 09, 2010 12:35 PM
> To: jhyde at pentaho.com
> Cc: mondrian at pentaho.org
> Subject: Interest in aggregate tables and their utilization
> 
> Hi,
> 
> First, I should preface this with a note that I have very limited
> experience with Mondrian, but extensive experience with MySQL and SQL
> in general.  Also, sorry for the length of this post, but aggregate
> tables and maintaining them is a complex topic.
> 
> I am interested in aggregate tables, not because I personally have a
> large data set which I am reporting on, but instead because I have a
> project called Flexviews (flexviews.sourceforge.net) which enables
> fast (incrementally) refreshable materialized views to MySQL.  These
> views work similar to fast refresh views in Oracle, materialized query
> tables in DB2, or indexed views in SQL server.  Flexviews has three
> parts, a data dictionary (MySQL tables), an external CDC tool (which
> makes table change logs) and a set of stored procedures for
> manipulating the dictionary (to create or modify views) and for
> refreshing the materialized views from the table change logs.  It also
> supports completely rebuilding views from scratch.
> 
> Flexviews supports the major aggregate functions: SUM, COUNT, AVG,
> MIN, MAX, COUNT DISTINCT, and joins. It should be able to
> incrementally maintain any of the aggregate tables that are defined by
> AggGen.
> 
> One of the difficulties with Flexviews is that incrementally
> maintaining aggregate tables is only one part of the problem.
> Effectively using them requires a query rewrite strategy to
> automatically use the aggregate tables when they are available.  As
> far as I know, Mondrian is the only open source software that has
> similar functionality.  This pairs well with Flexviews, which is the
> only open source (LGPL) project which supports incrementally
> refreshing materialized views.
> 
> I am thinking about modifying AggGen to support the stored procedure
> API to create Flexviews materialized views directly.  I was wondering
> if you had any thoughts or suggestions about
> such modifications.  I was also wondering if any progress had been
> made on the other tools which are mentioned in the Aggregate Tables
> section of the manual such as the recommender.  Another challenge with
> Flexviews is of course, choosing the best views to materialize.
> 
> For those curious how Flexviews works:
> 
> Flexviews is based primarily on:
> http://pages.cs.wisc.edu/~beyer/papers/matview_sigmod00.pdf 
> (How to Roll a Join)
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.8573
> (Maintenance of Data Cubes and Summary Tables in a Warehouse)
> 
> The first paper covers the incremental computation for SPJ queries. It
> should be noted that Flexviews doesn't currently implement rolling
> propagation, but only the simpler propagate function (both are defined
> by pseudo code in Beyer's paper).  The better algorithm will be
> implemented in the future, but the lesser performing version is still
> orders of magnitude faster than a complete refresh.  Flexviews
> implements a recursive stored procedure version of the algorithm, and
> communicates with the external CDC program which is processing the
> database transaction logs (binary logs).  The paper does not address
> SPJ queries and instead points the researcher to the second paper.
> One last note,the algorithm features "two phase propagation".  That
> is, computation of deltas is completely orthogonal to the application
> of the deltas, and they can be performed asynchronously.  This means
> that deltas can be calculated periodically during daily processing,
> and applied to the view at a later time in bulk.  This provides
> another nice benefit in the multiple views can be refreshed to the
> same exact transactional point in time providing a consistent
> aggregated data model.
> 
> The second paper provides a method of using 'summary aggregate tables'
> and it goes into detail about how transformations on various aggregate
> functions can be performed in order to compute incremental results.
> For instance, COUNT(*) can be turned into SUM(1), etc.  The paper
> covers the general rewrites necessary to support maintaining aggregate
> tables.  The ideas in this paper pair pretty well with the the rolling
> join propagation algorithm.
> 
> Instead of handling non-distributable aggregate functions directly,
> Flexviews creates an ancillary view which projects out the expressions
> as group-by attributes:
> select a,b,sum(c),min(d)
> from xyz
> group by a,b;
> 
> will create a second view:
> select a,b,d
> from xyz
> group by a,b,d
> 
> This has the side effect of making all single table views self
> maintainable, even those with non-distributable aggregate functions,
> at the cost of extra space and maintenance effort.
> 
> 
> --
> Justin Swanhart
> 




More information about the Mondrian mailing list