[Mondrian] Re: Interest in aggregate tables and their utilization

Mon Jun 14 01:13:46 EDT 2010

Hi,

I haven't been haven't much luck working directly with the aggregate
designer.  I can't get the source code to compile in Eclipse, and when
I try to use the RC version on sourceforge, everything works great
until I try to preview or export the aggregate scripts (see below).
Exporting the modified mondrian schema XML file works though.

I tried this on both a Vista machine with the Sun JVM, and on a FC8
machine with the OpenJDK.

22:00:32,262 ERROR [SwingButton] Error calling oncommand event: exportHandler.sh
owPreview()
org.pentaho.ui.xul.XulException: Error invoking method: exportHandler.showPrevie
w()
        at org.pentaho.ui.xul.impl.AbstractXulDomContainer.invoke(AbstractXulDom
Container.java:327)
        at org.pentaho.ui.xul.swing.tags.SwingButton$OnClickRunnable.run(SwingBu
tton.java:58)
        at java.awt.event.InvocationEvent.dispatch(Unknown Source)
        at java.awt.EventQueue.dispatchEvent(Unknown Source)
        at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
        at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
        at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
        at java.awt.Dialog$1.run(Unknown Source)
        at java.awt.Dialog$3.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.awt.Dialog.show(Unknown Source)
        at java.awt.Component.show(Unknown Source)
        at java.awt.Component.setVisible(Unknown Source)
        at java.awt.Window.setVisible(Unknown Source)
        at java.awt.Dialog.setVisible(Unknown Source)
        at org.pentaho.ui.xul.swing.tags.SwingDialog.show(SwingDialog.java:234)
        at org.pentaho.aggdes.ui.form.controller.ExportHandler.openDialog(Export
Handler.java:144)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.pentaho.ui.xul.impl.AbstractXulDomContainer.invoke(AbstractXulDom
Container.java:323)
        at org.pentaho.ui.xul.swing.tags.SwingButton$OnClickRunnable.run(SwingBu
tton.java:58)
        at java.awt.event.InvocationEvent.dispatch(Unknown Source)
        at java.awt.EventQueue.dispatchEvent(Unknown Source)
        at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
        at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
        at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
        at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
        at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
        at java.awt.EventDispatchThread.run(Unknown Source)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.pentaho.ui.xul.impl.AbstractXulDomContainer.invoke(AbstractXulDom
Container.java:323)
        ... 30 more
Caused by: java.lang.NullPointerException
        at mondrian.rolap.RolapStar$Column.generateExprString(RolapStar.java:116
8)
        at mondrian.rolap.RolapStar.generateSql(RolapStar.java:925)
        at org.pentaho.aggdes.model.mondrian.MondrianSchema.generateAggregateSql
(MondrianSchema.java:273)
        at org.pentaho.aggdes.output.impl.PopulateTableGenerator.generate(Popula
teTableGenerator.java:59)
        at org.pentaho.aggdes.output.impl.AbstractGenerator.generateFull(Abstrac
tGenerator.java:34)
        at org.pentaho.aggdes.output.impl.OutputServiceImpl.getFullArtifact(Outp
utServiceImpl.java:152)
        at org.pentaho.aggdes.ui.form.controller.ExportHandler.showPreview(Expor
tHandler.java:212)
        ... 35 more

Clicking 'execute' on the CREATE scripts doesn't generate an
exception, but it triggers a SQL error about a duplicate column on the
15th view:
StatementCallback; bad SQL grammar [CREATE TABLE `dw_Sales_Cube_15` (
    `customer_State` CHAR(2),
    `time_Year` YEAR(4),
    `time_Quarter` INT(11),
    `time_Month` INT(11),
    `product_Brand` CHAR(20),
    `sales_total_sale` DOUBLE,
    `sales_Fact_Count` INTEGER,
    `sales_fact_count` INTEGER)]; nested exception is
java.sql.BatchUpdateException: Duplicate column name
'sales_fact_count'

I can give you my database, mondrian schema or both.  The database is
randomly generated data for the snowflake schema described in the
Aggregate Tables portion of the manual: sales, mfr, product, customer,
day.  There could be a problem in my schema file.

--Justin

On Sat, Jun 12, 2010 at 12:32 AM, Julian Hyde <jhyde at pentaho.com> wrote:
> Justin,
>
> I'm pleased that you are implementing aggregate tables for MySQL. There is
> definitely a huge need. Without some kind of materialization, any database
> can only answer a top-level ROLAP query by doing a full table scan. That
> effort tends to grow linearly with the data size, and MySQL runs out of
> stream fast.
>
> There are several difficult problems with aggregate tables:
> 1. Choosing the right set of aggregate tables to instantiate;
> 2. Generating the DDL for those aggregate tables;
> 3. Initially populating them;
> 4. Incrementally populating them;
> 5. Formulating queries so that the aggregate tables are used.
>
> All of these problems are hard. The good news is that Mondrian's approach is
> complimentary to yours. We knew that MySQL's optimizer couldn't do #5 and
> was unlikely to do so in the near future, so we explicitly included the
> aggregate definitions in mondrian's schema and changed the SQL we generate.
> AggGen was a weak attempt at #1, #2 and #3, and you are doing an excellent
> job at #4 (also #3).
>
> Even better news, there is a superior tool for recommending and creating
> aggregate tables (#1, #2, #3). It is called Pentaho Aggregate Designer [ see
> http://ci.pentaho.com/view/Analysis/job/aggdesigner-core/ and
> http://ci.pentaho.com/view/Analysis/job/PAD/ ].
>
> PAD is an aggregate table recommender, exposed via a user interface. The
> recommendation algorithm is inspired by "Implementing Data Cubes
> Efficiently" (Harinarayan, Rajaraman, and Ullman, 1995).
> http://ilpubs.stanford.edu:8090/102/, and I used a randomized algorithm to
> tame the beastly polynomial running time.
>
> The algorithm is pluggable. You can drop in a completely different
> algorithm, if you have the urge to write one, provided that it works on the
> same inputs and produces a set of aggregate tables. It tool is also
> extensible: you can write your own input plugin to read a schema, or output
> plugin to generate DDL. The algorithm can even run from the command line,
> and has been embedded inside a stored procedure in LucidDB.
>
> The algorithm works on a static analysis of the schema (structure of the
> star schema, observed cardinalities of rows and column values) but does not
> today take query patterns into account. Strictly speaking if you want to be
> driven purely by the actual queries, AggGen is a superior algorithm, because
> it generated the aggregates for precisely the queries the end-user entered.
> But PAD generates much better coverage.
>
> I suggest you take a look at the PAD source code and see whether it would be
> possible to write a plugin to generate stored procedure calls. Also download
> and run PAD; playing with the UI gives you an idea what it can do.
>
> The algorithm could certainly do a good job at generating a near-optimal set
> of aggregate tables for a given amount of available disk space.
>
> Julian
>
>> -----Original Message-----
>> From: Justin Swanhart [mailto:greenlion at gmail.com]
>> Sent: Wednesday, June 09, 2010 12:35 PM
>> To: jhyde at pentaho.com
>> Cc: mondrian at pentaho.org
>> Subject: Interest in aggregate tables and their utilization
>>
>> Hi,
>>
>> First, I should preface this with a note that I have very limited
>> experience with Mondrian, but extensive experience with MySQL and SQL
>> in general.  Also, sorry for the length of this post, but aggregate
>> tables and maintaining them is a complex topic.
>>
>> I am interested in aggregate tables, not because I personally have a
>> large data set which I am reporting on, but instead because I have a
>> project called Flexviews (flexviews.sourceforge.net) which enables
>> fast (incrementally) refreshable materialized views to MySQL.  These
>> views work similar to fast refresh views in Oracle, materialized query
>> tables in DB2, or indexed views in SQL server.  Flexviews has three
>> parts, a data dictionary (MySQL tables), an external CDC tool (which
>> makes table change logs) and a set of stored procedures for
>> manipulating the dictionary (to create or modify views) and for
>> refreshing the materialized views from the table change logs.  It also
>> supports completely rebuilding views from scratch.
>>
>> Flexviews supports the major aggregate functions: SUM, COUNT, AVG,
>> MIN, MAX, COUNT DISTINCT, and joins. It should be able to
>> incrementally maintain any of the aggregate tables that are defined by
>> AggGen.
>>
>> One of the difficulties with Flexviews is that incrementally
>> maintaining aggregate tables is only one part of the problem.
>> Effectively using them requires a query rewrite strategy to
>> automatically use the aggregate tables when they are available.  As
>> far as I know, Mondrian is the only open source software that has
>> similar functionality.  This pairs well with Flexviews, which is the
>> only open source (LGPL) project which supports incrementally
>> refreshing materialized views.
>>
>> I am thinking about modifying AggGen to support the stored procedure
>> API to create Flexviews materialized views directly.  I was wondering
>> if you had any thoughts or suggestions about
>> such modifications.  I was also wondering if any progress had been
>> made on the other tools which are mentioned in the Aggregate Tables
>> section of the manual such as the recommender.  Another challenge with
>> Flexviews is of course, choosing the best views to materialize.
>>
>> For those curious how Flexviews works:
>>
>> Flexviews is based primarily on:
>> http://pages.cs.wisc.edu/~beyer/papers/matview_sigmod00.pdf
>> (How to Roll a Join)
>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.8573
>> (Maintenance of Data Cubes and Summary Tables in a Warehouse)
>>
>> The first paper covers the incremental computation for SPJ queries. It
>> should be noted that Flexviews doesn't currently implement rolling
>> propagation, but only the simpler propagate function (both are defined
>> by pseudo code in Beyer's paper).  The better algorithm will be
>> implemented in the future, but the lesser performing version is still
>> orders of magnitude faster than a complete refresh.  Flexviews
>> implements a recursive stored procedure version of the algorithm, and
>> communicates with the external CDC program which is processing the
>> database transaction logs (binary logs).  The paper does not address
>> SPJ queries and instead points the researcher to the second paper.
>> One last note,the algorithm features "two phase propagation".  That
>> is, computation of deltas is completely orthogonal to the application
>> of the deltas, and they can be performed asynchronously.  This means
>> that deltas can be calculated periodically during daily processing,
>> and applied to the view at a later time in bulk.  This provides
>> another nice benefit in the multiple views can be refreshed to the
>> same exact transactional point in time providing a consistent
>> aggregated data model.
>>
>> The second paper provides a method of using 'summary aggregate tables'
>> and it goes into detail about how transformations on various aggregate
>> functions can be performed in order to compute incremental results.
>> For instance, COUNT(*) can be turned into SUM(1), etc.  The paper
>> covers the general rewrites necessary to support maintaining aggregate
>> tables.  The ideas in this paper pair pretty well with the the rolling
>> join propagation algorithm.
>>
>> Instead of handling non-distributable aggregate functions directly,
>> Flexviews creates an ancillary view which projects out the expressions
>> as group-by attributes:
>> select a,b,sum(c),min(d)
>> from xyz
>> group by a,b;
>>
>> will create a second view:
>> select a,b,d
>> from xyz
>> group by a,b,d
>>
>> This has the side effect of making all single table views self
>> maintainable, even those with non-distributable aggregate functions,
>> at the cost of extra space and maintenance effort.
>>
>>
>> --
>> Justin Swanhart
>>
>
>

-- 
--
Justin Swanhart, Consultant, Percona Inc.
Bay Area, USA (PDT, GMT -7)
+1 888 401 3401 | Justin x503 | 24x7 Help x911

Our Services:
http://www.percona.com/services.html

Our Blog: http://www.mysqlperformanceblog.com/