[Mondrian] Again on the non-aggregable measure

Julian Hyde jhyde at pentaho.org
Sun Nov 25 08:47:41 EST 2007


I am really nervous that the feature you are proposing will be perfect to
you but not very useful to anyone else. I cannot afford to add yet another
mechanism to mondrian without it fulfilling at least one major feature.
Measures with custom aggregation paths would be such a feature; and
writeback would be another.
 
Writeback tables would, I think, have to be able to contain cells at
different levels of aggregation. If a measure was 'writeback enabled',
mondrian would look for values in the writeback table before trying to read
from the fact table or aggregate tables.
 
I have already described how I think we could support measures with custom
aggregators. The system designer could choose whether the measure would
exist in the fact table, or only be in aggregate tables with the required
column name.
 
I agree that it would be awkward to have ETL data mixed in with writeback
data. But remember that aggregate tables are a mechanism, not a process. You
can design aggregate tables which contain only writeback data, and you can
design other aggregate tables which are populated by an ETL process.
 
I think that the 'aggregator='none' ' mechanism I proposed would achieve the
effect that you want, but it would be a bit more work. If I help you
implement this feature, are you prepared to do it this way?
 
Julian


  _____  

From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On
Behalf Of michael bienstein
Sent: Friday, November 23, 2007 5:05 PM
To: Mondrian developer mailing list
Subject: [Mondrian] Again on the non-aggregable measure


Julian,

First off I can see where you are coming from and I see our major difference
here is that I have a project where I have to fix an existing application to
work quicker and my idea was to simply adapt Mondrian to fit *quickly*.  You
- completely understandably - want a much more general feature.  If we can
work around how to deal with this difference the rest will flow easily.

BTW, the reason I want Mondrian is not for the cool MDX stuff, it's the
cache.  The technology I use currently has to go to disk each time to read
the rows.  Mondrian can keep it in memory.  The other reason is that Java
servlets allows one process in the OS to handle multiple requests but the
current technology has one process per request and each one hits the disk
each time! As a result I have an upper limit on the number of concurrent
users of about 50 before the OS has trouble and the CPU is at 100% not on
doing computation but on thrashing between the jobs and giving time to the
file system to work out how to handle the load.  I need more users in
parallel so I want to load it into memory and run it off one memory image
for 250 users in parallel.

So on your ideas:
We have two different requirements that are both valid but are essentially
different.  You want to have more control in the schema about how to
aggregate measures which use out-of-the-ordinary aggregations and use this
to leverage modern SQL to generate aggregate tables based on the fact table.
I.e. data in the aggregate tables is still *dependent* on the fact table -
they just speed up performance.  I on the other hand want aggregate tables
that are not derived from the fact table.  It is essentially "write-back"
data.  In my example the users write this data back in a separate
application and the data is rolled into the nightly ETL job.  I haven't
talked about write-back because I don't need to go through Mondrian to get
this, but that's essentially what it is.  This data is still independent
because there is explicitly no rollup possible from a fact table to an
aggregation table.  As you can see these are very different requirements.

Now it is technically possible to create aggregate tables that contain some
measure columns that were calculated from the fact table and also some
columns already there from the ETL job.  It's ugly though.  You would have
to modify the table created by the ETL job to append the pre-calculated data
to.  Adding columns to tables that were created from the ETL makes me
shiver.  It's doable but you risk too much.  I can see a benefit in terms of
disk space in that the level columns don't have to be duplicated in separate
tables, but if we are spending time optimizing for disk when disk drives are
cheap (and just for write-back data which can't be huge because it's created
only by employees, not external systems), then I think it's a bit of wasted
effort especially since the cheaper memory gets the more Mondrian is
probably going to just read the tables once into memory and the disk space
won't count.

If you want to keep the schema metadata simple by putting non-aggregable
measures into the same cube as the normal measures then we should allow for
measures to define different aggregate tables at each level of aggregation.
E.g. "Volume" at Region*Month isin table "normalagg_region_month" while
"Budget" at Region*Month is in table "budget_region_month".  That's doable
in the XML without making it too terrible, but after that, is it worth it?
I'm not sure if it would be worth it for you.  I know for my project it
won't be though.  I therefore (re-)propose just having a whole RolapStar for
pre-prepared write-back cell data.  I am not considering more interesting
write-back cases such as "Budget for January for Stationary is 1000 so the
implied budget for pencils on 4 January is 1000/31/#expected ratio of
pencils cost to total stationary".  These sorts of more complicated
implications that roll-down rather than up are completely out of my scope.

Hoping you agree to keep the two concepts separate,

Michael


  _____  

Ne gardez plus qu'une seule adresse mail ! Copiez vos mails
<http://www.trueswitch.com/yahoo-fr/>  vers Yahoo! Mail 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20071125/7533c8ea/attachment.html 


More information about the Mondrian mailing list