[Mondrian] High Cardinality for Mondrian

Mon Feb 11 09:56:54 EST 2008

I have not looked at this code, but a similar approach was taken
in the RolapAxis code. When the Position Lists were really
Lists wrapping an iterable, calling methods like 'size' or 'get'
would cause the list to become 'materialized', meaning the
underlying implementation would be converted from an iterable
to a true list. This, of course, was expensive - memory and time,
so users were encouraged not to call 'size'. I modified
JPivot so that it would not call 'size'.
I think some of the xmla code still calls 'size'.

I found it useful to have debugging in the 'materialize'
methods so that I could find places in the code base where
methods such as 'size' and 'get' were being called and
then modifying such code, if possible, so that it was
purely iterative.

I must admit though, what I did also had the flaw that
it was hidden under an API. A Position is simply a
List<Member> and I did not document that certain operations
could have large costs.

Richard

Julian Hyde wrote:
> Luis,
> 
> Thanks for contributing these changes (for a second time!). I will
> incorporate them tomorrow, and if all goes well, they will be in
> mondrian-3.0.
> 
> The approach you have used - lists backed by iterators - is very clever, but
> I have some philosophical issues with it because it misleads the programmer.
> A call to '.size()', for example, looks simple and people would imagine that
> it is cheap, but as you point out it may cause a large collection to be
> fetched into memory.
> 
> I guess you could call it the principle of Honest APIs. A few years ago I
> read a similar critique of remote procedure calls, arguing that RPCs are
> inherently expensive and unreliable, and that it was dishonest to wrap them
> in an API that makes them look like regular procedure calls.
> 
> So, my instinct would have been to tackle large dimensions using an explicit
> iterator API, so users are aware of the cost of what they are doing.
> 
> That said, your approach works now, my preferred approach would require a
> major rework of the code base, and I am a pragmatist. If the high
> cardinality support results in some performance 'gotchas', we can try to
> devise incremental ways to make the whole mondrian API more predictable.
> 
> Julian
> 
>> -----Original Message-----
>> From: mondrian-bounces at pentaho.org 
>> [mailto:mondrian-bounces at pentaho.org] On Behalf Of Luis F. Canals
>> Sent: Friday, February 08, 2008 10:09 AM
>> To: mondrian at pentaho.org
>> Cc: Javier Giménez Aznar; jorge López Mateo
>> Subject: [Mondrian] High Cardinality for Mondrian
>>
>> Dear Julian Hyde,
>>
>> after the hard hard task of reprogramming all changes made for version
>> 2.4.2.9831 of Mondrian to provide the capability to manage high
>> cardinality dimensions for head version present in Preforge, 
>> we can send
>> you the list of differences to be applied as a patch to 
>> mondrian version
>> present on Preforge.
>>
>> Since we have no access to commit changes on Preforge, we will be very
>> happy if you apply these changes and comment us any problems you can
>> find that don't let the patch be applied.
>>
>> All tests are passed (using mysql as database and Windows and Linux as
>> operating systems, on Java 5 and 6).
>>
>> Some properties have been added to "mondrian.properties" to 
>> control high
>> cardinality and multhreading for queries behaviour:
>>     mondrian.result.highCardChunkSize indicates the number of elements
>> taken at the same time when a dimensions is marked as 
>> "highCardinality"
>>     mondrian.rolap.MaximumParallelThreads indicates the maximum number
>> of threads used to perform a query (since non dependant 
>> queries are now
>> parallelized)
>>
>> In FoodMart.xml, we have made another change to identify 
>> "Promotions" on
>> cube "Sales Ragged" as high cardinality to test the system in 
>> this case.
>>
>> There are some other points whould have taken into account now that
>> Mondrian is going to be able to manage ulimted dimensions:
>>     - avoid the use of ".size()" over the list of elements of a,
>> potentially, high cardinality dimension;
>>     - avoid the copy of elements iterating over the complete 
>> list of a,
>> potentially, high cardinality dimension
>>         (for example, things like
>>             "for(Member m:list) {
>>                 ...
>>                 anotherList.add(list);
>>                 ...
>>             }")
>>     - instead, use FilteredIterableList idea
>>     - don't try to get the first element when you have been 
>> got the last
>> (i.e., doing "list.get(x)" after "list.get(y)" with y>>>x) over a list
>> of elements of a, potentially, high cardinality dimension
>>     - some functions need all the elements in memory (for 
>> example "order
>> by"); these functions are not going to run with high cardinality
>> dimensions and an exception will be thrown
>>     - if you don't need high cardinality dimension, simply 
>> don't set the
>> attribe "highCardinality" to true in schema (FoodMart.xml)
>>
>> That's all!
>>
>> Since we think is a quite powerful improvement (very useful for our
>> clients) we would like these changes to be included in the 
>> next release
>> of Mondrian. Could it be possible?
>>
>> Best regards.
>>
>> - Jorge/Javier/Luis
>>
> 
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
> 

-- 
Quis custodiet ipsos custodes:
This email message is for the sole use of the intended recipient(s) and
may contain confidential information.  Any unauthorized review, use,
disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply email and destroy all
copies of the original message.