[Mondrian] High Cardinality for Mondrian

Thu Feb 14 01:48:48 EST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

since code is not clear about the use of Member[] as a "little tuples"
and Member[] as "big list of elements" is not possible (at this
moment) the use of two differents.

If Julian wants this distinction, it will be done, in future. But as
we belive in things like early integration, and we have to start with
a first step, these issues (using Member[] in specific cases instead
of a List -I remember that a List has an array inside, and an array is
an object too, and the use of List avoids memory overhead with
operations like System.arraycopy: take a look at the new class
TraversalList or FilteredList, for instance and how them avoid copying
arrays-) would it be added in future: it's not a trauma.

But using the collaborative model of FLOSS, not the "integration inferno".

API issue. Mondrian API should be changed to manage with
Lists/Iterators instead of static arrays. It's not optional. The only
way to get backward compatibility is to move "Mondrian-with-Lists" to
another different package and convert current mondrian classes into
proxies to "Mondrian-with-Lists". I will waste my time creating these
proxies to provide a smooth change from different Mondrian versions.

Performance. Absolutly true: in my little laptop, version 2.4.2.9831
with arrays takes 379 seconds to complete the tests, and 352 seconds
with List modifications over 2.4.2.9831. Same set of tests (copying
Main.java from 2.4.2.9831 to head version) applied over head version,
takes 205 seconds to complete the first 1525 tests and 215 with
version with the patch applied.
Maybe in some tests, that we have had to modify to be able to run with
parallel queries -and wait for the completion of the set of SQL
queries involved in the execution of a single MDX query-, and high
cardinality -retrying the query twice to get first elements twice-
time is "aparently" more than in original version... and, of course,
there will be places in which we have included "performance mistakes"
Of course! We aren't the creators of Mondrian and there are many many
places that we don't understand exactly! Maybe some operations should
be reviewed be other eyes.
In the other hand, we have detected a degradation dealing with high
cardinality dimensions, not present in previous version, and we will
work to make it as fast as the version without Lists/Iterators. More
homework!

Best regards!

- - luis

Julian Hyde wrote:
> There are two distinct collections here. Consider a method that
> returns a list of tuples.  It could return Member[][], or
> List<Member[]>, or List<List<Member>>, or even List<Member>[].
>
> When we are talking about collections of members or tuples, I
> totally agree that we should be returning a list (or in certain
> cases an iterator). These collections are often large (thousands of
> members) and dynamic. Luis and I were in agreement about that.
>
> However, the other collection is to represent a tuple. Tuples
> typically have low cardinality, and the cardinality is fixed: you
> almost never want to increase a tuple from cardinality 2 to
> cardinality 3. There are a lot of instances of tuples, so it is
> important that they can be created cheaply and use the minimum of
> memory.
>
> My use case is a crossjoin that returns 10,000 tuples each with 2
> members. If we represent each tuple as a list, that's at least one
> extra object per tuple, if we use the cheapest list available (the
> one created by Arrays.asList()). The memory and CPU overhead is
> unacceptable.
>
> That's what I want to represent a tuple as a Member[] rather than
> List<Member>, and I was clear with Luis that's what I wanted.
>
> Julian
>
>> -----Original Message----- From: Richard Emberson
>> [mailto:remberson at edgedynamics.com] Sent: Wednesday, February 13,
>> 2008 6:59 AM To: jhyde at pentaho.org; Mondrian developer mailing
>> list Cc: luis.canals at stratebi.com; 'Javier Giménez Aznar'; 'jorge
>>  López Mateo' Subject: Re: [Mondrian] High Cardinality for
>> Mondrian
>>
>> Julian,
>>
>> As long as I worked on Mondrian I always wondered why some of the
>>  API returned arrays rather than lists (Member[] rather than
>> List<? extends Member>). I would submit that arrays are less
>> flexible, more expensive to create and no faster to use.
>>
>> What are the advantages of returning arrays?
>>
>> I granted that there are a number of clients which would have to
>> migrate to any new API and backward compatibility could the the
>> sole reason.
>>
>> Thanks.
>>
>> Richard
>>
>>
>> Julian Hyde wrote:
>>> Luis,
>>>
>>> I have spent the past day or so integrating your changes
>> with the latest
>>> mondrian source code. I am not ready to check in yet, for
>> two reasons.
>>> First, there is a serious performance degradation. For
>> instance, the test
>>> suite run against oracle on jdk 1.6 takes 260 seconds
>> before the change, 540
>>> seconds after it.
>>>
>>> Second, the code is not in a state to check in. I sent you
>> an email in
>>> January describing the changes I needed you to make to the
>> code to make it
>>> ready for submission, and you have done very few of them.
>> In particular, you
>>> have changed the format of tuples from Member[] to List<?
>> extends Member>,
>>> something I explicitly requested you did not do.
>>>
>>> I know that this is a major change and you and your
>> colleagues have spent a
>>> lot of time working on it, but it is a waste of my time if
>> I give you a
>>> detailed list of issues and you ignore it.
>>>
>>> I would like this change to make it into the mondrian code
>> base, so I intend
>>> to carry on working on it to see if I can resolve the
>> performance and other
>>> issues so that I can check it in. However, mondrian 3.0 is
>> due very soon and
>>> I don't think this change will make it in before then.
>>>
>>> Julian
>>>
>>>
>>> ----
>>>
>>> My email from 2008/1/15, with my comments on what is left
>> to do in the code
>>> drop of 2008/2/8.
>>>
>>>> Thanks for the contribution. I'll accept the contribution and
>>>>  put it into the release, but there are quite a few issues
>>>> I'd like to resolve before I do that.
>>>>
>>>> * Please make a pass through your changes again and remove
>>>> any garbage code (e.g. code for debugging) or files you have
>>>>  edited but don't intend to check in (e.g.
>>>> mondrian.properties)
>>> NOT DONE - mondrian.properties is included in highcard.diff
>>>
>>>> * The code has moved on a lot since 2.4.2. Can you send me
>>>> the diffs with respect to the current perforce head? That
>>>> would help me a lot. To do this, sync your perforce client to
>>>>  2.4.2 ('p4 sync ... at 9831'), copy in your files, 'p4 edit
>>>> <filename>' for each file you have modified, then get the
>>>> latest ('p4 sync ...'), and resolve the changes ('p4
>>>> resolve').
>>> You sent a patch - I appreciate that. But it wasn't based
>> on perforce head.
>>> After about an hour of experiments, I discovered that it
>> was based on
>>> perforce change 10467 (Jan 24th) or thereabouts. You should
>> have mentioned
>>> this.
>>>
>>>> * Does the code pass the regression suite? And on what
>>>> DBMS/JDK/Operating system? Have you tried running 'megatest'
>>>>  (which sets various properties). The code needs to pass all
>>>>  tests before I can release it.
>>> You say that it passes tests on Oracle and MySQL. I believe
>> you. I am seeing
>>> 2 failures and 3 errors, but maybe this is due to a merge
>> mistake on my
>>> part.
>>>
>>>> * As I said before, I want to keep the tuple representation
>>>> as a member array, not a member list. In particular,
>>>> TupleCalc.evaluateTuple(Evaluator) should return Member[] not
>>>>  List<Member>.
>>> NOT DONE. This is major change, because it negatively
>> impacts memory and
>>> performance, and does not even contribute to your goal of
>>> making high-cardinality dimensions work better. I asked you
>>> twice
>> to do this and
>>> you ignored me. Did you think I wasn't going to notice?!
>>>
>>>> * In AbstractXxxCalc, there are a few toString() methods on a
>>>>  single line - make multiple lines or eliminate them.
>>> NOT DONE
>>>
>>>> * Likewise hashCode() in ConcatenableList
>>> NOT DONE
>>>
>>>> * ConcatenableList needs javadoc and a unit test
>>> NOT DONE
>>>
>>>> * In CmdRunner.java, remove ex.printStackTrace lines. I don't
>>>>  think you intended to check these in.
>>> NOT DONE
>>>
>>>> * Dimension.isHighCardinality() method needs javadoc
>>> Done - thank you.
>>>
>>>> * MondrianProperties.java - property descriptions have
>>>> spelling mistakes, should start with "Integer property which
>>>> ...".
>>> NOT DONE
>>>
>>>> * Document properties in configuration.html and
>>>> mondrian.properties. I need a MUCH better description of the
>>>>  MaxParallelThreads property: something which would be
>>>> understood by a DBA, not just a mondrian developer. I had to
>>>>  delve into the code to find out that this related to loading
>>>>  aggregations using GROUPING SETS.
>>> NOT DONE
>>>
>>>> * There are some files included which I guess you don't want
>>>>  to commit, e.g. mondrian.properties,
>>>> demo/foodmart.properties. Please remove these from the patch
>>>> file.
>>>>
>>>> * When breaking long lines, please use continuation indent 4
>>>> not 8.
>>> NOT DONE
>>>
>>>> * Make sure lines don't have trailing spaces. These may cause
>>>>  spurious diffs later.
>>> NOT DONE - I counted 191 new or modified lines with trailing
>>> spaces.
>>>
>>>> * Uses of UnsupportedList in HeadTailFunDef: could you use
>>>> AbstractList instead? I think you implement all of the
>>>> required methods.
>>> NOT DONE
>>>
>>>> * Aggregation.java and other places: code formatting; control
>>>>  flow constructs like 'if' and 'while' always use braces and
>>>>  span multiple lines.
>>> NOT DONE
>>>
>>>> * Aggregation.java: you've made the load method asynchronous.
>>>>  I'm not convinced that this is safe. In particular, how does
>>>>  the caller know when the load has completed? Need to update
>>>>  the javadoc of that method with these important details.
>>> NOT DONE
>>>
>>>> * If we're using Threads, let's get them from a thread pool.
>>>>  Can we use the JDK 1.5 concurrency features like Future and
>>>>  ThreadPoolExecutor? That will give us more control over how
>>>>  the threads are created, and save the effort of creating new
>>>>  threads; all important in a container. In JDK 1.4 it's OK if
>>>>  the code works single-threaded.
>>> NOT DONE
>>>
>>>> * CellRequest.java contains garbage code - please remove.
>>> Done - thanks.
>>>
>>>> * CacheMapTest.java - should end with '// End
>>>> CacheMapTest.java'
>>> NOT DONE
>>>
>>> _______________________________________________ Mondrian
>>> mailing list Mondrian at pentaho.org
>>> http://lists.pentaho.org/mailman/listinfo/mondrian
>>>
>>
>> -- Quis custodiet ipsos custodes: This email message is for the
>> sole use of the intended recipient(s) and may contain
>> confidential information.  Any unauthorized review, use,
>> disclosure or distribution is prohibited.  If you are not the
>> intended recipient, please contact the sender by reply email and
>> destroy all copies of the original message.
>>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHs+RP6XofO2yaIbARAhbsAJwKAFGHFqvv1wAZQGDcbqqonq9tfACgjtHT
CuDZID4KoASTpJ7Bedhn7mo=
=Ouzj
-----END PGP SIGNATURE-----