[Mondrian] Eliminating unnecessary "group by" clauses in dimension queries

Eric McDermid mcdermid at stonecreek.com
Tue Apr 14 17:26:47 EDT 2009


Thanks, Julian.  I realize now I'd misunderstood what  
uniqueMembers="true" actually implies, and that even if all levels  
have that attribute set, it doesn't guarantee multiple rows won't  
exist with the same value in the associated column.

That said, I think I may be even more confused now.  The generated SQL  
in my first email does not include the primary key column as part of  
the GROUP BY.  In that case, why would eliminating the GROUP BY still  
be safe?

Parent/child closure tables aside, it seems like the only cases in  
which GROUP BY can be eliminated are those where one of the following  
is true:

  (a) One of the levels is defined to use the primary key as its column
  (b) We know our data set well enough to guarantee that no two rows  
will ever share the same combination of level column values.

As it turns out, my dataset fits the latter case (the data is derived  
from a hierarchy of persistent objects).  This seems to leave me with  
two options:

One is to force the issue by altering the schema, adding a new level  
to the bottom of the hierarchy that is associated with the primary key  
column. This seems a bit clumsy, since it introduces a level we'd ever  
want to use or see in the query results, but may be the right answer  
nonetheless.

The other is to find some other way to tell Mondrian that the  
underlying data set fits case (b).  The obvious place for this would  
be an additional (and optional) attribute to the hierarchy schema  
element, but I'm not sure what the tradeoffs are.

Either way, of course, I'll have to modify addLevelMemberSql() to  
avoid generating the GROUP BY.

Is there another approach I've missed?  This doesn't seem like it  
would be an unusual case.

  -- Eric

On Apr 14, 2009, at 12:14 AM, Julian Hyde wrote:

> The GROUP BY will have no effect if the rows going into it are already
> unique. That basically occurs if the table's primary key is among  
> the GROUP
> BY columns.
>
> It is not sufficient to that the LEVEL's key is among the GROUP BY  
> columns.
> Levels are not necessarily unique. (Consider, for example, the year  
> level in
> the time_by_day table.) Even the month level of the time hierarchy  
> is not
> unique, even though it is the lowest level of the hierarchy.
>
> However, I think it is safe to assume that the primary key of a  
> hierarchy is
> unique. Hopefully no one is playing games with non-unique dimension  
> tables -
> that would cause double accounting. The one case where double- 
> counting is
> necessary is in the hierarchy that is generated implicitly to map a
> parent-child closure table.
>
> The FoodMart schema has examples of all of these things, so it  
> should be
> straightforward to write some unit tests.
>
> See further comments inline.
>
> Julian
>
>
>
>
>> -----Original Message-----
>> From: mondrian-bounces at pentaho.org
>> [mailto:mondrian-bounces at pentaho.org] On Behalf Of Eric McDermid
>> Sent: Monday, April 13, 2009 10:47 PM
>> To: Mondrian developer mailing list
>> Subject: [Mondrian] Eliminating unnecessary "group by"
>> clauses in dimensionqueries
>>
>> We're encountering a situation in which Mondrian generates an
>> apparently unnecessary GROUP BY clause when querying a dimension
>> table, resulting in unnecessarily slow performance on MySQL.   We're
>> actually seeing this on Mondrian 2.4, but per inspection the same
>> pattern appears to exist in the current codebase as well.
>>
>> Here's a (slightly simplified) example.  Assume the query is defined
>> as follows:
>>
>>   <Dimension foreignKey="customer_dim_id" name="CIs">
>> 	<Hierarchy hasAll="true" allMemberName="All CIs"
>> primaryKey="customer_dim_id">
>> 		<Table name="customer_dim"/>
>> 			<Level column="provider_id"
>> levelType="Regular" name="Provider"
>> type="Numeric" uniqueMembers="true"/>
>> 			<Level column="provider_payment_id"
>> levelType="Regular"
>> name="Payment" type="Numeric" uniqueMembers="true">
>> 					<Property name="Status"
>> column="provider_payment_status"/>
>> 					<Property name="Payment
>> Description"
>> column="provider_payment_desc" type="String"/>
>> 			</Level>
>> 			<Level column="provider_group_id"
>> levelType="Regular" name="Group"
>> type="Numeric" uniqueMembers="true">
>> 					<Property name="Group
>> Name" column="provider_group_name"
>> type="String"/>
>> 			</Level>
>>                		<Level column="customer_id"
>> levelType="Regular"
>> name="Creative" type="Numeric" uniqueMembers="true">
>> 					<Property name="Status"
>> column="customer_status" type="String"/>
>>                 	</Level>
>> 	</Hierarchy>
>>   </Dimension>
>>
>> The SQL eventually generated from an MDX query to select from
>> customer_dim looks like this:
>>
>>   select
>>     `customer_dim`.`provider_id ` as `c0`,
>>     `customer_dim`.`provider_payment_id ` as `c1`,
>>     `customer_dim`.`provider_payment_status ` as `c2`,
>>     `customer_dim`.`provider_payment_desc ` as `c3`,
>>     `customer_dim`.`provider_group_id ` as `c4`,
>>     `customer_dim`.`provider_group_name ` as `c5`,
>>     `customer_dim`.`customer_id ` as `c6`,
>>     `customer_dim`.`customer_status ` as `c7`
>>   from
>>     `customer_dim` as `customer_dim`
>>   group by
>>     `customer_dim`.`provider_id `,
>>     `customer_dim`.`provider_payment_id `,
>>     `customer_dim`.`provider_payment_status `,
>>     `customer_dim`.`provider_payment_desc `,
>>     `customer_dim`.`provider_group_id `,
>>     `customer_dim`.`provider_group_name `,
>>     `customer_dim`.`customer_id `,
>>     `customer_dim`.`customer_status `
>>   order by
>>     `customer_dim`.`provider_id ` ASC,
>>     `customer_dim`.`provider_payment_id ` ASC,
>>     `customer_dim`.`provider_group_id ` ASC,
>>     `customer_dim`.`customer_id ` ASC
>>
>> Since each item in the SQL select is a simple column reference, the
>> GROUP BY appears to be unnecessary.  Worse, including it can slow
>> performance of the query on MySQL by more than an order of magnitude.
>>
>> The relevant code is in SqlQuery.addSelectGroupBy() which is
>> called in
>> multiple circumstances by SqlTupleReader.addLevelMemberSql().  I'm
>> still in the process of understanding some test failures
>> related to my
>> experiments, as well as working out how this might apply to
>> RolapLevel.parentExp, but wanted to double-check a couple of
>> the more
>> basic items:
>>
>> 1.  Am I correct in thinking the GROUP BY can and should be
>> eliminated
>> in this case?
>
> Yes.
>
>>
>> 2.  When determining whether each element of the SELECT is a simple
>> column reference, is a check that each level's keyexp and properties
>> are instances of MondrianDef.Column sufficient?  Seems like
>> it should
>> be, but I don't know if there might be other cases I'm missing.
>
> No, as I noted above, level keys are not necessarily unique, even  
> the lowest
> level in the hierarchy. You can assume that a hierarchy's primary  
> key is
> unique, except if it is the implicit hierarchy for a parent-child
> hierarchy's closure table.
>
>>
>> 3.  Assuming the answer to the previous question is "yes", am I
>> correct in thinking that if we don't avoid generating a group by for
>> the key expressions, we should also avoid generating it for the
>> ordinal and caption expressions?
>
> N/A
>
> Julian
>
>
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>





More information about the Mondrian mailing list