[Mondrian] Eliminating unnecessary "group by" clauses in dimension queries

Tue Apr 14 01:46:35 EDT 2009

We're encountering a situation in which Mondrian generates an  
apparently unnecessary GROUP BY clause when querying a dimension  
table, resulting in unnecessarily slow performance on MySQL.   We're  
actually seeing this on Mondrian 2.4, but per inspection the same  
pattern appears to exist in the current codebase as well.

Here's a (slightly simplified) example.  Assume the query is defined  
as follows:

   <Dimension foreignKey="customer_dim_id" name="CIs">
	<Hierarchy hasAll="true" allMemberName="All CIs"  
primaryKey="customer_dim_id">
		<Table name="customer_dim"/>
			<Level column="provider_id" levelType="Regular" name="Provider"  
type="Numeric" uniqueMembers="true"/>
			<Level column="provider_payment_id" levelType="Regular"  
name="Payment" type="Numeric" uniqueMembers="true">
					<Property name="Status" column="provider_payment_status"/>
					<Property name="Payment Description"  
column="provider_payment_desc" type="String"/>
			</Level>
			<Level column="provider_group_id" levelType="Regular" name="Group"  
type="Numeric" uniqueMembers="true">
					<Property name="Group Name" column="provider_group_name"  
type="String"/>
			</Level>
                		<Level column="customer_id" levelType="Regular"  
name="Creative" type="Numeric" uniqueMembers="true">
					<Property name="Status" column="customer_status" type="String"/>
                 	</Level>
	</Hierarchy>
   </Dimension>

The SQL eventually generated from an MDX query to select from  
customer_dim looks like this:

   select
     `customer_dim`.`provider_id ` as `c0`,
     `customer_dim`.`provider_payment_id ` as `c1`,
     `customer_dim`.`provider_payment_status ` as `c2`,
     `customer_dim`.`provider_payment_desc ` as `c3`,
     `customer_dim`.`provider_group_id ` as `c4`,
     `customer_dim`.`provider_group_name ` as `c5`,
     `customer_dim`.`customer_id ` as `c6`,
     `customer_dim`.`customer_status ` as `c7`
   from
     `customer_dim` as `customer_dim`
   group by
     `customer_dim`.`provider_id `,
     `customer_dim`.`provider_payment_id `,
     `customer_dim`.`provider_payment_status `,
     `customer_dim`.`provider_payment_desc `,
     `customer_dim`.`provider_group_id `,
     `customer_dim`.`provider_group_name `,
     `customer_dim`.`customer_id `,
     `customer_dim`.`customer_status `
   order by
     `customer_dim`.`provider_id ` ASC,
     `customer_dim`.`provider_payment_id ` ASC,
     `customer_dim`.`provider_group_id ` ASC,
     `customer_dim`.`customer_id ` ASC

Since each item in the SQL select is a simple column reference, the  
GROUP BY appears to be unnecessary.  Worse, including it can slow  
performance of the query on MySQL by more than an order of magnitude.

The relevant code is in SqlQuery.addSelectGroupBy() which is called in  
multiple circumstances by SqlTupleReader.addLevelMemberSql().  I'm  
still in the process of understanding some test failures related to my  
experiments, as well as working out how this might apply to  
RolapLevel.parentExp, but wanted to double-check a couple of the more  
basic items:

1.  Am I correct in thinking the GROUP BY can and should be eliminated  
in this case?

2.  When determining whether each element of the SELECT is a simple  
column reference, is a check that each level's keyexp and properties  
are instances of MondrianDef.Column sufficient?  Seems like it should  
be, but I don't know if there might be other cases I'm missing.

3.  Assuming the answer to the previous question is "yes", am I  
correct in thinking that if we don't avoid generating a group by for  
the key expressions, we should also avoid generating it for the  
ordinal and caption expressions?

  -- Eric