[Mondrian] Eliminating unnecessary "group by" clauses in dimensionqueries

Julian Hyde jhyde at pentaho.com
Tue Apr 14 03:14:58 EDT 2009


The GROUP BY will have no effect if the rows going into it are already
unique. That basically occurs if the table's primary key is among the GROUP
BY columns.

It is not sufficient to that the LEVEL's key is among the GROUP BY columns.
Levels are not necessarily unique. (Consider, for example, the year level in
the time_by_day table.) Even the month level of the time hierarchy is not
unique, even though it is the lowest level of the hierarchy.

However, I think it is safe to assume that the primary key of a hierarchy is
unique. Hopefully no one is playing games with non-unique dimension tables -
that would cause double accounting. The one case where double-counting is
necessary is in the hierarchy that is generated implicitly to map a
parent-child closure table.

The FoodMart schema has examples of all of these things, so it should be
straightforward to write some unit tests.

See further comments inline.

Julian




> -----Original Message-----
> From: mondrian-bounces at pentaho.org 
> [mailto:mondrian-bounces at pentaho.org] On Behalf Of Eric McDermid
> Sent: Monday, April 13, 2009 10:47 PM
> To: Mondrian developer mailing list
> Subject: [Mondrian] Eliminating unnecessary "group by" 
> clauses in dimensionqueries
> 
> We're encountering a situation in which Mondrian generates an  
> apparently unnecessary GROUP BY clause when querying a dimension  
> table, resulting in unnecessarily slow performance on MySQL.   We're  
> actually seeing this on Mondrian 2.4, but per inspection the same  
> pattern appears to exist in the current codebase as well.
> 
> Here's a (slightly simplified) example.  Assume the query is defined  
> as follows:
> 
>    <Dimension foreignKey="customer_dim_id" name="CIs">
> 	<Hierarchy hasAll="true" allMemberName="All CIs"  
> primaryKey="customer_dim_id">
> 		<Table name="customer_dim"/>
> 			<Level column="provider_id" 
> levelType="Regular" name="Provider"  
> type="Numeric" uniqueMembers="true"/>
> 			<Level column="provider_payment_id" 
> levelType="Regular"  
> name="Payment" type="Numeric" uniqueMembers="true">
> 					<Property name="Status" 
> column="provider_payment_status"/>
> 					<Property name="Payment 
> Description"  
> column="provider_payment_desc" type="String"/>
> 			</Level>
> 			<Level column="provider_group_id" 
> levelType="Regular" name="Group"  
> type="Numeric" uniqueMembers="true">
> 					<Property name="Group 
> Name" column="provider_group_name"  
> type="String"/>
> 			</Level>
>                 		<Level column="customer_id" 
> levelType="Regular"  
> name="Creative" type="Numeric" uniqueMembers="true">
> 					<Property name="Status" 
> column="customer_status" type="String"/>
>                  	</Level>
> 	</Hierarchy>
>    </Dimension>
> 
> The SQL eventually generated from an MDX query to select from  
> customer_dim looks like this:
> 
>    select
>      `customer_dim`.`provider_id ` as `c0`,
>      `customer_dim`.`provider_payment_id ` as `c1`,
>      `customer_dim`.`provider_payment_status ` as `c2`,
>      `customer_dim`.`provider_payment_desc ` as `c3`,
>      `customer_dim`.`provider_group_id ` as `c4`,
>      `customer_dim`.`provider_group_name ` as `c5`,
>      `customer_dim`.`customer_id ` as `c6`,
>      `customer_dim`.`customer_status ` as `c7`
>    from
>      `customer_dim` as `customer_dim`
>    group by
>      `customer_dim`.`provider_id `,
>      `customer_dim`.`provider_payment_id `,
>      `customer_dim`.`provider_payment_status `,
>      `customer_dim`.`provider_payment_desc `,
>      `customer_dim`.`provider_group_id `,
>      `customer_dim`.`provider_group_name `,
>      `customer_dim`.`customer_id `,
>      `customer_dim`.`customer_status `
>    order by
>      `customer_dim`.`provider_id ` ASC,
>      `customer_dim`.`provider_payment_id ` ASC,
>      `customer_dim`.`provider_group_id ` ASC,
>      `customer_dim`.`customer_id ` ASC
> 
> Since each item in the SQL select is a simple column reference, the  
> GROUP BY appears to be unnecessary.  Worse, including it can slow  
> performance of the query on MySQL by more than an order of magnitude.
> 
> The relevant code is in SqlQuery.addSelectGroupBy() which is 
> called in  
> multiple circumstances by SqlTupleReader.addLevelMemberSql().  I'm  
> still in the process of understanding some test failures 
> related to my  
> experiments, as well as working out how this might apply to  
> RolapLevel.parentExp, but wanted to double-check a couple of 
> the more  
> basic items:
> 
> 1.  Am I correct in thinking the GROUP BY can and should be 
> eliminated  
> in this case?

Yes.

> 
> 2.  When determining whether each element of the SELECT is a simple  
> column reference, is a check that each level's keyexp and properties  
> are instances of MondrianDef.Column sufficient?  Seems like 
> it should  
> be, but I don't know if there might be other cases I'm missing.

No, as I noted above, level keys are not necessarily unique, even the lowest
level in the hierarchy. You can assume that a hierarchy's primary key is
unique, except if it is the implicit hierarchy for a parent-child
hierarchy's closure table.

> 
> 3.  Assuming the answer to the previous question is "yes", am I  
> correct in thinking that if we don't avoid generating a group by for  
> the key expressions, we should also avoid generating it for the  
> ordinal and caption expressions?

N/A

Julian





More information about the Mondrian mailing list