[Mondrian] Re-implementing native TopCount

Andrey Khayrutdinov andrey.khayrutdinov at gmail.com
Tue Sep 8 15:23:16 EDT 2015


Dear all,

I've picked up MONDRIAN-1272 <http://jira.pentaho.com/browse/MONDRIAN-1272>.
It is about shrinking tuples having no corresponding data in the fact table.
The approach to avoid such behaviour is to use OUTER JOIN.

The general routine is the following:

Let current SQL query be called SUBSELECT_Q. It is created by combining
different levels and then linking them with a constraint through the fact
table. Let's call these combined levels CROSSJOIN_Q.
So we can execute this query:

SELECT
FROM SUBSELECT_Q
RIGHT JOIN CROSSJOIN_Q


And this is the general schema.
Level should be combined not by linking via the fact table (that will
change nothing), but with CROSS JOIN operator.

As an example, here is an MDX query:

SELECT [Measures].[Unit Sales] on columns,
TopCount(Crossjoin([Product].[Product Category].Members,
[Customers].[City].Members), 10, [Measures].[Unit Sales]) on rows
FROM [Sales]
WHERE [Time].[1997].[Q3]

and here is the monstrous SQL query, produced via the schema above:

select
          crossj.`cj0` as `c0`,
          crossj.`cj1` as `c1`,
          crossj.`cj2` as `c2`,
          crossj.`cj3` as `c3`,
          crossj.`cj4` as `c4`,
          crossj.`cj5` as `c5`,
          subsel.`c6` as `c6`
from (
          select
                   `product_class`.`product_family` as `c0`,
                   `product_class`.`product_department` as `c1`,
                   `product_class`.`product_category` as `c2`,
                   `customer`.`country` as `c3`,
                   `customer`.`state_province` as `c4`,
                   `customer`.`city` as `c5`,
                   sum(`sales_fact_1997`.`unit_sales`) as `c6`
          from
                   `product` as `product`,
                   `product_class` as `product_class`,
                   `sales_fact_1997` as `sales_fact_1997`,
                   `customer` as `customer`,
                   `time_by_day` as `time_by_day`
          where
                   `product`.`product_class_id` =
`product_class`.`product_class_id` and
                   `sales_fact_1997`.`product_id` = `product`.`product_id`
and
                   `sales_fact_1997`.`customer_id` =
`customer`.`customer_id` and
                   `sales_fact_1997`.`time_id` = `time_by_day`.`time_id` and
                   `time_by_day`.`the_year` = 1997 and
                   `time_by_day`.`quarter` = 'Q3'
          group by
                   `product_class`.`product_family`,
                   `product_class`.`product_department`,
                   `product_class`.`product_category`,
                   `customer`.`country`,
                   `customer`.`state_province`,
                   `customer`.`city`
          order by
                   sum(`sales_fact_1997`.`unit_sales`) DESC,
                   ISNULL(`product_class`.`product_family`) ASC,
`product_class`.`product_family` ASC,
                   ISNULL(`product_class`.`product_department`) ASC,
`product_class`.`product_department` ASC,
                   ISNULL(`product_class`.`product_category`) ASC,
`product_class`.`product_category` ASC,
                   ISNULL(`customer`.`country`) ASC, `customer`.`country`
ASC,
                   ISNULL(`customer`.`state_province`) ASC,
`customer`.`state_province` ASC,
                   ISNULL(`customer`.`city`) ASC, `customer`.`city` ASC
) as `subsel`
right join (
          select
                   `t0`.`product_family` as `cj0` ,
                   `t0`.`product_department` as `cj1` ,
                   `t0`.`product_category` as `cj2` ,
                   `t1`.`country` as `cj3` ,
                   `t1`.`state_province` as `cj4` ,
                   `t1`.`city` as `cj5`
          from (
                   select
                             `product_class`.`product_family` as
`product_family`,
                             `product_class`.`product_department` as
`product_department`,
                             `product_class`.`product_category` as
`product_category`
                   from
                             `product` as `product`,
                             `product_class` as `product_class`
                   where
                             `product`.`product_class_id` =
`product_class`.`product_class_id`
                   group by
                             `product_class`.`product_family`,
                             `product_class`.`product_department`,
                             `product_class`.`product_category`
                   order by
                             ISNULL(`product_class`.`product_family`) ASC,
`product_class`.`product_family` ASC,
                             ISNULL(`product_class`.`product_department`)
ASC, `product_class`.`product_department` ASC,
                             ISNULL(`product_class`.`product_category`)
ASC, `product_class`.`product_category` ASC
          ) as t0
          cross join (
                   select
                             `customer`.`country` as `country`,
                             `customer`.`state_province` as
`state_province`,
                             `customer`.`city` as `city`
                   from
                             `customer` as `customer`
                   group by
                             `customer`.`country`,
                             `customer`.`state_province`,
                             `customer`.`city`
                   order by
                             ISNULL(`customer`.`country`) ASC,
`customer`.`country` ASC,
                             ISNULL(`customer`.`state_province`) ASC,
`customer`.`state_province` ASC,
                             ISNULL(`customer`.`city`) ASC,
`customer`.`city` ASC
          ) as t1
) as `crossj` on
          (`subsel`.`c0` = `crossj`.`cj0`) and
          (`subsel`.`c1` = `crossj`.`cj1`) and
          (`subsel`.`c2` = `crossj`.`cj2`) and
          (`subsel`.`c3` = `crossj`.`cj3`) and
          (`subsel`.`c4` = `crossj`.`cj4`) and
          (`subsel`.`c5` = `crossj`.`cj5`)
order by
          subsel.`c6` DESC,
          ISNULL(crossj.`cj0`) ASC, crossj.`cj0` ASC,
          ISNULL(crossj.`cj1`) ASC, crossj.`cj1` ASC,
          ISNULL(crossj.`cj2`) ASC, crossj.`cj2` ASC,
          ISNULL(crossj.`cj3`) ASC, crossj.`cj3` ASC,
          ISNULL(crossj.`cj4`) ASC, crossj.`cj4` ASC,
          ISNULL(crossj.`cj5`) ASC, crossj.`cj5` ASC

Unfortunately, Mondrian does not have many tools for manipulating with SQL,
thus I made my own.
I have completed the POC, it passes almost all prepared tests and before
starting polishing it I asked Matt C if he had any objection against the
approach. He told me to ask the community, and here I am with this thread.

My current work was pushed here:
https://github.com/akhayrutdinov/mondrian/tree/NativeTopCount.
It consists of two commits:

   1. Tests for TopCount:
   https://github.com/akhayrutdinov/mondrian/commit/3963fb37a7da8284bb5ed1ddb79825c6bf7018cf
   2. My changes:
   https://github.com/akhayrutdinov/mondrian/commit/2436bfbd6cca007f0f3fd866b8e61e7f4c2adb5c

I want to ask your opinion whether the approach is acceptable not to spend
time for improving the code that will be rejected.

Best regards,
Andrey Khayrutdinov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20150908/bd1e91b8/attachment-0001.html 


More information about the Mondrian mailing list