[Mondrian] duplicate foreign keys in foodmart

Julian Hyde jhyde at pentaho.com
Sun Sep 27 22:57:56 EDT 2009


There are two questions here. 1. Is it bad practice to define a fact table
where rows are not uniquely identified by a combination of foreign keys. 2.
Will Mondrian do the right thing?

Regarding question 1. I don't recall exactly what Dr Kimball had to say on
the matter ofr duplicate foreign keys. (As it happens I had a meeting with
him on Thursday, but that particular topic didn't come up!) The way I see
it, a fact table represents a business process, and while it's not common to
get two identical instances of the business process (transactions), it's not
out of the question. It is reasonable for the same customer to buy the same
product, at the same supermarket, on the same day, using the same promotion
code. (Who hasn't hosted a party where they had to go back to the
supermarket for another six-pack of beer or another pack of ice?)

Unlike fact tables, aggregate tables are unique on their foreign key values.
Mondrian assumes this; note in particular that it does not use a GROUP BY
clause if the set of cells being queried have an aggregate table of exactly
the same dimensionality.

Some people like to structure their fact table so that it is in effect an
aggregate table at the lowest possible granularity. I'm fine with that, and
so is Mondrian.

With such a fact table, if you have a COUNT measure, what would you expect
it to return for my two beer purchases on the same day? I would expect it to
return 2, because there are two transactions. To achieve this, I would
create a integer column called FACT_COUNT, usually 1, but 2 for the record
representing my beer purchases, and my COUNT measure would be the sum of the
values in that column.

Regarding question 2. Yes, Mondrian will do the right thing. Mondrian never
assumes that the fact table has a primary key, so if you have two identical
rows each row will be counted once and you will get the desired effect. It
will never create a cartesian product that would give you 4 or 8 copies of
the row.

Mondrian does assume the primary and foreign key property for dimension
tables; that is, for each foreign key value in the fact table, Mondrian
assumes that there is exactly one row in the dimension table with that
value. If not, you will get under- and over-counts in some circumstances.
Those circumstances have been discussed elsewhere. Mondrian's behavior in
those circumstances as you would expect, if you know how Mondrian works, and
can even be exploited. For example, parent-child closure tables sum the
salaries of all employees who report to Frank Black by creating multiple
records in the dimension table for Frank Black.

Julian

> -----Original Message-----
> From: mondrian-bounces at pentaho.org 
> [mailto:mondrian-bounces at pentaho.org] On Behalf Of Etienne Dube
> Sent: Friday, September 25, 2009 7:34 AM
> To: mondrian at pentaho.org
> Subject: [Mondrian] duplicate foreign keys in foodmart
> 
> Hi Mondrian developers,
> 
> I posted this a few days ago in the Pentaho forums but I 
> haven't got an 
> answer yet. Maybe one of you could enlighten me on that.
> 
> ------------------------------
> While toying around with the foodmart DB, I noticed that some of the 
> facts in the sales_fact_1997 and sales_fact_1998 tables have 
> duplicate 
> foreign keys. For example, if I run the following SQL query 
> to retrieve 
> rows for which there is more than one occurrence of a combination of 
> values for the foreign keys:
> 
> select a.* from sales_fact_1997 as a,
> (select product_id, time_id, customer_id, promotion_id, 
> store_id, count(*)
> from sales_fact_1997
> group by product_id, time_id, customer_id, promotion_id, store_id
> having count(*) > 1) as b
> where a.product_id = b.product_id and a.time_id = b.time_id and 
> a.customer_id = b.customer_id
> and a.promotion_id = b.promotion_id and a.store_id = b.store_id;
> 
> this yields these rows (sorry for the ugly formatting):
> 
> product_id    time_id    customer_id    promotion_id    store_id   
>  store_sales    store_cost    unit_sales
> ----------    -------    -----------    ------------    --------   
>  -----------    ----------    ----------
> 290            488        1690        0                16           
>  5.1000        2.2440        3.0000
> 290            488        1690        0                16           
>  5.1000        1.8870        3.0000
> 485            692        1850        1235            16           
>  6.5700        2.3652        3.0000
> 485            692        1850        1235            16           
>  8.7600        4.2924        4.0000
> 800            725        2244        0                17           
>  9.7800        4.1076        3.0000
> 800            725        2244        0                17           
>  6.5200        2.3472        2.0000
> 1262        381        3386        1845            24           
>  2.2800        1.0716        3.0000
> 1262        381        3386        1845            24           
>  3.0400        1.2160        4.0000
> 1539        482        3491        1220            16           
>  14.3200        6.7304        4.0000
> 1539        482        3491        1220            16           
>  10.7400        4.4034        3.0000
> 1381        655        4903        0                3           
>  6.5100        2.5389        3.0000
> 1381        655        4903        0                3           
>  4.3400        2.0832        2.0000
> 42            630        5026        0                17           
>  11.2800        3.7224        4.0000
> 42            630        5026        0                17           
>  5.6400        2.3124        2.0000
> 883            466        9535        824                11           
>  8.7000        3.1320        3.0000
> 883            466        9535        824                11           
>  14.5000        4.4950        5.0000
> 
> Notice the rows with duplicate foreign keys values, but 
> having different 
> values in the measures columns. I thought that usually (according to 
> Kimball's Data Warehouse Toolkit), foreign keys in a fact 
> table should 
> form a composite primary key. Of course in the case of that 
> table, this 
> doesn't hold true and creation of a primary key fails due to the 
> duplicate values.
> 
> Is that a normal and sane design choice, or is there 
> something wrong or 
> unusual with this fact table? Will Mondrian do the right thing, i.e. 
> aggregate the measure values, when there are duplicate facts 
> (w/ same keys)?
> 
> Thanks,
> 
> Etienne
> 
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
> 




More information about the Mondrian mailing list