[Mondrian] duplicate foreign keys in foodmart
Julian Hyde
jhyde at pentaho.com
Sun Sep 27 22:57:56 EDT 2009
There are two questions here. 1. Is it bad practice to define a fact table
where rows are not uniquely identified by a combination of foreign keys. 2.
Will Mondrian do the right thing?
Regarding question 1. I don't recall exactly what Dr Kimball had to say on
the matter ofr duplicate foreign keys. (As it happens I had a meeting with
him on Thursday, but that particular topic didn't come up!) The way I see
it, a fact table represents a business process, and while it's not common to
get two identical instances of the business process (transactions), it's not
out of the question. It is reasonable for the same customer to buy the same
product, at the same supermarket, on the same day, using the same promotion
code. (Who hasn't hosted a party where they had to go back to the
supermarket for another six-pack of beer or another pack of ice?)
Unlike fact tables, aggregate tables are unique on their foreign key values.
Mondrian assumes this; note in particular that it does not use a GROUP BY
clause if the set of cells being queried have an aggregate table of exactly
the same dimensionality.
Some people like to structure their fact table so that it is in effect an
aggregate table at the lowest possible granularity. I'm fine with that, and
so is Mondrian.
With such a fact table, if you have a COUNT measure, what would you expect
it to return for my two beer purchases on the same day? I would expect it to
return 2, because there are two transactions. To achieve this, I would
create a integer column called FACT_COUNT, usually 1, but 2 for the record
representing my beer purchases, and my COUNT measure would be the sum of the
values in that column.
Regarding question 2. Yes, Mondrian will do the right thing. Mondrian never
assumes that the fact table has a primary key, so if you have two identical
rows each row will be counted once and you will get the desired effect. It
will never create a cartesian product that would give you 4 or 8 copies of
the row.
Mondrian does assume the primary and foreign key property for dimension
tables; that is, for each foreign key value in the fact table, Mondrian
assumes that there is exactly one row in the dimension table with that
value. If not, you will get under- and over-counts in some circumstances.
Those circumstances have been discussed elsewhere. Mondrian's behavior in
those circumstances as you would expect, if you know how Mondrian works, and
can even be exploited. For example, parent-child closure tables sum the
salaries of all employees who report to Frank Black by creating multiple
records in the dimension table for Frank Black.
Julian
> -----Original Message-----
> From: mondrian-bounces at pentaho.org
> [mailto:mondrian-bounces at pentaho.org] On Behalf Of Etienne Dube
> Sent: Friday, September 25, 2009 7:34 AM
> To: mondrian at pentaho.org
> Subject: [Mondrian] duplicate foreign keys in foodmart
>
> Hi Mondrian developers,
>
> I posted this a few days ago in the Pentaho forums but I
> haven't got an
> answer yet. Maybe one of you could enlighten me on that.
>
> ------------------------------
> While toying around with the foodmart DB, I noticed that some of the
> facts in the sales_fact_1997 and sales_fact_1998 tables have
> duplicate
> foreign keys. For example, if I run the following SQL query
> to retrieve
> rows for which there is more than one occurrence of a combination of
> values for the foreign keys:
>
> select a.* from sales_fact_1997 as a,
> (select product_id, time_id, customer_id, promotion_id,
> store_id, count(*)
> from sales_fact_1997
> group by product_id, time_id, customer_id, promotion_id, store_id
> having count(*) > 1) as b
> where a.product_id = b.product_id and a.time_id = b.time_id and
> a.customer_id = b.customer_id
> and a.promotion_id = b.promotion_id and a.store_id = b.store_id;
>
> this yields these rows (sorry for the ugly formatting):
>
> product_id time_id customer_id promotion_id store_id
> store_sales store_cost unit_sales
> ---------- ------- ----------- ------------ --------
> ----------- ---------- ----------
> 290 488 1690 0 16
> 5.1000 2.2440 3.0000
> 290 488 1690 0 16
> 5.1000 1.8870 3.0000
> 485 692 1850 1235 16
> 6.5700 2.3652 3.0000
> 485 692 1850 1235 16
> 8.7600 4.2924 4.0000
> 800 725 2244 0 17
> 9.7800 4.1076 3.0000
> 800 725 2244 0 17
> 6.5200 2.3472 2.0000
> 1262 381 3386 1845 24
> 2.2800 1.0716 3.0000
> 1262 381 3386 1845 24
> 3.0400 1.2160 4.0000
> 1539 482 3491 1220 16
> 14.3200 6.7304 4.0000
> 1539 482 3491 1220 16
> 10.7400 4.4034 3.0000
> 1381 655 4903 0 3
> 6.5100 2.5389 3.0000
> 1381 655 4903 0 3
> 4.3400 2.0832 2.0000
> 42 630 5026 0 17
> 11.2800 3.7224 4.0000
> 42 630 5026 0 17
> 5.6400 2.3124 2.0000
> 883 466 9535 824 11
> 8.7000 3.1320 3.0000
> 883 466 9535 824 11
> 14.5000 4.4950 5.0000
>
> Notice the rows with duplicate foreign keys values, but
> having different
> values in the measures columns. I thought that usually (according to
> Kimball's Data Warehouse Toolkit), foreign keys in a fact
> table should
> form a composite primary key. Of course in the case of that
> table, this
> doesn't hold true and creation of a primary key fails due to the
> duplicate values.
>
> Is that a normal and sane design choice, or is there
> something wrong or
> unusual with this fact table? Will Mondrian do the right thing, i.e.
> aggregate the measure values, when there are duplicate facts
> (w/ same keys)?
>
> Thanks,
>
> Etienne
>
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>
More information about the Mondrian
mailing list