[Mondrian] Shared degenerate dimensions

Matt Campbell mcampbell at pentaho.com
Wed Oct 22 09:50:42 EDT 2014


Consider a concrete example in mongodb.  I've seen a customer collection with sensor readings of wattage, including Year/Month/Day/Hour/Minute attributes for the time of the reading.  I could easily imagine them wanting combine these metrics with information from a separate collection containing the times when alerts are issued to customers so that they can assess whether alerts impact electricity usage.

To support this sort of combined analysis the only option is to create a physically separate dimension that can be shared between the two sources, possibly by

1)      Extracting the data and loading it into fact/dim tables.

2)      Creating materialized views in Calcite (or additional collections) and relying on Calcite to handle joining.

It would be a whole lot simpler if we had a way to say that [Time].[2014].[February] from the sensor collection should be equated with the [Time].[2014].[February] member from the alerts collection.  There ideally should not be a need to join the collections to something else-they already have the member values directly present.  And even though the time members from the two collections come from different places they are the same entities and could be treated as equivalent or "aligned".

The MDX function LookupCube() comes to mind- where you can grab a cell value from a specified cube given a tuple expression.  If Mondrian supported LookupCube() you could define a calculated measure in one of the two cubes which does the lookup for the corresponding date member intersection in the other cube.  I guess I'm imagining something like an automatic way of handling this alignment.




From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On Behalf Of Julian Hyde
Sent: Tuesday, October 21, 2014 3:45 PM
To: Mondrian developer mailing list
Subject: Re: [Mondrian] Shared degenerate dimensions

The tricky thing about degenerate dimensions is that they have duplicate values. So you can't join to one without affecting totals.

Intermediate levels in star dimension tables (e.g. the state level of the customer table) has the same problem. The problem is compounded when you have an aggregate table keyed on that level that wants to join to the dimension table to get more attributes.

Eliminating duplicates makes it more complicated for Mondrian to generate SQL (read: there may be bugs) and incurs a big performance hit on pretty much all DBs. So, on those grounds, I'd say that kind of schema design was a bad idea.

In my work on Calcite I am interested in being able to create materialized views, which are basically copies of data managed by the DB. With Calcite, a sensible physical design would include:

CREATE MATERIALIZED VIEW states AS SELECT DISTINCT country, state FROM customers;
CREATE MATERIALIZED VIEW paymentTypes AS SELECT DISTINCT paymentType FROM sales;

Note that "states" and "paymentTypes" are very small tables, probably best cached in memory. You can think of them as shrunken aggregate tables. Then Mondrian would be able to generate queries that join to these non-distinct columns, and behind the scenes Calcite would translate queries to use the materialized views.

Julian


On Oct 21, 2014, at 10:41 AM, Matt Campbell <mcampbell at pentaho.com<mailto:mcampbell at pentaho.com>> wrote:


I'm not aware of any limitations in M3 or M4 that prevents multi-level hierarchies from being defined using attributes on the fact table.

Historically I'd say you're right-it would be odd to have a degenerate dim with >1 columns associated with it.  But with mongodb use cases, as well as columnar dbs, it may be a much more common pattern.


From: mondrian-bounces at pentaho.org<mailto:mondrian-bounces at pentaho.org> [mailto:mondrian-bounces at pentaho.org] On Behalf OfWright, Jeff
Sent: Tuesday, October 21, 2014 11:55 AM
To: Mondrian developer mailing list
Subject: Re: [Mondrian] Shared degenerate dimensions

Thinking out loud, wondering what you lose when you use degenerate dimensions instead of a standard star schema...

-          For M3 I think you lose the ability to participate in a multi-level hierarchy, not sure if that holds for M4.
-          You lose the ability to specify order via a different column than values. And maybe you don't really lose that, but it doesn't seem typical for a degenerate dimension to have >1 column associated to it.

From: mondrian-bounces at pentaho.org<mailto:mondrian-bounces at pentaho.org> [mailto:mondrian-bounces at pentaho.org] On Behalf OfWright, Jeff
Sent: Monday, October 20, 2014 12:56 PM
To: Mondrian developer mailing list
Subject: Re: [Mondrian] Shared degenerate dimensions

Very interested in shared degenerate dimensions. Besides the BFT use case, I've seen examples in a conventional star schema, where users wanted to treat a degenerate dimension that occurred on >1 fact table as the same dimension.

I've seen both these cases:
1.       At least one fact table can be treated as "complete", listing all members of the shared dimension.
2.       The set of members is a union across all fact tables.

It would be nice if the mondrian schema could define that the members of a degenerate dimension could be obtained from:
-          Distinct values in the fact table
-          Distinct values across multiple fact tables
-          Lookup from a table of distinct values created by ETL

That last option could even help performance of member queries for non-shared degenerate dimensions.

--jeff

From: mondrian-bounces at pentaho.org<mailto:mondrian-bounces at pentaho.org> [mailto:mondrian-bounces at pentaho.org] On Behalf Of Matt Campbell
Sent: Monday, October 20, 2014 12:16 PM
To: Mondrian developer mailing list
Subject: [Mondrian] Shared degenerate dimensions


Data models with "Big Flat Tables" instead of star schemas seem to be getting more common, driven in part by columnar databases where use of separate dimension tables has little or no performance/storage benefit.  Similarly, with document stores like mongodb it often makes sense to store dimensional data together with facts in the same collection.  Use of the Thin Kettle JDBC with Mondrian also results in BFT models, since Thin Kettle can only expose a single table with no ability to join.

This data model is far simpler, but loses important power and flexibility.  Dimensions in cubes built on a BFT are degenerate and are un-sharable in Mondrian.  But even with BFTs users will still want to combine data with overlapping but non-identical dimensionalities.  I.e. they still have motivations to use Virtual Cubes (or multiple measure groups in M4).  I'm wondering what options might be worth considering to support this use case.  Should there be a new dimension link type to allow sharing degenerate dimensions?  The idea of a shared degen dimension feels a bit like a CopyLink to me.  It also seems like there would still need to be a single "official" source for the full set of members (level.members needs to return a consistent set).

Curious if others have thoughts.


_______________________________________________
Mondrian mailing list
Mondrian at pentaho.org<mailto:Mondrian at pentaho.org>
http://lists.pentaho.org/mailman/listinfo/mondrian

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20141022/17b85123/attachment-0001.html 


More information about the Mondrian mailing list