[Mondrian] Preview of hybrid fact/aggregate tables coming in mondrian-4

Julian Hyde jhyde at pentaho.com
Wed Mar 28 20:52:55 EDT 2012


On our most recent trip to Florida, Luc & I did some brainstorming and came up with a really nice idea I call hybrid fact/aggregate tables. These are dual-purpose tables that are aggregate tables but also hold measures that don't exist at a lower granularity. They are are also going to be a nice building block for us to improve our support for semi- and non-additive measures (see http://jira.pentaho.com/browse/MONDRIAN-962 and http://jira.pentaho.com/browse/MONDRIAN-1100).

I've fleshed out those designs, and turned them into XML syntax. I thought it would be interesting to share with you all. (It uses mondrian-4 concepts that I've not officially explained yet... but I figure, if it's a choice between beta product and complete documentation, you'd all rather have the product.)

I'm using the <MeasureGroup> syntax that we introduced to allow a cube to have multiple fact tables (effectively replacing the virtual cube concept). But I added the attribute "type=aggregate" to distinguish these hybrid "fact-aggregate" tables from regular fact tables. Many of the same concepts hold (<Measure>, <ForeignKeyLink>); I added <MeasureRef> to signal that a column holds a roll-up of a measure defined in another measure group (hence you don't need to re-specify the formatString or aggregator attributes).

I also added <CopyLink> element to define a dimension that is denormalized into the agg table. (You can't do that on a base measure group; the nearest equivalent is degenerate dimensions, specified using the <FactLink> element, but these require that the dimension is based on the same table that is the fact table of the measure group.)

To give you some idea of what MeasureGroups are, here is a regular fact table in mondrian-4 syntax:

<Cube name="Sales">
 ...
    <MeasureGroups>
        <MeasureGroup name='Sales' table='sales_fact_1997' type='base'>
            <Measures>
                <Measure name='Unit Sales' column='unit_sales' aggregator='sum' formatString='Standard'/>
                <Measure name='Store Cost' column='store_cost' aggregator='sum' formatString='#,###.00'/>
                <Measure name='Store Sales' column='store_sales' aggregator='sum' formatString='#,###.00'/>
                <Measure name='Sales Count' column='product_id' aggregator='count' formatString='#,###'/>
                <Measure name='Customer Count' column='customer_id' aggregator='distinct-count' formatString='#,###'/>
                <Measure name='Promotion Sales' column='promotion_sales' aggregator='sum' formatString='#,###.00' datatype='Numeric'/>
            </Measures>
            <DimensionLinks>
                <ForeignKeyLink dimension='Store' foreignKeyColumn='store_id'/>
                <ForeignKeyLink dimension='Time' foreignKeyColumn='time_id'/>
                <ForeignKeyLink dimension='Product' foreignKeyColumn='product_id'/>
                <ForeignKeyLink dimension='Promotion' foreignKeyColumn='promotion_id'/>
                <ForeignKeyLink dimension='Customer' foreignKeyColumn='customer_id'/>
            </DimensionLinks>
        </MeasureGroup>
    </MeasureGroups>
</Cube>

As you can see, it defines measures and how the fact table links to each of the cube's dimensions.

Here is an aggregate table in mondrian-3 syntax:

<AggName name='agg_c_special_sales_fact_1997'>
    <AggFactCount column='FACT_COUNT'/>
    <AggIgnoreColumn column='foo'/>
    <AggIgnoreColumn column='bar'/>
    <AggForeignKey factColumn='product_id' aggColumn='PRODUCT_ID' />
    <AggForeignKey factColumn='customer_id' aggColumn='CUSTOMER_ID' />
    <AggForeignKey factColumn='promotion_id' aggColumn='PROMOTION_ID' />
    <AggForeignKey factColumn='store_id' aggColumn='STORE_ID' />
    <AggMeasure name='[Measures].[Unit Sales]' column='UNIT_SALES_SUM' />
    <AggMeasure name='[Measures].[Store Cost]' column='STORE_COST_SUM' />
    <AggMeasure name='[Measures].[Store Sales]' column='STORE_SALES_SUM' />
    <AggLevel name='[Time].[Year]' column='TIME_YEAR' />
    <AggLevel name='[Time].[Quarter]' column='TIME_QUARTER' />
    <AggLevel name='[Time].[Month]' column='TIME_MONTH' />
</AggName>

And here is the same aggregate table expressed as a MeasureGroup in mondrian-4 syntax.

<MeasureGroup table='agg_c_special_sales_fact_1997' type='aggregation'>
    <Measures>
        <Measure name='Projected Store Sales' column='PROJECTED_STORE_SALES' aggregator='sum' formatString='#,###.00'/>
        <MeasureRef name='Fact Count' column='STORE_COST_SUM'/>
        <MeasureRef name='Unit Sales' column='UNIT_SALES_SUM'/>
        <MeasureRef name='Store Cost' column='STORE_COST_SUM'/>
        <MeasureRef name='Store Sales' column='STORE_SALES_SUM'/>
    </Measures>
    <DimensionLinks>
        <ForeignKeyLink dimension='Store' foreignKeyColumn='store_id'/>
        <ForeignKeyLink dimension='Product' foreignKeyColumn='product_id'/>
        <ForeignKeyLink dimension='Promotion' foreignKeyColumn='promotion_id'/>
        <ForeignKeyLink dimension='Customer' foreignKeyColumn='customer_id'/>
        <CopyLink dimension='Time' attribute='Month'>
            <ColumnRef aggColumn='TIME_YEAR' table='time_by_day' column='the_year'/>
            <ColumnRef aggColumn='TIME_QUARTER' table='time_by_day' column='quarter'/>
            <ColumnRef aggColumn='TIME_MONTH' table='time_by_day' column='month_of_year'/>
        </CopyLink>
    </DimensionLinks>
</MeasureGroup>

It defines one additional measure, [Measures].[Projected Store Sales], that is not available in the base fact table.

At a high level, the concepts AggName becomes MeasureGroup, AggForeignKey becomes ForeignKeyLink, AggMeasure becomes MeasureRef, and AggLevel becomes CopyLink. AggFactCount is subsumed by MeasureRef, and AggIgnoreColumn is obsolete. But there is a big semantic shift going on here. In mondrian-3, aggregate tables were defined in terms of dimensional concepts (e.g. levels) but this was difficult because in mondrian-4 levels can have composite keys. Therefore mondrian-4 aggregate tables are defined in terms of tables and columns. Or more precisely, usages of tables and columns that were defined in the physical schema.

Currently, you can create rules to recognize aggregate tables and figure out the roles of their columns. But it's never worked that well, so I want to drop pattern-matching in mondrian-4. We'll provide some other way of doing the same thing. We haven't figured out exactly how, but the leading contender -- suggested by Will Gorman -- is as follows. Mondrian would provide an API (REST, or Java, or both) to register, enable, and disable an aggregate table. Then have an "on-schema-load" trigger, that scans the catalog, finds aggregate tables, registers and enables them. This script is pluggable, so people can devise their own rules.

The REST API would solves a problem that a lot of people have been complaining about. ETL jobs take a long time, and populating aggregate tables takes a significant portion of that. You can shorten your ETL load window by allowing queries as soon as the fact and dimension tables are populated, if there is an API to tell Mondrian when particular aggregate tables are populated and safe to bring online. The REST API would provide that. But as I said, it's no more than a glimmer in our eye at this point.

Give us feedback on what you think of these designs, and help us prioritize. The simpler we can make this, the sooner mondrian-4 will hit the streets.

Julian

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20120328/536a3265/attachment.html 


More information about the Mondrian mailing list