[Mondrian] SSB Schema

jeff.s.wright at thomsonreuters.com jeff.s.wright at thomsonreuters.com
Wed Dec 15 16:54:47 EST 2010


Here’s what I got from the fall semester project:

 

-          Mysql DDL to create warehouse tables based on TPC-DS benchmark

-          Mondrian schema describing data warehouse

-          JMeter test script with 12 simple queries

-          Installation guide that describes how to set this up, including running the TPC-DS data generator

 

I’ve arranged to have another semester’s worth of work. What I was picturing as the focus would be 

 

-          Some ramp-up activities, such as porting to Oracle 

-          Automation to build test script on the fly from a directory of queries (to make it easier to grow the base over time)

-          “Fuzzy logic” to convert results to a pass/fail on non-dedicated hardware

 

We will kick off the spring semester project in January, and I’m writing a high level description this week. I’m happy to get input, I’d like to see this get integrated into the official environment.

 

I was hoping sometime in the next 2 weeks to reproduce it myself at least with mysql. I’ll put the question back to you… What’s the best way for me to deliver something to you to try out?

 

--jeff

 

From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On Behalf Of Luc Boudreau
Sent: Wednesday, December 15, 2010 4:03 PM
To: Mondrian developer mailing list
Subject: Re: [Mondrian] SSB Schema

 


Hello everyone,

We started to integrate Joe's patches using our rudimentary performance test suite. We should be done integrating them by the end of the week.

Jeff, how are things at NCSU? Do we have an ETA on the performance benchmark? Is there anything we can do to help you and your people?

Cheers!

_____________________________
Luc Boudreau



On Fri, Dec 3, 2010 at 12:43 PM, Joe Barnett <thejoe at gmail.com> wrote:

Hi all,

Development of this benchmark suite is very exciting to see, and we'd
love to help once it gets off the ground.  We've recently been working
to port our application from mondrian 2.4 to the latest 3.2 release,
and had been running into issues due to performance regressions.  I've
just filed a number of JIRAs with patches that get our queries in 3.2
to 2.4 level speeds:

http://jira.pentaho.com/browse/MONDRIAN-842
http://jira.pentaho.com/browse/MONDRIAN-843
http://jira.pentaho.com/browse/MONDRIAN-844
http://jira.pentaho.com/browse/MONDRIAN-845
http://jira.pentaho.com/browse/MONDRIAN-846
http://jira.pentaho.com/browse/MONDRIAN-847

further description in each individual JIRA, but would also like to
mention two areas we've noticed degradation vs/ 2.4 that these patches
do not fully address:

1)  RolapEvaluator#setContext() is still slower in 3.2, though not as
bad as it was without the somewhat crazy 847 patches.  Believe this is
still related to the polymorphism issues described in
http://www.javaspecialists.eu/archive/Issue158.html , resulting in the
calls to member#isEvaluated() being slower to dispatch than they were
with the 2.4 class hierarchy.

2)  get()-ing and put()-ing AggregationKeys into/out of the
RolapStar#sharedAggregations and RolapStar#localAggregations maps
takes about twice as long as the same operation took when these maps
were of BitKeys instead of AggregationKeys.  I set up some
microbenchmarks to strip out as much as I could from AggregationKey,
and then compare BitKey#hashCode vs AggregationKey#hashCode,
BitKey#equals vs AggregationKey#equals, and
Map<AggregationKey,?>#get() vs Map<BitKey,?>#get().  Even with
AggregationKey simplified to just compare the BitKeys and ignore the
star and compoundPredicateMap, the map#get() calls were always ~2x
slower, even though #hashCode() times were equivalent, and #equals()
times were inconclusive (sometimes equivalent, sometimes BitKey
faster, sometimes AggregationKey faster).

Not sure how possible it is to speed these up while maintaining the
functionality changes that cause them, but worth bringing to
everybody's attention.

-Joe

PS, a typical query that we've been benchmarking looks something like
this, where [Dimension].[Level].Members can be quite large:

WITH
        MEMBER [Dimension].[Aggregated] AS IIF(COUNT([Available]) =
COUNT([Requested]), [Dimension].[All Dimension],
AGGREGATE(UNION([Requested], { [Dimension].[Special],
[Dimension].[Unattributed], [Dimension].[OtherSpecial] })))
        MEMBER [Dimension].[Special] AS [Dimension].[#Null]
        MEMBER [Dimension].[Unattributed] AS [Dimension].[#Null]
        MEMBER [Dimension].[OtherSpecial] AS [Dimension].[#Null]
        MEMBER [Measures].[total_available_count] AS
Format(COUNT([Available]), "######")
        MEMBER [Measures].[total_result_count] AS
Format(COUNT([Requested]), "######")
        MEMBER [Measures].[id] AS [Dimension].CurrentMember.Name
        MEMBER [Time].[Base] AS [Time].[Epoch].[1288483200]
        MEMBER [Time].[Slicer] AS [Time].[Epoch].[1289088000]
        MEMBER [OtherDimension].[Slicer] AS [OtherDimension].[1]
        SET [Available] AS FILTER([Dimension].[Level].Members,
CAST([Dimension].CurrentMember.Key AS NUMERIC) > 0)
        SET [Filtered] AS CustomFilteringUDF([Available],
"customfilteringstring")
        SET [Requested] AS [Filtered]
SELECT
       { [Dimension].[Aggregated], [Dimension].[Special],
[Dimension].[Unattributed], [Dimension].[OtherSpecial],
TopCount([Requested], 20, [Measures].[revenue]) } ON ROWS,
       { [Measures].[id], [Measures].[total_available_count],
[Measures].[total_result_count], [Measures].[profit],
[Measures].[roi], [Measures].[cost], [Measures].[profit_margin],
[Measures].[revenue] } ON COLUMNS
FROM
       Our_Cube
WHERE
{(
       [Time].[Slicer],
       [OtherDimension].[Slicer]
)}



On Fri, Dec 3, 2010 at 5:37 AM, <jeff.s.wright at thomsonreuters.com> wrote:
>
> >1. Before we get it working in Hudson, I'd like to just get it working. If we document the process of setting up, it can be done as a manual test.
>
>
>
> A documented procedure is part of the students’ assignment.
>
>
>
> >2. Making it an automated process will be a different challenge.
>
>
>
> I like the idea of a moving average and a tolerance. It would be interesting to try to monitor system load and use that to calibrate the pass/fail criteria, but I suspect that gets into the realm of PhD research when you start looking at distributed environments and VMs.
>
>
>
> >3. MySQL and Oracle are good dual platforms. I like the idea of using Oracle as the main database. Not politically correct in the open source world, but its performance is broadly representative of other databases (whereas MySQL has some weak points for BI queries).
>
>
>
> Delighted to hear that, because I’ve seen issues with mysql too.
>
>
>
> A few more ideas:
>
>
>
> 6. None of the students had prior knowledge of MDX, so the initial set of queries is going to be simplistic. I’d like to see this ultimately based on a directory of queries, similar to the Mondrian JUnit tests. That would give us a chance to develop queries that exercise features that are important to us (like native evaluation, large schemas, large result sets, virtual cubes, grouping sets) and others could contribute queries in the areas where they have interest.
>
>
>
> 7. I’d like to see both a single user query performance test and a multi-user throughput test. Multi-user has some interesting problems because of caching. The brute force solution is to do multi-user testing with cache off, but there might be other approaches that enable a mix of cached/non-cached queries. I’m thinking about parameterizing queries with random parameters, or randomizing the order of queries in different threads. It would be interesting to see how consistent (or not) throughput measurements would be in those cases.
>
>
>
> --jeff
>
>
>
> From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On Behalf Of Julian Hyde
> Sent: Thursday, December 02, 2010 8:51 PM
>
> To: 'Mondrian developer mailing list'
> Subject: RE: [Mondrian] SSB Schema
>
>
>
> A few points:
>
>
>
> 1. Before we get it working in Hudson, I'd like to just get it working. If we document the process of setting up, it can be done as a manual test.
>
>
>
> 2. Making it an automated process will be a different challenge. We will need to cope with natural variance in the results. The variance will be greater if we run on a VM, especially one with other tenants, but there will always be variance. We can discuss approaches to dealing with variance; one approach that springs to mind is bollinger bands (report whenever a result, or a short-term moving average, moves 2 or 3 standard deviations from a longer-term moving average). We can discuss others. I would also be inclined to do it under the auspices of a project independent of mondrian. Maybe an extension to junit for performance regression testing.
>
>
>
> 3. MySQL and Oracle are good dual platforms. I like the idea of using Oracle as the main database. Not politically correct in the open source world, but its performance is broadly representative of other databases (whereas MySQL has some weak points for BI queries). We have Oracle available in the Pentaho's Hudson environment.
>
>
>
> 4. The smallest size sounds good. (If we document the process, per 1, it would be easy to run with other sizes.)
>
>
>
> 5. We manage without a java generator. The task of generating data sets and loading them only has to be done once, and it can be a manual process.
>
>
>
> Julian
>
>
>
> ________________________________
>
> From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On Behalf Of jeff.s.wright at thomsonreuters.com
> Sent: Wednesday, December 01, 2010 12:31 PM
> To: mondrian at pentaho.org
> Subject: RE: [Mondrian] SSB Schema
>
> I think it’s useful to discuss what it would take to set this up for the Mondrian Hudson server.
>
>
>
> I’m not familiar with the continuous integration environment other than the Hudson emails I see posted to the mailing list. Here are my assumptions and guesses, please correct and add as needed:
>
>
>
> ·         I assume this is a Linux environment, maybe even virtual.
>
> ·         I assume this is a shared environment, meaning there are other loads on the hardware, and that a performance regression test would have to have some way of self-calibrating or at least have tolerances.
>
> ·         I assume that an open source database is preferred for testing. I had the students work with mysql. This is a little less than ideal from my selfish point of view. One performance tweak that’s important to us is grouping sets. I know that is available for Oracle, I was assuming not for mysql.
>
> ·         We should agree to a target database size. The smallest TPC-DS database size is 1GB. The data model includes at least one dimension with > 1M rows. The students have been working with the smallest size. I was hoping to kick the tires some with their final test setup and see if that is indeed big enough to get interesting query performance. I suspect it is.
>
>
>
> I’m actually trying to see if I can arrange with my company and the university to sponsor another semester of work on the performance test harness. Most of the fall semester was consumed with the mechanics of the technology stack: mysql, mondrian, jmeter, mdx. I think another semester’s work could take this to an actual regression test – convert results to a pass/fail.
>
>
>
> Btw, the data generator generated some discussion before. The TPC-DS data generator is written in C. The students spent some time investigating an automated java port, but hit dead ends. It looks like that would be a manual effort. I’m not really considering that right now.
>
>
>
> --jeff
>
>
>
> From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On Behalf Of Luc Boudreau
> Sent: Tuesday, November 30, 2010 9:09 AM
> To: Mondrian developer mailing list
> Subject: Re: [Mondrian] SSB Schema
>
>
>
> Hi Jeff,
>
> I saw your emails but figured I'd reach out at large. I'm very happy to hear that your project went forward. Would your students like to contribute the test suite back to the Mondrian project? I could serve as a contact for them so we can get this thing done once their work is finished.
>
> Cheers!
>
> Luc
>
> On Tue, Nov 30, 2010 at 9:01 AM, <jeff.s.wright at thomsonreuters.com> wrote:
>
> I’ve been working with some CS students at NCSU to create a Mondrian schema for the TPC-DS database for use in performance testing. We chose TPC-DS over SSB because it’s a larger data model that enables us to test virtual cubes. They’re also creating a set of test queries and a JMeter test script to execute them via XMLA.
>
>
>
> They’re scheduled to be done with their project on Dec 10.
>
>
>
> --Jeff Wright
>
>
>
> From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On Behalf Of Luc Boudreau
> Sent: Tuesday, November 30, 2010 8:54 AM
> To: Mondrian developer mailing list
> Subject: [Mondrian] SSB Schema
>
>
>
> Hi everyone,
>
> We are looking at adding a performance benchmark into Mondrian's test suite. Is there anyone who wrote a Mondrian schema for the Star Schema Benchmark (SSB)? If so, would you share it with the community?
>
> Thanks and please pass the word around!
>
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>
>
>
> _______________________________________________
> Mondrian mailing list
> Mondrian at pentaho.org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>
_______________________________________________
Mondrian mailing list
Mondrian at pentaho.org
http://lists.pentaho.org/mailman/listinfo/mondrian

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20101215/fb92171f/attachment.html 


More information about the Mondrian mailing list