[Mondrian] Multithreading etc

Julian Hyde julianhyde at speakeasy.net
Sun Mar 11 03:24:00 EDT 2007


 



  _____  

From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org]
On Behalf Of michael bienstein
Sent: Friday, March 09, 2007 2:36 AM
To: Mondrian developer mailing list
Subject: [Mondrian] Multithreading etc


I sent this to the list but it gets bounced because I attached the code
in a zip file.  How do I send code through without checking it in
because it is still orthogonal to the codebase?

Have you tried attaching a zip file to a forum thread? 
 
Alternatively, you could send to mondrian-devel. That list still works,
although it's not much used anymore.



Well, I have code that works for multi-threading infrastructure so I
would like to know if it is worth continuing with this or not.

As for ROLLUP/CUBE my thoughts are:
1) Either we keep the codebase simple by sticking to a standard
(SQL2003) even if this standard is not yet implemented widely and
certain databases have better special features than others, or we allow
a per-database SQL generation system.  The argument for the second makes
sense only if the developer resources to write and maintain each dialect
comes from the database vendor or their community.  Mondrian is probably
at a stage that such discussions can be undertaken with the database
vendors. 
 

I've been talking with Matt Campbell (mkambol) about how this could be
implemented. Apparently Oracle, DB2 and Teradata (the main platforms of
interest to Matt) implement the "GROUP BY GROUPING SETS" construct which
we will need, with the same syntax.
 
Grouping sets are good because they allow us to specify exactly which
groups we want the DBMS to return. If we had used the ROLLUP construct,
we would have had to write logic in mondrian to figure out which
aggregations could be grouped together in the same query. But with
GROUPING SETS, the DBMS can figure out which aggregations can be
computed by rolling others.
 
We will also need the GROUPING function.
 
Since these three databases support what we need, I am inclined to stick
to the standard. I haven't checked whether other databases support this
syntax, but I am hopeful that they do, or soon will.

 
 
2) Architecturally this implies loading multiple Aggregations from one
SQL query.  That requires a rethink of the way the cell cache loading is
done because at the moment an Aggregation is loaded one at a time and in
a synchronized block on the Aggregation.  Similar concerns have to be
dealt with for in-memory rollups.  I think that synchronized is too
forceful.  We need something more like a Lock from java.util.concurrent
so we can do tryLock().  Look at the TxLock idea I have in the code I'm
attaching. 

Yes, this issue came to light in our design discussions also.
 
I look forward to reading your code, but it occurs to me that we can
leverage aggregations' state of 'ready' or 'loading'. We could upgrade
this to a lock, so another thread can wait for a loading aggregation to
become ready.
 
Synchronized will still need to be used, and carefully, to ensure that
no thread ever sees the system in an inconsistent state.

 

As for multi-threading:
I have only written most of the base infrastructure, not the cell
loading.  To integrate would require a significant amount of work in
Mondrian's code to pass all interaction with Mondrian through
TxSystem.runWithTx().  

Basic concerns are:

1)      Threads should be able to share data related to the request
across the threads.
2)      A Thread should be loaned to a request and returned in a way
that is well-nigh fail-safe (i.e. the thread shouldn't keep running of
the request fails in some way).
3)      We should be able in a parameter of some sort decide to NOT use
threads at all.
4)      The number of threads should be configurable.
5)      There should be an independence from the rest of the code base.
6)      We should be able to make use of custom thread pools or use
managed thread pools from the application server.
7)      Then there is a relatively minor issue with read-consistency for
near-real-time data that turns out to be a real head-ache.  This can be
done by either: using the transaction semantics of the underlying data
store or modifying all SQL requests and cache interactions with a
timestamp and/or transaction id of some sort.  E.g. when an MDX requests
begins it asks the underlying data store for the id of the last
completed transaction that modified data and keeps this in a
request-scope available to all threads.  Then it appends "changedTxId <=
${lastTxIdWhenFirstEntered}" to each WHERE clause.  If however we use
the underlying data store's transactions then we must keep open the JDBC
Connection for the duration of the request reusing it on the same thread
for each interaction with that data store.
Now, I think that the best way to take advantage of multiple threads in
the storage system is NOT launching multiple SQLs on the same star
schema but different aggregations but rather to use partitioning of
data.  That is to segment the cell data (and maybe dimension data) based
on values of certain columns.  For example year<2007 and year=2007 in
two different partitions.  This can be introduced slowly by simply
making a RolapStar one Partition for the moment.  Having said that
aggregation tables are also a type of Partition and hitting two of them
at once should be quite easy.
So the design I am introducing has the following features:
1) A scope for "request" or "interaction" that is larger than the Thread
that begins it.  Since this is similar to a transaction I've called it a
Tx.  See the mondrian.tx package.  Each sub-system in Mondrian can
enlist a representation of itself in the Tx.
2) Break up the different tasks performed into Task objects that can be
run potentially in parallel.  Allow a set of Tasks to be tied to the
same Thread so that the same JDBC Connection can be used for all of them
for read-consistency and cleaned up at the end of the Tx.  This is done
declaratively so the implementation can be changed easily.  The
implementation can also ensure that the J2EE context is passed onto
separate threads (JNDI, context class loader etc).
3) A system of fail-quick locks at the Tx scope rather than just Thread
scope.  

If this is worth persuing as a design for the next version then good.
If not I'll stop now. 
 

This definitely sounds plausible... I'd like to read through your code
before I answer in detail.
 
Julian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20070310/f7c290f9/attachment.html 


More information about the Mondrian mailing list