[Mondrian] Mondrian SegmentCache SPI

Julian Hyde jhyde at pentaho.com
Thu Feb 3 18:04:10 EST 2011


Luc,
 
This is a great start to the pluggable-cache project. The SPI is clear and
high-level. Thanks for seizing the initiative.
 
I think we will need one more method:
 
    Future<List<SegmentHeader>> listSegments()
 
This will allow Mondrian to connect to a cache and see what it contains. (An
external cache may have been running longer than the Mondrian server.)
 
This method is necessary because of a peculiarity of Mondrian's caching
strategy, wherein there is not a simple mapping from a cell to the segment
that contains it. For example, consider a more conventional cache: a CPU
that brings L3 (level 3) cache that caches 64K blocks of RAM. The byte value
0xABCD1234 belongs to one and only one cache block, the one that starts at
0xABCD0000 and ends at 0xABCDFFFF. 
 
Now consider Mondrian's cache. On one day, the cell  ([Unit Sales], [CA],
[2010]) might be in the segment ([Unit Sales], {[CA], [OR}], {[2009],
[2010]}); the next day, it might be in ([Unit Sales], {[CA]}, {year=*}).
What segments exist depend on what queries have run earlier in the day. This
is different from a typical cache, but it works well, and is absolutely
appropriate for a ROLAP system. The listSegments method makes the cache
work; Mondrian can then index the segments and quickly find 
 
If a cache can add and remove segments without Mondrian knowing about it, we
may also need to give the cache some way to notify Mondrian about changes to
the list of segments.
 
Julian
 
 


  _____  

From: mondrian-bounces at pentaho.org [mailto:mondrian-bounces at pentaho.org] On
Behalf Of Luc Boudreau
Sent: Thursday, February 03, 2011 1:31 PM
To: Mondrian developer mailing list
Subject: [Mondrian] Mondrian SegmentCache SPI


Fellow Mondrian developers and users,

One month has already passed since the new year festivities, and while most
of you have been trying to renew your gym membership or hold on to your new
year resolutions the best you could, so did the Mondrian team. Our
resolutions, although not requiring personal sacrifices, are none the less
starting to bear fruit.

For you see, our resolution for the year was to provide Mondrian developers
and integrators means to achieve better understanding, scalability and
control. We have many ideas on how to reach those goals. Some of them are
still in their infancy, yet some of them have already been committed to the
source. Last month, we worked on the first phase. We added means for system
architects to externalize and share a pluggable segment cache. What does
this mean exactly? Let's take a step back in order to better understand.

Internally, Mondrian splits the tuples in segments. A typical segment could
be described as a measure crossjoined by a series of predicates. As an
example, a textual representation of a segment contents could be:


Measure = [ Sales ]
Predicates = { 
          [ Products = * ],
     [ State = California ],
     [ Gender = Male ] }
Data = [ 1346.34, 234.00, ... ]

In the case above, the segment would represent the Sales data of all males
in California, for all products. It is a lot more effective to deal with
those data structures. If Mondrian was to internally represent each data
cell individually, the unique identifier of that cell would be of a greater
size than the data itself, thus creating a whole lot more problems in terms
of data efficiency. This is therefore why Mondrian deals with groups of
cells, which it loads in batches, rather than individually. There is a lot
of voodoo magic and heuristics in the background trying to figure out how
best to group those segments and how to reduce the number of segments to
load, ultimately reducing the number of SQL queries to be executed. Mondrian
will group all segments with the same predicates but with a different
measure into a segment group. Mondrian will also tend to remove as many
predicates as it possibly can in order to optimize the data payload. Lets
say that a segment covers all products except a single one, Mondrian will
still include the product in the segment but filter it out when a specific
query requires it. 

Once those segments are populated, Mondrian keeps those in a collection of
weak references in local memory. All required segment references are pinned
down during the resolving of a particular query, but as soon as the query is
done executing, the references are returned to their weak state, thus ready
to be garbage collected if needed. This simple mechanism allows Mondrian to
answer just about any query, as long as the memory allocated is big enough
to answer that particular query. This works really well in fact, since in
most small deployments, the maximum amount of memory is never reached. And
if it ever gets filled, old segments will be evicted to make some room for
the new ones.

Now, there are obvious gotchas. First off, what if it takes a long time for
a segment to be populated by the RDBMS. This means that if a particular
segment ever gets picked up by the garbage collector, the MDX query sent to
Mondrian *might* take longer to execute, whether it was in the segment cache
or not. This is not acceptable, simply because this makes all performance
predictions impossible.

This is where the SegmentCache SPI comes in. It is essentially a pluggable
cache for segments. The algorithm behind the segment loader becomes this:


*	Lookup segments in local cache and pin those required. 

*	Optimize / group segments 

*	Lookup segments from the SPI cache 

*	Load the segments found from the SPI  cache 

*	Populate the remaining unloaded segments from the RDBMS 

*	Put the segments which come from the RDBMS into the SPI cache 

*	Pin all loaded segments 

*	Resolve the query 

*	Unpin all segments in the local cache

But wait! There is more! The SegmentCache SPI is trivial to implement.


Future<Boolean> contains(SegmentHeader header);

Future<SegmentBody> get(SegmentHeader header);

Future<Boolean> put(SegmentHeader header, SegmentBody body);

There are two assumptions that are made towards the implementation. The
first obvious one is that the cache must assume that many Mondrian instances
might access the cache concurrently, form different threads. We therefore
recommend using the Actor Pattern or anything similar in order to enforce
thread safety. The second is that SegmentCache implementations will be
instantiated very often. We therefore recommend using a facade object which
relays calls to the actual segment cache code.

As for the storage of the SegmentHeader and SegmentBody objects, we tried to
make it as simple and flexible as possible. Both objects are fully
serializable and are immutable. They are also specially crafted to use dense
arrays of primitive data types. We also tried to make extensive use of Java
native functions when copying the data to / from the cache within Mondrian
internals.

The bottom line is that from now on the Mondrian community will be free to
implement segment caches to fit their needs. We will be rolling out a few
default implementations and examples, obviously. One neat implementation
could be one which pages the segments to a super fast array of SSD drives.
Another one could be to store the segments in Terracota or ehCache or
Infinispan, or just about any scalable caching system there is out there. So
if any of you out there are interested in implementing this SPI for your
business and would like to either share your experiences or contribute those
implementations, don't hesitate to contact us. Or me directly.

There is more goodness to come, but that's it for now. Stay tuned! 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.pentaho.org/pipermail/mondrian/attachments/20110203/502f43ee/attachment.html 


More information about the Mondrian mailing list