[Mondrian] Mondrian cache sharing - Hacks and Proper Solutions (tm)
jhyde at pentaho.com
Thu Aug 30 20:29:19 EDT 2012
On Aug 27, 2012, at 10:25 AM, Pedro Alves <pedro.alves at webdetails.pt> wrote:
> I know that "mondrian" and "cache" are 2 words that when used together
> make everyone's eyes roll. We've been talking about it for so long, and
> it's still an issue.
> In the end, mondrian users are the ones that mostly suffer about it.
> Paul and I (I as in the webdetails team that actually does the work, I
> just write emails) went ahead and "fixed" it by turning the cache key
> into an SPI:
As I promised, I talked to the BI Server guys about this.
They are committing to move BI Server to olap4j by Sugar timeframe (currently April 2013 -- and everyone this knows this date will probably move). That will improve a lot of things. It's not as early as we'd all like, but it's a start.
It's important to me that connection factories (the means by which Mondrian gets JDBC connections to the underlying databases... which include instances of javax.sql.DataSource, or (URL, username) credentials) can be represented as strings. It was a mistake to allow javax.sql.DataSource objects to be passed into Mondrian when creating a connection via the legacy API. olap4j made it more difficult to pass in non-Strings, and that made life painful for some people. I thought it would be possible to just register DataSources in JNDI and pass in the JDNI name, but as Marc pointed out, Pentaho has to run in containers (such as Tomcat) with read-only JNDI environments.
Mondrian already has a DataSourceResolver SPI. This is important, and this works. The one thing it doesn't do is tell Mondrian whether two data sources point to the same database.
Consider setting up a distributed cache. It's important that all of the participating instances of Mondrian know that they are looking at the same database instance. If they don't know it's the same database, they can't safely share their cache. If we used an SPI to determine equality, it's difficult to ensure that the same SPI is being used on all machines. When I'm answering a support call, it's easy to forget to ask whether someone has overridden the default implementation of the SPI.
So, how to tell whether two connection factories are the same, without introducing an SPI? We introduce a new connect string parameter, JdbcConnectionUuid. (This complements existing parameters Jdbc, JdbcUser, JdbcPassword and DataSource.) If two mondrian connections have the same JdbcConnectionUuid, Mondrian will take the client at its word that the back-end databases are identical. It will not consider the other parameters in determining equality.
Determining whether two schemas are equal, and therefore candidates for sharing a cache, comes down to two parts: Are the connection factories equal (using JdbcConnectionUuid etc. as described above)? And are the contents of the XML schema files equal (using UseContentChecksum, Catalog, CatalogContent, DynamicSchemaProcessor, as today)? Both of these questions are answered by looking at a string.
JdbcConnectionUuid is optional in the connection parameters. If not specified, Mondrian would use the same connection factory matching rules as today. (Internally, Mondrian will generate a Uuid so that all connections have one.)
As its name suggests, it's a good idea if JdbcConnectionUuid is a UUID. But it doesn't need to be. It could be an MD5 hash. It could be anything the user likes. They should just make damn sure that it is unique.
in conclusion. I am going to reject the SPI you have implemented, even on the master branch (3.4). Sorry! I believe that JdbcConnectionUuid is the right solution, for both short and long term, so let's start using it as soon as possible. If someone implements it as I have described above, I will accept the patch into both master and lagunitas branches.
When we implement http://jira.pentaho.com/browse/MONDRIAN-1177, we will provide a means to define the UUID alongside the connection credentials.
More information about the Mondrian