The false optimism of GORM and Hibernate

Posted by: on Nov 12, 2012 | No Comments

There’s a bunch of problems using Hibernate with concurrent access to the same rows in the database.

In Grails applications you could be forgiven for ignoring these, because in general the GORM Hibernate examples and documentation err on the side of simplicity rather than completeness.

In addition you may think that this does not affect you, but in all likelihood this is because you either have an application with no real “edit/update” functionality, or you have an insignificant number or users, and/or you have just been lucky so far.

What causes concurrent DB problems in web apps?

The concurrent DB access problem is pretty obvious. The problems really come because of the asynchronous nature of web applications, multiple concurrent users, the limitations of web UIs and multiple access methods such as REST APIs alongside the web UI.

One good example is a web form submission or ajax interaction that is not “de-bounced” – i.e. a confused user may double click an item, or due to UI lag they click the item a second or third time to try to prod it into action. Yes this can often be handled by JS code in the client to block the UI element while the submission is in progress, but it highlights one of the less obvious areas where stale object exceptions can bite you.

Another interesting scenario is mobile devices, and also syncing. It is wrong to assume users have just one device accessing your service. If any of your clients have a poll mechanism or push via web sockets, or receive platform push notifications you have same-user concurrency. For example a user’s laptop, iPad and iPhone may all attempt to access your service at the same time to get an update. If this update has side-effects such as storing the date of last access, you could be exploding all over the place – but only for users with this multi-client scenario.

No problem, I’m already checking for changes to the object version

Even if you pre-emptively check for version changes in an object before applying the new values supplied by a user, you can still get a stale object failure when the session flushes. That’s where the luck part comes in.

If there’s one thing we all despise, its code that works by chance.

Optimistic Locking is enabled by default in Grails, to prevent user A overwriting changes from user B. In actuality this is very complex to deal with and is arguably pointless in many applications.

What compounds this situation is that Hibernate is trying to prematurely optimise SQL queries by defaulting dynamic updates to OFF. This means that every update to an object updates every column in the row even if you didn’t change it – leading to data loss in concurrent situations if optimistic locking is not used. This is because Hibernate caches the SQL generated for UPDATE statements such that it updates all the columns. One can only imagine how expensive this is in network and DB terms with large objects.

There’s an argument to be had that if you turn dynamic updates ON, so that only modified columns are included in the SQL UPDATE statement then maybe you can live without optimistic locking, provided changes to properties are not interdependent. Otherwise you could end up with inconsistent changes.

However even with transactions, if you did this you could still experience data loss as the object may have been loaded in one thread, modified in another and flushed, and then the original thread flushes – even if only updating the properties you changed, if both made a decision based on the existing values e.g. count += 1 then one of them will be lost.

Just change your model stupid!

There are plenty of legitimate use cases for concurrent edits where altering the model design to work around the restrictions of Hibernate optimistic locking compromises the design in unacceptable ways.

Here’s a really simple but surprisingly painful example:

  1. Photo has “name” and “url” properties
  2. A controller action loads object 1 and sets “name” only
  3. A background task loads object 1 and sets “url” only
  4. The controller action flushes
  5. The background task flushes
  6. The background task crashes with a StaleObjectException

This seems rather nonsensical because the work done on the rows did not actually affect each other – yes in some usage scenarios there might be interdependent values or constraints – but I’ll wager that’s not so common.

Some would say the model should be refactored to avoid this. Do we really think the URL should be pulled out into a separate domain class just so it can be updated concurrently? Taken to its extreme this would imply one table per property and everything would be a join from hell.

One can even imagine the above scenario with two data-entry operators editing different fields of the same record. Like the above it would fail and you’d have to tell the user that there was a change… but a lot of the time they’d scratch their heads because they did not change fields that affect each other.

Optimistic locking fails us in this case – because it is too painful to decide whether or not the user should be told about the problem, and if so what they should be shown. Do you show them both entries side by side and let them cherry pick changes? Forcing people to diff their data? Good luck with that.

Another, more tricky example:

  1. Photo has “name” and “url” properties and a “comments” collection of PhotoComment objects
  2. A controller action loads object 1 and adds a new PhotoComment to the comments collection
  3. A background task loads object 1 and sets “url” only
  4. The controller action flushes
  5. The background task flushes
  6. The background task crashes with a StaleObjectException

Why do we still see a staleness problem here? Because adding items to an association cause the version of the “parent” to change – rightly so.

However, if you step back and look at this, it is a nonsense for most use cases. These are non-colliding edits. If it really mattered to me that name and url somehow correlated, I would write code to check this, or update both at the same time – which forces me to consider and handle this issue – or let last writer win.

The fact is that the optimistic locking approach taken by Hibernate is killing us in these situations. What’s worse is that unless you are always explicitly flushing or using transactions, your staleness failure could happen when the implicit session flush occurs when your response view rendering is complete. In which case it is game over, you can’t reasonably catch this and inform the user.

So lesson 1 is: always wrap all potentially concurrent database activity in a transaction (or explicit session flush)

Why not just refresh if its found to be stale?

You can try that, but depending on the nature of your changes it could be very complex, as you have to re-apply your changes. If you’re just updating one object and a couple of properties that’s trivial. But what if you are adding 3 associations to a collection, creating some other objects that associate to the object you are editing etc?

Oh, and when you retry it could just happen again. And again. You have to loop until it works or you give up.

Your only real choice in complex cases is to use transactions, and then all your modified objects are likely invalid/detached after a rollback. You have to start from scratch all over again whatever happens.

And yet optimistic locking is apparently there to save us from the performance ills of transactions.

But with optimistic locking I can tell the user/client they have to retry

In situations like this, it would be a nonsensical approach from a UX point of view because the background change did not affect what the user was doing.

Forcing the user to deal with it is just plain wrong if it is not a user-level data conflict.

Imagine if you wanted to edit an unfulfilled Amazon order and because some background job just happened to update the estimated dispatch date, it forced you back onto the edit form for no good reason?

Lesson 2: testing for and reporting staleness is only useful if it materially affects what the user is doing in terms of your business rules

So what can we do?

As you have seen, optimistic locking is a false idol. She has been hiding us from the true inherent complexities of concurrent DB access – all she does is fail fast without providing remedies. We have in most cases failed to handle the staleness information she provides correctly, and as such there is close to zero benefit to date.

The only benefits have been: (a) thinking we don’t need to worry about concurrency and (b) never losing any data by accident through concurrent writes.

Let’s recap the concurrency options currently available in GORM Hibernate:

OptionChanges to code requiredOutcome
Turn off optimistic locking globally or per domain class None Means “last writer wins”, and this includes all properties of objects not just those modified in your current session. This results in data loss e.g. if a background job changed one property, the concurrent user request would reset that property.
Turn on dynamic updates globally or per domain class None Will update only the dirty properties, removing one of the main issues of data loss when optimistic locking is OFF. Adds an unknown performance cost to create SQL statements.
Pessimistic lock() Add call to lock() to every domain object that will be modified, even those referenced through associations. Every update to domains must be inside a transaction to avoid data loss or incorrect results. SQL will lock the rows so no other request can lock them for update until the transaction is committed. If you don’t use the static `lock(id)` to load the object, you can still suffer staleness if optimistic locking is ON if you call the instance `lock()` method and have to recover that. If optimistic locking is OFF you may overwrite others’ edits since you loaded the object. Transaction isolation level may affect what is read.
Use executeUpdate Call executeUpdate with HQL and values and do not change properties of the objects in the session Does not update version for Optimistic Locking by default but can if you wish. Doing so would cause concurrent access to fail. Using this mechanism, you might as well not use an ORM.

Lesson 3 is therefore: how I wish we didn’t even have to think about this

Playing fast and loose. Let the last edit to any given property win

This may be a workable strategy for you if you don’t have much in the way of concurrent editing by end users. It simply means that the last change to a property is what hits the database.

The caveat is that things like updating numerical values can lead to critical data loss where e.g. book.sales += 10 updates using an old value for “sales”. If your updates are simple property replacements without such manipulations, you may be fine.

Let’s work it through in code. Here’s the trivial domain class:

So when you have a piece of code like this:

… when the session flushes, if you have Hibernate logging on you will see this:

What?! I only changed url! All of the properties: version, ref, title and url were updated.

This is the “dynamic updates” problem mentioned earlier. In Grails thankfully you can easily turn this behaviour off, so that the SQL statements are tailored to your dirty properties. I would imagine they could cache these kinds of SQL statements too, hashed on the sorted dirty property list… but presume they don’t or this setting would surely be the default?

Turning this off is done using dynamicUpdate false in the mapping convention on domain classes or in the grails.gorm.default.mapping Config var. See docs

When you do so, the SQL logging shows:

That’s certainly better. However it won’t help with concurrent updates to e.g. title because the optimistic locking version is still being checked.

You can turn that off with the GORM mapping option version false which again can be global or domain-class specific.

With this set, your SQL is:

Now that is much better if you’re trying to achieve some concurrency of non-overlapping edits.

However now we are in a situation where data loss can occur. There may be situations in some applications where you want to prevent this from happening.

I don’t like the sound of that

OK so you do want to defend against concurrent updates in some situations, say two human users editing the same Photo at the same time?

The options currently are:

  1. Use explicit lock() on objects before making changes and re-enable optimistic locking. This means everything you do with that object has to be in a transaction, and your chosen transaction isolation level can affect the outcome. This will ensure that your code’s assumptions are valid for the duration of your transaction, and that no other request can update that row until it completes. You may need to catch staleness exceptions even if you check the version, if you do not lock the object at the point of loading it (see later).
  2. Implement your own “object version” mechanism and update that when it makes sense, and check it when updating data – and include a check for it in your SQL update which means all updates have to use executeUpdate instead of normal Hibernate SQL. Tricky.

Option (1) is at least workable if unattractive, the second makes using an ORM rather pointless.

Using explicit row locks

If we reuse our Photo example, imagine we have a transactional service method to add a comment to a photo, while a background job runs to update the url of the photo:

In the above case, if we have optimistic locking OFF, and dynamic update ON, things will be ok in the sense that everything that is meant to be written will be. However this may not be what we want in the rest of the app. Let’s say we want to leave dynamicUpdate OFF so that we guarantee consistency of objects that we write and avoid Hibernate calculating SQL statements all the time. If we use pessimistic lock() would we gain anything?

What have we gained? The answer is not much. Between the get and lock calls on Photo the data could have changed in the DB. So with dynamicUpdate OFF we could lose data that we didn’t change. With optimistic locking ON we could still get stale object exceptions when we call lock().

However if we lock the object on load, we can avoid this:

Now in this scenario, the uploadPhoto method cannot get a stale object exception. BUT this means that all other code in your application that updates Photo objects must call lock() in this way. Otherwise unlocked updates will still occur, and you will experience data loss because the UPDATE may have been based on stale data that has since changed in the DB. Only explicit lock() will issue a SELECT FOR UPDATE which guarantees serialisation of reads & updates to the row in question.

So how might we write this to handle staleness in the addComment call, given that the photo may not have been locked when first loaded? Here we have been passed a photo that is not yet locked, and the underlying data may have changed by the job since it was read. Therefore when we lock it, it will throw.

This is one way in which you could handle this. Obviously it would be nice to add a meta-method to make this less onerous.

There is another problem here however. The photo passed in to the method does not get refreshed by this approach. The caller would in this case have to perform the lock() on the Photo, which can get tricky if you don’t know until some logic has passed that you must lock the object. The caller or addComment methods could also refresh() on the photo after the changes have been flushed.

The final and most insidious problem here is that the addToComments call modifies a collection. Collections themselves cannot be locked, only individual objects can.

As a result you cannot make the change to the collection safe, and therefore when the collection is flushed you may still lose changes to it that were made concurrently.

That is, unless, you use explicit locking and optimistic locking together. Changing a collection changes the version of the object that owns the collection.

Some suggestions about which techniques to use

Let’s be clear here that the goal here is to eliminate those pesky StaleObjectExceptions. This means compromises based on your data coherence vs performance vs coding effort.

Here are what I believe are the available trade-offs and solutions.

I want…So I have to…
Fast and loose: A simple solution and am happy with last-property wins and possibly inconsistent property valuesTurn off optimistic locking and turn on dynamic update. No code changes. High risk.
Coherent: coherent property values from the DB, but don’t care about concurrent edits getting lostTurn off optimistic locking, use transactions everywhere and always call Domain.lock(id) to retrieve objects or call object.lock(). Dynamic update setting is purely an optimisation issue.
Bullet proof: detect staleness and ensure consistent propertiesTurn on optimistic locking, use transactions everywhere and call Domain.lock(id) to retrieve objects, handle staleness if object.lock() is called. Dynamic update setting is purely an optimisation issue.

The problem with the “Fast and loose” approach is that due to the object and property based nature of the ORM, it is really quite hard to work out if this is a safe enough option for you. Through association dereferencing and Grails Plugins that you use, it can be incredibly hard to know whether or not this could have catastrophic problems for you – such as security plugins losing password updates for user objects.

There’s also a few rules that I believe to be correct:

  1. The only foolproof way to avoid StaleObjectException(s) while keeping optimistic locking ON, is to do all GORM work in transactions, and always load objects with Domain.lock(id). When using dynamic finders or criteria you will need to specify the “lock” option to have results pre-locked
  2. The dynamicUpdate setting is off by default for likely dubious “performance reasons”. The setting does however materially affect data coherence if optimistic locking is also off.
  3. Always use transactions if you have concurrent applications to scope your updates – never rely on implicit hibernate session flushing.
  4. Calling lock() silently fails if you are not in a transaction. I think this should be changed in GORM to throw an exception. It is clearly a problem if you lock() and it does not lock.

Beware the plugins

After writing all this and starting to use this information to fix up my client’s app so that it handles these issues, I hit upon the realisation that very few plugins – if any – are written in this way. I am pretty certain that personally I have some work to do now to update the Email Confirmation at the very least, and Weceem definitely will need some work.

Thankfully not many plugins out there do updates to GORM domains, but any that do will have to be updated to use locking too. This typically becomes evident when you start running code that uses these plugins in production across multiple JVMs accessing the same data – especially if you haven’t configured Quartz to avoid duplicating jobs across JVMs in the cluster.

Some plugin authors might not want to update their plugins for DB concurrency, but hopefully this article makes it clear that there is no other choice for correctness than to use explicit locking.

Note also that plugins that also want to detect staleness with explicit locking will have to forcefully set version true on the domain classes that they supply, because the app developer might be “smart” and turn this off by default. If they work with other application-supplied domain classes the developer will likely need to do this on their own domains also to avoid data loss when objects are retrospectively locked. Let’s hope nobody decides to go down this route however, it would be a bit strange.

The problem is that without having your plugins updated to handle staleness correctly, you will not get rid of all your stale object exceptions.

Can’t we make this better?

I remain convinced that there is scope for improvement here. However it will require some changes to the general object update paradigm, and getting this to work with the results of queries would be a major challenge. It would likely mean a new ORM API.

In my view, knowing what I know now, we should not be using unlocked access and optimistic locking out of the box in Grails. Its basically a lie.

The patterns should be inverted to be reliable and exception-free by default, and going off-piste for performance reasons should be the less trodden path. The programming would should reflect this.

Some observations from Core Data APIs in iOS / OS X

Core Data is Apple’s ORM for Objective-C. It shares many concepts with Hibernate and GORM, and there is a ManagedObjectContext which is roughly equivalent to the Hibernate session.

What is interesting is that you can nest contexts in Core Data, which is in effect like creating a (nested) transaction in memory.

You can then have Core Data merge your nested context back into its parent context – and this has a merge conflict resolution policy that you define.

Now Core Data is a different beast to Hibernate, in that it is not really geared towards multiple processes (i.e. multiple servers) accessing the same shared DB – so it is privileged in that it can track things that are harder to track for our web server grade ORMs. However it is designed to work concurrently on the local machine, and is used like this extensively in apps.

The secret it has is that it knows exactly which properties on which objects where changed and in which context.

This enables its merge policies to do useful things. Notably, the merge policy options you have are:

The most important thing here is that merge conflicts are caused by changed properties not by the version of the object. A property value that differs from that in another context implies that the object has changed. That is the key difference between it and Hibernate Optimistic Locking. Core Data calls this mechanism optimistic locking, and by default NSErrorMergePolicyType is used which causes any changes to the same object to fail to be saved to the store.

Its worth looking at the policy type enums to see what kind of thing you can do:

  • NSErrorMergePolicyType – Persisting the context will fail. This is like StaleObjectException, except you get a list of the specific properties in conflict – not just the list of dirty properties that you changed.
  • NSMergeByPropertyStoreTrumpMergePolicyType – data is merged, and if a property is different in both contexts, the one from the store wins
  • NSMergeByPropertyObjectTrumpMergePolicyType – as above but the inverse. Changes in the store will lose out if your context also changed them
  • NSOverwriteMergePolicyType – your changes are forced to overwrite. This is like no optimising locking in effect
  • NSRollbackMergePolicyType – your changes are thrown away and the already persisted ones preferred

Already you can see that this is much richer than the current options available in GORM Hibernate.

However this policy can only be set per context, but as mentioned you can nest contexts, so effectively this is like setting it per transaction.

What this does is give you the freedom and speed of optimistic locking, and when it comes time to persist stuff to the store and something is stale, it can be automatically resolved for you according to your policy, or you can deal with it manually.

Safe updating with current GORM

For the sake of clarity, here’s what we can do today to ensure correct and robust updates.

These solutions all involve locking at the row level.

You will be well advised to have optimistic locking ON to catch cases in your code where you have forgotten to lock() The object. In this scenario, your code that forgot to lock() the object will get a stale object exception when it tries to flush, so you can find that place and make it lock the object.

When you know you are going to update and can lock in advance

It sounds silly, but you won’t necessarily know this in a complex application. There may be some code path that causes a property to change after you originally loaded the object somewhere else in the processing chain.

Hopefully you do know, because this is the easiest case:

This code does not have to handle staleness because you lock the object before you make any changes to it. If you need to warn the user that a concurrent change occurred, you can do so but the only safe way to do this is to include the previous version in the data they submitted.

Any other code attempting to update the same row will block until this transaction completes. Any non-locked update attempts will result in stale object errors if you have optimistic locking ON. As mentioned above, this is a useful thing.

When you don’t know if you are going to update objects…

In this case you need to load the object and perform some logic based on its properties, and only then do you know that actually you need to change some properties too.

The approach you take in this scenario is largely dependent on whether you want to detect staleness or not for user experience reasons.

If you want to know if the object you are retrospectively locking is actually stale, use the following:

On the other hand, if you don’t care about staleness but do want to use the latest data, you can treat it much like the case where you knew you would lock it in advance:

This obviously gets more complicated for collections. I’ll leave that as an exercise for the reader.

A glimpse of the future?

The kind of merging/locking behaviour you want depends on what is currently going on. What is the intention of the code, what was the user (if any) trying to achieve?

One approach would be to mimic Core Data, and always “lock” objects in hindsight rather than in advance, and use helper methods to deal with staleness (by merging usually) when locking. This could yield much better DB performance because you are only locking rows at the end of your transaction. One can imagine this is why Apple chose this mechanism for Core Data. Take for example:

These give the caller the control over how conflicting property changes should be resolved, and failure to use any of the lock methods would result in a stale object exception. Unfortunately you would only find these at runtime.

Developers have to handle this themselves currently if the row cannot be locked and is now stale, so we need to make this easier. These methods could be monkey patched or provided by a plugin. Here’s a gist containing simple code to do this.

This retrospective locking is not a “normal” approach. Pre-emptive locking is arguably easier. However the locked objects are locked for the duration of your transition. In this scenario you could lock them just at the end of the transaction – provided you know which objects they are! If your transaction includes expensive tasks like sending mails and querying other services, this could have a major effect on your concurrency.

This lock and merge could be totally automated, much like Core Data, so that there is a per-object or per-transaction conflict resolution strategy that you set – which is really just setting the relative importance of the updates you have just performed. When it comes time to flush, all the modified rows are locked and merged according to the policy. One may rightly start to worry here about the ordering of locks and deadlocks.

I also wonder if Grails should remove the explicit session in controller actions, and they should be created with a withUpdateTransaction call. This also solves the problems that “Open Session In View” causes regarding non-explicit commit/flush causing uncatchable validation and staleness exceptions. People do not understand this mechanism and its drawbacks clearly, and having GORM operations fail unless the correct wrapper method is called or annotation added, would help.

Using a non-transactional session should require a call to withoutTransaction or similar.

If not using the smart merging mooted above, updates to any domain object properties within a withUpdateTransaction should raise an exception unless lock() has already been called or a new beginUnsafeUpdate() has been called on the object to prevent this protection. Or more controversially, have code in withUpdateTransaction block automatically lock all get() and query results.

These could be reflected in new annotations for service methods also, to make clear the intent and to fail fast on incompatible requirements e.g. a “withoutTransaction” controller action would not be able to call a @TransactionalForUpdate service method.

I think this would make the task of identifying what exactly is happening in code that access the ORM much easier, and will default to correct first, but allow switching to highest performance if you know you can live without transactions:

Safe out of the box. And if I want to be unsafe:

Simple. Notice the only change is in the method call, the code is identical. I don’t think this could be monkey patched in however, due to the integration required with criteria, dynamic finders, where queries etc. who would all need the “lock” parameter set implicitly.

I think this would go some way to alleviating some the problems we have integrating smarter / hybrid update models. Ultimately however a whole raft of other problems exist relating to when errors will occur and how to handle them all correctly… but here the answer is simple and unequivocal:

Explicit transactions demarcate your updates and ensure you know if flush fails. Never, ever rely on the Open Session In View filter to flush for you.

Problem is, this is not the default pattern shown in Grails examples.

Note: Associations will still present problems. Arguably hibernate needs a version column for each collection in the object.

Can I go home now?

It has been a horrible few days crawling through the issues here, writing and rewriting this post several times, and trying to get to the bottom of what the best practice should be for Grails apps and plugins.

I think I have managed to establish this, but it is not great news. A lot of work needs to be done.

I’d like to thank the Grails Core team and especially Burt Beckwith for their advice and patience, it was a big help.