The other day I saw that Sun released JDK 1.6 Update 14 that includes version 14 of their HotSpot Virtual Machine. I think for the first time they included in the release notes a section mentioning the VM now accepts a DoEscapeAnalysis flag. Escape analysis has been a much talked about optimization technique, but while it's been around for a while this was the first time I've seen Sun officially document it.
My coworker then pointed me to an article that questioned whether escape analysis actually did anything (bing "did escape analysis escape Java"). The article is quite informative, but I was skeptical of the benchmark test and its conclusions. I mean, there are really smart people at Sun (not just guys like me with a blog who tries to sound smart) and I found it hard to believe that -XX:+DoEscapeAnalysis did nothing. In fairness, there's a second part to the blog, also quite informative, that digs a little deeper into benchmarks and clarifies some things.
Playing around a bit (with Java scientific benchmark) I think -XX:+DoEscapeAnalysis does offer a performance gain, but it's not as much as some of us had hoped (some of us are always hoping the next "breakthrough" will solve all our problems). Also, it seems that escape analysis is not as predictable as the other concurrency optimization techniques. One runs saw little gains, the next saw more substantial gains; whereas biased locking and lock coarsening predictably, consistently saw big performance improvements. Escape analysis doesn't seem to kick it sometimes. This is all guessing on my part. I have no idea on a low level how these techniques are implemented.
At this point some might be asking, what is escape analysis, biased locking and lock coarsening, anyway? I wrote a little bit about biased locking here last year. But now that I have OmniGraffle Pro I'll explain things better with diagrams.
Let's start with lock coarsening. You enable lock coarsening with -XX:+EliminateLocks. This does not mean locks are completely removed/eliminated. Rather, instead of a thread repeatedly acquiring and releasing the same lock the VM will combine these calls.
In figure 1 you can see two critical sections guarded by the same lock. Lock coarsening will merge the two critical sections, along with the non-critical section, into a larger synchronized block. Instead of two acquire/release calls there's only one. Moreover, with a larger block of code to work with other optimizations can be better applied. While this reduces the locking overhead it should be noted larger critical sections may cause the application to be less responsive. Some may notice the irony that we often try to break up large synchronized blocks.
Up next is biased locking. Biased locking stems from real world observations that most locks are uncontended and at most one thread tries to acquire the lock. Here's how two threads locking two different objects work without biased locking.
As each thread enters the critical section it tries to acquire the (different, uncontended) lock. This involves expensive operations: operating system mutexes, conditional variables, compare and swaps, etc. Given real word observations we can optimize by biasing objects to threads base on some heuristics (e.g., which thread created the object). The threads can then lock/unlock without expensive operations.
The downside is if there's another thread that comes along and attempts to acquire the same lock the lock bias has to be revoked then. If I remember correctly the VM does some sort of bulk rebiasing to amortize the overhead.
Finally, escape analysis. First, the virtual machine analyzes execution to see which objects escapes into the unknown and which are confined to the thread/stack frame. Objects that are confined can be optimized. The VM does two kinds of optimization on these confined objects. First, since the objects are confined to the thread it is safe to remove locking (no other thread can possibly come along and use the the confined objects). The fancy term is lock elision. The most basic example is with a local StringBuffer, but there are actually some really interesting examples that aren't so obvious, i.e., examples that can't be optimized statically at compile time.
The second optimization that can be done after escape analysis is allocation optimization. In Java when you do a new FooBar() the object is allocated on the heap. Heap allocation is actually really faster, contrary to what many might say, but then you have to deal with garbage collection (the de-allocation part). There's also an interesting issue with latency from cache misses due to the fact that memory is hierarchical. (google Brian Goetz, he explains it better).
However, since the VM knows that these objects won't escape it can allocate them on the stack or put them in the registers, known as stack allocation and scalar replacement, respectively. For example, if FooBar was a simple class that contained just a single int it could avoid allocation and the int field could be place in the register. This avoids allocation, garbage collection and cache misses.
Just as with lock coarsening, escape analysis leads to further optimization opportunities.
But as I've said before, my dream is to one day never have to deal with locks. I'm still waiting transactional memory, which might be coming soon hardware transactional memory.