For large shapes it does improve things quite a bit (on my Core2Duo notebook):

Hey, in one case jules now even beats ductus :)
However it reminded me again how hard it is to write well performing multi-threaded code in real-world.
The outcome is a 2-thread producer/consumer implementation, where the consumer can produce for itself if it has catched up with the producer.
The first implementation (multi-unopt) simply fetched idle tiles from a pool, and added rasterized tiles to another one. And there was a volatile variable for communication. That makes 4 uncontended monitorenter/exit + volatile read/write.
With that approach the synchronization overhead eat up almost any speedup, in some cases I even saw ugly regressions.
I now batch fetch/store from the idle/completed tile lists in order to minimize synchronization costs, as well as let the worker start a bit ahead of the consumer.
I am still not sure wether the whole effort makes much sence, it makes the whole thing quite complex.
After all, it was fun ;)
I wonder how much speedup can be archived by using profile-driven optimizations and -O3 when compiling pixman and jules. I guess arround 10-20%, which could be enough to push threaded Jules in front of ductus.

