TransformedBlit was one of the most ugly parts of the pipeline and had negative influence on its performance, so I finally sat down, tinkered a few days and rewrote that stuff.
Its far from perfect and the coding-style is even uglier than before, but image-interpolation is now done the same way as Java's software pipelines do and performance improved especially for not accalerated composition.
I did this before the pure-java rewrite to be able to test against different drivers, so at the time the rewritten pipeline is done drivers are in a good shape.
I'll soon start to distribute pre-built OpenJDK packages containing the pipeline, it would be really great if there would be interest in testing it.
I recently compiled xorg-server-1.6/intel from GIT. Intel seems to fight with some performance regressions because of their gem-rewrite (I only tested on a non-GEM system), with some workloads I saw halfed throughput, however there have been enhancements in the X-Server I was interested in:
The good news is that the XRender-Pipeline(XR) beats the X11 pipeline for this specific workload in every case on every accaleration architecture (EXA/XAA) on my system.
The bad one is that the pipeline is still faster on XAA than on EXA.
This could have to do with the performance regressions in the intel-driver mentioned before, but I still doubt that EXA will beat XAA even once they are fixed.
My theory why XAA is faster than Software is that my dual-core cpu is able to benefit from the additional asynchonizity brought by the client-server design of X.
While the X-client (java) is able to run nimbus/pipeline code on one CPU, the xserver can run its software rendering loops on the other core.
The Software-Pipe is synchronous however, waiting until a software-rendering operation has finished.
After all, once the GEM rework is done, I want EXA to beat XAA's result. I guess this means filing many performace bugs, like this one: https://bugs.freedesktop.org/show_bug.cgi?id=18075