Today I replaced lock/getrasinfo/unlock with one single call to XPutImage (no shm support for now), to get rid of some unescessary overhead for MaskFill.
To my surprise the LineAnim demo got even worse on Xorg-1.5, and according to top Xorg was using ~50% and the java-process 150% of my CPUs, which made me remember I saw similar things with the X11 pipeline when Java2Demo's delay was set to 0. Setting it to 1ms, the demo jumped from 150fps to 250fps, and now java was using 90% cpu on the line with xorg.
I don't know exactly what the cause is, but I guess its some locking problem in Java2Demo or Java2D itself.
I wrote a small benchmark drawing 10000 antialiased lines with a with of 100 and y1/y2 difference between 0 and 100, here are the results:
X11/Xorg-1.3/XAA : 1050ms
XR/Xorg-1.3/EXA : 1250ms
The X11 pipeline performs quite good on XAA, most likely due to using SHM (maybe even shm pixmaps), and Xorg-1.5 is slower than Xorg-1.3 most likely due to a performance bug I've already reported in a different context.
On Xorg-1.5 the profile looks for now like this:
72584 9.7587 Xorg Xorg dixLookupPrivate
51372 6.9068 libdcpr.so libdcpr.so writeAlpha8NZ
47983 6.4511 libdcpr.so libdcpr.so processSubBufferInTile
39541 5.3161 libc-2.8.90.so libc-2.8.90.so memcpy
32230 4.3332 libmawt.so libmawt.so prepareMaskPM
31531 4.2392 intel_drv.so intel_drv.so i915_prepare_composite
So after the dixLookupPrivate issue is resolved I guess performance will be better than Xorg-1.3 :)
I am quite curious how much SHM can help here.
Well of course using shm-images for MaskFill was a stupid idea, because it forces MaskFill to sync with the server every time, however I still found a small optimization to not unescessary copy data arround.
I guess a bit larger xlib-buffer-size would quite help here, but as far as I know its for now hardcoded to 4kb and can't be changed (for xlib-xcb).