Performance: Today I played with the ideas I accumulated over the past two weeks or so (already mentioned earlier) to improve performance of scanline based drawing, and implementing the "Extra-Alpha" concept with XRender.
The first attempts were quite frustrating, because performance was not near as good as I had hoped.
I ended somewhere in XOrg's software loops, spending 85% of total time in memcpy - which made me think FillRects to A8 is not accalerated by EXA or the intel driver (would have pretty much killed my ideas).
However it turned out to be a xorg-performance bug (at least as far as I can tell): https://bugs.freedesktop.org/show_bug.cgi?id=16600
Good to know about this limitation, maybe it can help improving the existing code a bit - thanks a lot to the guys at #xorg-devel for beeing that helpful and friendly :) (Still glad that I had not to fire up GDB^^)
The results where as pleasant as expected, for the micro-benchmark I ran before (filling 6250 spans, 250 at a time) I got:
Rendering to explicit mask using XRenderFillRectangles: ~10ms
Rendering to destination, compositing scanline per scanline: ~50ms
As always when using masks there is a worst case, a 45° line - however, as far as I can guess the low performance in the scanline-per-scanline case is not because of limited fill-rate, but from the per-operation overhead, which seems far lower with FillRectangles. I guess the worse the driver, the more the new approach "benefits" compared the old one - as long as it does accalerate FillRects to an A8 mask everything should be fine, with the exception of very old pre-3d-cards which are not really able to run EXA anyway.
Another advantage of the new explicit-mask-approach is that Extra-Alpha can be done easily and is "for free" in the scanline/fillSpans case, in the line/path case there is one additional (usually accalerated) composition step compared to plain XRenderCompositeTrapezoids without extra alpha.