Today I further investigated the performance problems when Gradients are used. Although its not really the the pipeline's fault something had to be done - even Metal/Ocean spent significant time drawing gradients.
The problem is how EXA does handle fallbacks:
For EXA the only concern is wether something can be accessed by the CPU or GPU, so if it has to draw a gradient which is pinned to sysmem (because gradients for now are not accalerated) it has to access the destination and mask surface with the CPU, because composition is done in software.
The problem is that for now the only way to make the surface CPU-addressable is to copy the content from vram to sysram - so mask and destination contents are copied from vram to sysram and composition is done by the CPU.
Now imagine what happens if directly after that e.g. fillRect is called. EXA realizes that this operation can be done by the GPU, but the image content is now in sysram. So it copies mask/dest pixmap to vram again, and the fills the single rect.
This made the X-Server spend more that 50% of its cycles in memcpy, most of the time reading back contents from vram.
The workarround for this problem was to allocate a 256x256 gradient-buffer-tile. So if a gradient would be used for composition it is drawn into this tile and then the tile is used instead of the "real" gradient as source.
The trick is that the tile-pixmap is never altered by the GPU, so the only copying is sysram->vram, once when the composition happens.
This workarround makes Netbeans-main-window-resizing a lot faster and also helps Nimbus which uses gradients even more. The drawback is once gradients will be accalerated, we will do a useless composition step - and the memory wasted for the buffer-tile.
Here are some benchmarks before and after that change:
BezierAnim: ~55fps -> 355-600fps
GradAnin: ~30fps -> 180fps