Montag, 22. September 2008

Mask upload performance

Antialiased rendering is currently done by uploading mask-tiles to the XServer followed by a composite operation with that mask.
Beside the fact that performance is not very good compared to the D3D pipeline, I saw an awful high context switch rate running J2DBench demos like lineanim (30.000/s) which is ... well ... not pretty.

* The problem is that xlib/xcb's buffer is only 4kb small (it has been 16kb by default in the "old" xlib implementation), and a AA tile is between 0-1kb large, so after maybe maybe 6-8 tiles the command-buffer is flushed, which results in a context switch. The ugly detail here is that its not possible to adjust the buffer size at runtime, not even before startup (was possible with old xlib) or compiletime, hopefully this will change.
* Another performance limiter is that the mask-data has to be copied using the command-buffer, over unix domain sockets to the XServer.

Xorg supports the Shm extension, however earlier benchmarks I did show the penality for having to wait until the XServer has copied the data before the shared memory region can be used again.
The X11 pipeline also does Shm transfers only if the amount of data to be transferred is >=64kb, otherwise its not woth the additional round-trip.
One roundtrip for 1 tile is way worse than one flush for every ~7 tiles of course.

The solution could be using more than 1 shared memory segment, and only Sync when all have been used, I did some benchmarks and the result looks promising.
Uploading a 32x32x8 mask 10.000 times and doing a composition operation with it, takes:

So when using 4 shm masks, and syncing after those have been used performance is the same as when using the traditional mask-upload-path. 1 mask consumes about 1kb for the pixmap and 1kb for the shared memory area + all the overhead associated with it, so it should still be no problem preallocating 32 or even 64 masks.
Allocating one large pixmap and maybe also shared memory area maybe reduce the overhead.

The cool thing is that this code does force round-trips by syncing, however Xlib provides a event-based system which notifies the client when image-transfer was completed - which should make the shared-memory approach even faster.
So for now I see a 2x improvement for the upload path, for sure this will speed up antialiased rendering quite a bit, I am quite curious how much.

In the benchmark above I only tested Xorg-1.3/XAA, I repeated the tests with Xorg-1.5 and EXA:

Xorg-1.3/NVIDIA: 80ms (tested on my old 2.6ghz/P4 notebook)
Xorg-1.3/XAA: 85ms
Xorg-1.3/EXA: 1000ms
Xorg-1.5/EXA: 250ms

So EXA seems to struggle a lot with that kind of workload :-/
Although a lot of time is spent in dixLookupPrivate, it only accumulates to 20% of total runtime. I definitivly need to build Xorg-Master and see how that performs.

NVidia does quite well, although this was an old legacy release. Looking at oprofile it seems to be done in hw, almost no time is spent in libfb :)
Also, the old Laptop still used traditional XLib, and a little disappointing result is that using SHM or not does not seem to make a lot of difference (>10%) there - so maybe using SHM in this case is just working arround Xlib/XCBs problems.


Dmitri hat gesagt…

that looks pretty good!


Linuxhippy hat gesagt…

Great to hear from you, Dimitri :)

Well, unfourtunatly this was with XAA, EXA itself does not perform that well.

However Xorg-Devs seem to be quite responsive to testcases where XAA clearly outperforms EXA, so maybe that can be resolved soon.