Dienstag, 30. September 2008

OpenJDK Challenge Results

Finally the waiting is over and the Challenge winners have been announced, Congratulations to all winners!
The projects are really cool and I am quite curious to see how they will evolve.
Of course I was quite suprised that the XRender project has won the Gold Medal :)

Thanks a lot to Dmitri Trembovetski who supported the project from the very beginning by offering to act as contact point, who was always patient and very helpful ... and even had a minute or two to talk about non technical stuff :)
Thanks of course also to the 2d-dev team who, especially in the beginning, tolerated all my neverending newbie questions and all people involved in the Challenge.
And last but of course not least thanks to Sun for open-sourcing java and for sponsoring the challenge.

Of course I'll continue to work on the project and that is what I hope will happen in the short term:
  1. Build binaries so that its easy to test it without spending a day (or two^^) compiling and patching OpenJDK. If your intention is using ordinary Swing apps with Ocean LnF your chances are good to see really good performance even with the current drivers :)
  2. Implement some outstanding optimizations and fix remaining bugs.
  3. Test on Xorg git to see if showstoppes and performance problems are present, test as much hardware as possible and report all bugs. Goal would be to have infrastructure available which is able to run the Xrender pipeline really well.

After so much excitment its time for a cold beer and StarTrek ;)

Mittwoch, 24. September 2008

NVidia Driver Bug

Finally I was able to hunt down the bug that caused ugly artifacts with the new NVidia beta drivers, things like this:

It seems to be a race-condition in the driver, if we do something like this (however that would work, because we would use another, optimized code-path which doesn't trigger the bug):
g.setColor(yellow); g.fillRect(); g.setColor(red); g.fillRect();
It seems the driver doesn't make sure the setColor really finished before it starts with fillRect, so it could be that the second rect is not red, but also yellow.


Hopefully all that bug-hunting will soon lead to good drivers ;)


There were some doubts wether Xorg's software implementation is correct, as it differs with what the Intel driver currently does I filed a bug some time ago. Luckily the Intel driver will follow the software implementation, the Radeon driver already does as far as I know.
This change means that we can implement transformed blits a bit more efficiently, if no filtering is requested.


Java Gears

To be able to compare the XRender pipeline's performance with other Libraries like QT or cairo, I ported qgears2 (a port of the original cairo-gears program to QT4) to Java:

Sure, its nothing to rely on, but at least a nice demo and at least gives some indication where we are when it comes to shape rendering.

EXA: 32/85 (No AA / AA)
XAA: 120/100

Java Gears:
EXA: 220/60
XAA: 200/82

So for aliased rendering Java running the XRender pipeline is quite a good deal faster, but we are behind when it comes to antialiased rendering.
I guess a large amount of cycles accounts to xlib/xcb, we hit 17500 context switches per second, I filed a bug about the problem discussed recently: https://bugs.freedesktop.org/show_bug.cgi?id=17735
This was on Xorg-, so the EXA results are influenced by some performance problems that version has.

I'll have to ask zack, if he agrees source will be available soon.

Montag, 22. September 2008

Mask upload performance

Antialiased rendering is currently done by uploading mask-tiles to the XServer followed by a composite operation with that mask.
Beside the fact that performance is not very good compared to the D3D pipeline, I saw an awful high context switch rate running J2DBench demos like lineanim (30.000/s) which is ... well ... not pretty.

* The problem is that xlib/xcb's buffer is only 4kb small (it has been 16kb by default in the "old" xlib implementation), and a AA tile is between 0-1kb large, so after maybe maybe 6-8 tiles the command-buffer is flushed, which results in a context switch. The ugly detail here is that its not possible to adjust the buffer size at runtime, not even before startup (was possible with old xlib) or compiletime, hopefully this will change.
* Another performance limiter is that the mask-data has to be copied using the command-buffer, over unix domain sockets to the XServer.

Xorg supports the Shm extension, however earlier benchmarks I did show the penality for having to wait until the XServer has copied the data before the shared memory region can be used again.
The X11 pipeline also does Shm transfers only if the amount of data to be transferred is >=64kb, otherwise its not woth the additional round-trip.
One roundtrip for 1 tile is way worse than one flush for every ~7 tiles of course.

The solution could be using more than 1 shared memory segment, and only Sync when all have been used, I did some benchmarks and the result looks promising.
Uploading a 32x32x8 mask 10.000 times and doing a composition operation with it, takes:

So when using 4 shm masks, and syncing after those have been used performance is the same as when using the traditional mask-upload-path. 1 mask consumes about 1kb for the pixmap and 1kb for the shared memory area + all the overhead associated with it, so it should still be no problem preallocating 32 or even 64 masks.
Allocating one large pixmap and maybe also shared memory area maybe reduce the overhead.

The cool thing is that this code does force round-trips by syncing, however Xlib provides a event-based system which notifies the client when image-transfer was completed - which should make the shared-memory approach even faster.
So for now I see a 2x improvement for the upload path, for sure this will speed up antialiased rendering quite a bit, I am quite curious how much.

In the benchmark above I only tested Xorg-1.3/XAA, I repeated the tests with Xorg-1.5 and EXA:

Xorg-1.3/NVIDIA: 80ms (tested on my old 2.6ghz/P4 notebook)
Xorg-1.3/XAA: 85ms
Xorg-1.3/EXA: 1000ms
Xorg-1.5/EXA: 250ms

So EXA seems to struggle a lot with that kind of workload :-/
Although a lot of time is spent in dixLookupPrivate, it only accumulates to 20% of total runtime. I definitivly need to build Xorg-Master and see how that performs.

NVidia does quite well, although this was an old legacy release. Looking at oprofile it seems to be done in hw, almost no time is spent in libfb :)
Also, the old Laptop still used traditional XLib, and a little disappointing result is that using SHM or not does not seem to make a lot of difference (>10%) there - so maybe using SHM in this case is just working arround Xlib/XCBs problems.

Dienstag, 9. September 2008

Laptop repair

Soon my Laptop, a Toshiba Tecra A8, will be on its 4th "vacation".
I tend to do all my work on a Laptop so I bought a model which was praised for its mechanical stability and durability. I guess those beasts simply are not built for my type of "workload", so like all its predecessors it needs frequently repairs.
I'm glad that 2 to 4year warrenty extension was so cheap and support is quite ok.

OpenSolaris support will have to wait until I live in vienna again, I simply don't get it working. I also plan to create some pre-compiled binaries, so people can test the pipeline without compiling OpenJDK itself.
I also would like my Deflater/Inflater improvements to get in OpenJDK, but I guess I have to get used to the code again and write some tests to verify it.

Montag, 1. September 2008

NVidia Binary Driver 177.70

NVidia has released a few small bugfix-releases following 177.67, and I gave the 177.70 beta release a try.
Their proprietary driver is especially interesting because it does not depend on EXA (which still has some performance bugs in xorg-server-1.5) and all the memory manager stuff is already in place.
Most Java2Demo tests are accalerated really well now, however the driver seems to struggle with text so nimbus-performance is worse than with the Nouveau driver.
Another weird thing is a software-fallback hit in Java2Demo when using TexturePaint, which I was not able to reproduce with J2DBench at all.

More about it here: http://www.nvnews.net/vbulletin/showthread.php?t=118801

Despite the problems reported NVidia is definitivly doing well.
Hopefully the problems will be fixed in the next few releases, enlarging the types of hardware the XRender pipeline can run on well.