Freitag, 24. Oktober 2008

Precompiled packages available

I've built precompiled packages as well as revised the project pages a bit.
Both are available for x86:

I would be really happy about feedback, especially how it works on different hardware.
There's quite a lot of hw untested (see "Driver Status"), so if you have an untested card with recent drivers it would be great if you could give it a try and report back wether it worked and which bugs you've run across.
Feedback over Mailing list is preffered, or simply leave a commend here.

Mittwoch, 22. Oktober 2008

TransformedBlit rewrite (and new benchmarks ;) )

TransformedBlit rewrite:
TransformedBlit was one of the most ugly parts of the pipeline and had negative influence on its performance, so I finally sat down, tinkered a few days and rewrote that stuff.
Its far from perfect and the coding-style is even uglier than before, but image-interpolation is now done the same way as Java's software pipelines do and performance improved especially for not accalerated composition.

I did this before the pure-java rewrite to be able to test against different drivers, so at the time the rewritten pipeline is done drivers are in a good shape.
I'll soon start to distribute pre-built OpenJDK packages containing the pipeline, it would be really great if there would be interest in testing it.

Swing Benchmarks:

I recently compiled xorg-server-1.6/intel from GIT. Intel seems to fight with some performance regressions because of their gem-rewrite (I only tested on a non-GEM system), with some workloads I saw halfed throughput, however there have been enhancements in the X-Server I was interested in:

"Nimbus - Metal" -> "Lightbeam - Metal"

The good news is that the XRender-Pipeline(XR) beats the X11 pipeline for this specific workload in every case on every accaleration architecture (EXA/XAA) on my system.
The bad one is that the pipeline is still faster on XAA than on EXA.
This could have to do with the performance regressions in the intel-driver mentioned before, but I still doubt that EXA will beat XAA even once they are fixed.

My theory why XAA is faster than Software is that my dual-core cpu is able to benefit from the additional asynchonizity brought by the client-server design of X.
While the X-client (java) is able to run nimbus/pipeline code on one CPU, the xserver can run its software rendering loops on the other core.
The Software-Pipe is synchronous however, waiting until a software-rendering operation has finished.

After all, once the GEM rework is done, I want EXA to beat XAA's result. I guess this means filing many performace bugs, like this one:

Dienstag, 14. Oktober 2008

Cool XCB improvements

Recently there's a lot of interesting work going on in XCB, which is used indirectly by the XRender pipeline via Xlib.

1. Output Buffer enlarged:
In one of my last posts I complained that the output buffer was limited to 4kb, now the default was raised to 16k. Although I had hoped for some API-adjustable buffer size, thats way better than the 4kb and its furthermore adjustable at compile time. So chances are good that desktop distributions will set it to ~64k to squeeze out some additional performance for their users.

2. Socket-Hand-Off:
Beside the fact that this will improve performance a bit, this will allow applications to write directly to xcb's socket, instead of having to go through xcb's protocol generator. (as far as I've understood).

This would really be *extremly* helpful for the pipeline rewrite.
I would like to write the pipeline in pure java, but instead of going through JNI for each primitive I would prefer to buffer it, very much like the D3D/OGL pipelines do.
The new pipelines however have to generate an opcode stream which is later re-interpreted at the native side (somthing like switch(getNextOpcode(buffer))) - and without the socket-handof-mechanism we would have to do exactly the same.

With the socket-handoff stuff in place, it should be possible to directly write X11-protocol to a large (NIO?) Buffer, and when its time we flush that data directly through xcb's socket, which means:
- No per-primitive JNI cost
- No additional command-stream generation / interpretation
- The buffer is ours, so we can size the buffer ;) (really?)

One thing I am not sure about is how the IDs could be synchronized with xcb, like the IDs of server-side resources, hopefully there's a solution.

Well, for now ... university has me again *argh!*, and there are a few things I would like to fix in the old pipeline - in order to test dirver compatibility and submit bug-reports, before I can start working on that stuff.

Samstag, 11. Oktober 2008

NVidia 178.80 :)

Finally the new NVidia drivers (178.80) went gold, and they are impressive :)
They perform extremly well over a broad range of RENDER operations and they don't seem to have any performance weaks the pipeline hits (except subpixel AA text on GF7 and below).
One bug which affects TexturePaints shows up in some Java2Demo's, but I guess this small problem should be fixed soon.

Furthermore, the git of the intel-driver now archieves 960.000glyphs/s on my 945GM, even without GEM, so text should no longer be a bottleneck for the swing benchmarks. Odly I am experiencing other slowdowns with this driver and I was not able to locate their reason - they are not caused by fallbacks, and a lot of time is spent in kernel. Maybe sysprof can help here...

Prebuild packages with the pipeline should be available soon, I am just waiting to get my vServer account activated.
Currently I am struggeling with the transformed blit rewrite ... hmm ... somebody with a good background about AffineTransformations out there? ;)

Samstag, 4. Oktober 2008

Driver bugs and software fallbacks

A bug I repored to nvidia and intel driver developers some time ago seems to be harder to fix than I thought.
For composition where the source is read outside of its surface bounds, the result should be transparent (not touching the background), but instead its black for SRC, when src is a RGB24 picture:

In the short term both, intel and nvidia driver will fallback to software in this case because hardware does not seem to support it directly. I hope soon workarrounds will be implemented (in theory the driver would only have to allocate a ARGB32 picture, blit to that and use that one instead - maybe it could be done even more efficient with shaders). In the meantime I fight with building xorg from git ;)

Thanks a lot to Carl Worth for looking deeper into this issue! At least i965 and higher support the required behaviour and Carl fixed the driver, I filed a bug about it on 915 and lower. There are workarrounds which would solve it on that hardware too (if there should really be no support for it), however they are more than a 1:1 mapping and only apply in special cases. Using a temporary ARGB32 surface (maybe with tiling for very large source-pictures) is somewhat inefficent but should solve all cases.

JVM improvements:
A lot of interesting stuff is happening in the JVM space:

- The new G1 "Garbage First" garbage collector has been open-sourced
- IBM's Java got a cache for JITed and AOT code, I hope Hotspot will soon have a compareable feature too.
- 64-bit client jvm is almost ready
- Tired compilers are working
- CompressedOops reduce overhead on 64-bit platforms, when heap < 32GB

No idea what will be ready for JDK7, but this definitivly will be an interesting release again :)