Montag, 22. Dezember 2008

Resistance against open standards (almost everywhere)

Disclaimer: This entry is full of half-facts and undirected sweeping swipes against all and everybody ... all this is only my personal opinion.
Furthermore I am I'll and have fever ... take everything with a grain of salt, please.

As you maybe have heard, JavaFX is bundled with ON2 video codecs. No idea who decided why to get this proprietary stuff into JavaFX. Even if ON2 is better than Vorbis - why not give users/developers the choice what to use?
After all, its just another proprietary codec - they could have used Vorbis based stuff without breaking any compatibility as far as I know.
Its like integrating just another binary plug, a plug that cannot be simply replaced by a community developed open-source version due to all the patent stuff. They did this after all the troubles they had with the binary plugs when OpenJDK was opened. Maybe the company behind ON2 decided to somehow pay money or share revenue caused by the success of their codecs boosted by JavaFX. Who knows.
I really like JavaFX, but looking at the video codec decision, I still doubt large companies are able to fully understand OpenSource. It isn't just a magical buzzword to increase your revenue.

Another example is Nokia - everything is well in their open-source universe as long as you don't dig deeper.
The bug with most votes is there since the 770 was released: Adding support for Ogg Vorbis audio playback to the device by default.
The N800 does support all the proprietary stuff like wma, real audio, mp3, aac ... but when MANY users ask for ogg the company is silent.
So what do users of Nokia's internet tablet get? They'll get support for Silverlight, I bet the stuff developed by Novell. Nobody opened a bug-report asking for silverlight support, but many asked for java support in the maemo-mailing lists.
Well, of course Novell's pushing of .NET into Linux doesn't have anything to do with their Microsoft deals.
And some former important developer now anxiously awaits advent of "his" technology conquering the linux world pushing it into Gnome. I doubt he will be celebrated anyway if that succeeds.

There seems to be a huge resistance against open standards, everybody likes squeeze money out of users, by not letting the users decide and forcing them into a format trap. And no, installing a WMA encoder with Windows by default is not choice in my eyes.
I am not against proprietary code, nor am I against large companies. I just would like to have the choice, and choice implies open standards.

By the way ... Merry Christmas :)

Dienstag, 16. Dezember 2008

Almost Pure Java(2D)

The last few days I ported (almost) all the XRender specific C code to Java, and although there are still some bugs left (e.g. it deadlocks from time to time) it works quite well:


All the functionality/features of the C based pipeline have been ported, except text rendering which is a dirty hack for now because I would need some data only available in native data structures by now.
So all the rendering is now done without JNI calls, resulting in ultra-low per-primitive overhead :)

Once all the stuff is working I guess its time for another cleanup ... however without structural changes ;)

Dienstag, 9. Dezember 2008

Java-level protocol generation

The last few days I was experimenting with xcb's new socket-handoff mechanism. It allows mixing self-generated protocol (java2d) with protocol generated by xcb/xlib (all the awt/motif stuff) - with a callback notifying the native side when it should give away control over the socket to the native libraries again.
I had quite a hard time with JNI, not knowing that local object references are only valid during the JNI interaction they were created for - but once that was fixed everything worked out quite fine.

The main advantages are no per-primitive JNI overhead and almost pure java - simplyfing maintainance and better portability as well as better performance.
Of course there are also disadvantages like dependency on a very recent version of libxcb (not in OpenSolaris for now) and maybe a higher protocol generation overhead due to java's managed nature.

For now only rectangles are supported, in the screenshot below the red rect was rendered by the "traditional" Java->C->libX11 call, and all other rects were directly generated in java:


Freitag, 5. Dezember 2008

JXRenderMark

JXRenderMark:
I've released JXRenderMark-0.6, a XRender benchmark written in C which stresses functionality the pipeline depends on. The idea is to be able to show acceleration problems with the benchmark rather than having to build one-time testcases for each incident.

source, binary and results can be found at:
http://78.31.67.79:8080/jxrender/RenderMark.html

Update:
JXRenderMark-0.6 has been integrated into the phoronix benchmark suite :)

NVidia Driver:
NVidia released the second release of the 180.* beta drivers. This release is impressing, especially with the fixes I did last week to the pipeline.
Almost the whole Java2Demo is accelerated, with the only exception of Gradients - all animations just fly :)
A Lightbeam-Nimbus run takes 6200ms, 5200 when I disabled gradient rendering in the pipeline and 4700 with Mask*-Operations and Gradients turned off.
Hopefully Gradients will soon be accalerated, and not only for GF8+ but also GF6/7.

Offscreen pixmaps:
Offscreen-Pixmaps in the X11 pipeline are no more, if you're running on an EXA based driver.
I've written a patch which disables them in the case java is running local and SHMPixmaps are not available, to avoid the 5-10x slowdown linux users were experiencing after distributions defaulted to EXA.
Better no accaleration than some. Will be in JDK6u12 :)

Sonntag, 30. November 2008

IcedTea6 integration

IcedTea Integration:
Thanks to Mark Wielaard, the XRender pipeline has been merged with IcedTea6 [link].
The pipeline will be built by default, but disabled at runtime, to enable it simply pass -Dsun.java2d.xrender=True.
This way the pre-rewrite version can be used as driver-testing vehicle, hopefully it will lead to stable and fast drivers.

Small enhancements & bug-fixes:
I've just pushed a new version:
* Optimized line-rendering, not triggering fallbacks on GPUs without A8->A8 composition support.
This should be only relevant for i830 (everything smaller than intel-915 chipsets) as well as GeForce6/7.
* Fixed a bug which caused an unescessary mask copy at every non-solid operation (e.g. paints, or translucent colors). For some Java2Demo's this improved performance a *lot* :)
* I've fixed a bug Mark found, where I'd mixed up xrender and x11-pipeline initialization.

Freitag, 21. November 2008

JGears2 / RenderMark

1.) JGears2, a simple Java2D shape rendering benchmark (Zack Rusin's QGears2 ported to Java) is available on the project page.
You can give it a try using Webstart.
Update: Somebody else did the port almost simultaneously, and his veryion even implements the "fance" mode: http://trac-hg.assembla.com/jgears/wiki#Java2D
Well, this clearly deprecates my version ;)

2.) I am currently developing RenderMark, a simple XRender benchmark written in C. It test the areas the pipeline uses heavily, to allow driver/xorg developers to find regressions and optimize their drivers for our type of workload.
A first version, only testing geometry processing, can also be found on the project pages.
Hopefully it will help AMD to optimize their drivers, and will be integrated into the phoronix unix benchmarking suite.

3.) Google decided to build a new datacenter located 2km away from my home. Strange ;)

4.) Today I got Catalyst-8.11 working on my HD3850.
As I suspected its almost software-only - so results are not bad, but also not that good.
Lightbeam/Nimbus took 9300ms vs. 15500ms with the X11 pipeline, and 6800ms with the NVidia driver.
However the nvidia driver hit fallbacks, so once these issues are resolved things should be even better.
By the way, the Radeon does the job in 4000ms on WindowsXP with the D3D pipeline - thats definitivly where I would like to be after the rewrite (and with useable drivers!)

Montag, 17. November 2008

university practical

1.) I am currently trying to find a professor who accepts the xrender-pipeline-rewrite project as university practical. It seems to be a problem that the pipeline does not fit well in an institute's sector, because it covers many fields but nothing in depth.
So if a professor of TU or university vienna is reading this ... ;)

2.) I gave the LightBeam/Nimbus benchmark another try with the new 180.06 Beta-drivers on my GF6600.
XRender: 6800ms / Software: 7600ms / X11: 16000ms

Although there are still performance problems left (we depend on solid operations on A8 destination, but for now the binary driver doesn't support it for GF6/7), the new pipeline is already faster than software-only.
And this was on a single-core CPU, so no benefits because X & java running simultaneously ;)

The new beta release also disables SHMPixmaps by default now, explaining the bad performance of the X11 pipeline. So there are not many drivers left supporting SHMPixmaps - I guess soon Linux users will start to complain about low performance of swing interfaces.

Sonntag, 9. November 2008

KDE4.1 rant

I am a long-time KDE user, and always enjoyed the high functionality and pretty good UI performance of KDE3/QT3.

Since KDE-4.0.1 I try KDE4 based distributions from time to time, and I am quite unhappy.
What really hurts is that QT4's window-redrawing is as slow as GTK's, maybe even slower - even if trolltech advertises it as x-times faster.
Furthermore it seems design and eye-candy has (had?) higher priority than functionality - a lot of common stuff is not implemented or doesn't work as expected. They even maintain their own HTML engine, but are not able to get the basic desktop right.

KDE-3.5 is phased out by many distributions, so I'll have a look at Gnome. However I hope the gnome project will go different ways than its founder and innovate instead of just cloning others technologies.

After so many harsh words, I would like to say thanks to all who made linux on desktop reality. My discomfort with KDE4 is only caused by the high standards (your work) I've become used to in the past.

Mittwoch, 5. November 2008

ATI - Catalyst Linux Drivers

I recently tried ATI's Catalyst 8.10 driver to see how well it accalerates XRender on my Radeon HD3850.

To be honest it was quite a disappointment:
  • RENDER accaleration has to be enabled manually, using the Textured2D/TexturedXRender
  • Locks up on my machine immediatly w/o those options.
  • Other users report screen corruptions when enabling TetxuredXRender, overall it does not seem to work very well.
I really like AMD for releasing their specs, but when it comes to drivers they are currently last.
Novell developers paid by AMD work on RadeonHD, which doesn't provide accaleration of R600/700 (HD2/3/4) and R500(X1x00) accaleration is just copy&paste from the open-source radeon-driver.

Their propietary driver doesn't do well at all for 2D accaleration, at least what I have heard and experienced.
So for now their is no useable RENDER accaleration for ATI's latest GPU generation (released in 2006), not by the closed-source one, and not by any existing open-source driver.

I opened a (rather lobby-oriented) bug-report: http://ati.cchtml.com/show_bug.cgi?id=1338
Would be great if you could post a comment and show that there are users who care about 2D accaleration, and not only 3DMark points.

Freitag, 24. Oktober 2008

Precompiled packages available

I've built precompiled packages as well as revised the project pages a bit.
Both are available for x86: http://78.31.67.79:8080/jxrender/

I would be really happy about feedback, especially how it works on different hardware.
There's quite a lot of hw untested (see "Driver Status"), so if you have an untested card with recent drivers it would be great if you could give it a try and report back wether it worked and which bugs you've run across.
Feedback over Mailing list is preffered, or simply leave a commend here.

Mittwoch, 22. Oktober 2008

TransformedBlit rewrite (and new benchmarks ;) )

TransformedBlit rewrite:
TransformedBlit was one of the most ugly parts of the pipeline and had negative influence on its performance, so I finally sat down, tinkered a few days and rewrote that stuff.
Its far from perfect and the coding-style is even uglier than before, but image-interpolation is now done the same way as Java's software pipelines do and performance improved especially for not accalerated composition.

I did this before the pure-java rewrite to be able to test against different drivers, so at the time the rewritten pipeline is done drivers are in a good shape.
I'll soon start to distribute pre-built OpenJDK packages containing the pipeline, it would be really great if there would be interest in testing it.


Swing Benchmarks:

I recently compiled xorg-server-1.6/intel from GIT. Intel seems to fight with some performance regressions because of their gem-rewrite (I only tested on a non-GEM system), with some workloads I saw halfed throughput, however there have been enhancements in the X-Server I was interested in:




"Nimbus - Metal" -> "Lightbeam - Metal"

The good news is that the XRender-Pipeline(XR) beats the X11 pipeline for this specific workload in every case on every accaleration architecture (EXA/XAA) on my system.
The bad one is that the pipeline is still faster on XAA than on EXA.
This could have to do with the performance regressions in the intel-driver mentioned before, but I still doubt that EXA will beat XAA even once they are fixed.

My theory why XAA is faster than Software is that my dual-core cpu is able to benefit from the additional asynchonizity brought by the client-server design of X.
While the X-client (java) is able to run nimbus/pipeline code on one CPU, the xserver can run its software rendering loops on the other core.
The Software-Pipe is synchronous however, waiting until a software-rendering operation has finished.

After all, once the GEM rework is done, I want EXA to beat XAA's result. I guess this means filing many performace bugs, like this one: https://bugs.freedesktop.org/show_bug.cgi?id=18075

Dienstag, 14. Oktober 2008

Cool XCB improvements

Recently there's a lot of interesting work going on in XCB, which is used indirectly by the XRender pipeline via Xlib.

1. Output Buffer enlarged:
In one of my last posts I complained that the output buffer was limited to 4kb, now the default was raised to 16k. Although I had hoped for some API-adjustable buffer size, thats way better than the 4kb and its furthermore adjustable at compile time. So chances are good that desktop distributions will set it to ~64k to squeeze out some additional performance for their users.

2. Socket-Hand-Off:
Beside the fact that this will improve performance a bit, this will allow applications to write directly to xcb's socket, instead of having to go through xcb's protocol generator. (as far as I've understood).

This would really be *extremly* helpful for the pipeline rewrite.
I would like to write the pipeline in pure java, but instead of going through JNI for each primitive I would prefer to buffer it, very much like the D3D/OGL pipelines do.
The new pipelines however have to generate an opcode stream which is later re-interpreted at the native side (somthing like switch(getNextOpcode(buffer))) - and without the socket-handof-mechanism we would have to do exactly the same.

With the socket-handoff stuff in place, it should be possible to directly write X11-protocol to a large (NIO?) Buffer, and when its time we flush that data directly through xcb's socket, which means:
- No per-primitive JNI cost
- No additional command-stream generation / interpretation
- The buffer is ours, so we can size the buffer ;) (really?)

One thing I am not sure about is how the IDs could be synchronized with xcb, like the IDs of server-side resources, hopefully there's a solution.

Well, for now ... university has me again *argh!*, and there are a few things I would like to fix in the old pipeline - in order to test dirver compatibility and submit bug-reports, before I can start working on that stuff.

Samstag, 11. Oktober 2008

NVidia 178.80 :)

Finally the new NVidia drivers (178.80) went gold, and they are impressive :)
They perform extremly well over a broad range of RENDER operations and they don't seem to have any performance weaks the pipeline hits (except subpixel AA text on GF7 and below).
One bug which affects TexturePaints shows up in some Java2Demo's, but I guess this small problem should be fixed soon.

Furthermore, the git of the intel-driver now archieves 960.000glyphs/s on my 945GM, even without GEM, so text should no longer be a bottleneck for the swing benchmarks. Odly I am experiencing other slowdowns with this driver and I was not able to locate their reason - they are not caused by fallbacks, and a lot of time is spent in kernel. Maybe sysprof can help here...

Prebuild packages with the pipeline should be available soon, I am just waiting to get my vServer account activated.
Currently I am struggeling with the transformed blit rewrite ... hmm ... somebody with a good background about AffineTransformations out there? ;)

Samstag, 4. Oktober 2008

Driver bugs and software fallbacks

A bug I repored to nvidia and intel driver developers some time ago seems to be harder to fix than I thought.
For composition where the source is read outside of its surface bounds, the result should be transparent (not touching the background), but instead its black for SRC, when src is a RGB24 picture:


In the short term both, intel and nvidia driver will fallback to software in this case because hardware does not seem to support it directly. I hope soon workarrounds will be implemented (in theory the driver would only have to allocate a ARGB32 picture, blit to that and use that one instead - maybe it could be done even more efficient with shaders). In the meantime I fight with building xorg from git ;)

Update:
Thanks a lot to Carl Worth for looking deeper into this issue! At least i965 and higher support the required behaviour and Carl fixed the driver, I filed a bug about it on 915 and lower. There are workarrounds which would solve it on that hardware too (if there should really be no support for it), however they are more than a 1:1 mapping and only apply in special cases. Using a temporary ARGB32 surface (maybe with tiling for very large source-pictures) is somewhat inefficent but should solve all cases.

JVM improvements:
A lot of interesting stuff is happening in the JVM space:

- The new G1 "Garbage First" garbage collector has been open-sourced
- IBM's Java got a cache for JITed and AOT code, I hope Hotspot will soon have a compareable feature too.
- 64-bit client jvm is almost ready
- Tired compilers are working
- CompressedOops reduce overhead on 64-bit platforms, when heap < 32GB

No idea what will be ready for JDK7, but this definitivly will be an interesting release again :)

Dienstag, 30. September 2008

OpenJDK Challenge Results

Finally the waiting is over and the Challenge winners have been announced, Congratulations to all winners!
The projects are really cool and I am quite curious to see how they will evolve.
Of course I was quite suprised that the XRender project has won the Gold Medal :)



Thanks a lot to Dmitri Trembovetski who supported the project from the very beginning by offering to act as contact point, who was always patient and very helpful ... and even had a minute or two to talk about non technical stuff :)
Thanks of course also to the 2d-dev team who, especially in the beginning, tolerated all my neverending newbie questions and all people involved in the Challenge.
And last but of course not least thanks to Sun for open-sourcing java and for sponsoring the challenge.

Of course I'll continue to work on the project and that is what I hope will happen in the short term:
  1. Build binaries so that its easy to test it without spending a day (or two^^) compiling and patching OpenJDK. If your intention is using ordinary Swing apps with Ocean LnF your chances are good to see really good performance even with the current drivers :)
  2. Implement some outstanding optimizations and fix remaining bugs.
  3. Test on Xorg git to see if showstoppes and performance problems are present, test as much hardware as possible and report all bugs. Goal would be to have infrastructure available which is able to run the Xrender pipeline really well.

After so much excitment its time for a cold beer and StarTrek ;)

Mittwoch, 24. September 2008

NVidia Driver Bug

Finally I was able to hunt down the bug that caused ugly artifacts with the new NVidia beta drivers, things like this:

It seems to be a race-condition in the driver, if we do something like this (however that would work, because we would use another, optimized code-path which doesn't trigger the bug):
g.setColor(yellow); g.fillRect(); g.setColor(red); g.fillRect();
It seems the driver doesn't make sure the setColor really finished before it starts with fillRect, so it could be that the second rect is not red, but also yellow.

http://www.nvnews.net/vbulletin/showthread.php?t=119910

Hopefully all that bug-hunting will soon lead to good drivers ;)

Update:

There were some doubts wether Xorg's software implementation is correct, as it differs with what the Intel driver currently does I filed a bug some time ago. Luckily the Intel driver will follow the software implementation, the Radeon driver already does as far as I know.
This change means that we can implement transformed blits a bit more efficiently, if no filtering is requested.

http://bugs.freedesktop.org/show_bug.cgi?id=16820

Java Gears

To be able to compare the XRender pipeline's performance with other Libraries like QT or cairo, I ported qgears2 (a port of the original cairo-gears program to QT4) to Java:


Sure, its nothing to rely on, but at least a nice demo and at least gives some indication where we are when it comes to shape rendering.

QGears2:
EXA: 32/85 (No AA / AA)
XAA: 120/100

Java Gears:
EXA: 220/60
XAA: 200/82

So for aliased rendering Java running the XRender pipeline is quite a good deal faster, but we are behind when it comes to antialiased rendering.
I guess a large amount of cycles accounts to xlib/xcb, we hit 17500 context switches per second, I filed a bug about the problem discussed recently: https://bugs.freedesktop.org/show_bug.cgi?id=17735
This was on Xorg-1.4.99.05, so the EXA results are influenced by some performance problems that version has.

I'll have to ask zack, if he agrees source will be available soon.

Montag, 22. September 2008

Mask upload performance

Antialiased rendering is currently done by uploading mask-tiles to the XServer followed by a composite operation with that mask.
Beside the fact that performance is not very good compared to the D3D pipeline, I saw an awful high context switch rate running J2DBench demos like lineanim (30.000/s) which is ... well ... not pretty.

* The problem is that xlib/xcb's buffer is only 4kb small (it has been 16kb by default in the "old" xlib implementation), and a AA tile is between 0-1kb large, so after maybe maybe 6-8 tiles the command-buffer is flushed, which results in a context switch. The ugly detail here is that its not possible to adjust the buffer size at runtime, not even before startup (was possible with old xlib) or compiletime, hopefully this will change.
* Another performance limiter is that the mask-data has to be copied using the command-buffer, over unix domain sockets to the XServer.

Xorg supports the Shm extension, however earlier benchmarks I did show the penality for having to wait until the XServer has copied the data before the shared memory region can be used again.
The X11 pipeline also does Shm transfers only if the amount of data to be transferred is >=64kb, otherwise its not woth the additional round-trip.
One roundtrip for 1 tile is way worse than one flush for every ~7 tiles of course.

The solution could be using more than 1 shared memory segment, and only Sync when all have been used, I did some benchmarks and the result looks promising.
Uploading a 32x32x8 mask 10.000 times and doing a composition operation with it, takes:

So when using 4 shm masks, and syncing after those have been used performance is the same as when using the traditional mask-upload-path. 1 mask consumes about 1kb for the pixmap and 1kb for the shared memory area + all the overhead associated with it, so it should still be no problem preallocating 32 or even 64 masks.
Allocating one large pixmap and maybe also shared memory area maybe reduce the overhead.

The cool thing is that this code does force round-trips by syncing, however Xlib provides a event-based system which notifies the client when image-transfer was completed - which should make the shared-memory approach even faster.
So for now I see a 2x improvement for the upload path, for sure this will speed up antialiased rendering quite a bit, I am quite curious how much.

Update:
In the benchmark above I only tested Xorg-1.3/XAA, I repeated the tests with Xorg-1.5 and EXA:

Xorg-1.3/NVIDIA: 80ms (tested on my old 2.6ghz/P4 notebook)
Xorg-1.3/XAA: 85ms
Xorg-1.3/EXA: 1000ms
Xorg-1.5/EXA: 250ms

So EXA seems to struggle a lot with that kind of workload :-/
Although a lot of time is spent in dixLookupPrivate, it only accumulates to 20% of total runtime. I definitivly need to build Xorg-Master and see how that performs.

NVidia does quite well, although this was an old legacy release. Looking at oprofile it seems to be done in hw, almost no time is spent in libfb :)
Also, the old Laptop still used traditional XLib, and a little disappointing result is that using SHM or not does not seem to make a lot of difference (>10%) there - so maybe using SHM in this case is just working arround Xlib/XCBs problems.

Dienstag, 9. September 2008

Laptop repair

Soon my Laptop, a Toshiba Tecra A8, will be on its 4th "vacation".
I tend to do all my work on a Laptop so I bought a model which was praised for its mechanical stability and durability. I guess those beasts simply are not built for my type of "workload", so like all its predecessors it needs frequently repairs.
I'm glad that 2 to 4year warrenty extension was so cheap and support is quite ok.

OpenSolaris support will have to wait until I live in vienna again, I simply don't get it working. I also plan to create some pre-compiled binaries, so people can test the pipeline without compiling OpenJDK itself.
I also would like my Deflater/Inflater improvements to get in OpenJDK, but I guess I have to get used to the code again and write some tests to verify it.

Montag, 1. September 2008

NVidia Binary Driver 177.70

NVidia has released a few small bugfix-releases following 177.67, and I gave the 177.70 beta release a try.
Their proprietary driver is especially interesting because it does not depend on EXA (which still has some performance bugs in xorg-server-1.5) and all the memory manager stuff is already in place.
Most Java2Demo tests are accalerated really well now, however the driver seems to struggle with text so nimbus-performance is worse than with the Nouveau driver.
Another weird thing is a software-fallback hit in Java2Demo when using TexturePaint, which I was not able to reproduce with J2DBench at all.

More about it here: http://www.nvnews.net/vbulletin/showthread.php?t=118801

Despite the problems reported NVidia is definitivly doing well.
Hopefully the problems will be fixed in the next few releases, enlarging the types of hardware the XRender pipeline can run on well.

Freitag, 29. August 2008

OpenSolaris

My train-flatrate ticket will expire soon, so I'll take another trip to vienna downloading a new OpenSolaris build tomorrow.
I haven't been very successful with 2008.05 (it does not even boot on my laptop), hopefully newer versions will work smoother.

Update:
Build95 seems way better than 2008.05. The package-manager worked out-of-the-box, the new shell seems to be bash-compatible and changing the desktop-resoltion no longer crashes the system running in VirtualBox.

However it fails building OpenJDK, as far as I can see the SunStudio compiler dies:
CC: Fatal error in ccfe: Killed
sun.font/fontmanager/obj/CursiveAttachmentSubtables.o Error 1
Hopefully this is a known problem, with a known workarround, because it starts to become frustrating.

Dienstag, 19. August 2008

NVidia Binary Driver 177.67

NVidia driver improvements:
Today NVidia made a new beta-driver available, adressing many of the performance problems I saw when running the XRender pipeline on my GeForce6600 using the proprietary drivers - caused by the quite limited Render accaleration.

Taken form the release notes:
Improved support for RENDER masks, as well as RENDER repeating modes and transformations, for video memory pixmaps.
I haven't tested for now, but this should mean nimbus should run quite well using the proprietary drivers.
They are especially interesting because they don't suffer from some EXA performance problems I am experiencing, I am quite curious how they will perform :)

Another statement taken form the release notes:
Added an 'AllowSHMPixmaps' X configuration option, which can be used to prevent applications from using shared memory pixmaps; the latter may cause some optimizations in the NVIDIA X driver to be disabled.
So it seems time has come for SHMPixmaps to go away. EXA already removed support for SHMPixmaps (which hurts the X11 pipeline quite a bit) and the NVidia driver seems to go the same route.

The interesting thing is that those features were done because of the preassure generated by the community for better performing drivers - at least thats my impression.


After all, I am also curious what will be done about the proprietary AMD driver.
For now it still relies on XAA, and it seems AMD is the only major GPU manufacturer not providing drivers which are able to accalerate XRender properly.
Hopefull when RadeonHD is able to accalerate XRender well on the HD series, the proprietary 2D part will be removed and replaced by RadeonHD. Maybe the propiertary driver could be installed as add-on to RadeonHD, this would be my favourite distribution model.

Challenge:
The result of the challenge is still unknown ... wow I am quite nervous right now.

I hope that the missing Solaris support wasn't too negative as it should be fixable in a couple of days.
I ... well just forgot about it. What a shame, my stuff does for now (at least in VirtualBox) not run on the Unix of the sponsor :-/

Sonntag, 10. August 2008

Sharp edges..

Benchmarks:
I did some further analysis why the swing-benchmarks (and some others) don't perform that well.
Of course there's still lots of room for improvements in the pipeline itself, but some Xorg performance bugs turned out to be show-stoppers here.
The updated benchmarks should be soon available at: ...

Blits / sharp edges:
Xorg-1.5 now supports RepeatPad (however Intel still falls back to software for now), which is needed to not get smeared borders when scaling images.
With some tricks the mask's rectangular geometry can be used itself to clip of the repeated edges (maps to GL_CLAMP), so we don't have to generate the rotated images geometry anymore :)

The following screenshot shows a rotated image with a 2px black line on its edges - sharp edges and interpolation inside the image ... as it should be:
.... :)



In theory it should be possible to use a 1x1 mask and simply adjust the scale to get the expected size - however I don't know how rounding errors could influence the result. Most likely I can use the existing mask-buffer pixmap and tile if the area is larger.

Dienstag, 5. August 2008

Over

2 hours ago the deadline ended, 6 projects have been submited - wow there've been some really cool things happening over the past few months :)
Well now the challenge is over ... but of course the project will live on.

The next few days I'll be recovering a bit, I guess the only good thing about deadlines is that everything is just cut off after it.
Wait ... maybe I'll profile nimbus and fillrect performance before ;)

Update: After stubbing out all MaskFills, MaskBlits, Paints as well as transformed blits, performance improved by only ~18%, for the Lightbeam-Nimbus test published on the project-page.
The profile looks very similar to MigLayout's. Most of time is spent on the X-Server in glyph composition, so once the glyph enhancements are there also nimbus should benefit a lot :)

Also a stand-alone testcase shows XAA beeing twice as fast for 20x20 rects than EXA.

In general EXA seems to suffer at least with Xorg-server-1.5 a quite high per-primitive overhead, hopefully this is better when 1.6 will be released. For example for the 20x20 rect 10% of cpu time is spent in the driver, the rest in XOrg.

My goal would be to have all functionality required to run the XRender pipeline (with the image-scale fix in place) accalerated in xorg-server-1.6 as well as intel-2.6.

Montag, 4. August 2008

Solaris...

I finally got the XRender pipeline to build on Solaris, however it does not work correctly on my installation in VirtualBox on the local X-Server. It does work however if I re-direct it to my X-Server running on Linux.
I need to get a "real" Solaris installation soon and fix it, I am a bit ashamed it does not work :-/

Samstag, 2. August 2008

Solaris, Radeon HD3850

Recent progress:
The past few days I worked further on the project-page (will be online soon), fixed some bugs and continued to work on get it on solaris building.
For now it fails at linking-stage, I don't know but the linked complains it can't find a symbol located in libmawt.so, when linking libfontmanager.so. The same stuff works on linux :-(

All the code is in the repository, so if you are curios give it a try :)
Just build it and specify -Dsun.java2d.xrender=True at the command line, and highest graphic performance will be yours ;)


Radeon 3850:



I also bought a Radeon HD3850, the card was extremly cheap, probably because its sold out in favour of its 4850 successor - its a nice upgrade for my brothers computer and gives me the opportunity to test against the RadeonHD driver.

This means I can test the following hardware-driver combinations at home:
Nvidia: proprietary / nouveau / nv (uninteresting)
AMD: RadeonHD / radeon (on RV420) / propietary (uninteresting)
Intel: 945GM

Maybe a collection of IGPs like VIA chrome or sis GPUs would be nice to have, I hope I can find some testers :)

Donnerstag, 31. Juli 2008

Building on OpenSolaris

Today I installed OpenSolaris-2008.05 in VirtualBox, took some time but works quite well.
Bridging the network to allow Solaris to download the Sun-Studio compilers took even more time and making OpenJDK build ... well ;)

1.) You need both gcc and Sun-Studio compilers
2.) You'll need cups-headers, in package SUNWcups. If "pkg install SUNWcups", can't find anything, run "pkginstall refresh" first
3.) If compiling the freetype-test fails with some cryptic relocation errors, install SUNWtoo
4.) If it bails out because it can't find sys/audio.h and other stuff, install SUNWaudh (this is not sanity checked).
5.) Install X11 headers (also not santity checked).

If you don't know how the package you need is called, just do
pkg search -lr file_contained_in_package
to get a list with packages containing the file.

Well, at least I can now also compile the hibernated patches on Deflater/Inflater after the challenge, don't know if I would have gone past this just for those patches ;)

Update:
Just tried to re-boot because I got some strange error messages when I shut the system down.
Well at least I know now that I should update the whole system when installing SUNWtoo. ARRG!

Mittwoch, 30. Juli 2008

Yet another Intel driver bug

1.) Today I had a great idea how transformed images could be implemented *efficiently* using XRender once RepeatPad is implemented and accalerated.

The main problem was the need to generate the geometry of the transformed image to clip away the PAD surrounding the image (GL_CLAMP).
However if a mask is used with the same size of the source-image (or a larger mask with clip-rectangles) and filtered with nearest, this already represents the final geometry.
If a mask with the scaled size is used and billinear filtering, I guess it could be used for implementing antialiased image rendering.


2.) Today I discovered the 3rd intel-driver bug:


On the top you can see how the result should look like and on the bottom how it actually does look ;)
The problem is that if a mask has a transformation set, its clip-rectangles are ignored.

Of course I'll test it on nouveau tomorrow ;)

Documentation

The last two days I was working on cleaning up and re-formating the existing code and writing documentation.

Furthermore I've installed Fedora9 on an USB stick, so I can test the open-source nouveau driver on my brother's computer. It works suprisingly well, I've seen no artifacts running SwingSet2 with Nimbus :)

Tomorrow I'll travel to vienna again and try to get my code compiling and working on OpenSolaris.
I'll also start benchmarking - I almost forgot about an area where the XRender pipeline should shine :)

Montag, 28. Juli 2008

Java2Demo is not a benchmark

Today I replaced lock/getrasinfo/unlock with one single call to XPutImage (no shm support for now), to get rid of some unescessary overhead for MaskFill.

To my surprise the LineAnim demo got even worse on Xorg-1.5, and according to top Xorg was using ~50% and the java-process 150% of my CPUs, which made me remember I saw similar things with the X11 pipeline when Java2Demo's delay was set to 0. Setting it to 1ms, the demo jumped from 150fps to 250fps, and now java was using 90% cpu on the line with xorg.
I don't know exactly what the cause is, but I guess its some locking problem in Java2Demo or Java2D itself.

I wrote a small benchmark drawing 10000 antialiased lines with a with of 100 and y1/y2 difference between 0 and 100, here are the results:
X11/Xorg-1.3/XAA : 1050ms
XR/Xorg-1.3/EXA : 1250ms
XR/Xorg-1.3/XAA: 1350ms
XR/Xorg-1.5/EXA: 1500ms
X11/Xorg-1.3/XAA: 15500ms

The X11 pipeline performs quite good on XAA, most likely due to using SHM (maybe even shm pixmaps), and Xorg-1.5 is slower than Xorg-1.3 most likely due to a performance bug I've already reported in a different context.
On Xorg-1.5 the profile looks for now like this:
72584 9.7587 Xorg Xorg dixLookupPrivate
51372 6.9068 libdcpr.so libdcpr.so writeAlpha8NZ
47983 6.4511 libdcpr.so libdcpr.so processSubBufferInTile
39541 5.3161 libc-2.8.90.so libc-2.8.90.so memcpy
32230 4.3332 libmawt.so libmawt.so prepareMaskPM
31531 4.2392 intel_drv.so intel_drv.so i915_prepare_composite

So after the dixLookupPrivate issue is resolved I guess performance will be better than Xorg-1.3 :)
I am quite curious how much SHM can help here.

Update:
Well of course using shm-images for MaskFill was a stupid idea, because it forces MaskFill to sync with the server every time, however I still found a small optimization to not unescessary copy data arround.
I guess a bit larger xlib-buffer-size would quite help here, but as far as I know its for now hardcoded to 4kb and can't be changed (for xlib-xcb).

Samstag, 26. Juli 2008

MaskFill again

Yesterday I had a cool idea about MaskFills, which are currently not as fast as I would like them to be ;)

Although the mask-upload-path for now is far from ideal (however this can be improved), a lot of overhead is caused by the fact that only 32x32 (or smaller) large tiles are processed, one after another.
So there is significant overhead setting up all the composition parameters for such a small operation - however I hope that drivers will also improve in that area.

One idea would be to use the mask-buffer-composition-tile, which currently is 256x256 by default. The small tiles could be shmput'ed to those mask and with a single composition operation (or more if the area is larger than the tile). The worst case would be again a diagonal line - where a lot of empty space would be composited - however, even in that case I think the new approach would be significant faster.
The fewer composition operations should easily compensate the wasted fillrate.
My goal would be 250-300fps in the LineAnim Java2D test.

So that would be one cool thing to do next week, if I won't get flooded with enhancement requests from the j2d guys ;)

Freitag, 25. Juli 2008

Jippie :)

After a long fight with mercurial over the past two or three days, its finally done: The code resides its mercurial repository, this is the initial changeset:
http://hg.openjdk.java.net/xrender/xrender/jdk/rev/6d294fa2bd42

Thanks a lot to Dmitri for his help and neverending patience ... I'll re-read the mercurial docs, I promised ;)

Thanks to flick-user rileyroxx for the nice photograph

Its still very proof of concept, and has many quirks I would like to change before the deadline, however I guess there are even more problems I don't know about ;)

I guess this weekend I'll have a break, my laptop feels quite burned out ;)

Donnerstag, 24. Juli 2008

Vienna

Today and tomorrow I am in vienna, I've a free train ticket which lasts two months.
My internet-connection an home is extremly expensive ( http://www.aon.at/ ), I'll change to an UMTS connection soon.
Hopefully the mercurial stuff will work out fine, and soon the xrender pipeline will on its place ;)

Today I did cleanups, removed tons of warnings from the native code and did some small fixes, but nothing exciting.

Mittwoch, 23. Juli 2008

Benchmarking again...

MigLayout Benchmark:

Again I could not resist benchmarking my pipeline when I remembered the MigLayout benchmark.
I used it a few years ago to proof SWT's poor performance (inherited from GTK+), and I think its a quite good swing benchmark.

These were the results I got:

NimbusOcean
XRender/EXA2100ms1000ms
X11/EXA8000ms5800ms
XRender/XAA2800ms750ms
X11/XAA2600ms800ms

The X11 pipeline is really fast when using XAA, because if something falls back to software it can directly manipulate the target pixmap using shm pixmaps.
On EXA however pixmaps are stored in VRAM, and shm pixmaps are not supported - the sysprof profile is completly dominated by moving data from/to VRAM.

The real suprise was how well the XRender pipeline does when running on XAA, and how little EXA helps when running nimbus. However Xorg's profile looks quite well and top says a lot of time is spent in the java-process itself (65% java, 30% Xorg), so either I am using JNI too much or some validation/transformation stuff eats up all the cycles.

Ocean on EXA spends most time in gradients and text, at leats text will improve a lot once owen taylor's glyph patches are in Xorg.


Update:


I profiled my pipeline running the benchmark and only little time was spent in the pipeline, so I ran the same benchmark on my brother's computer (Sempron64 1.8ghz, Geforce6600, WinXP, ForceWare 9371 driver):


NimbusOcean
Linux/X114400ms800ms
Linux/OpenGL6535ms*--------
Windows-D3D4800ms500ms

* OpenGL on Windows did not work at all with fbobject=true, and got stuck after 5s with fbobject=false.
* OpenGL on Linux showed artifacts, and became slower each run (6535->33000ms). Latest nvidia binary driver was installed.

The benchmark seems to stress Nimbus in a way it doesn't like, no matter which pipeline was used.
I am totally impressed by the ocean result running on D3D, keep in mind this CPU is ~50% slower than mine.


MaskFill Performance:
Another topic I am not happy about is poor MaskFill performance.


BezierAnimLineAnim
All175fps150fps
no
mask upload
300fps200fps
no
composition
300fps200fps
nothing750fps440fps

I thought Mask-upload (XPutImage) would be the slow part because of using suboptimal uploading paths with quite some overhead and furthermore the x-server has to migrate data from sysmem->vram.
It showed up that composition (with a mask in vram) as well as mask-uploading (+migration) are both almost equal slow/fast.
Antialiasing relies a lot on MaskFill/MaskBlit (Nimbus), however I am not sure how much room for improvement is left - for sure it would help if the no-mask operations could be accumulated in the MaskBuffer, however for this an API-change would be required.

Dienstag, 22. Juli 2008

Rewrite

Today I started the "rewrite" with the goal of clear seperation between the existing X11 pipeline and the new XRender pipeline.
It almost works as weel as the old "hack" pipeline, but some pieces are still missing and a few bugs did not dissapear suddenly as I had hoped for ;)

For now I won't create the "pure"-java pipeline I dreamed about a few days ago, its less than two weeks till 4th of August and I am not sure how much effort this route would take.
I'll try it out later with enough time to think about a clean design and an efficient implementation ... I am quite curious how well this design would play together with caciocavallo.

Montag, 21. Juli 2008

Driver bugs...

Thanks to Dmitri, who suggested that both bugs I am experiencing are probably due to incorrect pixel/texel mapping. XRender itself is pixel-based, so the driver is responsible for the mapping.

The nimbus-corruption shown below is fixed in Xorg-1.5/Intel-2.3.2, however there is another bug where scaling on the x-axis only also blurs pixels on the y-axis when billinear filtering is used, and its still there in the setup mentioned above.
Switching to XAA (software scaling using pixman) did not show the problem.

Thanks again!

Sonntag, 20. Juli 2008

Nimbus

After the gradient work I profiled Nimbus again and noticed there were still some fallbacks in action.
I had forgotten to register MaskFill primitives for argb-pre surfaces, and after adding those ... I faced the usual problems:
Note that they are not equally scaled, so in reality the size should be quite the same.
The two visual problems are the white area below the knob, and the different outline of the up/down buttons.
Also the gradient looks a bit different, I am not sure wether this is a problem in the pipeline or just XRender's different gradient implementation.

Overall performance is quite good now and with some planned enhancements for MaskFill (fast upload paths) it should be become even better, well at least something encouraging.

Nimbus seems to be an excellent "tool" for testing the more advanced features of the pipeline ... it seems there is no bug it does not catch ;)

Gradient performance

Today I further investigated the performance problems when Gradients are used. Although its not really the the pipeline's fault something had to be done - even Metal/Ocean spent significant time drawing gradients.

The problem is how EXA does handle fallbacks:
For EXA the only concern is wether something can be accessed by the CPU or GPU, so if it has to draw a gradient which is pinned to sysmem (because gradients for now are not accalerated) it has to access the destination and mask surface with the CPU, because composition is done in software.
The problem is that for now the only way to make the surface CPU-addressable is to copy the content from vram to sysram - so mask and destination contents are copied from vram to sysram and composition is done by the CPU.
Now imagine what happens if directly after that e.g. fillRect is called. EXA realizes that this operation can be done by the GPU, but the image content is now in sysram. So it copies mask/dest pixmap to vram again, and the fills the single rect.

This made the X-Server spend more that 50% of its cycles in memcpy, most of the time reading back contents from vram.

The workarround for this problem was to allocate a 256x256 gradient-buffer-tile. So if a gradient would be used for composition it is drawn into this tile and then the tile is used instead of the "real" gradient as source.
The trick is that the tile-pixmap is never altered by the GPU, so the only copying is sysram->vram, once when the composition happens.

This workarround makes Netbeans-main-window-resizing a lot faster and also helps Nimbus which uses gradients even more. The drawback is once gradients will be accalerated, we will do a useless composition step - and the memory wasted for the buffer-tile.

Here are some benchmarks before and after that change:
BezierAnim: ~55fps -> 355-600fps
GradAnin: ~30fps -> 180fps

Donnerstag, 17. Juli 2008

Netbeans...

Sorry for the flood of screenshots....


I profiled it a bit running on Xorg-1.5 and for rendering about:
- 30% were spent in text-rendering
- 45% were spent in compositing (with about 30% spent in ram->vram migration caused by gradients)
- 2% were spent in lines (are unaccalerated unfourntunatly)

Not that bad, keeping in mind that text-rendering is currently heavily worked on, however I guess I have to work arround the horrible gradient performance.
Scrolling in the Editor is already feeling faster than with the X11 pipeline using XAA.

Better ...

Today I found two bugs which led to the corruptions with Nimbus.
It looks quite well now, the visual differences are because I am running it on the OpenJDK7 codebase, maybe synth has not been updated in JDK7 to work with the ned Nimbus releases:




The first bug was when drawing images with transformations. The solution works (kind of) but is in my opinion very sub-optimal.
I am also fighting with something which could be an xorg-bug, I am not sure ... I prefer to blame my code before others ;)

Nimbus

Nimbus:
Yesterday I started to make my pipeline work well with Nimbus.
The first test results were mixed - rendering was correct but performance was really bad.
The reason was that Nimbus uses translucent VIs, and I completly forgot to allow this.
Well performance is a lot better now, but the result is not that appealing:


The xorg-profile looks quite welll with the exception of gradients which are currently not accalerated.
EXA seems to move all pixmaps out of vram if a single pixmap cannot be migrated to vram, which leads to about 70% of xorg's time spent in moving pixmaps arround.
Even if they don't accalerate gradients in hw, I hope they will do better than that in future.


Xorg Bug or feature:
I am not sure, but I guess I ran into another Xorg-Bug - or its just a feature I don't see the use for.
In fact it makes my "life" quite hard :-/
http://thread.gmane.org/gmane.comp.freedesktop.xorg/30450/focus=30476

Dienstag, 15. Juli 2008

Artifact free Swing

After quite a lot debugging I was able to find the cause of the swing artifacts.
The problem was that render also honors the clip-mask when a pixmap is used as source, of course this completly messed up the double buffering Repaint-Manager.

Now swing application look almost like they should:


(Boinc was crunching on both cores, so performance is not as bad as it seems ^^)

The last remaining piece to make common swing apps useable is copyArea, I hope I get it done in the next few days. There are also other annoying and serious bugs, but this one will make swing apps just work :)

After the most serious bugs I know about are fixed, I hope I can upload that "hack"-release of my pipeline to my OpenJDK repository, and start working on the final design.

Montag, 14. Juli 2008

Swing artifacts

Today I was working on removing show-stoppers from the existing pipeline.
I also investigated the problem with the artifacts I see when running swing applications (unuseable, most components only paint at mouseover).
Although I still don't know whats causing this bug, but at least I know now where to start.
Clipping seems to be the problem, if I comment out clip-validation almost all elements paint properly - I hope I can find some short code to reproduce the problem.

Well here is s hort overview what has to be done for the "hack"-pipeline:
- Transformed blit (to be consistent with Java2D's rendering)
- Scale-quality adjustent (for now everything is smooth-scaled)
- Swing artifacts
- Extra alpha for mask and texture fills doesn't work
- CopyArea unimplemented

Of course there well be even more bugs pop up when I solve the problem with corrupted swing, but I hope that at the end of the week I can release a pipeline which works ok for most uses.
After this release I'll work on rewriting it, so that it can exist beside the X11 pipeline.

Sonntag, 13. Juli 2008

Rewrite in Java?

Bugs and Glitches:
Well there are still some open issues I know about, and many small glitches which have to be fixed - however I hope that this will be done in the following two days and I'll start implementing the rewrite which hopefully will not be that ugly ;)

Rewrite in Java:
The current hack is more or less completly written in C (like the old pipeline), however with the mask-batchbuffer-work I actually realize how little interaction I really need with the C-libraries.
Some time ago I had an email conversaion with roman kennke about making the new XRender pipeline work with his java-only AWT implementation and the idea seemed really nice - however the way I started with the pipeline made it rather impossible.

In theory all I need to do is drawing lines and rects and playing a bit with dest and source parameters, in fact it would also make batching much more useful ... bringing the design closer to the STR approach again ;)

Well I still don't know wether I should go this route, the "deadline" is pretty close and any design mistakes could bring me way behind schedule. After all ... its only 3 weeks till then. If I could only have a little more time ;)

Freitag, 11. Juli 2008

Line performance

Lines:
After a lot of profiling and tuning line/draw performance is now quite ok, however for single diagonal lines the overhead is quite high.
I have some further ideas for optimizations, but I'll do them later - the project isn't really on scedule, so I'll focus on making things work ;) - and do those optimizations later.

The challenge was that EXA always ping-ponged the mask pixmap between sysram and vram, because it thought accalerating the fillRects() from drawScanline() is a good idea, actually it was a pretty bad one.
For now I have two masks, one I render the scanlines to and another for rendering the lines.
I render first the lines, blit the line-mask to the rect-mask, clean the line-mask with the same lines again but black-color and draw the rects to the rect-mask.
Sounds pretty horrible, after all this gave me best performance at the expence of one additional accalerated blit.
Its pretty hard to keep the line-mask in sysram (not even a fillRect for clearing the pixmap is allowed), after studying EXA's source I found it uses a "dirty" concept. If I use fillRect, the native surface is marked "dirty" and the next time a software-fallback happens EXA does a vram-readback ... well or somehow like this ;)

Yes this relies pretty much on expected behaviour, however it works also quite well with NVidia's proprietary driver - which is capable of accalerating lines. However, there is overhead of course.

Xorg-Bugs:
Well after those optimizations were done, it worked well on Xorg-1.3, but completly sucked on Xorg-1.5, because of the performance-bug already mentioned: http://bugs.freedesktop.org/show_bug.cgi?id=16647
and because of a new bug (so the report covers two regressions in fact),
A big thanks to Michel Dänzer who immediatly replied and fixed the second bug, and to Eamon Walsh for working on the remaining one.

Whats going on currently:
Well I am trying to fix all the stuff I know about, especially the thing where I know what causes the bug ;)
However, I still don't know where the swing-corruptions come from, maybe because copyArea is not implemented for now.

Well, here's whats going on for now:


Most of the Java2Demos work already fine, and with the hacks performance is ~ok.
I also worked arround some performance-bugs in Xorg when using Pictures without drawable, I simply don't use them anymore.
A kingdom for GL_LINE or however its called ;)


Whats next?:

Well after some finishing on the buffer-code, I'll try to get a clean pipeline implemented, that finally can be uploaded into the repositor.

Donnerstag, 10. Juli 2008

fillOval performance

Today I started integrating the mask-batch-buffer bits I wrote the last days into and did some benchmarking.

I tested with fillOval, because in the old pipeline it "simply" used XFillArc, whereas with XRender it is done by a general drawPath routine, so this should show how much the batching is woth:


20x20100x100250x2501000x1000
X112.44E71.44E83.7E81.92E9
XR2.8E71.8E84.8E81.78E9


Except for very large ovals, the XRender pipeline is slighly faster, even in an area where the old X11 pipeline is pretty good. For lines and draw() I guess the picture will be different :-/
Using EXA or XAA makes no (large) difference, I guess the intel driver does not accalerate arcs/ovals anyway.

Mittwoch, 9. Juli 2008

About Gradients and Traps...

Radial and Linear Gradients:


Today I started Linear- and RadialGradients (that stuff that was introduced in JDK6). As far as I know LinearGradients are quite complete, Radial Gradients are still missing some functionality.
I guess the Gradient-Stuff is almost complete now :)


No more Traps!
Lines cause many troubles because Java relies on fast lines for shapes, however XRender simply does not support lines at all.
A line can be composed out of 3 trapezoids (and a huge amount of code and runtime-work is needed to get it done right), the traps generate a mask which can be used for composition - however thats really slow for something which is expected to be almost a no-op. For now there is not even a single driver which is able to rasterize traps in hardware.

While playing with the mask-tile-approach I had the idea to draw to my mask with Core-Drawing, which has support for pretty lines.
This would remove the burden of all those trapezoids :)
In theory it should work very well, in reality its not that bad - because EXA doesn't provide driver hooks for lines at all, but because traps aren't accalerated too, nothing is lost - its therefor as fast as the old pipeline on EXA.

The sad news is that its quite slow with Xorg-1.5, but worked ok on 1.3, and it really flies with the proprietary nvidia driver (I guess it has hw line accaleration).
I filed a bug-report, hoping that the situation can be improved.

I really hope I get the mask-tile stuff integrated soon, I am quite curious how it will perform.

Sonntag, 6. Juli 2008

Java.net Project

Java.net Project:

I've created a java.net project, to allow participation and reviewing.
For now I don't use subversion, the "releases" are uploaded as tarballs - however I plan to change this soon.
The first release called "0.0.1" is really a development-only review, it will be completly rewritten soon so I recommend not wasting your time reading it ;)
The project-page can be found here: https://xrpipeline.dev.java.net/


Text benchmarks:
Last time I benchmarked, Dmitri asked about benchmarks done with J2DBench, however I was not familiar with it and only had a look at it recently. What a great and helpful tool :)

I've done some text-benchmarks with it, here are the results:

PlainAALCD
X11561.75842.95541.678
XRender148.757134.76195.184

Results are in chars/s with a 12pt font drawn directly to screen in 16-char blocks.
Currently there is some progress speeding up Xorg's glyph compositing, I guess with the patches it should be possible to reach ~500.000c/s for all three cases: Link

I did some other benchmarks and with the old "architecture" it seems text is really the only piece which is faster when a solid color is used. Although I expected some slowdowns, they were sometimes horrible :-/
Just another sign which underlines the need for the batched-mask-rendering approach.

Donnerstag, 3. Juli 2008

Performance improvements

Performance: Today I played with the ideas I accumulated over the past two weeks or so (already mentioned earlier) to improve performance of scanline based drawing, and implementing the "Extra-Alpha" concept with XRender.

The first attempts were quite frustrating, because performance was not near as good as I had hoped.
I ended somewhere in XOrg's software loops, spending 85% of total time in memcpy - which made me think FillRects to A8 is not accalerated by EXA or the intel driver (would have pretty much killed my ideas).
However it turned out to be a xorg-performance bug (at least as far as I can tell): https://bugs.freedesktop.org/show_bug.cgi?id=16600
Good to know about this limitation, maybe it can help improving the existing code a bit - thanks a lot to the guys at #xorg-devel for beeing that helpful and friendly :) (Still glad that I had not to fire up GDB^^)

The results where as pleasant as expected, for the micro-benchmark I ran before (filling 6250 spans, 250 at a time) I got:
Rendering to explicit mask using XRenderFillRectangles: ~10ms
Rendering to destination, compositing scanline per scanline: ~50ms

As always when using masks there is a worst case, a 45° line - however, as far as I can guess the low performance in the scanline-per-scanline case is not because of limited fill-rate, but from the per-operation overhead, which seems far lower with FillRectangles. I guess the worse the driver, the more the new approach "benefits" compared the old one - as long as it does accalerate FillRects to an A8 mask everything should be fine, with the exception of very old pre-3d-cards which are not really able to run EXA anyway.

Extra Alpha:
Another advantage of the new explicit-mask-approach is that Extra-Alpha can be done easily and is "for free" in the scanline/fillSpans case, in the line/path case there is one additional (usually accalerated) composition step compared to plain XRenderCompositeTrapezoids without extra alpha.

Dienstag, 1. Juli 2008

Over..

- All exams are done, finally university is over and I can concentrate on my project.

- Rawhide (the development version of fedora) almost works ... using an old kernel its even able to run sysprof.
Performance is much better than with the Fedora8 based system I am running currently, however EXA still moves pixmaps arround like a mad cow.
I hope I can get a pre-release of the GEM intel driver working soon, so that I will be able to target the latest improvements, and not optimize for already dead code.

- I plan to create a java.net project where I will upload all code which currently exists. Well its mostly an unuseable hack, but I hope this will change soon.


Looking foreward to a few exciting (and hopefully not too short) weeks :)

Mittwoch, 25. Juni 2008

Fighting with performance...

The current pipeline does perform a lot better than the old one when running on EXA (the new, default accaleration architecture of Xorg), however the old pipeline running on XAA (almost no accaleration, pixmaps always in sysram) its faster than everything else.
Of course XAA won't count soon, some distributions have already switched to EXA by default for many drivers (Ubuntu, Fedora9), however it looks a bit odd to have an old unaccalerated pipeline which is faster than the new accalerated one ;)

I already have some ideas howto speed up Fills in general (for strokes its not that easy unfortunately), I am quite interested how it will work out. Furthermore it would also solve the problem with extra-alpha.

For fills the approach could be like this:
- Have a mask-pixmap with a fixed size (e.g. 512x512), A8 format
- Render geometry to this pixmap using XRenderFillRectangles, which is itself hw-accalerated and *really* fast
- (apply extra alpha to the mask image if nescessary)
- Composite with the specified Texture- or GradientPaint. (For colors we can directly paint to the surface).

This approach would introduce some tiling (if the shape is larger than the fixed-size mask pixmap), however has quite some benefits:
- The mask is explizit, so we have control over its content (helpful for extra alpha)
- Using always the same mask removes the need for allocating implizit masks every time (but I guess xorg does optimize this anyway)
- Rendering rectangles is really fast, and EXA supports this operation done in HW. Trapezoids are currently rendered in software and then sent down to VRAM.

For MaskFills I will experiement with uploading the image to X itself, it seems the lock/getrasinfo/unlock functions introcude quite some overhead (C->JNI->Java calls, locking and an XSync)
Furthermore ~30% of cpu cycles when doing maskfills are spent in malloc although I don't malloc anything.
I hope I get sysprof working to see where all the cycles are wasted.

Montag, 23. Juni 2008

University

The next week or so I'll be quite busy with university :-/
However there will be almost unlimited time to code afterwards ;)

Freitag, 20. Juni 2008

Java2D uses scanlines intenally to fill transformed shapes, and because those
shapes are used a lot I today did some benchmarking on my laptop (C2D,
i945GM) to investigate which way of processing thode scanlines is most
efficient.

The two things I compared are for rendering 2500 times a 100 pixel wide, 1 pixel high scanline:
  • Draw the scanline as trapezoids to an alpha-mask, and later do one
    composition step using that mask (XRender generates an implicit alpha
    mask internally).
  • Render the scanlines one-by-one, without any mask.
I also tested how much batching solid color fills is worth, to see wether
it would be worth optimizing in this
direction. I compared Fedora8 to Fedora9, because EXA and the
intel-driver were quite in a bad shape in Xorg-1.3.

1.) Fedora9 x86_64,
Xorg-1.5, intel-master with TTM:


Intermediate
Mask Type
Time
A1
(1 bit alpha)
14ms
A8
(8 bit alpha)
4ms
Seperate
rendering*
16ms
* Seperate rendering = Many width/1 independent calls to XRenderComposite
* No mask = XRenderCompositeTrapezoids with masktype None


On this machine rendering to an A8 mask and compositing with that yields
best results.

2.) Fedora 8 i386, Xorg-1.3, Intel-2.1.1, EXA:


Intermediate
Mask Type
Time
A1
(1 bit alpha)
100ms
A8
(8 bit alpha)
120ms
No
mask*
40ms
Seperate
rendering*
13ms

Way
of filling
Time
Batched
Solid FillRetcs
1ms
Batched
Alpha FillRects
14ms
Single
Solid FillRects
8ms
Single
Alpha FillRects
35ms


Conclusions:
The results for composition are quite surprising.
* The mask-based approach performs terrible on Fedora8, although I thought this technique should be quite hardware/driver independent. Most of the time is spent inside libfb.so (arrrg, no symbols), maybe the
driver is completly falling back to software, or mask generation is really that slow.
* The many-and-small area composition approach performs quite similar on both systems.
* On Fedora9 as expected using an A8 intermediate mask yields best results, beeing 4
times faster than rendering many small pieces one-by-one.
* Batching solid color fills seems to speed things up a lot.

I'll try the benchmark on other HW to see wether Xorg-1.3 is the reason for
slow masking (no problem), or if the driver has to be highly tuned (quite bad).
I would prefer high performance across different drivers (and not all GPUs will have highest optimized EXA drivers), instead of peak-performance on some cards.
Hopefully I find some recent live-cd with Xorg-1.5 and Noveau included to test some nvidia hw ;)

Donnerstag, 19. Juni 2008

JavaDeus (and Gradients)

JavaDeus:

... was really great!
The sessions were quite interesting, the location was nice and catering was excellent ;)
I took some photos, however not really good ones: http://picasaweb.google.com/linuxhippy/JavaDeus
(It was lunch-time when I took those pictures, the room was not half empty during the sessions ;) )


Gradients:
During the Swing-App-Framework session I couldn't resist and started to implement Gradients.
For now only old-school gradients are supported without transformation (hey but with repeat and without ;)), but this should not cause many troubles (if XRender provides compatible implementation for the other Gradient types too):



Left = XRender, Right = RI



I also had a nice idea howto improve performance in the >scale transformed case, without changing the way Java2D itself works - after all XRender is all about creating masks and bliting with them. So I could just call XCompositeRectangles instead of just Composite.
This however introcuces the extra-alpha implementation problem i mentioned in even more situations, I still don't know howto fix that. Time hopefully will show ;)

Mittwoch, 18. Juni 2008

TexturePaint

Today I got the first bits of TexturePaint working:

It performs quite OKish with XAA, I haven't tried with EXA till now. It currently only works for fills, draw() and lines still quite struggly me:



I am also sure its not 100% correct ... the work to make it pixel-perfect compatible with Sun's software implementation quite frightens me ;)
The current Java2D-approach for transformed (>scale) TexturePaints is to fill them using strides, which means just drawing one pixel wide one after another, however on real hardware many small operations are quite expensive. I am curious how it would perform using a MaskFill based approach.

I'll try to get GradientPaints working too, and then find all the stuff which still does not work.


JavaDeus 08:
Tomorrow I'll attend JavaDeus '08 in St.Pölten. Its like a very small JavaOne for Austria, and according to Sun some guys from JavaOne will hold their speeches there too.
Sun provides a free shuttle bus from Vienna to the FH St.Pölten, the only drawback is that bus-transfer is at 8:00 am ;)
I am sure it will be really cool :)

Sonntag, 15. Juni 2008

Whats left ... and further project plan

Project Plan:
Well its now mid. of June and its not that long until August 4th.
I've written down some thoughts how I plan to continue with the project.

First large pieces which are still left:
- Paints (other than Color/AlphaColor)
- XOR (scares me^^)
- Nice, and correct lines
- Maybe some fast-paths for often-used primitives
- Bugs, bugs, bugs, bug implementation, special cases, .....

When those things are done, the pipeline is basically feature-complete as far as I see, but of course theres still a lot to do.
For now its really a dirty hack, a typical proof-of-concept - it simply does not care about corner cases, if full with bugs and missing pieces.
I simply hacked and totally screwed up and broke the existing X11 pipeline, which is bad because ... well ... this is *working* code and mine isn't for now ;)

Short term project plan:
My plan is to re-implement everything done so far starting with a new, untouched OpenJDK-src, and implement it at least on the java-side independent from the existing X11 pipeline, but the native side will share quite a lot of code.
I'll upload the resulting code on a java.net project, the current code is so messy and broken that its not useful for anthing more than implementing new features.
I hope I can start with implementing the new, clean implementation at the end of June, thats the time all my exams at universtity are behind me.

Goal: The resulting code should be able to run (most?) many important applications with good performance, and the resulting code should be clean.
I also hope Dmitri will be patient in helping me finding the "correct" way for thing which I hacked together for now. Thanks for all your help till now Dmitri!

Long term project plan:
My long-term goal (of course) would be the integration into (Open-)JDK7.
I don't think that the code will be ready for integration on August 4th and I hope nobody expects that.
After all its a large project which interfaces with a lot of project-"external" code, there's a lot of testing, corner-case fixing and dealing with bad driver/xorg-implementations and so on, I simply would like to get everything done "right", a lot of testing and performance evaluation on different hw-platforms.

So after 4th of August I'll ..... go on holidays for at least two weeks ;)
I guess I'll need some recovery to not loose fun working on this, after that I'll of work on finishing the pipeline.

I hope the code will be ready soon enough for JDK7, because I think the time would be perfect: EXA has become really useful with Xorg-7.4 (prerelease shipping with Fedora9) and finally drivers start to provide good exa performance:
- The intel-driver works very well, when one of the experimental branches is used ;)
- The open-source radeon drivers also seems to be in a good shape now.
- NVidia is working in improving their XRender implementation after many user-complaints about bad performance.

So for now the only driver which doesn't (or will soon) accalerate XRender on modern hardware is the proprietary AMD/ATI driver. However I don't expect the new pipeline to be that slow without accaleration, however there won't (expect for text) be large speedups (and some slowdowns) compared to the old pipeline. Furthermore I expect the proprietary AMD drivers to be able to run the OGL pipeline anyway, so those users won't be sad at all ;)

Wow that was quite a lot off-topic stuff...

-----------------------------------------------------------------------------------------------------------------------------------
Compositing:
Last but not least the usual image:

Compositing works :)
As you can see swing is really broken, so while most demos look some kind of OK, typical swing apps are totally unuseable for now :-/
I still have to figure out an efficient way to implement Extra-Alpha for operations which require an implicit mask (lines and line-related stuff like stroking).

Freitag, 13. Juni 2008

Positions work also now...

Abolute glyph-positions work now too ... even better its consistent with the software pipelines ;)
And as always ... a boring picture with some random positions:

Donnerstag, 12. Juni 2008

subpixel antialiased text:

1. Thanks to Phil Race I was able to get subpixel-antialiased text mostly woring:

1. Subpixel antialiased - Reference Implementation
2. Subpixel antialiased - XRender
3. Grayscale antialiased - XRender
4. No antialiasing - XRender

When you compare the RI (first line) and XRender (second one) you'll notice XRender looks a bit more bold, this is missing gamma correction which is done by the software-loops to match MSWindows's behaviour, but I am unsure wether XRender does support this feature.
If it does not support it I hope quality will be good enough this way to enable it by default (at least, its good enough for all other X-based apps), with accalerating drivers performance for sure is >500% better than with the software-based approach.

Thanks again Phil for beeing that helpful :)

2. Another thing I did today was making my glyphcache play nice with other JDK-internal classes.
I chose a array-based approach compared to the linked-list one when I implemented it, however I use realloc and GlyphInfo holds a direct pointer to its cache-entry which is the dangling arround.
The array-based approach also has its benefits (e.g. better cache locality), so I now simply update the pointer to it after doing realloc (which should almost never happen).

Whats missing for text-accaleration:
  1. Well for text its support for absolute positioning, which should not be too much work because most of the existing code is prepared to support it.
  2. Allocation-checks to not fail if memory-allocation fails.
  3. Gamma correction, although I am unsure if this can be implemented on top of XRender at all.

MaskFill working

Yesterday I finally got MaskFills working:


It took me quite some time until I figured out howto upload the mask-tiles using the lock/getrasinfo/unlock functions, and I guess my code is full with wrong assumptions ;) ... but its working quite well.

Performance with XAA (no accaleration) is most of the time on par with the existing implementation (there seem to be some performance bugs, sometimes its fast and sometimes not), performance on EXA is far better. However this approach has a big benefit for accalerating drivers (EXA), where the old approach had a large penality because of the readbacks used.

Although still intermediate-mask-images are used to render AA shapes and fills (instead of using XRender's AA capabilities), there's now no download from X-Server, and the upload is almost 1/3. Furthermore there is currently no EXA-driver which is able to accalerate geometry (they use a similiar approach like Java to render AA traps).

Today I'll play with lcd-text again and have a look at MaskBlit.

Dienstag, 10. Juni 2008

MaskFill / MaskBlit

LCD-Text:
Today I delayed the LCD text stuff, I don't know howto get any further so I posted a question to the 2d-dev list.
I hope I don't bother them too much, often something is not clean, but after asking it I spend some time iwth the code in a less stressed way, and find the answer.
However I am quite sure this time I won't be able to answer it myself ;)

State-Management:
Today I implemented some simple state-management.
It differs a lot from how X11 did it (there the state was held by the GC), because with XRender the state is spread over all possible places.
Solid color is sometimes used as XRenderColor, sometimes as Picture (for composition operations), clip is on the dst surface, transformation only on src.

For now I implemented only color values, however I guess later I'll have to do some kind of solid-fill-src-surface caching.

MaskFill / Blit:
MaskFill / Blit are really important pieces for antialiased rendering.
Today I looked a bit in how it was done in the OGL pipeline and fighted with creating my own instance of XRPixmapSurfaceData without allocating a VI.

The reason why this is hepful is that for X11 the up/download of data is more complicated than for OGL (where a simple glTexSubImage2D is all you need), and there are several ways of passing data down to X.
Of course the whole code is already there from the old X11 pipeline, but to be able to re-use it I need the XSDO structure it depends on.
Well it was easier than I thout, and I removed some ugly hacks I did some time before.

So tomorrow will be MaskFill day :)