Dienstag, 8. Dezember 2009

Large fill optimizations

Frustrated of learning for exams I fixed/enabled an optimization for large fills, which I had implemented some time ago, but which wasn't working properly.

The results are quite pleasing:


At least for large operations Jules is now really close to ductus, although its burning significantly more cpu cycles using two threads.
I am still not sure the whole multi-threading is really a good idea, however the fill optimization mentioned applies to the single-threaded version alteady released as well :)

Dienstag, 10. November 2009

Multi-Threading Jules

I played with the idea of making Jules multi-threaded, by rasterizing the trapezoids using multiple threads.
For large shapes it does improve things quite a bit (on my Core2Duo notebook):



Hey, in one case jules now even beats ductus :)

However it reminded me again how hard it is to write well performing multi-threaded code in real-world.
The outcome is a 2-thread producer/consumer implementation, where the consumer can produce for itself if it has catched up with the producer.
The first implementation (multi-unopt) simply fetched idle tiles from a pool, and added rasterized tiles to another one. And there was a volatile variable for communication. That makes 4 uncontended monitorenter/exit + volatile read/write.
With that approach the synchronization overhead eat up almost any speedup, in some cases I even saw ugly regressions.

I now batch fetch/store from the idle/completed tile lists in order to minimize synchronization costs, as well as let the worker start a bit ahead of the consumer.

I am still not sure wether the whole effort makes much sence, it makes the whole thing quite complex.
After all, it was fun ;)

I wonder how much speedup can be archived by using profile-driven optimizations and -O3 when compiling pixman and jules. I guess arround 10-20%, which could be enough to push threaded Jules in front of ductus.

Donnerstag, 5. November 2009

"Jules" cairo binding 0.0.1 release

I've prepared a webrev/patch of the XRender Java2D backend based on jdk7-master. It also includes a preview of "Jules", a cairo based RenderingEngine implementation, for faster antialiased rendering.

Please note Jules is more or less proof-of-concept, especially the native code is ugly, full of dirty assumptions and its probably not 64-bit clean.
However it runs Java2Demo quite well, and usually is a good deal faster than pisces even when rendering to software-surfaces. (and a *lot* faster when rendering to XRender surfaces), especially with the client-jvm.

Because of the need for a modified version of cairo and its complex build-system, I've seperated the native components and build them indepent from OpenJDK - falling back to pisces when loading fails.

Howto:

1.) Download
The webrev can be downloaded from:
http://93.83.133.214/webrev-xrender-jules-0.0.1.zip

The native jules/cairo code is located at:
http://93.83.133.214/jules-0.0.1.tar.gz

2.) Patch
To patch your local jdk7 repository, just do
> hg import ~/webrev/jdk.patch
or if your hg has that option:
> hg import --no-commit ~/webrev/*.patch

3.) Build
- Build your JDK as usual
- Compile Jules by running build_jules.sh (some makefile guru here?) ;)
- Copy libjules.so into build/linux-i586/lib/i386 or elsewhere, where it can be found by the JVM

4. Use it
Simply adding "-Dsun.java2d.renderer=sun.java2d.jules.JulesRenderingEngine" should do the trick :)
In addition you can enable the xrender-backend too, "-Dsun.java2d.xrender=True".
If everything works you should see:
XRender pipeline enabled
Jules library successfully loaded


Known problems:
- Some clipping problems when rendering to software surfaces
- Paints get wrong transformation when rendering to XRender surfaces

Last but not least the usual screenshot:


Update: Unfourtunatly my Nokia-770 where the stuff is hosted went down. Its up&running again.

Sonntag, 1. November 2009

cairo integration continued...

The cairo guys made clear the functions in use are subject to change anytime, and won't be exported.
I'll create a custom cairo-version as shared library, which will include the JNI-binding code, with a different name.

That will:
- Not change the OpenJDK build process (just add a few pure-java files to it)
- Provide native code that fullfills all the assumptions I need.
- Allow distributors to package it as an extension library, if they wish to enhance performance of OpenJDK.
- Allow us to detect at runtime, wether the native rasterizer is available and fall back to pisces if not.

Unfourtunatly it will also waste a few KB of your disk ;)

Donnerstag, 29. Oktober 2009

What to do with the Cairo rasterizer?

I've the cairo based rasterizer on my disk, and to be honest I don't really know what to do with it.
I'll release it hopefully soon ;)

Pros:

- A lot faster than pisces, could be speeded up further by some optimizations as well as multithreading.
- Runs Java2Demo perfectly, even gets some things right pisces fails at.
- Better image quality than pisces

Cons:
- Uses private cairo symbols
- Uses a modified version of cairo (one line commented out).
- Writes into C structures from Java-code
- Currently depends on the XRender pipeline code a bit.

There are some really dirty assumptions in that code that could break from one cairo version to another, but they are there for performance reasons. The more clean you make it, the more performance you loose.

Actually I don't see the modified cairo version as a bit problem itself, because if I wouldn't have used cairo and written a rasterizer from scratch in native code, the code would be there too - and nobody would care about it. The problem however is, cairo's codebase is huge, and nobody wants to integrate a private version of it of course.

Any ideas howto "solve" that dilemma? I doubt the cairo guys would open up their interface, not to the level I am using it.

Dienstag, 20. Oktober 2009

Mercurial struggles

I've today fixed the cairo bindings, to correctly generate cubic bezier curves out of quadratic ones.

I don't see any correctness bugs running Java2Demo anymore, at least when rendering to BufferedImages - rendering to VolatileImages triggers some problems in an optimized path when using Texture/Gradient paints.

However I've real troubles publishing the code. I am working on OpenJDK revision 1421, but how do I get the latest JDK without destroying my work. I executed "hg update -C" some time ago, and all files I altered were overwritten by the files in the repo.
I've already given up maintinaing my personal xrender repo, because it caused me even more struggles.
Would be glad if some mercurial gurus could assist me.

Samstag, 12. September 2009

Ductus vs Cairo vs Pisces

Today I got the AATileGenerator based on cairo into a useable state. This way software-rendering is done by cairo instead of the pisces rasterizer.

The relative results are quite pleasing:

On the positive side, most benchmarks improved by a factor of 2-3x :).
However cairo is no cure-all. The ductus rasterizer used in Sun's proprietary builds is still the winner in all tests.

Most likely this is the result of cairo beeing a two-step design:
First a path is tesselated into trapezoids, and later those trapezoids are rasterized, whereas ductus rasterizes directly.
However a positive side-effect of this design is trapezoid rasterization can be parallelized, so on dual-core systems it should be possible to get close to ductus.

Dienstag, 8. September 2009

Running Java2Demo on cairo's tesselator

I've just sorted out the worst problems, and its possible to run Java2Demo using the Cairo based rasterizer now. The hardest part was a small bug in cairo, which was caused by the uncommon situation how cairo is used.

Performance is excellent compared to pisces, and quite good compared to ductus.
By implementing more specific functionality (not only draw/fill) it should be possible to get even better performance for common cases like arcs or ovals.


A drawback is that this rasterizer backend needs a private version of cairo, because its tapping a lot into cairo internals. Hopefully this will find its way into IcedTea some day...

Freitag, 4. September 2009

Using cairo for antialiasing

I've written some glue code to use Cairo's tesselator to generate trapezoids for antialiased rendering, which can be directly fed to the X-Server. With a little more work, pixman could be used to generate AA tiles which could also be fed to the software loops (e.g. BufferedImage rendering) - replacing the slow pisces rasterizer currently used by OpenJDK.

For now I've only implemented the (simple) fill operation, stroking/drawing is still missing - however the results look already quite fine:

Boring background:
Currently both OpenJDK as well as Sun's proprietary JDK builds implement antialiased rendering by computing coverage values using C/Java code on the CPU, uploading those coverage values to the GPU and finally do blending on the GPU.

D3D/OpenGL drivers are usually highly optimized, so this works quite well there. For XRender however, especially with the Intel drivers (*cought*), this results in quite low performance, because every operation involves a syscall + a full GPU flush.
Furthermore when using OpenJDK, the rasterizer generating those coverage values is horribly slow, so AA rendering to any surface suffers.

Samstag, 1. August 2009

Traffic shaping on the Nokia 770

Thanks to the Mer project (Ubuntu-9.04 for ARM5TEJ + some tablet specific optimizations) I now can revive my Nokia 770 to replace the old P90 server. Because of its low power consumption I can keep it running 24/7, hosting a tor-relay to make use of otherwise unused traffic, and using traffic shaping to not let tor interfer with "real" workload.

However the 770's wlan driver is a binary blob (umac.ko), and enabling certain networking features (like QoS needed for trafic shaping) change the size of the sk_buff structure, which breaks it.

To be able to enable traffic shaping (and other features), you can steal some memory from the command buffer:
/include/linux/skbuff.h / line 254::
char cb[46]; //-2
__u16 tc_index;


This is really ugly and unsafe, may lead to real troubles and could cause whatever disease you can think of.
However it seems both, wlan-networking as well as traffic shaping work perfectly fine now :)

Mittwoch, 29. Juli 2009

Shame on HP

Shame on HP for not phasing out PVC and BFR, althout other manufacturer clearly demonstrate it can be done!
In 5 years all those computers built today will rot in China&Africa, polluting our enviroment.

Kudos to Captain Kirk ;)

Montag, 6. Juli 2009

Driver bugs (and how quick they _can_ fixed)

Recently I found exactly the same driver bug in RadeonHD on my RadeinHD3850, I've reported for <= Intel-945 about almost a year ago. The bug probably would not be worth mentioning it, however Tungstengraphic's Roland Scheidegger fixed the bug 3 weeks after I've reported it.

The problem seems to be an off-by-one-half problem, causing images which are scaled on the x-axis to be also blurred on the y-axis. The first line is rendered with pixman (reference), the second with Intel-2.8.0pre on my i945GM:
No big deal, unfourtunatly it makes Nimbus look pretty ugly :-(

There is quite a long list of driver bugs and performance problems I reported against the Intel driver in the freedesktop bugzilla. So far a single pipeline related bug has been fixed (and only for i965 and up, not for the i945 I call my own). Even for bugs causing image rendering corruptions in firefox nothing happens.
Despite the fact that the last two stable releases (2.6 & 2.7) can hardly be called "stable" and are driving distributors crazy, there are major performance regressions.

Hopefully that problems will be resolved soon, however time told a different story in the past.

Donnerstag, 25. Juni 2009

Only good news ;)

Tomorrow my holidays begin, after my bachelor thesis presentation I will hold at the computer graphics institute at the Technical University of Vienna and an exam two hours later.
Its basically the same presentation I held at Fosdem09, spiced up a bit with the most boring parts removed. Needless to say I prefer a 15min talk a lot over a 30min talk ;)
A big thank you to Hans-Peter Bernhard, my previous informatics teacher, who made all this possible.

Furthermore the xcb guys seem to have found the reason why the pure-java backend blocked (thanks a lot to Jamey Sharp). I haven't had the time to try it out, hopefully it will solve the problems and make the pure java backend finally useable after all.
If somebody would tell me tiered compilation would be enabled in JDK7 for sure, along with saturated cast instrincts (6850613) my day would be perfect ;)

So far ... only good news :)

Samstag, 6. Juni 2009

Rewrite ... finally

Finally the rewrite version is available in the repo.
It is not production ready and still has a few known bugs.
At least it builds ;)

Took me ages to start fighting with mercurial, however turned out to be a lot less horrible than expected.

The rewrite version features two different backends, for now only the native backend is functional.
You can find more details here.

Mittwoch, 20. Mai 2009

JXRenderMark 1.0 released

JXRenderMark 1.0 has been released, you can download it here

Thanks to a contribution made by a major GPU vendor who uses JXRenderMark for nightly performance regression testing, the results are now more stable and the benchmark runs faster. Thanks again!
Furthermore the "Transformed Blit Billinear sharp edges" test has been corrected, as well as small adoptions to the blit tests have been made.

Please give it a try on your system, and submit the results along with a short description of your system.
I am especially interested in (with EXA enabled):
  • open-source radeon driver, R100-R500, R600/700 with DRM+EXA
  • Unichrome
  • Intel <= i865

Montag, 18. Mai 2009

Austria to quit CERN engagement

Austria plans to stop its CERN engagement to save anual ~26M €, because we are so thight on budget.

- Currently you get 1.500€ for your old car if you buy a new one. No matter where it was produced.
- We spend ~40M € per year on Euratom, although we don't have a single nuclear plant and did decide to never build one. (well, thanks to ...., .... built us one in the 70' but we never switched it on)
- Austria just lost 1,5B €, because a bank went bankrupt we lend money to.

In 2009 and 2010 Austria will set a new all-time record for new national debts and all I saw this morning were government advertisments in the newspaper how well they work.
Its seems its all about choice, the questions is between what.

Samstag, 16. Mai 2009

Sw->Surface Blits

Recently I've implemented SW->Surface blits. Those code paths are used if an image is not cached (first use or not cacheable), e.g. when blitting a BufferedImage with stolen raster which some programs do quite frequently.
I avoided that stuff for a long time, because I didn't want to touch the whole well tuned image-upload code which is shared with the X11 pipeline.
For now a new VolatileImage is created every time the image is blited, using it later for composition - which is probably not an optimal path - but a lot less code ;)

> sun.java2d.xr.XrSwToPMTransformedBlit::TransformBlit(IntArgb, AnyAlpha, "Integer RGB Pixmap")
> sun.java2d.loops.Blit::Blit(IntArgb, SrcNoEa, IntRgb)


Update: Ok admited, it reuses the VI now for better performance ^^

Sonntag, 3. Mai 2009

XOR rendering

The last few days I've been busy with cleanups and refactoring, today I implemented acceleration for the infamous XOR mode. Modern applications don't use it very often, but many older apps rely a lot on it, and especially for the remote case fallbacks would kill performance. (I'ts not very fast in the local case either).
XRender doesn't support bitwise-XOR, so the pipeline has to use the X11-GC which adds another few lines of code for state-tracking and another bit complexity:


Donnerstag, 23. April 2009

Native Backend

Because of the deadlock-problem when using xcb's socket handoff mechanism required for the "pure java" approach, I factored all stuff out into an interface and wrote a jni based backend. This will make the pipeline also work (when it works) on systems without recent xcb based xlib.

The plan is to have multiple backends:
  • Pure Java (self generated X protocol)
  • Native (using X11+libXrender through JNI)
  • Escher planned, for integration with Caciocavallo.

The whole rewrite-code is an ugly mess, hopefully I'll soon find some time to clean it up soon.

And of course ... Java2Demo on the native backend:

Mittwoch, 22. April 2009

Cacao Jit Cache

Congrats again to Robert Schuster who held his diploma thesis presentation about implementing a JIT cache in Cacao at the TU Vienna yesterday.
Really cool and interesting work and excellent presentation :)

Hmm, by the way ... thanks for offering me to sleep in your hotel room on the floor at Fosedem ;)

Mittwoch, 18. März 2009

UXA ... unuseable slow

I just recently filed a bug about horrible performance when runnig the existing X11 pipeline using intel's forked UXA accaleration architecture. AWT applications are extremly slow, hardly useable.

> We don't intend to fix this in UXA ourselves. If you're drawing unantialiased
> text in 2009, your software fails. If you're not using Xft or cairo for text
> drawing in 2009, your software fails even harder (it would perform fine if you
> were).

So, it seems intel doesn't care they broke your software :-/

Donnerstag, 5. März 2009

Shader Trapezoids

Lately I've been playing with shaders a bit, investigating the possibility of implementing XRender's more complex features (trapezoids and gradients) using shaders instead of falling back to software all the time.

I started with trapezoids because they're used a lot, and because pixman's strange implementation made me curious:



I used NVidia's Cg language because it can be compiled down to ARG_fragment_program, which is the best my i945GM can do. Performance could be better and I am limited to 0,25px subpixel precision because of the i915's 64 ALU instruction limit.
But at least from what I can tell the traps should conform to the spec for imprecise trapezoid mode (default), I'll try to feed some cairo output at it and see what will go wrong ;)

On my i945GM I get 25fps fullscreen, a RedeonHD3850 is able to do ~1000fps and is still not ALU limited :)

Hopefully some day that stuff will find its way into one of the opensource drivers, I guess on more powerful hardware like AMD's R600/700 GPU setup costs will dominate :)

Donnerstag, 12. Februar 2009

Fosdem

Fosdem was really cool and exciting.
Many interesting talks and a great way to meet all the people I've mailed or heard about ... and all those I haven't known before .... and drink some beer with them.
I am really looking foreward to next year's fosdem, thanks guys for the great time :)

The Caciocavallo talk was quite impressive (although I still have troubles with pronouncation), I guess their work will boost Java's popularity for embedded systems. Unfourtunatly I missed the first part of Karl Helgason's talk about Gervill.
My talk went ~ok I guess, nobody fell asleep. I just talked a lot about not really related stuff like Java2D in general to get my 30min filled somehow. Thanks nobody recorded videos ;)

Unfourtunatly Java talks colided with some Xorg talks I planned to attend, however Michel Dänzer talked to me after my presentation about some problems I had with X :)
The open-source radeon/ati driver is in a pretty good shape, once intel is stable and fast again and radeonhd pushes the r600/700 branch to mainline, we have >80% of hardware with good drivers (Hopefully in Fedora11), with the only exception of AMD's Catalyst.

Michel had a great idea about how handling fallbacks better. Currently there are several different heuristics when to move a pixmat to video-memory and when it should stay in system-memory.
The problem is such heuristic descisions are based on what happend in the past, which often means its hard to get it right - better would be a prognosticator to get an idea what will happen in future ;)
A way this could be archived would be some kind of delayed rendering:
All rendering commands could be buffered in a queue of e.g. 50 entries, and the current operation could make migration descisions based on what it knows to come. If the current operation could be done on the GPU, the image contents are in RAM and in the queue are 5 operations triggering a fallback, it may be more efficient to not move the image to VRAM and do work by the CPU, avoiding the later readback to RAM altogether.
This would help drivers/GPUs which provide only basic RENDER accaleration, like Nouveau with <= GF5 or for frequently use of unaccalerated stuff (AddTraps, A1 masks, frequent XGetImage, ...). If the queue is not full, X processes painting commands faster than the client so less optimal descisions would not hurt (the smaller the queue, the more it behaves like default the "Always" heuristic). However would be a lot of work especially without deep knowledge of the code , I'll have a look in the Summer holidays :-/ Back to the pipeline ... I am still distracted by the problem with xcb's socket handoff mechanism, no idea whats going wrong. If somebody knows xcb (and unix sockets, ...) a bit, I would be really happy for any help.
Its not a lot of fun to work on a pipeline locking up after 5s, without knowing the cause :-/

Donnerstag, 5. Februar 2009

QT rocks :)

I investigated the poor performance QT4 showed at QGears2 in the aliased case. (20fps vs. 80fps antialiased)
QT was using XRender in a quite inefficient way for the aliased case, as well as Xorg falling back to software although it could clearly do better. I hope I can fix that, shouldn't be too hard.

I sent them a mail, and after a short discussion a patch was written by trolltech targeted for 4.5, improving the performance of aliased shapes by a factor 2. Once Xorg is fixed performance should double again.

Sometimes its really a big plus to have a company behind an open source project maintaining it, like its the case for OpenJDK or QT. Some other projects I tried to contribute are in a far worse state.

Montag, 26. Januar 2009

Fosdem, Mozilla, RadeonHD, Intel

Fosem:
I'll talk a bit about the pipeline at Fosem 09.
I got a 30min slot assigned, no idea what I should talk about half an hour ;)

Mozilla / RepeatPad:
The XRender pipeline has to use RepeatPad for transformed and scaled images, however this is only accalerated by NVidia and Intel for now.
Mozilla now discusses the switch to RepeatPad (eliminates some artifacts and allows billinear scaling), which has already led to a series of bug-reports.
Update: the open-source radeon driver already has experimental RepeatPad accaleration for R100-R300 :)

RadeonHD:
AMD is currently working on XRender accaleration for their Radeon HD series in a seperate git tree.
Hopefully it will be merged into mainline soon, and in the long term I hope it will make its way into the proprietary Catalyst drivers.

Intel:
Intel has released 2.6.0 and soon after that 2.6.1. Hopefully their GEM refactoring is finished soon, the driver is now in a fairly bad shape.
Performance is bad and non-GEM related bug fixing seems quite halted.
There's still an off-one-half bug on i915/i865 which makes Nimbus look ugly on my 945GM as well as several performance regressions which affect antialiasing :-/

Donnerstag, 22. Januar 2009

No Software Patents! (again!)

Please subscribe to the Stop Software Patents Petitition: http://stopsoftwarepatents.eu/

Nobody needs software patents, except patent trolls.
Writing code should be no crime - but with those thousands of patents its almost impossible to not interfere.

Examples of what can be patented in the EU:
- Selling over a mobile phone network - EP1090494
- Video streaming (segmented video on-demand) - EP633694
- Electronic shopping cart - EP807891

Donnerstag, 8. Januar 2009

fillRect overhead analysis

Ever since working on the pipeline I've been interested where how many cycles are spent in which parts.
Today I profiled fillRect a bit:
Protocol generation: 120 cycles (40%)
Pipeline overhead : 90 cycles (30%)
Locking/synchronization: 90 cycles (30%)
Total: 300 cycles with server compiler (480 cycles with client compiler; ~20.000 cycles interpreter-only)
Protocol generation is writing the X11 protocol into a sun.misc.Unsafe.
Pipeline-Overhead is all the work done to validate pipeline/surface state and decide which code-path to use for the current operation, as well as all the abstraction from Graphics2D up to our XRender Surface.
Locking means aquiring/releasing a ReentrantLock, which guards AWT access.

Conclusions:
- 300 cycles is not that well, however we are generating rectangles probably faster the XServer can process it :). After specific optimizations as well as biased locking I guess 175 cycles is realistic, which is not that bad.
- The server-compiler does pretty well, hopefully tiered compilation will be implemented for JDK7. In this case the client-compiler produces 60% slower code :-/
- Locking is expensive, especially on older muti-core processors (like my Core2Duo). Biased locking could really help here, unfourtunatly it has a limitation which make it hard to use for the pipeline.
Furthermore it seems some optimizations don't have any effect when locking is done, but show e.g. 10 cycles improvement when no locking is done.
- The pipeline-overhead could be lower, but its not bad.