Donnerstag, 31. Juli 2008

Building on OpenSolaris

Today I installed OpenSolaris-2008.05 in VirtualBox, took some time but works quite well.
Bridging the network to allow Solaris to download the Sun-Studio compilers took even more time and making OpenJDK build ... well ;)

1.) You need both gcc and Sun-Studio compilers
2.) You'll need cups-headers, in package SUNWcups. If "pkg install SUNWcups", can't find anything, run "pkginstall refresh" first
3.) If compiling the freetype-test fails with some cryptic relocation errors, install SUNWtoo
4.) If it bails out because it can't find sys/audio.h and other stuff, install SUNWaudh (this is not sanity checked).
5.) Install X11 headers (also not santity checked).

If you don't know how the package you need is called, just do
pkg search -lr file_contained_in_package
to get a list with packages containing the file.

Well, at least I can now also compile the hibernated patches on Deflater/Inflater after the challenge, don't know if I would have gone past this just for those patches ;)

Update:
Just tried to re-boot because I got some strange error messages when I shut the system down.
Well at least I know now that I should update the whole system when installing SUNWtoo. ARRG!

Mittwoch, 30. Juli 2008

Yet another Intel driver bug

1.) Today I had a great idea how transformed images could be implemented *efficiently* using XRender once RepeatPad is implemented and accalerated.

The main problem was the need to generate the geometry of the transformed image to clip away the PAD surrounding the image (GL_CLAMP).
However if a mask is used with the same size of the source-image (or a larger mask with clip-rectangles) and filtered with nearest, this already represents the final geometry.
If a mask with the scaled size is used and billinear filtering, I guess it could be used for implementing antialiased image rendering.


2.) Today I discovered the 3rd intel-driver bug:


On the top you can see how the result should look like and on the bottom how it actually does look ;)
The problem is that if a mask has a transformation set, its clip-rectangles are ignored.

Of course I'll test it on nouveau tomorrow ;)

Documentation

The last two days I was working on cleaning up and re-formating the existing code and writing documentation.

Furthermore I've installed Fedora9 on an USB stick, so I can test the open-source nouveau driver on my brother's computer. It works suprisingly well, I've seen no artifacts running SwingSet2 with Nimbus :)

Tomorrow I'll travel to vienna again and try to get my code compiling and working on OpenSolaris.
I'll also start benchmarking - I almost forgot about an area where the XRender pipeline should shine :)

Montag, 28. Juli 2008

Java2Demo is not a benchmark

Today I replaced lock/getrasinfo/unlock with one single call to XPutImage (no shm support for now), to get rid of some unescessary overhead for MaskFill.

To my surprise the LineAnim demo got even worse on Xorg-1.5, and according to top Xorg was using ~50% and the java-process 150% of my CPUs, which made me remember I saw similar things with the X11 pipeline when Java2Demo's delay was set to 0. Setting it to 1ms, the demo jumped from 150fps to 250fps, and now java was using 90% cpu on the line with xorg.
I don't know exactly what the cause is, but I guess its some locking problem in Java2Demo or Java2D itself.

I wrote a small benchmark drawing 10000 antialiased lines with a with of 100 and y1/y2 difference between 0 and 100, here are the results:
X11/Xorg-1.3/XAA : 1050ms
XR/Xorg-1.3/EXA : 1250ms
XR/Xorg-1.3/XAA: 1350ms
XR/Xorg-1.5/EXA: 1500ms
X11/Xorg-1.3/XAA: 15500ms

The X11 pipeline performs quite good on XAA, most likely due to using SHM (maybe even shm pixmaps), and Xorg-1.5 is slower than Xorg-1.3 most likely due to a performance bug I've already reported in a different context.
On Xorg-1.5 the profile looks for now like this:
72584 9.7587 Xorg Xorg dixLookupPrivate
51372 6.9068 libdcpr.so libdcpr.so writeAlpha8NZ
47983 6.4511 libdcpr.so libdcpr.so processSubBufferInTile
39541 5.3161 libc-2.8.90.so libc-2.8.90.so memcpy
32230 4.3332 libmawt.so libmawt.so prepareMaskPM
31531 4.2392 intel_drv.so intel_drv.so i915_prepare_composite

So after the dixLookupPrivate issue is resolved I guess performance will be better than Xorg-1.3 :)
I am quite curious how much SHM can help here.

Update:
Well of course using shm-images for MaskFill was a stupid idea, because it forces MaskFill to sync with the server every time, however I still found a small optimization to not unescessary copy data arround.
I guess a bit larger xlib-buffer-size would quite help here, but as far as I know its for now hardcoded to 4kb and can't be changed (for xlib-xcb).

Samstag, 26. Juli 2008

MaskFill again

Yesterday I had a cool idea about MaskFills, which are currently not as fast as I would like them to be ;)

Although the mask-upload-path for now is far from ideal (however this can be improved), a lot of overhead is caused by the fact that only 32x32 (or smaller) large tiles are processed, one after another.
So there is significant overhead setting up all the composition parameters for such a small operation - however I hope that drivers will also improve in that area.

One idea would be to use the mask-buffer-composition-tile, which currently is 256x256 by default. The small tiles could be shmput'ed to those mask and with a single composition operation (or more if the area is larger than the tile). The worst case would be again a diagonal line - where a lot of empty space would be composited - however, even in that case I think the new approach would be significant faster.
The fewer composition operations should easily compensate the wasted fillrate.
My goal would be 250-300fps in the LineAnim Java2D test.

So that would be one cool thing to do next week, if I won't get flooded with enhancement requests from the j2d guys ;)

Freitag, 25. Juli 2008

Jippie :)

After a long fight with mercurial over the past two or three days, its finally done: The code resides its mercurial repository, this is the initial changeset:
http://hg.openjdk.java.net/xrender/xrender/jdk/rev/6d294fa2bd42

Thanks a lot to Dmitri for his help and neverending patience ... I'll re-read the mercurial docs, I promised ;)

Thanks to flick-user rileyroxx for the nice photograph

Its still very proof of concept, and has many quirks I would like to change before the deadline, however I guess there are even more problems I don't know about ;)

I guess this weekend I'll have a break, my laptop feels quite burned out ;)

Donnerstag, 24. Juli 2008

Vienna

Today and tomorrow I am in vienna, I've a free train ticket which lasts two months.
My internet-connection an home is extremly expensive ( http://www.aon.at/ ), I'll change to an UMTS connection soon.
Hopefully the mercurial stuff will work out fine, and soon the xrender pipeline will on its place ;)

Today I did cleanups, removed tons of warnings from the native code and did some small fixes, but nothing exciting.

Mittwoch, 23. Juli 2008

Benchmarking again...

MigLayout Benchmark:

Again I could not resist benchmarking my pipeline when I remembered the MigLayout benchmark.
I used it a few years ago to proof SWT's poor performance (inherited from GTK+), and I think its a quite good swing benchmark.

These were the results I got:

NimbusOcean
XRender/EXA2100ms1000ms
X11/EXA8000ms5800ms
XRender/XAA2800ms750ms
X11/XAA2600ms800ms

The X11 pipeline is really fast when using XAA, because if something falls back to software it can directly manipulate the target pixmap using shm pixmaps.
On EXA however pixmaps are stored in VRAM, and shm pixmaps are not supported - the sysprof profile is completly dominated by moving data from/to VRAM.

The real suprise was how well the XRender pipeline does when running on XAA, and how little EXA helps when running nimbus. However Xorg's profile looks quite well and top says a lot of time is spent in the java-process itself (65% java, 30% Xorg), so either I am using JNI too much or some validation/transformation stuff eats up all the cycles.

Ocean on EXA spends most time in gradients and text, at leats text will improve a lot once owen taylor's glyph patches are in Xorg.


Update:


I profiled my pipeline running the benchmark and only little time was spent in the pipeline, so I ran the same benchmark on my brother's computer (Sempron64 1.8ghz, Geforce6600, WinXP, ForceWare 9371 driver):


NimbusOcean
Linux/X114400ms800ms
Linux/OpenGL6535ms*--------
Windows-D3D4800ms500ms

* OpenGL on Windows did not work at all with fbobject=true, and got stuck after 5s with fbobject=false.
* OpenGL on Linux showed artifacts, and became slower each run (6535->33000ms). Latest nvidia binary driver was installed.

The benchmark seems to stress Nimbus in a way it doesn't like, no matter which pipeline was used.
I am totally impressed by the ocean result running on D3D, keep in mind this CPU is ~50% slower than mine.


MaskFill Performance:
Another topic I am not happy about is poor MaskFill performance.


BezierAnimLineAnim
All175fps150fps
no
mask upload
300fps200fps
no
composition
300fps200fps
nothing750fps440fps

I thought Mask-upload (XPutImage) would be the slow part because of using suboptimal uploading paths with quite some overhead and furthermore the x-server has to migrate data from sysmem->vram.
It showed up that composition (with a mask in vram) as well as mask-uploading (+migration) are both almost equal slow/fast.
Antialiasing relies a lot on MaskFill/MaskBlit (Nimbus), however I am not sure how much room for improvement is left - for sure it would help if the no-mask operations could be accumulated in the MaskBuffer, however for this an API-change would be required.

Dienstag, 22. Juli 2008

Rewrite

Today I started the "rewrite" with the goal of clear seperation between the existing X11 pipeline and the new XRender pipeline.
It almost works as weel as the old "hack" pipeline, but some pieces are still missing and a few bugs did not dissapear suddenly as I had hoped for ;)

For now I won't create the "pure"-java pipeline I dreamed about a few days ago, its less than two weeks till 4th of August and I am not sure how much effort this route would take.
I'll try it out later with enough time to think about a clean design and an efficient implementation ... I am quite curious how well this design would play together with caciocavallo.

Montag, 21. Juli 2008

Driver bugs...

Thanks to Dmitri, who suggested that both bugs I am experiencing are probably due to incorrect pixel/texel mapping. XRender itself is pixel-based, so the driver is responsible for the mapping.

The nimbus-corruption shown below is fixed in Xorg-1.5/Intel-2.3.2, however there is another bug where scaling on the x-axis only also blurs pixels on the y-axis when billinear filtering is used, and its still there in the setup mentioned above.
Switching to XAA (software scaling using pixman) did not show the problem.

Thanks again!

Sonntag, 20. Juli 2008

Nimbus

After the gradient work I profiled Nimbus again and noticed there were still some fallbacks in action.
I had forgotten to register MaskFill primitives for argb-pre surfaces, and after adding those ... I faced the usual problems:
Note that they are not equally scaled, so in reality the size should be quite the same.
The two visual problems are the white area below the knob, and the different outline of the up/down buttons.
Also the gradient looks a bit different, I am not sure wether this is a problem in the pipeline or just XRender's different gradient implementation.

Overall performance is quite good now and with some planned enhancements for MaskFill (fast upload paths) it should be become even better, well at least something encouraging.

Nimbus seems to be an excellent "tool" for testing the more advanced features of the pipeline ... it seems there is no bug it does not catch ;)

Gradient performance

Today I further investigated the performance problems when Gradients are used. Although its not really the the pipeline's fault something had to be done - even Metal/Ocean spent significant time drawing gradients.

The problem is how EXA does handle fallbacks:
For EXA the only concern is wether something can be accessed by the CPU or GPU, so if it has to draw a gradient which is pinned to sysmem (because gradients for now are not accalerated) it has to access the destination and mask surface with the CPU, because composition is done in software.
The problem is that for now the only way to make the surface CPU-addressable is to copy the content from vram to sysram - so mask and destination contents are copied from vram to sysram and composition is done by the CPU.
Now imagine what happens if directly after that e.g. fillRect is called. EXA realizes that this operation can be done by the GPU, but the image content is now in sysram. So it copies mask/dest pixmap to vram again, and the fills the single rect.

This made the X-Server spend more that 50% of its cycles in memcpy, most of the time reading back contents from vram.

The workarround for this problem was to allocate a 256x256 gradient-buffer-tile. So if a gradient would be used for composition it is drawn into this tile and then the tile is used instead of the "real" gradient as source.
The trick is that the tile-pixmap is never altered by the GPU, so the only copying is sysram->vram, once when the composition happens.

This workarround makes Netbeans-main-window-resizing a lot faster and also helps Nimbus which uses gradients even more. The drawback is once gradients will be accalerated, we will do a useless composition step - and the memory wasted for the buffer-tile.

Here are some benchmarks before and after that change:
BezierAnim: ~55fps -> 355-600fps
GradAnin: ~30fps -> 180fps

Donnerstag, 17. Juli 2008

Netbeans...

Sorry for the flood of screenshots....


I profiled it a bit running on Xorg-1.5 and for rendering about:
- 30% were spent in text-rendering
- 45% were spent in compositing (with about 30% spent in ram->vram migration caused by gradients)
- 2% were spent in lines (are unaccalerated unfourntunatly)

Not that bad, keeping in mind that text-rendering is currently heavily worked on, however I guess I have to work arround the horrible gradient performance.
Scrolling in the Editor is already feeling faster than with the X11 pipeline using XAA.

Better ...

Today I found two bugs which led to the corruptions with Nimbus.
It looks quite well now, the visual differences are because I am running it on the OpenJDK7 codebase, maybe synth has not been updated in JDK7 to work with the ned Nimbus releases:




The first bug was when drawing images with transformations. The solution works (kind of) but is in my opinion very sub-optimal.
I am also fighting with something which could be an xorg-bug, I am not sure ... I prefer to blame my code before others ;)

Nimbus

Nimbus:
Yesterday I started to make my pipeline work well with Nimbus.
The first test results were mixed - rendering was correct but performance was really bad.
The reason was that Nimbus uses translucent VIs, and I completly forgot to allow this.
Well performance is a lot better now, but the result is not that appealing:


The xorg-profile looks quite welll with the exception of gradients which are currently not accalerated.
EXA seems to move all pixmaps out of vram if a single pixmap cannot be migrated to vram, which leads to about 70% of xorg's time spent in moving pixmaps arround.
Even if they don't accalerate gradients in hw, I hope they will do better than that in future.


Xorg Bug or feature:
I am not sure, but I guess I ran into another Xorg-Bug - or its just a feature I don't see the use for.
In fact it makes my "life" quite hard :-/
http://thread.gmane.org/gmane.comp.freedesktop.xorg/30450/focus=30476

Dienstag, 15. Juli 2008

Artifact free Swing

After quite a lot debugging I was able to find the cause of the swing artifacts.
The problem was that render also honors the clip-mask when a pixmap is used as source, of course this completly messed up the double buffering Repaint-Manager.

Now swing application look almost like they should:


(Boinc was crunching on both cores, so performance is not as bad as it seems ^^)

The last remaining piece to make common swing apps useable is copyArea, I hope I get it done in the next few days. There are also other annoying and serious bugs, but this one will make swing apps just work :)

After the most serious bugs I know about are fixed, I hope I can upload that "hack"-release of my pipeline to my OpenJDK repository, and start working on the final design.

Montag, 14. Juli 2008

Swing artifacts

Today I was working on removing show-stoppers from the existing pipeline.
I also investigated the problem with the artifacts I see when running swing applications (unuseable, most components only paint at mouseover).
Although I still don't know whats causing this bug, but at least I know now where to start.
Clipping seems to be the problem, if I comment out clip-validation almost all elements paint properly - I hope I can find some short code to reproduce the problem.

Well here is s hort overview what has to be done for the "hack"-pipeline:
- Transformed blit (to be consistent with Java2D's rendering)
- Scale-quality adjustent (for now everything is smooth-scaled)
- Swing artifacts
- Extra alpha for mask and texture fills doesn't work
- CopyArea unimplemented

Of course there well be even more bugs pop up when I solve the problem with corrupted swing, but I hope that at the end of the week I can release a pipeline which works ok for most uses.
After this release I'll work on rewriting it, so that it can exist beside the X11 pipeline.

Sonntag, 13. Juli 2008

Rewrite in Java?

Bugs and Glitches:
Well there are still some open issues I know about, and many small glitches which have to be fixed - however I hope that this will be done in the following two days and I'll start implementing the rewrite which hopefully will not be that ugly ;)

Rewrite in Java:
The current hack is more or less completly written in C (like the old pipeline), however with the mask-batchbuffer-work I actually realize how little interaction I really need with the C-libraries.
Some time ago I had an email conversaion with roman kennke about making the new XRender pipeline work with his java-only AWT implementation and the idea seemed really nice - however the way I started with the pipeline made it rather impossible.

In theory all I need to do is drawing lines and rects and playing a bit with dest and source parameters, in fact it would also make batching much more useful ... bringing the design closer to the STR approach again ;)

Well I still don't know wether I should go this route, the "deadline" is pretty close and any design mistakes could bring me way behind schedule. After all ... its only 3 weeks till then. If I could only have a little more time ;)

Freitag, 11. Juli 2008

Line performance

Lines:
After a lot of profiling and tuning line/draw performance is now quite ok, however for single diagonal lines the overhead is quite high.
I have some further ideas for optimizations, but I'll do them later - the project isn't really on scedule, so I'll focus on making things work ;) - and do those optimizations later.

The challenge was that EXA always ping-ponged the mask pixmap between sysram and vram, because it thought accalerating the fillRects() from drawScanline() is a good idea, actually it was a pretty bad one.
For now I have two masks, one I render the scanlines to and another for rendering the lines.
I render first the lines, blit the line-mask to the rect-mask, clean the line-mask with the same lines again but black-color and draw the rects to the rect-mask.
Sounds pretty horrible, after all this gave me best performance at the expence of one additional accalerated blit.
Its pretty hard to keep the line-mask in sysram (not even a fillRect for clearing the pixmap is allowed), after studying EXA's source I found it uses a "dirty" concept. If I use fillRect, the native surface is marked "dirty" and the next time a software-fallback happens EXA does a vram-readback ... well or somehow like this ;)

Yes this relies pretty much on expected behaviour, however it works also quite well with NVidia's proprietary driver - which is capable of accalerating lines. However, there is overhead of course.

Xorg-Bugs:
Well after those optimizations were done, it worked well on Xorg-1.3, but completly sucked on Xorg-1.5, because of the performance-bug already mentioned: http://bugs.freedesktop.org/show_bug.cgi?id=16647
and because of a new bug (so the report covers two regressions in fact),
A big thanks to Michel Dänzer who immediatly replied and fixed the second bug, and to Eamon Walsh for working on the remaining one.

Whats going on currently:
Well I am trying to fix all the stuff I know about, especially the thing where I know what causes the bug ;)
However, I still don't know where the swing-corruptions come from, maybe because copyArea is not implemented for now.

Well, here's whats going on for now:


Most of the Java2Demos work already fine, and with the hacks performance is ~ok.
I also worked arround some performance-bugs in Xorg when using Pictures without drawable, I simply don't use them anymore.
A kingdom for GL_LINE or however its called ;)


Whats next?:

Well after some finishing on the buffer-code, I'll try to get a clean pipeline implemented, that finally can be uploaded into the repositor.

Donnerstag, 10. Juli 2008

fillOval performance

Today I started integrating the mask-batch-buffer bits I wrote the last days into and did some benchmarking.

I tested with fillOval, because in the old pipeline it "simply" used XFillArc, whereas with XRender it is done by a general drawPath routine, so this should show how much the batching is woth:


20x20100x100250x2501000x1000
X112.44E71.44E83.7E81.92E9
XR2.8E71.8E84.8E81.78E9


Except for very large ovals, the XRender pipeline is slighly faster, even in an area where the old X11 pipeline is pretty good. For lines and draw() I guess the picture will be different :-/
Using EXA or XAA makes no (large) difference, I guess the intel driver does not accalerate arcs/ovals anyway.

Mittwoch, 9. Juli 2008

About Gradients and Traps...

Radial and Linear Gradients:


Today I started Linear- and RadialGradients (that stuff that was introduced in JDK6). As far as I know LinearGradients are quite complete, Radial Gradients are still missing some functionality.
I guess the Gradient-Stuff is almost complete now :)


No more Traps!
Lines cause many troubles because Java relies on fast lines for shapes, however XRender simply does not support lines at all.
A line can be composed out of 3 trapezoids (and a huge amount of code and runtime-work is needed to get it done right), the traps generate a mask which can be used for composition - however thats really slow for something which is expected to be almost a no-op. For now there is not even a single driver which is able to rasterize traps in hardware.

While playing with the mask-tile-approach I had the idea to draw to my mask with Core-Drawing, which has support for pretty lines.
This would remove the burden of all those trapezoids :)
In theory it should work very well, in reality its not that bad - because EXA doesn't provide driver hooks for lines at all, but because traps aren't accalerated too, nothing is lost - its therefor as fast as the old pipeline on EXA.

The sad news is that its quite slow with Xorg-1.5, but worked ok on 1.3, and it really flies with the proprietary nvidia driver (I guess it has hw line accaleration).
I filed a bug-report, hoping that the situation can be improved.

I really hope I get the mask-tile stuff integrated soon, I am quite curious how it will perform.

Sonntag, 6. Juli 2008

Java.net Project

Java.net Project:

I've created a java.net project, to allow participation and reviewing.
For now I don't use subversion, the "releases" are uploaded as tarballs - however I plan to change this soon.
The first release called "0.0.1" is really a development-only review, it will be completly rewritten soon so I recommend not wasting your time reading it ;)
The project-page can be found here: https://xrpipeline.dev.java.net/


Text benchmarks:
Last time I benchmarked, Dmitri asked about benchmarks done with J2DBench, however I was not familiar with it and only had a look at it recently. What a great and helpful tool :)

I've done some text-benchmarks with it, here are the results:

PlainAALCD
X11561.75842.95541.678
XRender148.757134.76195.184

Results are in chars/s with a 12pt font drawn directly to screen in 16-char blocks.
Currently there is some progress speeding up Xorg's glyph compositing, I guess with the patches it should be possible to reach ~500.000c/s for all three cases: Link

I did some other benchmarks and with the old "architecture" it seems text is really the only piece which is faster when a solid color is used. Although I expected some slowdowns, they were sometimes horrible :-/
Just another sign which underlines the need for the batched-mask-rendering approach.

Donnerstag, 3. Juli 2008

Performance improvements

Performance: Today I played with the ideas I accumulated over the past two weeks or so (already mentioned earlier) to improve performance of scanline based drawing, and implementing the "Extra-Alpha" concept with XRender.

The first attempts were quite frustrating, because performance was not near as good as I had hoped.
I ended somewhere in XOrg's software loops, spending 85% of total time in memcpy - which made me think FillRects to A8 is not accalerated by EXA or the intel driver (would have pretty much killed my ideas).
However it turned out to be a xorg-performance bug (at least as far as I can tell): https://bugs.freedesktop.org/show_bug.cgi?id=16600
Good to know about this limitation, maybe it can help improving the existing code a bit - thanks a lot to the guys at #xorg-devel for beeing that helpful and friendly :) (Still glad that I had not to fire up GDB^^)

The results where as pleasant as expected, for the micro-benchmark I ran before (filling 6250 spans, 250 at a time) I got:
Rendering to explicit mask using XRenderFillRectangles: ~10ms
Rendering to destination, compositing scanline per scanline: ~50ms

As always when using masks there is a worst case, a 45° line - however, as far as I can guess the low performance in the scanline-per-scanline case is not because of limited fill-rate, but from the per-operation overhead, which seems far lower with FillRectangles. I guess the worse the driver, the more the new approach "benefits" compared the old one - as long as it does accalerate FillRects to an A8 mask everything should be fine, with the exception of very old pre-3d-cards which are not really able to run EXA anyway.

Extra Alpha:
Another advantage of the new explicit-mask-approach is that Extra-Alpha can be done easily and is "for free" in the scanline/fillSpans case, in the line/path case there is one additional (usually accalerated) composition step compared to plain XRenderCompositeTrapezoids without extra alpha.

Dienstag, 1. Juli 2008

Over..

- All exams are done, finally university is over and I can concentrate on my project.

- Rawhide (the development version of fedora) almost works ... using an old kernel its even able to run sysprof.
Performance is much better than with the Fedora8 based system I am running currently, however EXA still moves pixmaps arround like a mad cow.
I hope I can get a pre-release of the GEM intel driver working soon, so that I will be able to target the latest improvements, and not optimize for already dead code.

- I plan to create a java.net project where I will upload all code which currently exists. Well its mostly an unuseable hack, but I hope this will change soon.


Looking foreward to a few exciting (and hopefully not too short) weeks :)