Mittwoch, 25. Juni 2008

Fighting with performance...

The current pipeline does perform a lot better than the old one when running on EXA (the new, default accaleration architecture of Xorg), however the old pipeline running on XAA (almost no accaleration, pixmaps always in sysram) its faster than everything else.
Of course XAA won't count soon, some distributions have already switched to EXA by default for many drivers (Ubuntu, Fedora9), however it looks a bit odd to have an old unaccalerated pipeline which is faster than the new accalerated one ;)

I already have some ideas howto speed up Fills in general (for strokes its not that easy unfortunately), I am quite interested how it will work out. Furthermore it would also solve the problem with extra-alpha.

For fills the approach could be like this:
- Have a mask-pixmap with a fixed size (e.g. 512x512), A8 format
- Render geometry to this pixmap using XRenderFillRectangles, which is itself hw-accalerated and *really* fast
- (apply extra alpha to the mask image if nescessary)
- Composite with the specified Texture- or GradientPaint. (For colors we can directly paint to the surface).

This approach would introduce some tiling (if the shape is larger than the fixed-size mask pixmap), however has quite some benefits:
- The mask is explizit, so we have control over its content (helpful for extra alpha)
- Using always the same mask removes the need for allocating implizit masks every time (but I guess xorg does optimize this anyway)
- Rendering rectangles is really fast, and EXA supports this operation done in HW. Trapezoids are currently rendered in software and then sent down to VRAM.

For MaskFills I will experiement with uploading the image to X itself, it seems the lock/getrasinfo/unlock functions introcude quite some overhead (C->JNI->Java calls, locking and an XSync)
Furthermore ~30% of cpu cycles when doing maskfills are spent in malloc although I don't malloc anything.
I hope I get sysprof working to see where all the cycles are wasted.

Montag, 23. Juni 2008

University

The next week or so I'll be quite busy with university :-/
However there will be almost unlimited time to code afterwards ;)

Freitag, 20. Juni 2008

Java2D uses scanlines intenally to fill transformed shapes, and because those
shapes are used a lot I today did some benchmarking on my laptop (C2D,
i945GM) to investigate which way of processing thode scanlines is most
efficient.

The two things I compared are for rendering 2500 times a 100 pixel wide, 1 pixel high scanline:
  • Draw the scanline as trapezoids to an alpha-mask, and later do one
    composition step using that mask (XRender generates an implicit alpha
    mask internally).
  • Render the scanlines one-by-one, without any mask.
I also tested how much batching solid color fills is worth, to see wether
it would be worth optimizing in this
direction. I compared Fedora8 to Fedora9, because EXA and the
intel-driver were quite in a bad shape in Xorg-1.3.

1.) Fedora9 x86_64,
Xorg-1.5, intel-master with TTM:


Intermediate
Mask Type
Time
A1
(1 bit alpha)
14ms
A8
(8 bit alpha)
4ms
Seperate
rendering*
16ms
* Seperate rendering = Many width/1 independent calls to XRenderComposite
* No mask = XRenderCompositeTrapezoids with masktype None


On this machine rendering to an A8 mask and compositing with that yields
best results.

2.) Fedora 8 i386, Xorg-1.3, Intel-2.1.1, EXA:


Intermediate
Mask Type
Time
A1
(1 bit alpha)
100ms
A8
(8 bit alpha)
120ms
No
mask*
40ms
Seperate
rendering*
13ms

Way
of filling
Time
Batched
Solid FillRetcs
1ms
Batched
Alpha FillRects
14ms
Single
Solid FillRects
8ms
Single
Alpha FillRects
35ms


Conclusions:
The results for composition are quite surprising.
* The mask-based approach performs terrible on Fedora8, although I thought this technique should be quite hardware/driver independent. Most of the time is spent inside libfb.so (arrrg, no symbols), maybe the
driver is completly falling back to software, or mask generation is really that slow.
* The many-and-small area composition approach performs quite similar on both systems.
* On Fedora9 as expected using an A8 intermediate mask yields best results, beeing 4
times faster than rendering many small pieces one-by-one.
* Batching solid color fills seems to speed things up a lot.

I'll try the benchmark on other HW to see wether Xorg-1.3 is the reason for
slow masking (no problem), or if the driver has to be highly tuned (quite bad).
I would prefer high performance across different drivers (and not all GPUs will have highest optimized EXA drivers), instead of peak-performance on some cards.
Hopefully I find some recent live-cd with Xorg-1.5 and Noveau included to test some nvidia hw ;)

Donnerstag, 19. Juni 2008

JavaDeus (and Gradients)

JavaDeus:

... was really great!
The sessions were quite interesting, the location was nice and catering was excellent ;)
I took some photos, however not really good ones: http://picasaweb.google.com/linuxhippy/JavaDeus
(It was lunch-time when I took those pictures, the room was not half empty during the sessions ;) )


Gradients:
During the Swing-App-Framework session I couldn't resist and started to implement Gradients.
For now only old-school gradients are supported without transformation (hey but with repeat and without ;)), but this should not cause many troubles (if XRender provides compatible implementation for the other Gradient types too):



Left = XRender, Right = RI



I also had a nice idea howto improve performance in the >scale transformed case, without changing the way Java2D itself works - after all XRender is all about creating masks and bliting with them. So I could just call XCompositeRectangles instead of just Composite.
This however introcuces the extra-alpha implementation problem i mentioned in even more situations, I still don't know howto fix that. Time hopefully will show ;)

Mittwoch, 18. Juni 2008

TexturePaint

Today I got the first bits of TexturePaint working:

It performs quite OKish with XAA, I haven't tried with EXA till now. It currently only works for fills, draw() and lines still quite struggly me:



I am also sure its not 100% correct ... the work to make it pixel-perfect compatible with Sun's software implementation quite frightens me ;)
The current Java2D-approach for transformed (>scale) TexturePaints is to fill them using strides, which means just drawing one pixel wide one after another, however on real hardware many small operations are quite expensive. I am curious how it would perform using a MaskFill based approach.

I'll try to get GradientPaints working too, and then find all the stuff which still does not work.


JavaDeus 08:
Tomorrow I'll attend JavaDeus '08 in St.Pölten. Its like a very small JavaOne for Austria, and according to Sun some guys from JavaOne will hold their speeches there too.
Sun provides a free shuttle bus from Vienna to the FH St.Pölten, the only drawback is that bus-transfer is at 8:00 am ;)
I am sure it will be really cool :)

Sonntag, 15. Juni 2008

Whats left ... and further project plan

Project Plan:
Well its now mid. of June and its not that long until August 4th.
I've written down some thoughts how I plan to continue with the project.

First large pieces which are still left:
- Paints (other than Color/AlphaColor)
- XOR (scares me^^)
- Nice, and correct lines
- Maybe some fast-paths for often-used primitives
- Bugs, bugs, bugs, bug implementation, special cases, .....

When those things are done, the pipeline is basically feature-complete as far as I see, but of course theres still a lot to do.
For now its really a dirty hack, a typical proof-of-concept - it simply does not care about corner cases, if full with bugs and missing pieces.
I simply hacked and totally screwed up and broke the existing X11 pipeline, which is bad because ... well ... this is *working* code and mine isn't for now ;)

Short term project plan:
My plan is to re-implement everything done so far starting with a new, untouched OpenJDK-src, and implement it at least on the java-side independent from the existing X11 pipeline, but the native side will share quite a lot of code.
I'll upload the resulting code on a java.net project, the current code is so messy and broken that its not useful for anthing more than implementing new features.
I hope I can start with implementing the new, clean implementation at the end of June, thats the time all my exams at universtity are behind me.

Goal: The resulting code should be able to run (most?) many important applications with good performance, and the resulting code should be clean.
I also hope Dmitri will be patient in helping me finding the "correct" way for thing which I hacked together for now. Thanks for all your help till now Dmitri!

Long term project plan:
My long-term goal (of course) would be the integration into (Open-)JDK7.
I don't think that the code will be ready for integration on August 4th and I hope nobody expects that.
After all its a large project which interfaces with a lot of project-"external" code, there's a lot of testing, corner-case fixing and dealing with bad driver/xorg-implementations and so on, I simply would like to get everything done "right", a lot of testing and performance evaluation on different hw-platforms.

So after 4th of August I'll ..... go on holidays for at least two weeks ;)
I guess I'll need some recovery to not loose fun working on this, after that I'll of work on finishing the pipeline.

I hope the code will be ready soon enough for JDK7, because I think the time would be perfect: EXA has become really useful with Xorg-7.4 (prerelease shipping with Fedora9) and finally drivers start to provide good exa performance:
- The intel-driver works very well, when one of the experimental branches is used ;)
- The open-source radeon drivers also seems to be in a good shape now.
- NVidia is working in improving their XRender implementation after many user-complaints about bad performance.

So for now the only driver which doesn't (or will soon) accalerate XRender on modern hardware is the proprietary AMD/ATI driver. However I don't expect the new pipeline to be that slow without accaleration, however there won't (expect for text) be large speedups (and some slowdowns) compared to the old pipeline. Furthermore I expect the proprietary AMD drivers to be able to run the OGL pipeline anyway, so those users won't be sad at all ;)

Wow that was quite a lot off-topic stuff...

-----------------------------------------------------------------------------------------------------------------------------------
Compositing:
Last but not least the usual image:

Compositing works :)
As you can see swing is really broken, so while most demos look some kind of OK, typical swing apps are totally unuseable for now :-/
I still have to figure out an efficient way to implement Extra-Alpha for operations which require an implicit mask (lines and line-related stuff like stroking).

Freitag, 13. Juni 2008

Positions work also now...

Abolute glyph-positions work now too ... even better its consistent with the software pipelines ;)
And as always ... a boring picture with some random positions:

Donnerstag, 12. Juni 2008

subpixel antialiased text:

1. Thanks to Phil Race I was able to get subpixel-antialiased text mostly woring:

1. Subpixel antialiased - Reference Implementation
2. Subpixel antialiased - XRender
3. Grayscale antialiased - XRender
4. No antialiasing - XRender

When you compare the RI (first line) and XRender (second one) you'll notice XRender looks a bit more bold, this is missing gamma correction which is done by the software-loops to match MSWindows's behaviour, but I am unsure wether XRender does support this feature.
If it does not support it I hope quality will be good enough this way to enable it by default (at least, its good enough for all other X-based apps), with accalerating drivers performance for sure is >500% better than with the software-based approach.

Thanks again Phil for beeing that helpful :)

2. Another thing I did today was making my glyphcache play nice with other JDK-internal classes.
I chose a array-based approach compared to the linked-list one when I implemented it, however I use realloc and GlyphInfo holds a direct pointer to its cache-entry which is the dangling arround.
The array-based approach also has its benefits (e.g. better cache locality), so I now simply update the pointer to it after doing realloc (which should almost never happen).

Whats missing for text-accaleration:
  1. Well for text its support for absolute positioning, which should not be too much work because most of the existing code is prepared to support it.
  2. Allocation-checks to not fail if memory-allocation fails.
  3. Gamma correction, although I am unsure if this can be implemented on top of XRender at all.

MaskFill working

Yesterday I finally got MaskFills working:


It took me quite some time until I figured out howto upload the mask-tiles using the lock/getrasinfo/unlock functions, and I guess my code is full with wrong assumptions ;) ... but its working quite well.

Performance with XAA (no accaleration) is most of the time on par with the existing implementation (there seem to be some performance bugs, sometimes its fast and sometimes not), performance on EXA is far better. However this approach has a big benefit for accalerating drivers (EXA), where the old approach had a large penality because of the readbacks used.

Although still intermediate-mask-images are used to render AA shapes and fills (instead of using XRender's AA capabilities), there's now no download from X-Server, and the upload is almost 1/3. Furthermore there is currently no EXA-driver which is able to accalerate geometry (they use a similiar approach like Java to render AA traps).

Today I'll play with lcd-text again and have a look at MaskBlit.

Dienstag, 10. Juni 2008

MaskFill / MaskBlit

LCD-Text:
Today I delayed the LCD text stuff, I don't know howto get any further so I posted a question to the 2d-dev list.
I hope I don't bother them too much, often something is not clean, but after asking it I spend some time iwth the code in a less stressed way, and find the answer.
However I am quite sure this time I won't be able to answer it myself ;)

State-Management:
Today I implemented some simple state-management.
It differs a lot from how X11 did it (there the state was held by the GC), because with XRender the state is spread over all possible places.
Solid color is sometimes used as XRenderColor, sometimes as Picture (for composition operations), clip is on the dst surface, transformation only on src.

For now I implemented only color values, however I guess later I'll have to do some kind of solid-fill-src-surface caching.

MaskFill / Blit:
MaskFill / Blit are really important pieces for antialiased rendering.
Today I looked a bit in how it was done in the OGL pipeline and fighted with creating my own instance of XRPixmapSurfaceData without allocating a VI.

The reason why this is hepful is that for X11 the up/download of data is more complicated than for OGL (where a simple glTexSubImage2D is all you need), and there are several ways of passing data down to X.
Of course the whole code is already there from the old X11 pipeline, but to be able to re-use it I need the XSDO structure it depends on.
Well it was easier than I thout, and I removed some ugly hacks I did some time before.

So tomorrow will be MaskFill day :)

Sonntag, 8. Juni 2008

Glyph cache...

There aren't any nice images to show this time, however:

Over the weekend I implemented a glyphcache, so that I do not have to upload the glyph-images to the X-Server every time I draw some of them.
The whole thing was quite simple (beside some usual fighting with C), escpecially because XRender does already provide a glyph-centric API so I don't have to mess arround with pixmaps I cache the glyphs to - this is all done by X behind the scenes.

Althought there is already a glyph-cache in OpenJDK (jdk/src/share/native/sun/font/AccelGlyphCache), it does focus a lot on whats need for D3D/OGL and does too much for XRender (just overhead), but on the other hand does not provide some features which are needed, especially the ability to guarantee that a set of glyphs is in the cache.

The OGL backend draws its glyphs one after another, however with XRender we first have to upload all glyphs and later draw them all with a single command. So if we have to draw 1000 individual glyphs, but the cache can only hold 512, well, it will break.
Another nice feature implemented is, that inserting a new glyph will never throw out a glyph which is known to be used in the following glyph-blit.
All those things would be hard to implement with the existing cache, especially because OGL and D3D pipelines use a fixed-size texture for caching their glyphs, furthermore they would not benefit a lot from those changes.

Some things seem to be implemented too complex, and some are ... well not nice solutions (I currently squeeze an integer value into a pointer), but beside that it worked out quite well.

Time for drawing string rotated 50x:
X11: 6ms
X11-AA: 82ms
XR: 2ms
XR-AA: 2ms

X11 is the traditional X11 Java2d backend, XR the XR_ender backend I currently work on. -AA just means with grayscale antialiasing enabled.

Note that this was done using XAA (the old accaleration architecture) and quite likely is not even hardware accelerated. With recent improvements in EXA I hope for even better results and hw-accaleration.
Please note that XR still had some logging-code in, so it would be even a bit faster :)

Donnerstag, 5. Juni 2008

All's well :)

Looks good :)

Today I got rid of enough positioning bugs, to make default text look pretty good:

The first line is XRender accalerated antialiased text, the second is the RI rendered to a BufferedImage.

For now no caching is done, and the code only works properly with no extra position information, however I think this can be fixed using XRenderCompositeText instead of XRenderCompositeString.

Dienstag, 3. Juni 2008

Text (and ugly lines)

Text:
After fighting a bit with it, I was able to draw my first glyphs using XRender.
For now a lot is missing, the most visible missing part is correct positioning of the glyphs.
I am still not 100% sure wether XRender is flexible enough to fullfill Java's requirements in any case, but for now there are other things to worry ;)

After all text is one of the critcal areas where I expect most improvements.
Currently lcd-antialiased text is implemented with readbacks (not sure about grayscale) ... well ... even normal non-antialiased text is rendered with some kind of software-rendering, but without readbacks so its pretty fast.
With XRender modern GPUs have the ability to accalerate even lcd-antialiased text :)

Ugly Lines:
A mistake made me upload the line-circle-picture made by the reference software implementation, so all looked well. In fact, XRender-lines look for now quite ugly, like they would not be of consistent width.
Carl Worth pointed out a solution by John D. Hobby, I'll continue with text/glyph-rendering for now but thats definitivly something I would like to have been done.

Sonntag, 1. Juni 2008

Way too much work :-/

It took me the whole weekend to get the line-drawing stuff working for real XRender.

The problem I had was, that although Java2D rendered correctly what I thought were trapezoids, XRender did not like it. I had a look at the Cairo-code which of course has to do the same things, and thanks to the excellent comments I was able to figure it out how to do it.
After all way too much work for something which I supposed would be "hard, but not that".

Performance is really bad, however I know something is wrong with my X-Server.
OProfile showed ~90% spent in libfb.so, well I plan to switch to Fedora9 as soon as its ready (maybe in a month or so).

Now the good news, the code is basically ready (with some modifications to meet Java2Ds expected behaviour) for:
- Lines with translucent color AND
- AA'ed lines AND
- Lines with width > 1.0
and theoretically even texturepaint.