JIT Shaders For Better Performance

The subject is really vast and complex and I’ve been trying to write an article about this for quite some time now. Recently, I made a small patch to enhance this technique and I thought it was a good occasion to try to summarize how it works and the benefits of it. In order to talk about this new enhancement, I would like to draw the big picture first.

The Problem

That might look like a complicated post title… but this is rather complex than really complicated. Here is how it starts: rendering a 3D object require to execute a graphics rendering program – or “shader” – on the GPU. To make it simple, let’s just say this program will compute the final color of each pixel on the screen. Thus, the operations performed by this shader will vary according to how you want your object to look like. For example rendering with a solid flat color requires different operations than rendering with per-pixel lighting. Any programming beginner will understand that such program will test conditions – for example whether to use lighting or not – and perform some operations according to the result of this test. Yes: that’s pretty much exactly what an “if” statement is. It might look like programming trivia to you. And it would be if this program was not meant to be executed on the GPU… You see, the GPU does not like branching. Not one bit (literally)! For the sake of parallelization, the GPU expects the program to have a fixed number of operations. This is the only efficient way to ensure computations can be distributed over a large number of pipelines without having to care too much about their synchronization. Thus, the GPU does not know branching and each program has a fixed number of instructions that will always be executed in the same order. Conclusion: shader programs cannot use “if” statements. And of course, loops are out of the game too since they are pretty much pimped out “if” statements. Can you imagine what such logic would imply on your daily programming tasks? If you simply try to, you will quickly understand that instead of writing one program that can handle many different situations you will have to write many different programs that will handle a single situation. And then manually choose which one should be launched according to your initial setup…

Workarounds…

Mutables

The simplest workaround is to find “some way” to make sure useless computations do not affect the actual rendering operations. For example, you can “virtually disable” lighting by setting all lights diffuse/specular components to 0. As you can imagine, this is really a suboptimal option. Performance wise, it’s actually the worst possible idea: a lot of computations happen and most of them are likely to be useless in most cases.

If/else shader intrinsic instructions

After a few years, shaders evolved and featured more and more instructions. Those instructions are now usable through higher level languages such as CG or GLSL. Those languages feature “if” statements (and even loops too). How are they compiled into shader code that can run on a GPU? Do they overcome the challenges implied by parallelization? No. They actually fit in in a very straight forward and simple way. As a shader program must feature a single fixed list of instructions, the two parts of a if/else statement will both be executed. The hardware will then decide which one should be muted according to the actual result of the test performed by the conditional instructions. The bright side is that you can use this technique to have a single shader program that handles multiple scenarios. The dark side is that this shader is still very inefficient and might eventually break the limit number of instructions for a single program. On some older hardware, the corresponding hardware instructions simply do not exist… So even this “brand new” feature that will be introduced in Flash 11.7 and its “extended” profile is far from sufficient.

Pre-compilation

Some engines will use high level shader programming languages (like CG or GLSL) and a pre-compilation workflow to generate all the possible outcomes. Then, the right shader is loaded at runtime according to the rendering setup. This is the case of the Source Engine, created by Valve and used in famous games like Half Life 2, Team Fortress 2 or Portal. This solution is efficient performance wise: there is always a shader that will do exactly and strictly the required operations according to the rendering setup. Plus it does not have to rely on some hardware features availability. But pre-compilation implies a very heavy and inefficient assets workflow.

Minko’s Solution

We’ve seen the common workarounds and each of them has very strong cons. The most robust implementation seems to be the pre-compilation option despite the obvious workflow issues. Especially when we’re talking web/mobile applications! But the past 10 years have seen the rise of a technique that could solve this problem: Just In Time (JIT) compilation. This technique is mostly used by Virtual Machines – such as the JVM (Java Virtual Machine), the AVM2 (Actionscript Virual Machine) or V8 (Chrome’s JavaScript virtual machine). It’s purpose is to compile the virtual machine bytecode into actual machine opcodes at runtime in order to get better performances. How would the same principle apply to shaders? If you consider your application as the VM and your shader code as this VM execution language, then it all falls into place! Indeed, your 3D application could simply compile some higher level language shader code into actual machine shader code according to the available data. For example, some shader might compile differently according to whether lighting is enabled or not or even according to the number of lights. With Minko, we tried to keep it as simple as possible. Therefore, we worked very hard to find a way to be able to write shaders using AS3. As the figure above explains, the AS3 shader code you write is not executed on the GPU (because that’s simply not possible). Instead, the application acts as a Virtual Machine and as it gets executed at runtime, this AS3 shader code transparently generates what we call an Abstract Shader Graph (ASGs). You can see it as an Abstract Syntax Tree for shaders (you can even ask Minko to output ASGs in the consoleas they get generated using a debug flag). This ASG in then optimized and compiled into actual shader code for the GPU. For example: everytime you call the add() method in your AS3 shader code, it will create a corresponding ASG node. This very node will be linked with the rest of the ASG as you use it in other operations until it is finally used as the result of the shader. This result node becomes the “entry point” of the ASG. Here is what a very simple ASG that just handles a solid flat color rendering looks like: Here is what a (complicated) ASG that handle multiple lights looks like: Your AS3 shader code is executed at runtime on the CPU to generate this ASG that will be compiled into actual shader code that will run on the GPU (in the case of Flash it will actually output AGAL bytecode that will be translated into shader machine code by the Flash Player). As such, you can easily perform “if” statements that will shape the ASG. You can even use loops, functions and OOP! You just have to make sure the shader is re-evaluated anytime the output might be different (for example when the condition tested in a “if” changes). But that’s for another time… Using JIT shaders, Minko can efficiently dynamically compile shaders shaped by the actual rendering settings occuring at runtime. Thus, it combines the high performance of a pre-compilation solution while leveraging all the flexibility of JIT compilation. In my next articles, I will explain how JIT shaders compilation can be efficiently automated and how multi-pass rendering can also be made more efficient thanks to this approach. If you have questions, hit the comments or post in the Minko category on Aerys Answers!

Teasing: Simple Minko Physics Stacking Stress Test

Minko physics is probably one of the most awaited features for this new year. I will not cover it extensively right now – I’ll rather post a few demos in a later post – but if you must know it was designed to be the fastest 3D physics engine for ActionScript 3 and the Flash platform. I will share extensive details about this new physics engine during the Stage3D online conference in February. Make sure you attend: it’s online and it’s free! :)

The Demo

Anyway, I just wanted to tease the community with one of the stress tests we’re using here to benchmark the engine. It’s a really simple test and it uses only a very very small subset of the available features. Here you go (just press “space” to throw some balls): Read more about performances after the jump…

3D Matrices Update Optimization

4×4 matrices are the heart of any 3D engine as far as math is concerned. And in any engine, how those matrices are computed and made available through the API are two critical points regarding both performances and ease of development. Minko was quite generous regarding the second point, making it easy and simple to access and watch local to world (and world to local) matrices on any scene node. Yet, the update strategy of those matrices was.. naïve, to say the least.

TL;DR

There is a new 3D transforms API available in the dev branch that provides a 50000% 25000% boost on scene nodes’ matrices update in the best cases, making it possible to display 50x 25x more animated objects. You can read more about the changes on Answers. Continue reading 3D Matrices Update Optimization

Minko Weekly Roundup #4

Features

  • The JointsDebugController makes it possible to display the joints and bones of a skinned mesh in order to check if everything works as expected.
  • The VertexPositionDebugController will display the position of each vertex of a mesh.
  • The VertexNormalDebugController will display the normal of each vertex. You can use it with the VertexNormalShader to display and debug the normals of a mesh.

Examples

Answers

Tutorials

Fixes

  • minko-collada will now load the vertex RGBA data when it’s available
  • min/max computation is always possible upon creation of a VertexStream regardless of its StreamUsage
  • frustum culling will now also test the center of the bounding box and not only the 8 corners

Minko Weekly Roundup #3

Projects

Wieliczka Salt Mine website and apps

Created by the Polish agency GoldenSubmarine, the Wieliczka Salt Mine web site offers an interative experience with HD 360° panoramas built with Minko. Thanks to Adobe AIR, the application is also available on mobile platforms such as the iPad, the iPhone and Android.
This awesome project was just rewarded with the “FWA Mobile of the Day” award! This is the very first FWA for a project built with Minko so we are very proud of course!

Videos

This video presents yet another example of a very cool shader built by developer Jeremy Sellam from the Les Chinois, a French digital agency based in Paris, France. It was built with the public beta of the ShaderLab of course!

Features

ByteArray streams

Geometry is now stored in ByteArray objects. It reduces the memory consumption and should provide a significant performances boost. Especially on mobile devices. You can read more about this feature on my previous blog post.

Parametric frustum culling

Minko now implements frustum culling in a very flexible fashion. You can chose on what planes and using what volume (sphere or box) frustum culling is computed. For example, to use the bounding sphere on all planes but the box only on the near/far planes, you can write: Such flexibility should make it possible to optimize performances to every use case. It’s also nice to consider computations are kept to a minimum by storing world space versions of the bounding sphere/box. You can view the entire code for this feature in the VisibilityController.

Per-triangle ray casting

Geometry.cast() now makes it possible to get the triangle ID (the first indice of the triangle) that crosses the specified ray. Combined with Mesh.cast() and Group.cast(), it makes it very easy to perform ray-casting at any level, from the root of the scene to the geometry level.

Local/world gizmos in Minko Studio

Gizmos are the key to any WYSIWYG editor. We’ve enhanced Minko Studio’s gizmos not only by adding scale/rotation gizmos but also the possiblity to chose whether those gizmos should be used in local or world space.

Smooth shadows in Minko Studio

Shadows are a very important part of any real-time 3D immersive application. Thus, being able to control the quality of the shadows while keeping interactive framerates is very important. The minko-lighting plugin already introduced soft shadows a few weeks ago, and now the corresponding options are available right inside Minko Studio! It’s now easier than ever to fine tune the lighting engine in order to get the best performance/quality ratio.

Documentation

Minko now has a new, clean and fast growing wiki: the developers Hub. You will find a lots a old and new tutorials. But also lots a new examples and projects done with Minko!

Fixes

  • Geometry.fillNormalsData() and Geometry.fillTangentsData will reset the corresponding buffers to make sure normals/tangents are not incremented everytime they are supposed to be updated.
  • Picking has been completely refactored: all the known bugs have been fixed and we are now using a 1*1 BitmapData/scissor rectangle to get better performances.
  • The Collada loader will now inspect 3D transforms in order to handle negative scales properly. It will also invert the z axis at the vertex level in order to perform right to left handed coordinates conversion. It fixes problems occuring when loading animations build using symmetry.
Don’t forget to “watch” Minko’s github repository to get your daily dose of new features and fixes!

New Minko Feature: ByteArray Streams

I’ve just pushed on github my work for the past few weeks and it’s a major update. But most of you should not have to change a single line of code in the best case. The two major changes are the activation of frustum culling – who now works perfectly well – and the use of ByteArray objectst to store vertex/index streams data.

Using ByteArray instead of Vector, why are we doing this?

As you might now, Number is the equivalent of the “double” data type and as such they are stored on 64bits. As 32bits is all a GPU can handle regarding vertex data it is a big waste of RAM. Using ByteArray makes it possible to store floats as floats and avoid any memory waste. The same goes with indices stored in uint when they are actually shorts. Another important optimization is the GPU upload. Using Number of uint requires the Flash player to re-interpret every value before upload: each 64bits Number has to be turned into a 32bits float, each 32bit uint has to be turned into a 16bits short. This process is slow by itself, but it also prevent the Flash player to simply memcopy the buffers into the GPU data. Thus, using ByteArray should really speed up the upload of the streams data to the GPU and make it as fast as possible. This difference should be even bigger on low-end and mobile devices. Finally, it also makes it a lot faster to load external assets because it is now possible to memcopy chunk of binary files directly into vertex/index streams. It should also prove to be very very useful for a few exclusive – and quite honestly truly incredible – features we will add in the next few months.

What does it change for you ?

If you’ve never been playing around with the vertex/index streams raw data, it should not change a single thing in your code. For example, iterators such as VertexIterator and TriangleIterator will keep working just the way they did. A good example of this is the TerrainExample, who runs just fine without a single change. If you are relying on VertexStream.lock() or IndexStream.lock(), you will find that those methods now return a ByteArray instead of a Vector. You should update you code accordingly. If you want to see a good example of ByteArray manipulations for streams, you can read the code of the Geometry.fillNormalsData() and Geometry.fillTangentsData() methods.

What’s next?

This and some recent additions should make it much easier to keep streams data in the RAM without wasting too much memory and be able to restore it on context device loss. It’s not implemented yet but it’s a good first step on this complicated memory management path. Another possible feature would be to store streams data in compressed ByteArray. As LZMA compression is now available, it could save a lot of memory. The only price to pay would be to have to uncompress the data before being able to read/write it.

Tutorial: Add pixel-perfect 3D mouse interactivity

In this tutorial we’re going to see how you can add pixel-perfect 3D mouse interactivity. I’ve already introduced a technique called “ray casting” in another article. But it works only with very basic static shapes. And sometimes, testing very complex shapes can be very painful performance wise. It’s even more expensive when you want it to be very precise. In this article, we will see a technique called “pixel picking”. This technique uses hardware acceleration to provide pixel perfect mouse interactivity. It works very well for both static and animated models. The concept is very simple: we render the scene with one color per mesh. Then, we just have to get the pixel under the mouse cursor to know what mesh is “interactive”. Of course, things are much more complicated in the real life: this kind of stunts are pretty hard to push properly in a general purpose rendering pipeline. But Minko provides everything required out of the box! Even better, the minko-picking extension features a simple controller – the PickingController – that provides all the mouse signals we might need! This tutorial will explain how to setup the PickingController and listen for the mouse signals.
Pixel picking test application (sources)

Create and setup the PickingController

The first step is to instanciate a new PickingController: The constructor takes only one argument: the “picking rate” of the controller. This value will determine how many times per second the controller will try to execute the picking pass and the relevant mouse signals. The lower the picking rate, the better the performances. A picking rate of 30 should be more than enough for 99% of the applications. You can also set that value at any time using the PickingController.pickingRate property: Setting the picking rate to the half of the frame rate will work just fine for most applications and should be completely painless performance wise. By default, the picking rate is fixed to 15.

Set the mouse events source

The job of the PickingController is to listen for the mouse events on one (or more) specific dispatcher(s) and re-dispatch them as mouse signals. The difference between the original events and the signals executed by the PickingController is that the signals are aware of the 3D scene. To setup the dispatcher to listen, you just have to call the PickingController.bindDefaultInputs() and provide the IDispatcher object to listen:

Setup the PickingController on the 3D scene

In most cases, you don’t want the whole 3D scene to be mouse interactive. Sometimes it’s just a Mesh or a Group. The PickingController can be added to any Mesh/Group so it’s easy to target precisely what is interactive and what is not. The basic use case is to add mouse interactivity on a single Mesh: BUt you also might want to listen for the mouse signals trigerred by a whole sub-scene instead of a single mesh. For example, some skinned 3D assets have multiple meshes animated by a single skeleton. To do this, we can add the PickingController on Group: In the code snippet above, the PickingController will execute mouse signals for all the Mesh descendants of the target group. You don’t have to worry about the descendants of the groups targeted by a PickingController: it will listen for the Group.descendantsAdded and Group.descendantsRemoved to start/stop tracking any descendant Mesh added to this part of the scene. Thus, if your whole 3D scene is interactive, you can add the PickingController directly on the Scene node:

Listen for the mouse signals

To catch 3D mouse events, you just have to add callback(s) to any of the PickingController.mouse* signals. The available signals are:
  • mouseClick, mouseDown, mouseUp: executed when the left button is clicked, down or up
  • mouseRightClick, mouseRightDown, mouseRightUp: executed when the right button is clicked, down or up
  • mouseMiddleClick, mouseMiddleDown, mouseMiddleUp: executed when the right button is clicked, down or up
  • mouseDoubleClick: executed when the user makes a double click
  • mouseMove: executed when the mouse moves
  • mouseWheel: executed when the mouse wheel turns
  • mouseRollOver, mouseRollOut: executed when the mouse roll over/out a mesh
The following code sample will catch the left and the right click signals: It would be too difficult to use the PickingController if the mouse signals where triggered only when an actual 3D object is under the cursor. For example, it would be pretty hard to select/unselect objects without listening to some actual 2D mouse events. The code would then quickly become very complicated to mix both 2D mouse events and 3D mouse signals. Therefore, the mouse signals are triggered whenever the corresponding mouse event is dispatched (and when the picking rate allows it of course). As a direct consequence, the mesh : Mesh argument is null when there is no actual interactive 3D object under the mouse cursor.

Conclusion

You can find the complete source code of the picking example demo in the minko-examples repository on github. If you have questions/suggestions regarding this comment, you can ask them in the comments or on Aerys Answers, the official support forum for Minko.